<h1>Generalized Linear Model (GLM)</h1>

Unlike linear regression, where the output is assumed to follow a Gaussian distribution, 
in [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) the response variable $Y_i$ follows some distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).

We carry out the usual initialization and the relevant imports.

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

146 new artifact(s)


146 new artifacts in macro
146 new artifacts in runtime
146 new artifacts in compile




In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.regression.GeneralizedLinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.mllib.util.MLUtils[0m
[32mimport [36morg.apache.spark.ml.regression.GeneralizedLinearRegression[0m
[32mimport [36morg.apache.spark.ml.evaluation.RegressionEvaluator[0m

As mentioned above, we can view linear regression as a [generalized linear model](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) using the Gaussian family. The following families (and supported links) are available: 

<table>
<tr><td><b>Family</b></td><td><b>Response Type</b></td><td><b>Supported Links</b></td></tr>
<tr><td>Gaussian</td><td>Continuous</td><td>Identity, Log, Inverse</td></tr>
<tr><td>Binomial</td><td>Binary</td><td>Logit, Probit, CLogLog</td></tr>
<tr><td>Poisson</td><td>Count</td><td>Log, Identity, Sqrt</td></tr>
<tr><td>Gamma</td><td>Continuous</td><td>Inverse, Identity, Log</td></tr>
</table>

And a Gaussian example:

In [10]:
val sparkSession = SparkSession
  .builder()
  .master("local[1]")
  .appName("GLM")
  .getOrCreate()

// Load training data
val dataset = sparkSession.read.format("libsvm")
  .load("files/sample_linear_regression_data.txt")
    
dataset.show()
val glr = new GeneralizedLinearRegression()
  .setFamily("gamma")
  .setLink("log")
  .setMaxIter(10)
  .setRegParam(0.3)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = dataset.randomSplit(Array(0.7, 0.3))

// Fit the model to the training data
val model = glr.fit(trainingData)

// Print the coefficients and intercept for generalized linear regression model
println(s"Coefficients: ${model.coefficients}")
println(s"Intercept: ${model.intercept}")

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println("Root Mean Squared Error (RMSE) on test data = " + rmse)


+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

: 

<h2>Exercise 1</h2>

Implement GLM as a standalone program that you can run on the HPC. Make sure that you can pass the family and the link function in as inputs. Using the family Gaussian (i.e. carrying out linear regression), explore different link functions for the datasets from Notebook 6. Are there any link functions that make more sense than others for the datasets?

<h2>Exercise 2</h2>

Explore different link functions for Gamma regression for the same datasets.

<h2>Exercise 3</h2>

Compare optimization algorithms: LinearRegression uses L-BFGS or OWL-QN, depending of the type of regularization, and GeneralizedLinearRegression uses IRLS (iterative reweighted least squares). 

<h2>Exercise 4</h2>

Compare linear regression and Gamma regression for the following two datasets: 

1. [Airline dataset](http://stat-computing.org/dataexpo/2009/the-data.html) Design your program to accept multiple years' worth of input.

2. [Allstate claim prediction challenge](https://www.kaggle.com/c/ClaimPredictionChallenge/data)

Take a look at the datasets before you apply Gamma regression and consider whether it makes sense to apply (and alternative options).