## Generate Dummy Data

Generate data according to $y = x_1 + 2x_2 + 3x_3$

In [1]:
import scala.math._
val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(sc)
import sqlContext.implicits._


// gen data according to x1 + 
def genRanRow: (Double, Double, Double, Double) = {
    val (x1, x2, x3) = (100*random, 100*random, 100*random)
    (x1, x2, x3, x1 + 2*x2 + 3*x3)
}

val myDF = sc.parallelize((0 to 10000).map(x => genRanRow)).toDF("x1", "x2", "x3", "y")

myDF.show

+------------------+------------------+------------------+------------------+
|                x1|                x2|                x3|                 y|
+------------------+------------------+------------------+------------------+
|  64.4817542443203|  64.9836020835302| 42.69495350096511|  322.533818914276|
|  79.6847253074154| 42.24299337729658|24.270577087512045| 236.9824433245447|
| 96.19210956565894| 47.67103248508631|  34.3704759118884| 294.6456022714967|
| 73.64834716438214| 33.55295811971977|14.885980608569572| 185.4122052295304|
| 90.52472359657864| 56.01089283416708| 74.75058796401082|426.79827315694524|
| 32.88804555707674| 68.33729286349431| 44.35036380289458| 302.6137226927491|
|18.372687963745683| 95.33755341720891|  15.7354010856253| 256.2539980550394|
| 80.66991985399748|  38.0787423257596| 40.14828039937336|277.27224570363677|
| 98.70467443622748|   7.4301428466089| 81.07031801233549|356.77591416645174|
| 16.58016823391368| 97.13707824345154| 46.03651140560293|348.96

## Run Linear Regression

Here are the steps to running linear regression:

1. Split the data into training and testing sets
- Make a `VectorAssembler` which combines our feature columns into a single `features` column
- Make a `LinearRegression` object with label column `y`
- Put the `VectorAssembler` and `LinearRegression` objects into a single `Pipeline` object
- Fit the `Pipeline` to the training data. This runs the `VectorAssembler` and the `LinearRegression`
- Tranform the testing data with our `Pipeline` to make predictions. This runs the `VectorAssembler` and applies the learned `LinearRegression` method
- Extract the `LinearRegressionModel` from our `Pipeline` to check the Root Mean Square Error = $\displaystyle\sqrt{\displaystyle\frac{\sum_i (x_i - \hat{x_i})^2}{n}}$ and the coffificients $x_1$, $x_2$, and $x_3$

In [2]:
/* Import needed classes*/
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.Pipeline

/* (1) Split the Data */
val Array(trainingData, testData) = myDF.randomSplit(Array(0.7, 0.3))

/* (2) Make your feature vector assembler */
val assembler = new VectorAssembler().
    setInputCols(Array("x1", "x2", "x3")).
    setOutputCol("features")

/* (3) Make your Linear Regression model */
val lr = new LinearRegression().setLabelCol("y")

/* (4) Put the assembler and regression model into a pipeline */
val pipeline = new Pipeline().setStages(Array(assembler, lr))

/* (5) Run the pipeline on your training data */
val model = pipeline.fit(trainingData)

/* (6) Make the predictions */
val predictions = model.transform(testData).persist

/* (7) Pull out the linear regression model from the pipeline, generate summary info */
val lrModel = model.stages(1).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary

/* (7.1) Print the Root Mean Square Error */
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")

/* (7.2) Print the Coefficients */
println(lrModel.coefficients)

RMSE: 4.3049474205807575E-13
[0.999999999999993,2.0000000000000053,2.999999999999988]


## Conclusion

As we can see the model found the correct coefficients (within a very small $\varepsilon$) with a corresponding error of $4.3 \times 10^{-13}$