## Generate Dummy Data

Generate data according to $y = x_1 + 2x_2 + 3x_3 + \xi$ where $\xi$ represents random noise such that $\xi \sim U[-10,10]$ 

In [5]:
import scala.math._
val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(sc)
import sqlContext.implicits._


// gen data according to y = x1 + 2*x2 + 3*x3 + random noise
def genRanRow: (Double, Double, Double, Double) = {
    val (x1, x2, x3) = (100*random, 100*random, 100*random)
    (x1, x2, x3, x1 + 2*x2 + 3*x3 + (20*random - 10))
}

val myDF = sc.parallelize((0 to 10000).map(x => genRanRow)).toDF("x1", "x2", "x3", "y")

myDF.show

+------------------+------------------+------------------+------------------+
|                x1|                x2|                x3|                 y|
+------------------+------------------+------------------+------------------+
| 46.52689119609921|24.327819287759354|  79.0476322364853| 340.4882827237372|
| 67.08943710680401|25.232985240496607| 38.48523378105699| 236.4988103991865|
|11.674995654017506|31.006456175387598|14.597980548332002|112.34261900289309|
|27.691692128385213|  37.8236750041249|17.815721014295573| 150.4252984305596|
|10.551002908145934| 34.63097721151326| 99.05640815930421| 375.4659166042033|
| 34.76411479712841| 68.74288501576116|10.245467974502976|193.92369292606014|
|  69.4740640445639| 32.71061010585551| 52.65989806303608| 294.4560694057112|
| 59.34707067918383| 33.48781157143556| 47.38970612201261| 273.2186789668586|
|  49.8910798266533| 60.09534545944313| 5.666297852299628| 189.2823263464727|
|50.039032825920984| 61.78759952278872| 85.88950039187218|428.78

## Run Linear Regression

Here are the steps to running linear regression:

1. Split the data into training and testing sets
- Make a `VectorAssembler` which combines our feature columns into a single `features` column
- Make a `LinearRegression` object with label column `y`
- Put the `VectorAssembler` and `LinearRegression` objects into a single `Pipeline` object
- Fit the `Pipeline` to the training data. This runs the `VectorAssembler` and the `LinearRegression`
- Tranform the testing data with our `Pipeline` to make predictions. This runs the `VectorAssembler` and applies the learned `LinearRegression` method
- Extract the `LinearRegressionModel` from our `Pipeline` to check the Root Mean Square Error = $\displaystyle\sqrt{\displaystyle\frac{\sum_i^n (x_i - \hat{x_i})^2}{n}}$ and the coffificients $x_1$, $x_2$, and $x_3$

In [6]:
/* Import needed classes*/
import org.apache.spark.ml.feature.{VectorAssembler, StandardScaler}
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.Pipeline

/* (1) Split the Data */
val Array(trainingData, testData) = myDF.randomSplit(Array(0.7, 0.3))

/* (2) Make your feature vector assembler */
val assembler = new VectorAssembler().
    setInputCols(Array("x1", "x2", "x3")).
    setOutputCol("features")

/* (3) Make your Linear Regression model */
val lr = new LinearRegression().
    setLabelCol("y"). // Output column name
    setFeaturesCol("features"). // Features column name
    setStandardization(true) // Standardize training data

/* (4) Put the assembler and regression model into a pipeline */
val pipeline = new Pipeline().setStages(Array(assembler, lr))

/* (5) Run the pipeline on your training data */
val model = pipeline.fit(trainingData)

/* (6) Make the predictions */
val predictions = model.transform(testData).persist

/* (7) Pull out the linear regression model from the pipeline, generate summary info */
val lrModel = model.stages(1).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary

/* (7.1) Print the Root Mean Square Error */
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")

/* (7.2) Print the Coefficients */
println(lrModel.coefficients)

RMSE: 5.741902848522907
[1.000852160795363,2.000315472723747,3.003152979678295]


## Conclusion

As we can see the model found the correct coefficients (within a small $\varepsilon$) with a corresponding error of $5.74$ (this is irreduciable error due to $\xi$).