# Linear Regression Example

We will walk through the steps of the official documentation example

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [3]:
from pyspark.ml.regression import LinearRegression

In [10]:
# Load training data
training = spark.read.format("libsvm").load("file:///home/erin/Downloads/spark-3.0.1-bin-hadoop2.7/SparkFolder/spark/Data/sample_linear_regression_data.txt")

We are using libsvm because the SparkDocumentation has this format.

In [11]:
training.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

This is the format that Spark expects. Two columns with the name "label" and "features".

The "label" column then needs to have the numerical label, either a regression numerical value, or a numerical value that matches to a classification grouping. 

The feature column has inside of it a vector of all the features that belong to that row. Usually what we end up doing is combining the various feature columns we have into a single 'features' column using the data transformations.

In [12]:
# These are the default values for the featuresCol,labelCol, predictionCol
lr = LinearRegression(featuresCol="features", labelCol='label',predictionCol='prediction')

In [13]:
# You could also pass in additional parameters for regularization
# Read in ISLR to fully understand that


In [14]:
#Fir the model
lrModel = lr.fit(training) 

In [15]:
# Print the coefficients and intercept for linear regression
print("coefficients: {}".format(str(lrModel.coefficients))) # For each feature...
print('\n')
print("Intercept:{}".format(str(lrModel.intercept)))

coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]


Intercept:0.14228558260358093


In [16]:
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary

In [17]:
trainingSummary.residuals.show()
print("RMSE: {}".format(trainingSummary.rootMeanSquaredError))
print("r2: {}".format(trainingSummary.r2))

+-------------------+
|          residuals|
+-------------------+
|-11.011130022096554|
| 0.9236590911176538|
|-4.5957401897776675|
|  -20.4201774575836|
|-10.339160314788181|
|-5.9552091439610555|
|-10.726906349283922|
|  2.122807193191233|
|  4.077122222293811|
|-17.316168071241652|
| -4.593044343959059|
|  6.380476690746936|
| 11.320566035059846|
|-20.721971774534094|
| -2.736692773777401|
| -16.66886934252847|
|  8.242186378876315|
|-1.3723486332690233|
|-0.7060332131264666|
|-1.1591135969994064|
+-------------------+
only showing top 20 rows

RMSE: 10.16309157133015
r2: 0.027839179518600154


### Train/Test Splits
But wait! We've commited a big mistake, we never separated our data set into a training and test set. Instead we trained on ALL of the data, something we generally want to avoid doing. Read ISLR and check out the theory lecture for more info on this, but remember we won't get a fair evaluation of our model by judging how well it does again on the same data it was trained on!

Luckily Spark DataFrames have an almost too convienent method of splitting the data! Let's see it:

In [19]:
all_data = spark.read.format("libsvm").load("file:///home/erin/Downloads/spark-3.0.1-bin-hadoop2.7/SparkFolder/spark/Data/sample_linear_regression_data.txt")

In [20]:
# Pass in the split between training/test as a list.
# No correct, but generally 70/30 or 60/40 splits are used. 
# Depending on how much data you have and how unbalanced it is.
train_data,test_data = all_data.randomSplit([0.7,0.3])

In [21]:
train_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
|-28.046018037776633|(10,[0,1,2,3,4,5,...|
|-26.805483428483072|(10,[0,1,2,3,4,5,...|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|
|-22.837460416919342|(10,[0,1,2,3,4,5,...|
|-21.432387764165806|(10,[0,1,2,3,4,5,...|
|-20.212077258958672|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-19.884560774273424|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|-18.845922472898582|(10,[0,1,2,3,4,5,...|
| -18.27521356600463|(10,[0,1,2,3,4,5,...|
|-17.803626188664516|(10,[0,1,2,3,4,5,...|
|-17.494200356883344|(10,[0,1,2,3,4,5,...|
|-17.428674570939506|(10,[0,1,2,3,4,5,...|
| -17.32672073267595|(10,[0,1,2,3,4,5,...|
|-17.026492264209548|(10,[0,1,2,3,4,5,...|
|-16.692207021311106|(10,[0,1,2,3,4,5,...|
|-16.151349351277112|(10,[0,1,2,3,4,5,...|
|-15.951512565794573|(10,[0,1,2,3,4,5,...|
+----------

In [22]:
test_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -23.51088409032297|(10,[0,1,2,3,4,5,...|
|-23.487440120936512|(10,[0,1,2,3,4,5,...|
|-22.949825936196074|(10,[0,1,2,3,4,5,...|
|-19.872991038068406|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -19.66731861537172|(10,[0,1,2,3,4,5,...|
|-19.402336030214553|(10,[0,1,2,3,4,5,...|
|-17.065399625876015|(10,[0,1,2,3,4,5,...|
| -16.71909683360509|(10,[0,1,2,3,4,5,...|
| -16.26143027545273|(10,[0,1,2,3,4,5,...|
| -16.08565904102149|(10,[0,1,2,3,4,5,...|
|-15.780685032623301|(10,[0,1,2,3,4,5,...|
| -15.72351561304857|(10,[0,1,2,3,4,5,...|
|-15.359544879832677|(10,[0,1,2,3,4,5,...|
|-15.310980589416289|(10,[0,1,2,3,4,5,...|
|-15.056482974542433|(10,[0,1,2,3,4,5,...|
|-14.762758252931127|(10,[0,1,2,3,4,5,...|
|-13.976130931152703|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -13.15333560636553|(10,[0,1,2,3,4,5,...|
+----------

In [23]:
unlabeled_data = test_data.select('features')

In [24]:
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows



In [25]:
correct_model = lr.fit(train_data)

In [26]:
test_results = correct_model.evaluate(test_data)

In [27]:
test_results.residuals.show()
print("RMSE: {}".format(test_results.rootMeanSquaredError))

+-------------------+
|          residuals|
+-------------------+
|-20.918440820585737|
| -20.52596929525588|
|-26.741084239283406|
| -19.49734786204372|
|-20.054444022555153|
|-19.498729359123526|
|-20.759862379583343|
|  -18.3305280315326|
|-18.621427393487174|
|-18.881232365094867|
|-14.101946647216732|
|-18.387384988036324|
| -18.40863858912507|
|-16.948843993698567|
| -13.85024611639944|
|-17.132141220354008|
| -16.84671658853824|
|-14.297789492863492|
| -19.24217590865316|
| -12.86162428590023|
+-------------------+
only showing top 20 rows

RMSE: 10.974553801626298


Well that is nice, but realistically we will eventually want to test this model against unlabeled data, after all, that is the whole point of building the model in the first place. We can again do this with a convenient method call, in this case, transform(). Which was actually being called within the evaluate() method. Let's see it in action:

In [28]:
predictions = correct_model.transform(unlabeled_data)

In [29]:
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...|  -2.592443269737236|
|(10,[0,1,2,3,4,5,...|  -2.961470825680633|
|(10,[0,1,2,3,4,5,...|  3.7912583030873335|
|(10,[0,1,2,3,4,5,...| -0.3756431760246853|
|(10,[0,1,2,3,4,5,...|  0.2716812329406135|
|(10,[0,1,2,3,4,5,...|-0.16858925624819415|
|(10,[0,1,2,3,4,5,...|  1.3575263493687884|
|(10,[0,1,2,3,4,5,...|  1.2651284056565852|
|(10,[0,1,2,3,4,5,...|  1.9023305598820865|
|(10,[0,1,2,3,4,5,...|  2.6198020896421363|
|(10,[0,1,2,3,4,5,...|  -1.983712393804759|
|(10,[0,1,2,3,4,5,...|   2.606699955413024|
|(10,[0,1,2,3,4,5,...|  2.6851229760765007|
|(10,[0,1,2,3,4,5,...|  1.5892991138658892|
|(10,[0,1,2,3,4,5,...| -1.4607344730168492|
|(10,[0,1,2,3,4,5,...|  2.0756582458115758|
|(10,[0,1,2,3,4,5,...|   2.083958335607112|
|(10,[0,1,2,3,4,5,...| 0.32165856171078894|
|(10,[0,1,2,3,4,5,...|   5.469734346950286|
|(10,[0,1,2,3,4,5,...| -0.291711