# Linear Regression Code Along

This notebook is the reference for the video lecture on the Linear Regression Code Along. Basically what we do here is examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

First thing to do is start a Spark Session

In [53]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [54]:
from pyspark.ml.regression import LinearRegression

In [55]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("/dataset/linear_reg/Ecommerce_Customers.csv",inferSchema=True,header=True)

In [56]:
# Print the Schema of the DataFrame
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [57]:
data.show(10)

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
|riverarebecca@gma...|1414 David Throug...|   

In [58]:
type(data.collect()), type(data.head())

(list, pyspark.sql.types.Row)

In [59]:
for item in data.head(2)[1]:
    print(item)

hduke@hotmail.com
4547 Archer CommonDiazchester, CA 06566-8576
DarkGreen
31.92627202636016
11.109460728682564
37.268958868297744
2.66403418213262
392.2049334443264


## Setting Up DataFrame for Machine Learning 

In [60]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [61]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [62]:
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [63]:
output = assembler.transform(data)

In [64]:
output.select("features").show(10)

+--------------------+
|            features|
+--------------------+
|[34.4972677251122...|
|[31.9262720263601...|
|[33.0009147556426...|
|[34.3055566297555...|
|[33.3306725236463...|
|[33.8710378793419...|
|[32.0215955013870...|
|[32.7391429383803...|
|[33.9877728956856...|
|[31.9365486184489...|
+--------------------+
only showing top 10 rows



In [66]:
output.collect()[0]

Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005, features=DenseVector([34.4973, 12.6557, 39.5777, 4.0826]))

In [67]:
output.show(5)

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|            features|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|[34.4972677251122...|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|[31.9262720263601...|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37

In [68]:
final_data = output.select("features",'Yearly Amount Spent')

In [69]:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [70]:
final_data.show(10)

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
+--------------------+-------------------+
only showing top 10 rows



In [79]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [80]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                364|
|   mean|  500.1237159630984|
| stddev|  79.42539036926115|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [81]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                136|
|   mean|  497.1469596965265|
| stddev|  79.26993766503048|
|    min|   266.086340948469|
|    max|  712.3963268096637|
+-------+-------------------+



In [82]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='Yearly Amount Spent', featuresCol='features', predictionCol='prediction')

In [83]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [84]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [25.541345628115728,38.819402785727355,0.18585604213245885,61.052234503080506] Intercept: -1035.7553381378154


In [91]:
# training summary
lrModel.summary.residuals.show(5)

+-------------------+
|          residuals|
+-------------------+
|-12.754498888490957|
| -6.602582252704963|
|0.38541534528843613|
| 10.173240776771195|
| -4.792826261087782|
+-------------------+
only showing top 5 rows



In [95]:
# Root MEan squared error
lrModel.summary.rootMeanSquaredError

9.801351512413538

In [96]:
lrModel.summary.r2

0.9847296858295227

In [97]:
test_results = lrModel.evaluate(test_data)

In [99]:
test_results.predictions.show()

+--------------------+-------------------+------------------+
|            features|Yearly Amount Spent|        prediction|
+--------------------+-------------------+------------------+
|[29.5324289670579...|  408.6403510726275|398.05164408575206|
|[30.8162006488763...|   266.086340948469| 284.4752135755966|
|[31.1239743499119...|  486.9470538397658| 508.4741550753963|
|[31.1695067987115...|  427.3565308022928|418.47482818962703|
|[31.3091926408918...|  432.7207178399336| 430.1368234083459|
|[31.3895854806643...|  410.0696110599829| 409.1820463547906|
|[31.4252268808548...|  530.7667186547619|  534.557598833363|
|[31.5257524169682...|  443.9656268098819| 449.2374574621401|
|[31.5316044825729...| 436.51560572936256| 433.2223258891338|
|[31.5702008293202...|  545.9454921414049| 563.4317074682631|
|[31.6253601348306...|  376.3369007569242| 382.0659059436864|
|[31.7216523605090...| 347.77692663187264|350.40057252484917|
|[31.8124825597242...|  392.8103449837972|  396.898417857876|
|[31.909

In [100]:
# Interesting results....
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
| 10.588706986875422|
|-18.388872627127625|
| -21.52710123563054|
|  8.881702612665777|
|  2.583894431587737|
| 0.8875647051922897|
| -3.790880178601128|
| -5.271830652258245|
|  3.293279840228763|
|  -17.4862153268582|
| -5.729005186762208|
|-2.6236458929765263|
| -4.088072874078762|
| 12.488403953228385|
| 11.685791579243698|
|-4.3399589170556965|
| 23.088522389861566|
|  19.62273268028366|
| -4.989802860777729|
|  4.294179396750906|
+-------------------+
only showing top 20 rows



In [101]:
unlabeled_data = test_data.select('features')

In [102]:
predictions = lrModel.transform(unlabeled_data)

In [103]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[29.5324289670579...|398.05164408575206|
|[30.8162006488763...| 284.4752135755966|
|[31.1239743499119...| 508.4741550753963|
|[31.1695067987115...|418.47482818962703|
|[31.3091926408918...| 430.1368234083459|
|[31.3895854806643...| 409.1820463547906|
|[31.4252268808548...|  534.557598833363|
|[31.5257524169682...| 449.2374574621401|
|[31.5316044825729...| 433.2223258891338|
|[31.5702008293202...| 563.4317074682631|
|[31.6253601348306...| 382.0659059436864|
|[31.7216523605090...|350.40057252484917|
|[31.8124825597242...|  396.898417857876|
|[31.9096268275227...| 550.9576317200108|
|[31.9262720263601...| 380.5191418650827|
|[31.9453957483445...| 661.3598828547076|
|[32.0498393904573...| 455.6308344843537|
|[32.1253868972878...| 438.2249632646649|
|[32.1878120459321...| 457.3054783408131|
|[32.1898447292735...| 529.1023743900919|
+--------------------+------------

In [104]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 10.333181636468845
MSE: 106.77464273225696


Excellent results! Let's see how you handle some more realistically modeled data in the Consulting Project!