# Linear Regression Code Along

This notebook is the reference for the video lecture on the Linear Regression Code Along. Basically what we do here is examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

First thing to do is start a Spark Session

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [0]:
from pyspark.ml.regression import LinearRegression

In [0]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/arunyuvi0206@gmail.com/Ecommerce_Customers-3.csv")

In [0]:
# Print the Schema of the DataFrame
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: string (nullable = true)
 |-- Time on App: string (nullable = true)
 |-- Time on Website: string (nullable = true)
 |-- Length of Membership: string (nullable = true)
 |-- Yearly Amount Spent: string (nullable = true)



In [0]:
data.show()

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
|riverarebecca@gma...|1414 David Throug...|   

In [0]:
data.head()

Out[43]: Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length='34.49726772511229', Time on App='12.65565114916675', Time on Website='39.57766801952616', Length of Membership='4.0826206329529615', Yearly Amount Spent='587.9510539684005')

In [0]:
for item in data.head():
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Setting Up DataFrame for Machine Learning 

In [0]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [0]:
data.columns

Out[46]: ['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [0]:
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [0]:
data = data.withColumn("Avg Session Length", data["Avg Session Length"].cast("float"))
data = data.withColumn("Time on App", data["Time on App"].cast("float"))
data = data.withColumn("Time on Website", data["Time on Website"].cast("float"))
data = data.withColumn("Length of Membership", data["Length of Membership"].cast("float"))
#converting Yearly Amount Spent to float
data = data.withColumn("Yearly Amount Spent", data["Yearly Amount Spent"].cast("float"))



In [0]:
output = assembler.transform(data)

In [0]:
output.select("features").show()

+--------------------+
|            features|
+--------------------+
|[34.4972686767578...|
|[31.9262714385986...|
|[33.0009155273437...|
|[34.3055572509765...|
|[33.3306732177734...|
|[33.8710365295410...|
|[32.0215950012207...|
|[32.7391433715820...|
|[33.9877738952636...|
|[31.9365482330322...|
|[33.9925727844238...|
|[33.8793601989746...|
|[29.5324287414550...|
|[33.1903343200683...|
|[32.3879776000976...|
|[30.7377204895019...|
|[32.1253852844238...|
|[32.3388977050781...|
|[32.1878128051757...|
|[32.6178550720214...|
+--------------------+
only showing top 20 rows



In [0]:
output.show()

+--------------------+--------------------+----------------+------------------+-----------+---------------+--------------------+-------------------+--------------------+
|               Email|             Address|          Avatar|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|            features|
+--------------------+--------------------+----------------+------------------+-----------+---------------+--------------------+-------------------+--------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet|          34.49727|  12.655651|      39.577667|           4.0826206|          587.95105|[34.4972686767578...|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen|         31.926271|  11.109461|       37.26896|           2.6640341|          392.20493|[31.9262714385986...|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|         33.000916|  11.330278|      37.110596|            4.104543|          487.54752|[3

In [0]:
final_data = output.select("features",'Yearly Amount Spent')

In [0]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [0]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                331|
|   mean|  498.6559886183263|
| stddev|  79.99649311217311|
|    min|           256.6706|
|    max|          744.22186|
+-------+-------------------+



In [0]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                169|
|   mean|  500.6028807149131|
| stddev|     78.18200501426|
|    min|          266.08633|
|    max|          765.51843|
+-------+-------------------+



In [0]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [0]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data,)

In [0]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [24.965775690432555,38.35111896972482,0.0892767832178607,61.686539190833216] Intercept: -1009.6551640365732


In [0]:
test_results = lrModel.evaluate(test_data)

In [0]:
# Interesting results....
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
|-18.324100640499523|
|-23.191743232620297|
| -8.196278349790532|
| 18.204201160931348|
|  3.008844331281466|
| 2.4042996409535817|
| -4.543428993398379|
|  2.423362272814188|
| -5.347231144446937|
| -4.968178715333238|
| -6.756633563790388|
|-2.7680550511737465|
| 16.246690499690317|
|  -4.81881865858611|
| -2.122864132210566|
| -27.19444779351909|
| -7.571860967156113|
| -6.849286562411066|
| 0.5629351656847348|
| -4.947046952060532|
+-------------------+
only showing top 20 rows



In [0]:
unlabeled_data = test_data.select('features')

In [0]:
predictions = lrModel.transform(unlabeled_data)

In [0]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.8162002563476...|284.41043486901515|
|[31.1239738464355...| 510.1387952345734|
|[31.1280899047851...| 565.4489638966655|
|[31.3123493194580...|445.38722950313115|
|[31.3584766387939...| 492.1671200241873|
|[31.3662128448486...|428.18459806412454|
|[31.4252262115478...| 535.3101526262109|
|[31.4459724426269...| 482.4535908521858|
|[31.5147380828857...|495.15973114444694|
|[31.5171222686767...| 280.8866052290051|
|[31.5257530212402...|450.72227077082164|
|[31.5761318206787...| 543.9946175511737|
|[31.6098403930664...| 428.2988722444503|
|[31.6253604888916...| 381.1557327210861|
|[31.6610488891601...|418.48123205213244|
|[31.6739158630371...|502.91951493219096|
|[31.7207698822021...| 546.3468243460624|
|[31.7242031097412...| 510.2371649803798|
|[31.7366352081298...|496.37050599642464|
|[31.7656192779541...|501.50112410049803|
+--------------------+------------

In [0]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 10.375876305167704
MSE: 107.6588091001406


Excellent results! Let's see how you handle some more realistically modeled data in the Consulting Project!