# Linear Regression Example Code Along

In the resources download, the relevant notebook is called:
- ```linear_regression_code_along.ipynb```

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("lr_example").getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression

In [3]:
data = spark.read.csv("Ecommerce_Customers.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()

# The actual value that we are trying to predict is: `Yearly Amount Spent`

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [8]:
for element in data.head(1)[0]:
    print(element)

# Inspection of the elements reveal that `Avatar` is just a color.

# For this project, we just care about the numerical data:
# --- Avg Session Length
# --- Time on App
# --- Time on Website
# --- Length of Membership

# And we care about predicting:
# --- Yearly Amount Spent.

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


**Now we need to setup the DataFrame for Machine Learning.  This is really an important component of using PySpark MLlib.  We must concentrate on fully grasping what is happening in the following code...**

In [9]:
# Firstly, we need to import the vector assembler and vectors.
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [10]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [11]:
# Proceed by creating an assembler object.
assembler = VectorAssembler(inputCols=['Avg Session Length',
                                       'Time on App',
                                       'Time on Website',
                                       'Length of Membership'],
                           outputCol="features")  
# VectorAssembler is a feature transformer that merges multiple columns into a vector column.
# inputCols is a list of strings of the actual columns that you want to include.

In [12]:
# Secondly, we want to transform the data.
output = assembler.transform(data)  # Notice that we are passing in ALL the data (not just a train/test split).
output.printSchema()  # You have everything that you used to have, plus an additional feature column.

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



In [16]:
output.head(1)[0][-1]  # Notice that we have here a DenseVector:
# DenseVector([34.4973, 12.6557, 39.5777, 4.0826])

DenseVector([34.4973, 12.6557, 39.5777, 4.0826])

In [17]:
# Next, we create our final_data:
final_data = output.select(["features", "Yearly Amount Spent"])
final_data.show()

# Notice how now we have a DataFrame whose columns are in the form:  | `features` | `labels` |

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

In [18]:
# Next, we need to split up the final_data into a train/test split:
train_data, test_data = final_data.randomSplit([0.7, 0.3])  # 70%/30% is a good default choice.

In [19]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                358|
|   mean|  498.4636864063505|
| stddev|  80.06444722704694|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [20]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                142|
|   mean| 501.45788306916813|
| stddev|  77.63034026924589|
|    min|  304.1355915788555|
|    max|  744.2218671047146|
+-------+-------------------+



**Now, we create our Linear Regression Model.**

In [21]:
lr = LinearRegression(labelCol="Yearly Amount Spent")  # We do not adjust the other parameters because they match.

In [22]:
lr_model = lr.fit(train_data)

**Let us see how well our model actually performed.**

In [23]:
test_results = lr_model.evaluate(test_data)

In [24]:
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
|0.28038733185297815|
| 10.709307835746472|
| -6.351051578016779|
| 2.9216968110291646|
| 19.063815641797248|
|  3.800463888802483|
| -5.315671165363938|
|  6.657952018951335|
| -7.377146922995848|
|  -4.81905280428947|
| -9.353148325667178|
|   1.07830180297492|
|-2.2770975477609454|
| -10.39882845281511|
| -8.975074456393429|
|-17.489441897049858|
| 10.812557456158515|
|-1.8995175961796917|
|  8.113243129177022|
|  5.408183605730301|
+-------------------+
only showing top 20 rows



In [25]:
test_results.rootMeanSquaredError

9.357332800814493

In [26]:
test_results.r2
# 0.98 indicates that our model is explaining a lot of the variance.

0.9853677738859922

In [27]:
# Compare rootMeanSquaredError with the following report:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



**Let us review how we can deploy this model for data for which we only have feature data (and no label data).**

In [28]:
# We "mimic" the deployment procedure in the following:
unlabeled_data = test_data.select("features")
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[30.5743636841713...|
|[30.7377203726281...|
|[31.0613251567161...|
|[31.3091926408918...|
|[31.3123495994443...|
|[31.3584771924370...|
|[31.5147378578019...|
|[31.6548096756927...|
|[31.7207699002873...|
|[31.7656188210424...|
|[31.8279790554652...|
|[31.8293464559211...|
|[31.8627411090001...|
|[31.8648325480987...|
|[31.8854062999117...|
|[31.9048571310136...|
|[31.9096268275227...|
|[31.9120759292006...|
|[31.9549038566348...|
|[31.9764800614612...|
+--------------------+
only showing top 20 rows



In [29]:
predictions = lr_model.transform(unlabeled_data)

In [30]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.5743636841713...| 441.7840264262127|
|[30.7377203726281...| 451.0714343604834|
|[31.0613251567161...| 493.9065096359184|
|[31.3091926408918...|429.79902102890446|
|[31.3123495994443...|444.52760238614337|
|[31.3584771924370...| 491.3754865606729|
|[31.5147378578019...|495.12815916182535|
|[31.6548096756927...| 468.6054717085972|
|[31.7207699002873...| 546.1520804010188|
|[31.7656188210424...| 501.3731344398966|
|[31.8279790554652...| 449.3558958726087|
|[31.8293464559211...|384.07403618500007|
|[31.8627411090001...| 558.5752387218076|
|[31.8648325480987...| 450.2901089296288|
|[31.8854062999117...|399.07834742886894|
|[31.9048571310136...|  491.439299319866|
|[31.9096268275227...| 552.6334782170807|
|[31.9120759292006...| 389.4342339018874|
|[31.9549038566348...|431.88463681074995|
|[31.9764800614612...| 325.1862624283699|
+--------------------+------------

**Coming up next is going to be a consulting project.**