# Linear Regression (Advanced Example)

Objective: Basically what we do here is examine a dataset with E-commerce Customer Data for a company's website and mobile application. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

In [1]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('linear_regression_adv').getOrCreate()

# If you're getting an error with numpy, please type 'sudo pip install numpy --user' into the EC2 console.
from pyspark.ml.regression import LinearRegression

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/02 01:55:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/02 01:55:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
# Use Spark to read in the Ecommerce Customers csv file. You can infer csv schemas. 
data = spark.read.csv("Datasets/ecommerce_data.csv",inferSchema=True,header=True)

In [3]:
# Print the schema of the DataFrame. You can see potential features as well as the predictor.
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [4]:
# Let's focus on one row to make it easier to read.
data.head()

Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)

In [5]:
# A simple for loop allows us to make it even clearer. 
for item in data.head():
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Setting Up a DataFrame for Machine Learning (MLlib)

We need to do a few things before Spark can accept the data for machine learning. First of all, it needs to be in the form of two columns: label and features. Unlike the documentation example, this data is messy. We'll need to combine all of the features into a single vector. VectorAssembler simplifies the process.

In [6]:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
# The input columns are the feature column names, and the output column is what you'd like the new column to be named. 
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [8]:
# Now that we've created the assembler variable, let's actually transform the data.
output = assembler.transform(data)

In [9]:
# Using print schema, you see that the features output column has been added. 
output.printSchema()

# You can see that the features column is a dense vector that combines the various features as expected.
output.head(1)

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005, features=DenseVector([34.4973, 12.6557, 39.5777, 4.0826]))]

In [10]:
# Let's select two columns (the feature and predictor).
# This is now in the appropriate format to be processed by Spark.
final_data = output.select("features",'Yearly Amount Spent')
final_data.show()

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

In [11]:
# Let's do a randomised 70/30 split. 
# Remember, you can use other splits depending on how easy/difficult it is to train your model.
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [12]:
# Let's see our training data.
train_data.describe().show()

# And our testing data.
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                360|
|   mean|  498.0971226357352|
| stddev|  79.73375726559242|
|    min| 256.67058229005585|
|    max|  725.5848140556806|
+-------+-------------------+

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                140|
|   mean| 502.44324986021957|
| stddev|  78.42377012673346|
|    min| 302.18954780965197|
|    max|  765.5184619388373|
+-------+-------------------+



Now we can create a Linear Regression Model object. Because the feature column is named 'features', we don't have to worry about it. However, as the labelCol isn't the default name, we have to specify it's name (Yearly Amount Spent).

In [13]:
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [14]:
# Fit the model to the data.
lrModel = lr.fit(train_data)

23/10/02 01:57:30 WARN Instrumentation: [9134012c] regParam is zero, which might cause numerical instability and overfitting.
23/10/02 01:57:30 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/10/02 01:57:30 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/10/02 01:57:30 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [15]:
# Print the coefficients and intercept for linear regression.
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [25.71956724841068,39.043619650661086,0.45276721049979984,62.463702868266985] Intercept: -1059.3620709335287


In [16]:
# Let's evaluate the model against the test data.
test_results = lrModel.evaluate(test_data)

In [17]:
# Interesting results! This shows the difference between the predicted value and the test data.
test_results.residuals.show()

# Let's get some evaluation metrics (as discussed in the previous linear regression notebook).
print("RSME: {}".format(test_results.rootMeanSquaredError))



+-------------------+
|          residuals|
+-------------------+
| -9.845237739732625|
| -8.334245930444922|
|  3.747652549159625|
|  4.094557252741765|
|  4.442266821418002|
|  3.107264740466178|
|-15.074228757230458|
|-2.5796769839276976|
|  18.59969332175001|
|  7.207409651807382|
|-26.472190429289526|
| -7.839883789678765|
|-1.6609697003750057|
|    8.3829877691266|
|  -9.35510773574481|
|-18.566025324536668|
| 0.5740960177401462|
| -7.992582650199665|
|  4.244427677161525|
|  23.49701816499811|
+-------------------+
only showing top 20 rows

RSME: 10.541561536710073


In [18]:
# We can also get the R2 value. 
print("R2: {}".format(test_results.r2))

R2: 0.98180183079438


Looking at RMSE and R2, we can see that the model is quite accurate. The RMSE shows that, on average, there's only a \\$10 discrepancy between the actual and predicted results. Comparing this to the table below, the average amount spent (\\$499) and standard deviation (\\$79), a \\$10 error is surprisingly good. 

The R2 also shows that the model accounts for 98% of the variance in the data. 

In [19]:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



## But what if we didn't have the predictor data?

This isn't really relevant to your assignment, but useful in a real-world scenario. What if you have all of these features but no predictor data? How do you actually use the model you've created? Check out the example below.

In [20]:
# Let's just select the features column (removing the label column).
unlabeled_data = test_data.select('features')
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[30.3931845423455...|
|[31.1280900496166...|
|[31.3091926408918...|
|[31.3584771924370...|
|[31.3662121671876...|
|[31.4459724827577...|
|[31.5741380228732...|
|[31.5761319713222...|
|[31.6005122003032...|
|[31.6548096756927...|
|[31.6739155032749...|
|[31.7207699002873...|
|[31.8186165667690...|
|[31.8209982016720...|
|[31.8648325480987...|
|[31.9563005605233...|
|[32.0047530203648...|
|[32.0085045178551...|
|[32.0305497162129...|
|[32.0498393904573...|
+--------------------+
only showing top 20 rows



In [21]:
# Now we can transform the unlabeled data.
predictions = lrModel.transform(unlabeled_data)

In [22]:
# It worked! Feeding the unlabeled data features into the model results in a prediction, 
# which is the amount someone with those features is likely to spend in a year.
predictions.show()
predictions.head(1)

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.3931845423455...|329.77410754292623|
|[31.1280900496166...| 565.5869326774996|
|[31.3091926408918...|  428.973065290774|
|[31.3584771924370...|491.08139319673364|
|[31.3662121671876...| 426.1466157350669|
|[31.4459724827577...| 481.7697001946624|
|[31.5741380228732...| 559.4835009178173|
|[31.5761319713222...|  543.806260973256|
|[31.6005122003032...| 460.5731581693469|
|[31.6548096756927...|468.05601407574113|
|[31.6739155032749...|502.19725833917073|
|[31.7207699002873...| 546.6148172677017|
|[31.8186165667690...|448.07964307051066|
|[31.8209982016720...|416.29229324408675|
|[31.8648325480987...| 449.2463882125585|
|[31.9563005605233...| 565.6919570717355|
|[32.0047530203648...| 463.1718851028893|
|[32.0085045178551...|451.18980367895506|
|[32.0305497162129...| 590.0300557414503|
|[32.0498393904573...| 455.2223387092172|
+--------------------+------------

[Row(features=DenseVector([30.3932, 11.803, 36.3158, 2.0838]), prediction=329.77410754292623)]