<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/05_Linear_regression_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression. A more complex example

In [1]:
# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=0b121552c69958706393f57d722a1f9dab677d04b895962ade6b903e9539bba4
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [2]:
# Download the file
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/Ecommerce_Customers.csv

--2023-10-03 14:17:13--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/Ecommerce_Customers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86871 (85K) [text/plain]
Saving to: ‘Ecommerce_Customers.csv’


2023-10-03 14:17:13 (3.83 MB/s) - ‘Ecommerce_Customers.csv’ saved [86871/86871]



In [3]:
# Import the libraries
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression

In [4]:
# Create a spark session
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [5]:
# Read in the file
data = spark.read.csv('Ecommerce_Customers.csv', inferSchema=True, header=True)

In [6]:
# Print schema
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [7]:
# Show data
data.show()

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
|riverarebecca@gma...|1414 David Throug...|   

In [8]:
# Print each item in the first row to inspect the data
for item in data.head(1)[0]:
  print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Assemble data
Graoup all numeric values into one `features` column

In [16]:
# Import libraries
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [17]:
# Take the column names that contain our feature values
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [18]:
# Create an assembler oject
assembler = VectorAssembler(inputCols=['Avg Session Length', 'Time on App', 'Time on Website', 'Length of Membership'],
                            outputCol='features')

In [19]:
# Transorm the data with the assembler object we just created
output = assembler.transform(data)

Let's inspect the new features column we just created

In [21]:
# Chech the schema. Se the new 'features' column created
output.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



In [22]:
# Show the features column values
output.select('features').show()

+--------------------+
|            features|
+--------------------+
|[34.4972677251122...|
|[31.9262720263601...|
|[33.0009147556426...|
|[34.3055566297555...|
|[33.3306725236463...|
|[33.8710378793419...|
|[32.0215955013870...|
|[32.7391429383803...|
|[33.9877728956856...|
|[31.9365486184489...|
|[33.9925727749537...|
|[33.8793608248049...|
|[29.5324289670579...|
|[33.1903340437226...|
|[32.3879758531538...|
|[30.7377203726281...|
|[32.1253868972878...|
|[32.3388993230671...|
|[32.1878120459321...|
|[32.6178560628234...|
+--------------------+
only showing top 20 rows



In [23]:
# Show all values in the first row
output.head(1)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005, features=DenseVector([34.4973, 12.6557, 39.5777, 4.0826]))]

In [24]:
# Save the features and labels in a variable
final_data = output.select('features', 'Yearly Amount Spent')

In [26]:
# Show the data
final_data.show()

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

## Prepare the data. Split

In [27]:
# Split the data into train and test
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [32]:
# Let's describe our train and test data
train_data.describe().show()

test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                355|
|   mean|  495.7368059724206|
| stddev|   78.0425794272199|
|    min|   266.086340948469|
|    max|  744.2218671047146|
+-------+-------------------+

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                145|
|   mean| 508.07208971783524|
| stddev|  81.96262746126975|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



## Create the Linear Regression Model and Evaluate

In [33]:
# Create the model
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [34]:
# Fit the model with our train_data
lr_model = lr.fit(train_data)

In [35]:
# Evaluate the model
test_results = lr_model.evaluate(test_data)

In [36]:
# Show the errors
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|  0.8044391036755201|
|   10.63730036508673|
|  -0.320160611330607|
|   19.31050914941153|
|   1.095946136028033|
|  -3.570703206576127|
|  -7.105245271921035|
|  -8.585304436876811|
|   4.001100119648811|
|-0.34408220228021946|
|  -6.685050218826632|
|   -5.69727218458064|
| -10.709884351235587|
|  -2.826345661160758|
|  -8.882980844623432|
|  -5.348978283240854|
|   5.730201083637326|
|   8.771473062603377|
|  -17.45708406267454|
|   4.044408609978234|
+--------------------+
only showing top 20 rows



In [40]:
# Show the MSE
test_results.rootMeanSquaredError

9.26238200584349

In [38]:
# Show R squared
test_results.r2

0.9871406462647594

In [41]:
# Compare with the summary of the total data
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



We can confirm that the model performs pretty well since our MSE is only 10 units compare to the 500 mean and 80 stddev.
We have an error of only 10 units compare with its stddev.

Our Rsquared gives us a 98% result, which means that our model explain 98% of the variance in the data. This is a great number

## Predict data results

In [42]:
# Take some unlabeled data
unlabeled_data = test_data.select('features')

In [43]:
# Show the unlabeled data
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[30.5743636841713...|
|[31.1695067987115...|
|[31.2606468698795...|
|[31.3123495994443...|
|[31.3895854806643...|
|[31.4252268808548...|
|[31.4474464941278...|
|[31.5261978982398...|
|[31.5316044825729...|
|[31.6610498227460...|
|[31.7207699002873...|
|[31.7242025238451...|
|[31.8093003166791...|
|[31.8124825597242...|
|[31.8279790554652...|
|[31.9453957483445...|
|[31.9480174211613...|
|[31.9549038566348...|
|[31.9563005605233...|
|[32.0123007682454...|
+--------------------+
only showing top 20 rows



In [46]:
# Get the predictions
predictions = lr_model.transform(unlabeled_data)

In [47]:
# Show perdictions
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.5743636841713...|441.25997465439013|
|[31.1695067987115...| 416.7192304372061|
|[31.2606468698795...|  421.646791868282|
|[31.3123495994443...| 444.2809088785291|
|[31.3895854806643...|408.97366492395486|
|[31.4252268808548...|  534.337421861338|
|[31.4474464941278...|425.70798736714505|
|[31.5261978982398...| 417.6798306292146|
|[31.5316044825729...|432.51450560971375|
|[31.6610498227460...|416.70243578218106|
|[31.7207699002873...| 545.4599836968496|
|[31.7242025238451...|509.08515947254114|
|[31.8093003166791...| 547.4817837140768|
|[31.8124825597242...|  395.636690644958|
|[31.8279790554652...|448.88572839156495|
|[31.9453957483445...| 662.3689022208928|
|[31.9480174211613...| 456.1906758092605|
|[31.9549038566348...| 431.2264068773236|
|[31.9563005605233...| 564.5830158098734|
|[32.0123007682454...|   488.90064445598|
+--------------------+------------