# Linear Regression - E-commerce  Customer Data

Aim - Examine Ecommerce Customer Data for a company's website and mobile app to build a regression model that will predict the customer's yearly expenditure on the company's product.

Steps to follow: 

1. Create a Spark Session and import LinearRegression
2. Load data and check if it's in the format - label, features (if not, assemble the features using an assembler)
3. Split data into training and testing set (7:3)
4. Create an instance of Logistic Regression 
5. Create a model by using the instance to train/fit training data 
6. Use trained model to obtain prediction results by evaluating on testing data
7. Select label and predictions from prediction results
8. Create evaluator instance 
9. Get accuracy by evaluating predictions and label on evaluator instance

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression

If it generates an import error No module named 'numpy', close jupyter notebook and install numpy for Python3 via terminal using 'pip3 install numpy' 

In [3]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("Ecommerce_Customers.csv",inferSchema=True,header=True)

In [4]:
# Print the Schema of the DataFrame
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [6]:
data.head()

Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)

In [7]:
#Use Row object to check every value as a Dictionary
data.head().asDict()
#Note asDict() sorts the key values

{'Address': '835 Frank TunnelWrightmouth, MI 82180-9605',
 'Avatar': 'Violet',
 'Avg Session Length': 34.49726772511229,
 'Email': 'mstephenson@fernandez.com',
 'Length of Membership': 4.0826206329529615,
 'Time on App': 12.65565114916675,
 'Time on Website': 39.57766801952616,
 'Yearly Amount Spent': 587.9510539684005}

In [8]:
for item in data.head():
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


In [9]:
# For Spark to accept the data it needs to be in the form of two columns - label, features

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [10]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [11]:
# Use VectorAssembler to assemble the indepedent variables to a single features column
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [12]:
output = assembler.transform(data)

In [13]:
#To understand how it's formed we will check the first row
output.head(1)[0].asDict()

{'Address': '835 Frank TunnelWrightmouth, MI 82180-9605',
 'Avatar': 'Violet',
 'Avg Session Length': 34.49726772511229,
 'Email': 'mstephenson@fernandez.com',
 'Length of Membership': 4.0826206329529615,
 'Time on App': 12.65565114916675,
 'Time on Website': 39.57766801952616,
 'Yearly Amount Spent': 587.9510539684005,
 'features': DenseVector([34.4973, 12.6557, 39.5777, 4.0826])}

As we can see features is a DenseVector of the four input values.

In [14]:
output.select("features").show()

+--------------------+
|            features|
+--------------------+
|[34.4972677251122...|
|[31.9262720263601...|
|[33.0009147556426...|
|[34.3055566297555...|
|[33.3306725236463...|
|[33.8710378793419...|
|[32.0215955013870...|
|[32.7391429383803...|
|[33.9877728956856...|
|[31.9365486184489...|
|[33.9925727749537...|
|[33.8793608248049...|
|[29.5324289670579...|
|[33.1903340437226...|
|[32.3879758531538...|
|[30.7377203726281...|
|[32.1253868972878...|
|[32.3388993230671...|
|[32.1878120459321...|
|[32.6178560628234...|
+--------------------+
only showing top 20 rows



In [15]:
final_data = output.select("features",'Yearly Amount Spent')

In [16]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [17]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                363|
|   mean|   496.690127954307|
| stddev|  77.77836580528425|
|    min| 256.67058229005585|
|    max|  725.5848140556806|
+-------+-------------------+



In [18]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                137|
|   mean|  506.2664429334457|
| stddev|  83.14124908040189|
|    min|  275.9184206503857|
|    max|  765.5184619388373|
+-------+-------------------+



In [19]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [20]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [21]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [25.812210401866317,38.629487335826944,0.38104904469444756,61.491956147395754] Intercept: -1051.3968799204345


In [22]:
test_results = lrModel.evaluate(test_data)

In [23]:
# Interesting results....
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
| -11.10871185927192|
| 11.343712521435123|
|  10.79019763892643|
| -3.322829177428275|
| 22.917219621288268|
|  4.685404159482118|
| 3.5883999393801673|
|-3.9349419278362916|
|   -8.3582750073096|
|  7.334647198627351|
| -5.496541539562202|
|-3.1833594398226523|
| 1.4768389571602256|
|-1.0837096622209401|
|-1.4456586190696612|
|  12.34341567023995|
| -3.702441190750392|
| 0.7363566636153678|
|  5.945674154703397|
| 17.123674052708964|
+-------------------+
only showing top 20 rows



In [24]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("R2: {}".format(test_results.r2))

RMSE: 10.21685930664927
R2: 0.9847881023468451


The R-squared value indicates that the lrModel is a very good model and tells that we are explaining a lot of the variance.

In [25]:
final_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                500|
|   mean|  499.3140382585909|
| stddev|   79.3147815497068|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



We can infer the Yearly Amount Spent has a mean of \$500 and standard deviation of \$80, and through the lrModel we can predict the amount with an error of $11. Combined with an R-squared value of 97\% we can safely infer this model to be a pretty good model.  

In [26]:
#Predicting Yearly amount spent on test data
unlabeled_data = test_data.select('features')
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[30.3931845423455...|
|[30.7377203726281...|
|[31.1695067987115...|
|[31.2681042107507...|
|[31.2834474760581...|
|[31.3584771924370...|
|[31.4459724827577...|
|[31.5171218025062...|
|[31.5261978982398...|
|[31.6548096756927...|
|[31.7242025238451...|
|[31.8124825597242...|
|[31.8293464559211...|
|[31.8627411090001...|
|[31.9120759292006...|
|[31.9262720263601...|
|[31.9673209478824...|
|[32.0047530203648...|
|[32.0478009788678...|
|[32.0478146331398...|
+--------------------+
only showing top 20 rows



In [27]:
predictions = lrModel.transform(unlabeled_data)

In [28]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.3931845423455...| 331.0375816624655|
|[30.7377203726281...|450.43702967479476|
|[31.1695067987115...| 416.5663331633664|
|[31.2681042107507...| 426.7933623512522|
|[31.2834474760581...| 568.8638698043792|
|[31.3584771924370...| 490.4905462899933|
|[31.4459724827577...| 481.2885649957484|
|[31.5171218025062...|  279.853362578222|
|[31.5261978982398...| 417.4528011996474|
|[31.6548096756927...|467.92877652892116|
|[31.7242025238451...| 508.8844288275227|
|[31.8124825597242...| 395.9937044236199|
|[31.8293464559211...|383.67549903081476|
|[31.8627411090001...| 557.3818508362676|
|[31.9120759292006...| 388.9803749247774|
|[31.9262720263601...|379.86151777408645|
|[31.9673209478824...|449.45228243040265|
|[32.0047530203648...|463.00962445701407|
|[32.0478009788678...|507.50489703139306|
|[32.0478146331398...|480.26588370613445|
+--------------------+------------