# Linear Regression Sample

In this sample, we are going to examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

First thing to do is start a Spark Session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

# if you want to see the syntax of the class, "SparkSession"
SparkSession?

In [None]:
from pyspark.ml.regression import LinearRegression

# or using "help" command to see the syntax of a class, say "LinearRegression"

help(LinearRegression)

In [None]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("Ecommerce_Customers.csv",inferSchema=True,header=True)

In [None]:
# Print the Schema of the DataFrame

# if you want to see what methods or functions, supported in a class or object, say "data" here
# simply, after 'data.', press 'tab'key

# type a command, data.'tab key', without "#"

data.printSchema()

In [None]:
data.show()



In [None]:
data.head()

In [None]:
for item in data.head():
    print(item)

## Setting Up DataFrame for Machine Learning 

In [None]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# (features", "label")

# Import VectorAssembler and Vectors

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [None]:
data.columns

In [None]:
# define columns for "feature" in DataFrame

assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [None]:
output = assembler.transform(data)

In [None]:
output.select("features").show()

In [None]:
output.show()

In [None]:
# Add column for "label" - (Yearly Amount Spent)

final_data = output.select("features",'Yearly Amount Spent')

In [None]:
# Divide the original data into train data (70%) and test data (30%)

train_data,test_data = final_data.randomSplit([0.7,0.3])

In [None]:
# look at the characteristics of the train data and test data

train_data.describe().show()

In [None]:
test_data.describe().show()

In [None]:
# Create a Linear Regression Model object

lr = LinearRegression(labelCol='Yearly Amount Spent')

In [None]:
# Fit or train the model to the data and 
# call this model as lrModel

lrModel = lr.fit(train_data,)

In [None]:
# Print the coefficients and intercept (or bias) for linear regression

print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

In [None]:
#  testing the model

test_results = lrModel.evaluate(test_data)

In [None]:
# Interesting results....

test_results.residuals.show()

# See the definition of "residuals" after you run the current cell

What is "Residual"? 

It is the difference between the observed value of the dependent variable (y) and 
the predicted value (ŷ) is called the residual (e).

Each data point has one residual. 

Both the sum and the mean of the residuals are equal to zero.



In [None]:
# take some of test data for prediction

unlabeled_data = test_data.select('features')

In [None]:
predictions = lrModel.transform(unlabeled_data)

In [None]:
predictions.show()

In [None]:
# Evaluating the performance of the model, by usng RMSE and MSE

print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

Excellent results! Let's see how you handle some more realistically modeled data in the Consulting Project!