## Spark MLib utilizes the Spark DataFrame syntax

One of the main quirks of using MLib is that need to format the data so that eventually it just has one or two columns:
- Features, Labels (Supervised Learning)
- Features (Unsupervised Learning)

For Myself Information: 
- Spark documentations are really good formatted: [Spark](spark.apache.org)
- Books recommendations: Introduction to Statistical Learning By Gareth James

### Some common evaluation metrics for Regression methods

* Mean Absolute Error
* Mean Squared Error
* Root Mean Square Error
* R Square Values

MSE will make larger errors more notable than MAE, which makes MSE more popular;

While RMSE is the most popular method for regression;

R Squared Values is actually not quite an error metric, more of a statistical measure of your regression model. R2 is a measure of how much variance your model accounts for, its value is between 0-1

### Linear Regression

Least Squares Method:
A method to measure the Linear Regression.
By minimizing the sum of squares of the residuals.
The residuals fir an observation is the difference between the observation (the y-value) and the fitted line

**1. An example from Spark official documentation**

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Jerry_LR_sample").getOrCreate()

In [4]:
from pyspark.ml.regression import LinearRegression

In [5]:
training = spark.read.format('libsvm').load('datasets/sample_linear_regression_data.txt')

Below shows the format we will be using in the future when doing machine learning in Spark

In [7]:
# Sample data is already formatted
training.show(2)

+------------------+--------------------+
|             label|            features|
+------------------+--------------------+
|-9.490009878824548|(10,[0,1,2,3,4,5,...|
|0.2577820163584905|(10,[0,1,2,3,4,5,...|
+------------------+--------------------+
only showing top 2 rows



In [8]:
# Initialisation the Linear Regression
# Usually the default feature/label columns are names as features/label
lr = LinearRegression(featuresCol='features',labelCol='label',predictionCol='prediction')

In [10]:
# Now we need to apply our training dataset
lr_model = lr.fit(training)

In [11]:
lr_model.coefficients

DenseVector([0.0073, 0.8314, -0.8095, 2.4412, 0.5192, 1.1535, -0.2989, -0.5129, -0.6197, 0.6956])

In [12]:
lr_model.intercept

0.14228558260358093

In [13]:
training_summary = lr_model.summary

In [14]:
training_summary.r2

0.027839179518600154

In [15]:
training_summary.rootMeanSquaredError

10.16309157133015

### Split data into training & testing data

In [16]:
all_data = spark.read.format('libsvm').load('datasets/sample_linear_regression_data.txt')

In [17]:
# The number indicates how you gonna split the dataset
split_object = all_data.randomSplit([0.7,0.3])

In [18]:
split_object

[DataFrame[label: double, features: vector],
 DataFrame[label: double, features: vector]]

In [19]:
# Usually we do
training_data, test_data = all_data.randomSplit([0.7,0.3])

In [20]:
training_data.show(1)

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
+-------------------+--------------------+
only showing top 1 row



In [21]:
training.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                501|
|   mean|0.25688882219498976|
| stddev| 10.317884030544564|
|    min|-28.571478869743427|
|    max|  27.78383192005107|
+-------+-------------------+



In [22]:
test_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                149|
|   mean|-0.8253147832832288|
| stddev|  9.635891688501527|
|    min|-28.046018037776633|
|    max| 20.874246167942125|
+-------+-------------------+



In [24]:
split_model = lr.fit(training_data)

In [25]:
# Using our test data to see how the model has been trained
test_results = split_model.evaluate(test_data)

In [26]:
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
|-28.370129994491126|
|-28.516278118222427|
|-21.533356037904316|
|-19.830850537094786|
| -18.16071966175852|
|-19.281523587878905|
|-18.641883457094423|
|  -19.2618007333354|
|-17.125444851315443|
|-16.961068020758777|
|-14.780148269426743|
|-16.601543885661577|
|-14.872080695071702|
|-14.996203091758307|
|-17.945562205720012|
|-15.287215336421413|
|-12.920129064312961|
|-16.714585704385076|
| -13.54941866426533|
| -9.944510707009712|
+-------------------+
only showing top 20 rows



In [27]:
test_results.rootMeanSquaredError

9.80614462530596

In [28]:
unlabeled_data= test_data.select('features')

In [29]:
unlabeled_data.show(1)

+--------------------+
|            features|
+--------------------+
|(10,[0,1,2,3,4,5,...|
+--------------------+
only showing top 1 row



In [30]:
predictions = split_model.transform(unlabeled_data)

In [31]:
# Predicted values based on the features values
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...|   0.324111956714493|
|(10,[0,1,2,3,4,5,...|  1.7107946897393558|
|(10,[0,1,2,3,4,5,...|  1.8660374225325946|
|(10,[0,1,2,3,4,5,...|   1.555636971090154|
|(10,[0,1,2,3,4,5,...|  0.3570934730940021|
|(10,[0,1,2,3,4,5,...|  2.2161239620028907|
|(10,[0,1,2,3,4,5,...|   1.949676435783316|
|(10,[0,1,2,3,4,5,...|  3.0003704578826667|
|(10,[0,1,2,3,4,5,...|  0.9740955000383327|
|(10,[0,1,2,3,4,5,...|  0.8754089797372868|
|(10,[0,1,2,3,4,5,...| -1.1713642963678303|
|(10,[0,1,2,3,4,5,...|  0.7395345580910166|
|(10,[0,1,2,3,4,5,...| -0.5653040983595159|
|(10,[0,1,2,3,4,5,...| -0.3796546315539897|
|(10,[0,1,2,3,4,5,...|   2.586017325887334|
|(10,[0,1,2,3,4,5,...| 0.46506242667022407|
|(10,[0,1,2,3,4,5,...|-0.11979899979165398|
|(10,[0,1,2,3,4,5,...|  3.7923626010146565|
|(10,[0,1,2,3,4,5,...|  1.0701384528138322|
|(10,[0,1,2,3,4,5,...|  -2.52314

## Linear Regression 2

In [33]:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("Jerry_LR_sample").getOrCreate()
# from pyspark.ml.regression import LinearRegression

In [34]:
data = spark.read.csv("datasets/Ecommerce_Customers.csv",inferSchema=True,header=True)

In [36]:
data.show(1)

+--------------------+--------------------+------+------------------+-----------------+-----------------+--------------------+-------------------+
|               Email|             Address|Avatar|Avg Session Length|      Time on App|  Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+------+------------------+-----------------+-----------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|Violet| 34.49726772511229|12.65565114916675|39.57766801952616|  4.0826206329529615|  587.9510539684005|
+--------------------+--------------------+------+------------------+-----------------+-----------------+--------------------+-------------------+
only showing top 1 row



In [38]:
data.printSchema()
# In this case we are going to predict the `yearly amount spent`

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [40]:
for i in data.head(1)[0]:
    print(i)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


### Setup the dataframe to be used for MLib

In [41]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [42]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [43]:
'''
Init signature: VectorAssembler(inputCols=None, outputCol=None, handleInvalid='error')
Docstring: A feature transformer that merges multiple columns into a vector column.
'''
# The idea here is to use this VectorAssembler function to take the columns you want
# and merge them into a single Vector column 'features'
assembler_col = VectorAssembler(inputCols=['Avg Session Length',
                                           'Time on App',
                                           'Time on Website',
                                           'Length of Membership'],
                                outputCol='features')

In [44]:
# Now we want to transform our data
output = assembler_col.transform(data)
output.printSchema()
# We can see there is a new column called 'features'

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



In [45]:
output.head(1)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005, features=DenseVector([34.4973, 12.6557, 39.5777, 4.0826]))]

In [46]:
final_data = output.select('features','Yearly Amount Spent')

In [47]:
final_data.show(2)

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
+--------------------+-------------------+
only showing top 2 rows



In [49]:
# Train Test Split
train_data, test_data = final_data.randomSplit([0.8,0.2])

In [50]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                414|
|   mean|    500.28523404033|
| stddev|  79.19305190240036|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [51]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                 86|
|   mean| 494.63874693719566|
| stddev|  80.19904202231011|
|    min|   266.086340948469|
|    max|  725.5848140556806|
+-------+-------------------+



In [52]:
lr = LinearRegression(featuresCol='features',labelCol='Yearly Amount Spent')

In [53]:
lr_model = lr.fit(train_data)

In [54]:
test_results = lr_model.evaluate(test_data)

In [55]:
# residuals are the difference between the predicted values and the actual label from the test_data
test_results.residuals.show(3)

+-------------------+
|          residuals|
+-------------------+
|-10.602109731667156|
| -16.16771778351324|
|  7.882524938824417|
+-------------------+
only showing top 3 rows



In [56]:
test_results.r2

0.9824941960796217

In [57]:
test_results.rootMeanSquaredError

10.549222387277165

### How to apply the model on the data only has 'features' column (no label)

In [58]:
unlabel_data = test_data.select('features')

In [59]:
unlabel_data.show(2)

+--------------------+
|            features|
+--------------------+
|[30.3931845423455...|
|[30.8162006488763...|
+--------------------+
only showing top 2 rows



In [60]:
predictions = lr_model.transform(unlabel_data)
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.3931845423455...|330.53097953486076|
|[30.8162006488763...|282.25405873198224|
|[30.9716756438877...| 486.7560848180683|
|[31.0613251567161...|492.61345638953935|
|[31.1695067987115...| 416.3089240051486|
|[31.2681042107507...| 426.4780116017662|
|[31.3091926408918...|428.89299948792177|
|[31.3123495994443...|443.77737837200857|
|[31.3584771924370...| 490.2313154390522|
|[31.3662121671876...|425.95264574533485|
|[31.4252268808548...| 533.7239067652235|
|[31.4459724827577...| 481.0956541511846|
|[31.8124825597242...| 395.7856615582391|
|[31.9262720263601...| 379.7647316622447|
|[31.9453957483445...|  661.664166239279|
|[31.9673209478824...| 449.4154678025375|
|[32.0180740106320...|339.86516196499224|
|[32.0542618511847...| 556.2535131672357|
|[32.0609143984100...|  610.100716858474|
|[32.1223647957977...| 531.6059588728715|
+--------------------+------------