<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/04_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Method

Now let's implement some of the most common machine learning methoda for supervised and unsupervised learning (Refer to the documentation whenever needed: https://spark.apache.org/docs/latest/ml-guide.html)

To distiguish between both of kinds of models:

## Supervised Models
If the data has historical labels so there is more likely to treat that problem with a supervised learning model.

Supervised models use patterns to predict the values of the label on addtional unlabeled data. Is commonly used where historical data predict likely future events.

Most Supervised models include:
  * Classification
  * Regression
  * Prediction
  * Gradient boosting

## Unsupervised Models
If the problem works with non historical label data, the problems might need an unsupervised learning model.

There's no "right answer". The algorithm must figure it out what is happening.

It could be difficult to evaluate results of an unsupervised model.

Some of unsupervised models are:
  * Self-organizing maps
  * Nearest-neighbor mapping
  * K-means klustering
  * Singular value decomposition




We're going to be reviewing some of the Spark MLlib documentation in order to code the next steps. Here's the reference for future look: https://spark.apache.org/docs/latest/ml-guide.html

In [1]:
# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=e462047c4f2b62b7e8217a5d8de3f0b692e6dd40f0bb85563d5284e1a1828289
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [3]:
# Download data file
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/sample_linear_regression_data.txt

--2023-10-03 05:02:30--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/sample_linear_regression_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 119069 (116K) [text/plain]
Saving to: ‘sample_linear_regression_data.txt’


2023-10-03 05:02:30 (4.71 MB/s) - ‘sample_linear_regression_data.txt’ saved [119069/119069]



In [4]:
# Import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression

In [5]:
# Init Spark Session
spark = SparkSession.builder.appName('lrex').getOrCreate()

In [7]:
# Read in the file
training = spark.read.format('libsvm').load('sample_linear_regression_data.txt')
training.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

## First, this is how NOT to create a model.
Including all data as training dataset

In [9]:
# Define the Linear Regression object
lr = LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

In [10]:
# Train the model with the train_data
lrModel = lr.fit(training)

In [12]:
# Show the learned coefficients model
lrModel.coefficients

DenseVector([0.0073, 0.8314, -0.8095, 2.4412, 0.5192, 1.1535, -0.2989, -0.5129, -0.6197, 0.6956])

In [14]:
# Show the y interception
lrModel.intercept

0.14228558260358093

In [17]:
# Save the model summaty in a variable
training_summary = lrModel.summary

In [18]:
# Print the MSE
training_summary.rootMeanSquaredError

10.16309157133015

## Now this is the right way to create and train the model
Split on test and train

In [35]:
# Again read in the file data
all_data = spark.read.format('libsvm').load('sample_linear_regression_data.txt')

In [23]:
# Split into train and test
train_data, test_data = all_data.randomSplit([0.7, 0.3])

In [24]:
# Show the train_data
train_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
|-28.046018037776633|(10,[0,1,2,3,4,5,...|
|-26.805483428483072|(10,[0,1,2,3,4,5,...|
| -23.51088409032297|(10,[0,1,2,3,4,5,...|
|-22.837460416919342|(10,[0,1,2,3,4,5,...|
|-21.432387764165806|(10,[0,1,2,3,4,5,...|
|-20.212077258958672|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-19.884560774273424|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -19.66731861537172|(10,[0,1,2,3,4,5,...|
|-19.402336030214553|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|-18.845922472898582|(10,[0,1,2,3,4,5,...|
| -18.27521356600463|(10,[0,1,2,3,4,5,...|
|-17.803626188664516|(10,[0,1,2,3,4,5,...|
|-17.494200356883344|(10,[0,1,2,3,4,5,...|
|-17.428674570939506|(10,[0,1,2,3,4,5,...|
|-17.065399625876015|(10,[0,1,2,3,4,5,...|
|-17.026492264209548|(10,[0,1,2,3,4,5,...|
+----------

In [36]:
# Describe the train_data
train_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                365|
|   mean|0.16556937384948994|
| stddev| 10.197775067613874|
|    min|-28.571478869743427|
|    max|  27.78383192005107|
+-------+-------------------+



In [37]:
# Train the model ONLY on the train_data
correct_model = lr.fit(train_data)

In [39]:
# Evaluate on the test data (data that has not been seen by the model yet)
test_results = correct_model.evaluate(test_data)

In [53]:
# Show the errors
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
|-22.803378454221637|
| -22.19547804100331|
|-27.360655037131597|
|-20.382473623411446|
|-17.294080603857708|
|-17.126928380878944|
|-18.829610556326138|
| -16.43412388719373|
| -19.65586108802081|
|-14.245286217629841|
|-15.576967483611172|
|-14.164669751487567|
| -20.46574485782032|
|-14.194536366505776|
|-17.544789986964652|
|-14.113655153292717|
|-14.948296411131482|
|-14.167173536885606|
|-14.479775478929724|
| -7.303395935959501|
+-------------------+
only showing top 20 rows



In [41]:
# Calculate the MSE of the model
test_results.meanSquaredError

120.38478771428925

## Make some predictions

In [43]:
# Now let's extract the features to which we're gonna make predictions (from the test data)
unlabeled_data = test_data.select('features')
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
|(10,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows



In [44]:
# Make predictions
predictions = correct_model.transform(unlabeled_data)

In [46]:
# Show predictions
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...| -3.9328287283800862|
|(10,[0,1,2,3,4,5,...| -1.2919620799332014|
|(10,[0,1,2,3,4,5,...|   4.410829100935522|
|(10,[0,1,2,3,4,5,...|  0.5094825853430384|
|(10,[0,1,2,3,4,5,...|-0.03264012881823944|
|(10,[0,1,2,3,4,5,...|  0.4078315472738539|
|(10,[0,1,2,3,4,5,...|   2.137403535015033|
|(10,[0,1,2,3,4,5,...|  0.3484648461722426|
|(10,[0,1,2,3,4,5,...|   3.875176055397509|
|(10,[0,1,2,3,4,5,...| -0.5768666921213478|
|(10,[0,1,2,3,4,5,...|  0.8142092306800462|
|(10,[0,1,2,3,4,5,...|  0.2975818563287984|
|(10,[0,1,2,3,4,5,...|   6.693303296117446|
|(10,[0,1,2,3,4,5,...|  1.1546083024011606|
|(10,[0,1,2,3,4,5,...|   4.622566883594231|
|(10,[0,1,2,3,4,5,...|   1.612881367937663|
|(10,[0,1,2,3,4,5,...|   2.456854333585069|
|(10,[0,1,2,3,4,5,...|  1.9690769722241936|
|(10,[0,1,2,3,4,5,...|  2.6016079789628828|
|(10,[0,1,2,3,4,5,...|  -4.33715