## PySpark ML - Mlib

## Overview

In this article we will be working with PySpark's Mlib library it is commonly known as the Machine learning libary of the PySpark where we can use any ML algorithm that was previously available in sklearn (scikit-learn) along with that we can perform all the operation which were required in the complete ML pipeline.

Read my previous blogs on Pyspark before going on with this one.
1. Getting started with PySpark using Python
2. Data Preprocessing using PySpark - PySpark's DataFrame
3. Data preprocessing using PySpark - Handling missing values
4. Data Preprocessing using PySpark - Aggregate and GroupBy functions

## What we will cover in this article?

1. Setting up the Spark Session and reading the dataset
2. PySpark's Vector Assembler
3. Transforming the dataset
4. Train Test Split
5. Model Building
6. Coffecients and Intercepts of linear regression
7. Predicting the results

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 46.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=3e25f4640886aeed47aa4092cff854e692dba6bb38b3747f99aed6c55a97bb09
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Starting the spark session

The very first step before playing with PySpark is to setup and start the Spark session and for that we will be first importing the **`SparkSession`** function from **`pyspark.sql`** package

In [None]:
from pyspark.sql import SparkSession

df_ml = SparkSession.builder.appName('Machine learning example').getOrCreate()
df_ml

**Inference:** After importing the SparkSession function we have used the **`builder function`** to build our session and gave the name to the session using **`appName function`** which is under builder function only at the last we simply created the session using **`getOrCreate()`** function.

## Reading the dataset

Before heading towards reading the data let's understand what our dataset actually is!

So this dataset is basically bank note authentication dataset from kaggle and it holds the statistical details of both real notes and fake notes. IF you wanna know more about this dataset follow this link.
`https://www.kaggle.com/code/dsabhis04/bank-note-detection-data-set/data`

Feature columns:
1. **Variance**
2. **Skewness**
3. **Curtosis**
4. **Entropy**

Target column:
1. **Target**

Now, after creating and setting up the SparkSession its time to **read the dataset** on which we will be applying **machine learning** operations and before that the **data preprocessing** techniques using **PySpark**.




In [None]:
training_dataset  = df_ml.read.csv('/content/bank_notes.csv', header=True, inferSchema=True)
training_dataset

DataFrame[variance: double, skewness: double, curtosis: double, entropy: double, Target: int]

**Inference:** Here with the help of **`read.csv()`** function we have read the CSV formatted dataset and provide **header as True** so that we can get the column name as header and **inferScehma as True** so that we can get real data type of each column.

**Let's see our dataset now**

In [None]:
training_dataset.show()

+--------+--------+--------+--------+------+
|variance|skewness|curtosis| entropy|Target|
+--------+--------+--------+--------+------+
|  3.6216|  8.6661| -2.8073|-0.44699|     0|
|  4.5459|  8.1674| -2.4586| -1.4621|     0|
|   3.866| -2.6383|  1.9242| 0.10645|     0|
|  3.4566|  9.5228| -4.0112| -3.5944|     0|
| 0.32924| -4.4552|  4.5718| -0.9888|     0|
|  4.3684|  9.6718| -3.9606| -3.1625|     0|
|  3.5912|  3.0129| 0.72888| 0.56421|     0|
|  2.0922|   -6.81|  8.4636|-0.60216|     0|
|  3.2032|  5.7588|-0.75345|-0.61251|     0|
|  1.5356|  9.1772| -2.2718|-0.73535|     0|
|  1.2247|  8.7779| -2.2135|-0.80647|     0|
|  3.9899| -2.7066|  2.3946| 0.86291|     0|
|  1.8993|  7.6625| 0.15394| -3.1108|     0|
| -1.5768|  10.843|  2.5462| -2.9362|     0|
|   3.404|  8.7261| -2.9915|-0.57242|     0|
|  4.6765| -3.3895|  3.4896|  1.4771|     0|
|  2.6719|  3.0646| 0.37158| 0.58619|     0|
| 0.80355|  2.8473|  4.3439|  0.6017|     0|
|  1.4479| -4.8794|  8.3428| -2.1086|     0|
|  5.2423|

**Inference:** Here we can see the **top 20 rows of dataset** with the help of **show() function**

Now we will look at the Scehma of our bank note detection dataset i.e. we will see what data type each column hold and do it have null values or not? So let's answer this question with **`printSchema()`** function.

In [None]:
training_dataset.printSchema()

root
 |-- variance: double (nullable = true)
 |-- skewness: double (nullable = true)
 |-- curtosis: double (nullable = true)
 |-- entropy: double (nullable = true)
 |-- Target: integer (nullable = true)



**Inference:** After calling the printSchema method we can see that it returned the type of the data of each column where: 

* **Variance, Skewness, Curtosis and Entropy** column holds the double type value which is our dependent columns i.e. **features** and
* **Target** column holds the integer type value which is our independent column i.e. **Target** column.


Though by far we saw the complete schema of our dataset but this is not something which we wanna see all the time instead to see just **how many columns are there** so let's figure that out!

In [None]:
training_dataset.columns

['variance', 'skewness', 'curtosis', 'entropy', 'Target']

**Inference:** By using columns object we can see **how many columns are there** in the data and it will be returned in the **list** format.

## Vector Assembler

Vector assembler is the package which helps us to bring all the dependent columns i.e. **features in one column** in short it **stacked the feature columns together** in the form of **`vector type`** so now instead of dealing with multiple columns we only need to care about that one column because it holds all the data which we need to train our model.

In [None]:
## ["variance", "skewness","curtosis", "entropy"] -------> new feature -------> independent feature
from pyspark.ml.feature import VectorAssembler

featassembler = VectorAssembler(inputCols=["variance", "skewness","curtosis", "entropy"], outputCol = "Independent Features" )
featassembler

VectorAssembler_835df549891f

Code breakdown:

1. Firstly we imported the **`VectorAssembler`** from **`pyspark.ml.feature`** library.
2. Then we have used the same VectorAssembler to stack our dependent features together with the help of following parameters.
  * **inputCols:** This parameter will hold all the features in the form of list on which we are aiming to perform ML operations.
  * **outputCol:** Here we will give the name to the column to which we are grouping all the features.

## Transforming the dataset

In this section we will transform our dataset i.e. we will add our Independent feature columns in the original dataset.

In [None]:
result = featassembler.transform(training_dataset)
result.show()

+--------+--------+--------+--------+------+--------------------+
|variance|skewness|curtosis| entropy|Target|Independent Features|
+--------+--------+--------+--------+------+--------------------+
|  3.6216|  8.6661| -2.8073|-0.44699|     0|[3.6216,8.6661,-2...|
|  4.5459|  8.1674| -2.4586| -1.4621|     0|[4.5459,8.1674,-2...|
|   3.866| -2.6383|  1.9242| 0.10645|     0|[3.866,-2.6383,1....|
|  3.4566|  9.5228| -4.0112| -3.5944|     0|[3.4566,9.5228,-4...|
| 0.32924| -4.4552|  4.5718| -0.9888|     0|[0.32924,-4.4552,...|
|  4.3684|  9.6718| -3.9606| -3.1625|     0|[4.3684,9.6718,-3...|
|  3.5912|  3.0129| 0.72888| 0.56421|     0|[3.5912,3.0129,0....|
|  2.0922|   -6.81|  8.4636|-0.60216|     0|[2.0922,-6.81,8.4...|
|  3.2032|  5.7588|-0.75345|-0.61251|     0|[3.2032,5.7588,-0...|
|  1.5356|  9.1772| -2.2718|-0.73535|     0|[1.5356,9.1772,-2...|
|  1.2247|  8.7779| -2.2135|-0.80647|     0|[1.2247,8.7779,-2...|
|  3.9899| -2.7066|  2.3946| 0.86291|     0|[3.9899,-2.7066,2...|
|  1.8993|

**Inference:** So by using the **`transform()`** over assembler object we have sucessfully added the independent feature(S) column at the last column(from left) 

Technically thinking so now our dataset should hold one more column i.e. The independent feature column, let's check that using our **columns** object on the variable that holds resultant dataset.


In [None]:
result.columns

['variance',
 'skewness',
 'curtosis',
 'entropy',
 'Target',
 'Independent Features']

Yes it does! we have our last column in the dataset but do we need the other columns like *curtosis, variance, skewness and entropy ?*

No right! because these columns we already have in our last column that we created using **Vector Assembler** method. So at the end we should only have 2 columns from the dataset and they are:

1. **Independent features:** That holds all the features which we need to apply machine learning alorithm
2. **Target:** That holds the result and from which we will be checking our prediction.


Here we are doing it, we are simply making a final dataset that will consist of only 2 columns.

In [None]:
final_data = result.select("Independent features", "Target")
final_data.show()

+--------------------+------+
|Independent features|Target|
+--------------------+------+
|[3.6216,8.6661,-2...|     0|
|[4.5459,8.1674,-2...|     0|
|[3.866,-2.6383,1....|     0|
|[3.4566,9.5228,-4...|     0|
|[0.32924,-4.4552,...|     0|
|[4.3684,9.6718,-3...|     0|
|[3.5912,3.0129,0....|     0|
|[2.0922,-6.81,8.4...|     0|
|[3.2032,5.7588,-0...|     0|
|[1.5356,9.1772,-2...|     0|
|[1.2247,8.7779,-2...|     0|
|[3.9899,-2.7066,2...|     0|
|[1.8993,7.6625,0....|     0|
|[-1.5768,10.843,2...|     0|
|[3.404,8.7261,-2....|     0|
|[4.6765,-3.3895,3...|     0|
|[2.6719,3.0646,0....|     0|
|[0.80355,2.8473,4...|     0|
|[1.4479,-4.8794,8...|     0|
|[5.2423,11.0272,-...|     0|
+--------------------+------+
only showing top 20 rows



**Inference:** Now here with the help of **`select()`** we filtered out the grouped feature column as well as the resultant column and now are dataset only have 2 column and these are the only one which we care for now.

## Train Test Split

Now as we know that **`Train Test split`** is one of the known step in machine learning pipeline where we divide our training dataset and testing dataset to **remove the problem of `overfitting of the model`** as if we will train the model on the whole dataset then it will surely lead to problem of overfitting of model hence we should always divide the data into training and testing set.

In PySpark we will be using the **`randomSplit()`** function to divide the data into training and testing set.

In [None]:
train_data, test_data = final_data.randomSplit([0.75, 0.25])

**Inference:** Now as we can see that we are breaking up the data into **75% of training** and **25% of testing** data using the **randomSplit()** function and it is getting stored in **train_data** and **test_data** variable simultaneously.

## Model building

Now as if we have **splitted our dataset** and we already have our training set so it's time to **`build our model`** based on the training dataset and then test the same model corresponds to testing data. As we know that it is the **regression problem** so we will be using the **`Linear Regression algorithm`**.

In [None]:
from pyspark.ml.regression import LinearRegression

model = LinearRegression(featuresCol = 'Independent features', labelCol='Target')
model = model.fit(train_data)

**Code breakdown:**

1. Firstly we have imported the **`LinearRegression`** algorithm from the **`pyspark.ml.regression`** package.
2. Then we will define our **independent features** and **target** column after specifying the **`featureCol`** and **`labelCol`** simultaneously.
3. After defining the feature columns and target we will **`fit`** our training data with the model that we have created.

## Coffecients and Intercepts

The one who know about the **mathematical intiution of linear regression** they can easily pick up what this coefficient and intercepts demonstrate. For the one who are not aware of the same for them we will discuss it in nuthshell.

Equation of linear regression: **`y = c + b*x`**
Where, 
* "y" is the dependent variable i.e. target variable.
* "x"is the independent variable i.e. features.
* "b" is the **slope** of the line and also known as **regression coefficient**
* "c" is the **intercept** which is also known as **constant**.

**Coffecients**

In [None]:
model.coefficients

DenseVector([-0.142, -0.0786, -0.1014, 0.0])

**Inference:** In the output we can see that it has given us the **array of list of coffecients** in the form if **Dense Vector** i.e. all the regression coefficents in the Vector format.

**Intercepts**

In [None]:
model.intercept

0.7994252652531315

**Inference:** In the output we can see the intercept that our model have and it represents the **mean of the target variable** when all the feature variables collectively have **zero** value.

## Predicting the results

So here we have come to the section where we will see how our model performed after all the training it went through and we call it **Prediction section**.

In [None]:
prediction_result = model.evaluate(test_data)
prediction_result

<pyspark.ml.regression.LinearRegressionSummary at 0x7f45ec9551d0>

**Inference:** Here for predicting the results we are using the testing data and along with that involving the **evaluate()** method to predict the results on that unseen data then in the output we can see that it returned the object of **ml.regression.LinearRegressionSummary**.

In [None]:
prediction_result.predictions.show()



+--------------------+------+------------------+
|Independent features|Target|        prediction|
+--------------------+------+------------------+
|[-6.651,6.7934,0....|     1|1.1404038856208492|
|[-6.5235,9.6014,-...|     1|0.9969175247975464|
|[-6.3364,9.2848,0...|     1|0.9680322282277627|
|[-6.1632,8.7096,-...|     1|1.0120857344568213|
|[-5.9034,6.5679,0...|     1|1.0529166027113548|
|[-5.525,6.3258,0....|     1|0.9957800288837861|
|[-5.3857,9.1214,-...|     1| 0.889829131214498|
|[-5.2406,6.6258,-...|     1|1.0430280860585202|
|[-5.2049,7.259,0....|     1|  0.96080342143579|
|[-5.119,6.6486,-0...|     1|  1.00885077845283|
|[-4.8392,6.6755,-...|     1|0.9865451974451206|
|[-4.7331,-6.1789,...|     1|0.8024320566742837|
|[-4.577,3.4515,0....|     1|1.1105526948591118|
|[-4.5046,-5.8126,...|     1| 0.792035762049575|
|[-4.4779,7.3708,-...|     1|0.8876080853643694|
|[-4.2932,3.3419,0...|     1|1.0681676832868743|
|[-4.2333,4.9166,-...|     1|1.0640529769769747|
|[-4.211,-12.4736,..

**Inference:** Now with the help of show method we can easily compare that **how close is the predicted value than the actual value** and it returned the DataFrame where we can see the predicted as well as actual values side by side based on the DataFrame we build previously.

## Key takeaways from this article

1. First we have completed our mandatory steps of starting the spark session and reading the bank note dataset.
2. Then we used **Vector Assembler** to stacked all our features column.
3. Then we have **transformed the dataset** so that it can lead to clear and understandable results.
4. Later we used the **randomsplit() method** to split our dataset into training and testing data
5. Then we build our **linear regression model** using **fit** method and found the cooeficients and intercepts of the model
6. Finally we draw the **prediction** that our model has predicted based on the testing data

In [None]:
prediction_result.meanAbsoluteError, prediction_result.meanSquaredError

(0.13659314366111264, 0.03228711994476193)