# Introduction to Linear Regression using MLIB

## Introduction

In this article we will be learning about the **Linear Regression using Mlib** and everything will be hands on i.e. we will be building an end to end Linear regression model which will predict the **customer's yearly spend on the company's product** if we talk about the dataset so it is completely a dummy dataset which is generated in purpose to undertand the concepts of **model building for continous data** using **"MLIB"**.

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 59.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=b1421377ef53b43d724cfef41ed551f2f530eccd4473ea920ddc7a9d4ea468bc
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Mandatory Steps:

Before getting into the machine learning process and following the steps to predict the customer's yearly spend we must need to intialize the Spark Session and read our dummy dataset of Ecommerce website which have all the relevant features.

1. Initializing the Spark Session
2. Reading the dataset 

**Setting up the spark session**

In this particular section we will setup up the Spark object so that we will be able to create an environment to perfor the operations which are supported and managed by it.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('E-commerce').getOrCreate()

**Inference:** So from the above two code of lines we have sucessfully imported the **SparkSession object** from pyspark's sql package and then we have created the environment using **getOrCreate**() function and one thing to note is that before creating it we have build it using **builder** function and given it the name as **"E-commerce"**

**Reading the dataset**

In this section we will be **reading the dummy dataset** which I've created to perform the **ML operations** along with **Data Preprocessing using PySpark**.

In [None]:
data = spark.read.csv("Ecommerce_Customers.csv",inferSchema=True,header=True)

**Inference:** So in the above line of code we have read the Ecommerce data and kept the **inferSchema** parameter as **True** so that it will return the real data type which dataset possess and **header** as **True** so that first tuple of record will be stated as header.

**Showing the Schema of our dataset**

Here the **Schema** of the dataset will be shown so that one could get the inference of what kind of data each column holds and then the analysis could be done with more precision.

In [None]:
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



**Inference:** So we have used the **printSchema**() function to show the information about each column that our dataset holds and while looking at the output one can clearly see what kind of data type is there.

Now we will go through the dataset using three different ways so that one could also know all the methods to investigate it.

1. show() function
2. head() function
3. Iterating through each item

Looking at the data using the **show() function** where it will return the **top 20 rows** from the complete data.

In [None]:
data.show()

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
|riverarebecca@gma...|1414 David Throug...|   

Now the head function needs to be introduced which is quite similar to **head function used in pandas** in the below code's output we can see that head function returned the **Row** object which holds a one complete **record/tuple**.

In [None]:
data.head()

Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)

Now let's see the more clear version of getting into the data where each item will be **iterable** through the combination of for loop and head function and the output shown is the more clear version of the **Row object** output.

In [None]:
for item in data.head():
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Importing Linear Regression library

As mentioned earlier that we will gonna predict the customer's yearly expenditure on products so based on that we already know, we have to deal with **continous data** and when we are working with such type of data we have to use the **linear regression** model.

For that reason we will be importing the **Linear Regression** package from the **ML** library of **PySpark**.

In [None]:
from pyspark.ml.regression import LinearRegression

## Data preprocessing for Machine Learning

In this section all the data preprocessing techniques will be performed which are required to make the dataset ready to be sent it across the ML pipeline where the model could easily adapt it and build an efficient model.

Importing **Vector** and **VectorAssembler** libraries so that we could easily seperate the **features** columns and the **Label** column i.e. all the dependent columns will be stacked together as the feature column and the independent column will be as a label column.

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

Let's have a look at which columns are present in our dataset

In [None]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

**Inference:** So from the above output all the columns are listed down in the form of list type only but this will not give us the enough information about which column to select hence for that reason we will use the **describe** method.

In [None]:
data.describe()

DataFrame[summary: string, Email: string, Address: string, Avatar: string, Avg Session Length: string, Time on App: string, Time on Website: string, Length of Membership: string, Yearly Amount Spent: string]

**Inference:** If you will go through the output closely you will find that, columns which have a **string as the data type will have no role in model development** phase as machine learning is the involvment of mathematical calculation where only number game is allowed hence integer and double data type columns are accepted.

Based on the above discussion the columns which are selected to be part of machine learning pipeline are as follows:

1. Average Session Length
2. Time on App
3. Time on Website
4. Length of Membership

In [None]:
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

**Inference:** In the above code we chose the VectorAssembler method to stack all our features column togethere and return them as "features" column by the **outputCol** parameter.

In [None]:
output = assembler.transform(data)

Here, Transform function is used to **fit the real data** with the changes that we have done in assembler variable using the VectorAssembler function so that the changes should reflect in real dataset.

In [None]:
output.select("features").show()

+--------------------+
|            features|
+--------------------+
|[34.4972677251122...|
|[31.9262720263601...|
|[33.0009147556426...|
|[34.3055566297555...|
|[33.3306725236463...|
|[33.8710378793419...|
|[32.0215955013870...|
|[32.7391429383803...|
|[33.9877728956856...|
|[31.9365486184489...|
|[33.9925727749537...|
|[33.8793608248049...|
|[29.5324289670579...|
|[33.1903340437226...|
|[32.3879758531538...|
|[30.7377203726281...|
|[32.1253868972878...|
|[32.3388993230671...|
|[32.1878120459321...|
|[32.6178560628234...|
+--------------------+
only showing top 20 rows



Now with **select** function we have selected only the **features** column from the dataset and showed it in the form of DataFrame using **show**() function

In [None]:
final_data = output.select("features",'Yearly Amount Spent')

From the above code we are concatenating the stack of dependent features (named as features) and **independent** features together and naming it as **final_data** and this frame will be analyse further in the process.

## Train Test Split

In this step of the Model building we will be dividing our data into **training set and the testing set**, where training data will be the one on top of which our model will be build and on the other hand testing data is the one on which we will test our model that how well it performed.

In Mlib, for dividing the data into testing and training set we have to use **randomSplit**() function which take an input in the form of the **list type**.

In [None]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

**Inference:** From the help of tuple unpacking concept we have stored the training set (**70**%) into train_data and similarly **30**% of the dataset into test_data. Note that in the **randomsplit**() method the list is passed.

In [None]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                349|
|   mean| 501.61565245873084|
| stddev|  79.05343373222998|
|    min|   266.086340948469|
|    max|  765.5184619388373|
+-------+-------------------+



In [None]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                151|
|   mean|  493.9944133854194|
| stddev|  79.92486453175475|
|    min| 256.67058229005585|
|    max|  744.2218671047146|
+-------+-------------------+



**Inference:** **Describe** method seems to be an accurate way to analyse and draw the difference between training and testing data where we can see that in training set there is **349** records while **151** on the other hand.

## Model Development

Finally we have come across the step where we will be building our Linear Regression Model and for that **LinearRegression** object is used which if you remember we have imported in the starting and then passed the **"Yearly Amount Spent"** column in the **labelCol** parameter which is our **independent** column.

In [None]:
lr = LinearRegression(labelCol='Yearly Amount Spent')

Now, as we have created our **Linear Regression** object so now we can easily fit our data i.e. we can do the **model training** by passing the training data in the **fit** method.

In [None]:
lrModel = lr.fit(train_data,)

Now let's print the **Coefficients** of each feature and **intercepts** of the model which is being trained on training dataset and this is also one of the information which will let you know that how well your model is involving with each independent variable seperately.

In [None]:
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [25.324513354618116,38.880247333555445,0.20347373150823037,61.82593066961652] Intercept: -1031.8607952442187


## Model Evaluation

So in this step we will be **evaluating our model** i.e. We will analyze that how well our model performed and in this stage of the model building we decide whether to go with existing one or not in the **model deployment stage**.

So for evaluating we have come across **"evaluate"** function and store it in the **test_results** variable as we will use it for further analysis.

In [None]:
test_results = lrModel.evaluate(test_data)

The one who knows the mathematical intuition behind Linear Regression they must be aware of the fact that **residual = Original result -  Predicted result** i.e. difference between the predicted result by the model and the original result of the label column. 

In [None]:
test_results.residuals.show()



+-------------------+
|          residuals|
+-------------------+
|   8.93880303335402|
| -6.031583754482824|
| -7.850241028538278|
| -9.098072078293853|
| -5.381642839990491|
| -0.193321226719263|
| -4.383507943842062|
| -6.621851079839303|
| -9.652981447308719|
|-19.370052426012762|
|  16.93656694674837|
| -5.377740911170633|
|  5.678080576919854|
| -2.432170378950275|
| -7.623945670593628|
|-12.779065649375752|
|  -4.78248762240969|
|-19.039173653509636|
|-10.188722409023455|
| -3.275297621379309|
+-------------------+
only showing top 20 rows



Now its time to make predictions from our model for that we will first store the unlablled data i.e the feature data and transform it too so that changes will take place.

In [None]:
unlabeled_data = test_data.select('features')

In [41]:
predictions = lrModel.transform(unlabeled_data)
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[30.7377203726281...|452.84193916287586|
|[30.8364326747734...|473.53348418147243|
|[31.0613251567161...| 495.4056990864399|
|[31.1280900496166...| 566.3507588253485|
|[31.2681042107507...| 428.8521760138144|
|[31.3895854806643...|410.26293228670215|
|[31.5171218025062...| 280.3019285942278|
|[31.5257524169682...| 450.5874778897212|
|[31.5261978982398...| 418.7475076396465|
|[31.5702008293202...| 565.3155445674176|
|[31.6005122003032...|462.23628454434856|
|[31.6253601348306...| 381.7146416680948|
|[31.6548096756927...|469.58534315062866|
|[31.7216523605090...| 350.2090970108229|
|[31.7242025238451...| 511.0118329585541|
|[31.8093003166791...| 549.5509650122169|
|[31.8124825597242...| 397.5928326062069|
|[31.8164283341993...|  520.161665157166|
|[31.8279790554652...|  450.191469955965|
|[31.8627411090001...|  559.573438795426|
+--------------------+------------

**Inference:** So from the above output we can see that it returned a DataFrame which practically have two columns one is the complete stack of features column and the other one is **prediction** column. 

## Conclusion

So, in this section we will see by far what we have learnt in this article if I have to mention it in the nutshell then we have gone through a complete machine learning pipeline for the linear regression algorithm.

1. We started the spark session and read the dataset on top of which everything was performed.
2. Then we performed each data preprocessing step which was required to make the data ready for a ML algorithm to accept.
3. After Data cleaning we moved towards dividing the data and later towards model building where we built Linear regression model.
4. At the end we evaluated the model using relevant functions and predicted the results.