![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/5.PySpark_Regression.ipynb)

# **PySpark Tutorial-5 Regression**

## **Overview**


In this notebook, linear regression is performed for the Advertising dataset using PySpark.

### **LINEAR REGRESSION**

[spark](https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression)

To establish the possible relationship among different variables, various modes of statistical approaches are implemented, known as regression analysis. In order to understand how the variation in an independent variable can impact the dependent variable, regression analysis is specially moulded out. Basically;

 

* Regression analysis sets up an equation to explain the significant relationship between one or more predictors and response variables and also to estimate current observations.
 

* The regression outcomes lead to the identification of the direction, size, and analytical significance of the relationship between predictor and response where the dependent variable could be numerical or discrete in nature.

It is the simplest regression technique used for predictive analysis, a linear approach for featuring the relationship between the response and predictors or descriptive variables. It mainly considers the conditional probability distribution of the response presents the predictor’s uses. 

**Y = bX+C**, where Y is a dependent variable and X, is the independent variable, that shows a best fitted straight line(regression curve) having b as the slope of the line and C intercept.

[link text](https://www.analyticssteps.com/blogs/7-types-regression-technique-you-should-know-machine-learning)


###  **Install spark**










In [None]:
!pip install pyspark


### **Import Library**

> ###### Start spark session, read csv file and make some analysis about data



In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('linear').getOrCreate()  ## start spark session

spark

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/Advertising.csv

In [None]:
adv = spark.read.csv('Advertising.csv', inferSchema=True, header=True)
## read csv file 

## header – uses the first line as names of columns. If None is set, it uses the default value, false.
## inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.

***Header:*** If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc. Setting this to true or false should be based on your input file. 

***Schema:*** The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work). 
By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input. [link text](https://stackoverflow.com/questions/56927329/spark-option-inferschema-vs-header-true/56933052#:~:text=By%20setting%20inferSchema%3Dtrue%20%2C%20Spark,correct%20schema%20given%20its%20input.)

In [None]:
adv.printSchema()   # getting information about all column from dataset

# Prints out the schema in the tree format.

root
 |-- TV: double (nullable = true)
 |-- radio: double (nullable = true)
 |-- newspaper: double (nullable = true)
 |-- sales: double (nullable = true)



In [None]:
adv.head(5)    # getting first five rows from dataset

[Row(TV=230.1, radio=37.8, newspaper=69.2, sales=22.1),
 Row(TV=44.5, radio=39.3, newspaper=45.1, sales=10.4),
 Row(TV=17.2, radio=45.9, newspaper=69.3, sales=9.3),
 Row(TV=151.5, radio=41.3, newspaper=58.5, sales=18.5),
 Row(TV=180.8, radio=10.8, newspaper=58.4, sales=12.9)]

In [None]:
adv.show()   # getting first 20 rows from dataset ---- default 20

+-----+-----+---------+-----+
|   TV|radio|newspaper|sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
|  8.7| 48.9|     75.0|  7.2|
| 57.5| 32.8|     23.5| 11.8|
|120.2| 19.6|     11.6| 13.2|
|  8.6|  2.1|      1.0|  4.8|
|199.8|  2.6|     21.2| 10.6|
| 66.1|  5.8|     24.2|  8.6|
|214.7| 24.0|      4.0| 17.4|
| 23.8| 35.1|     65.9|  9.2|
| 97.5|  7.6|      7.2|  9.7|
|204.1| 32.9|     46.0| 19.0|
|195.4| 47.7|     52.9| 22.4|
| 67.8| 36.6|    114.0| 12.5|
|281.4| 39.6|     55.8| 24.4|
| 69.2| 20.5|     18.3| 11.3|
|147.3| 23.9|     19.1| 14.6|
+-----+-----+---------+-----+
only showing top 20 rows



##### **Import ML Libraries**

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [None]:
adv.columns  ## getting columns name from dataset

['TV', 'radio', 'newspaper', 'sales']

In [None]:
assembler = VectorAssembler(inputCols=['TV', 'radio', 'newspaper'], outputCol='features')   
## VectorAssembler = A feature transformer that merges multiple columns into a vector column.

In [None]:
output = assembler.transform(adv)

In [None]:
output.head(1)

[Row(TV=230.1, radio=37.8, newspaper=69.2, sales=22.1, features=DenseVector([230.1, 37.8, 69.2]))]

In [None]:
final_data = output.select('features', 'sales')

In [None]:
final_data.show()

+-----------------+-----+
|         features|sales|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]|  9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
|  [8.7,48.9,75.0]|  7.2|
| [57.5,32.8,23.5]| 11.8|
|[120.2,19.6,11.6]| 13.2|
|    [8.6,2.1,1.0]|  4.8|
| [199.8,2.6,21.2]| 10.6|
|  [66.1,5.8,24.2]|  8.6|
| [214.7,24.0,4.0]| 17.4|
| [23.8,35.1,65.9]|  9.2|
|   [97.5,7.6,7.2]|  9.7|
|[204.1,32.9,46.0]| 19.0|
|[195.4,47.7,52.9]| 22.4|
|[67.8,36.6,114.0]| 12.5|
|[281.4,39.6,55.8]| 24.4|
| [69.2,20.5,18.3]| 11.3|
|[147.3,23.9,19.1]| 14.6|
+-----------------+-----+
only showing top 20 rows



In [None]:
final_data.describe().show() # describe: Computes basic statistics for numeric and string columns.

+-------+------------------+
|summary|             sales|
+-------+------------------+
|  count|               200|
|   mean|14.022500000000003|
| stddev| 5.217456565710477|
|    min|               1.6|
|    max|              27.0|
+-------+------------------+



#### **Train Test Split**

In [None]:
train_data1, test_data1 = final_data.randomSplit([0.7,0.3])  

## split data to train and test for machine learning

In [None]:
train_data1.describe().show()

+-------+------------------+
|summary|             sales|
+-------+------------------+
|  count|               133|
|   mean|13.846616541353391|
| stddev| 5.239311614396937|
|    min|               1.6|
|    max|              26.2|
+-------+------------------+



In [None]:
test_data1.describe().show()

+-------+------------------+
|summary|             sales|
+-------+------------------+
|  count|                67|
|   mean|14.371641791044775|
| stddev| 5.195301081197088|
|    min|               4.8|
|    max|              27.0|
+-------+------------------+



#### **Linear regression**

In [None]:
from pyspark.ml.regression import LinearRegression

In [None]:
lr = LinearRegression(labelCol='sales')

In [None]:
lr_model = lr.fit(train_data1)

In [None]:
test_results_new = lr_model.evaluate(test_data1)

In [None]:
test_results_new.residuals.show()



+--------------------+
|           residuals|
+--------------------+
|  1.2827466322635814|
|  1.6853048458467361|
|  -2.801278949614172|
| 0.08078926532971664|
| -1.0337119307416138|
|   2.417937615312418|
|  1.0835103841096299|
|  1.1617021062303312|
|   2.076519487927907|
|   0.601843699307226|
|  1.7633542585985538|
|  1.5162876241525858|
|   1.501816135378677|
|   1.278209833603169|
|  0.8305184242427952|
|  0.7178826100144313|
|-0.02819732122089391|
|   1.473227975240258|
|  1.0999621864904263|
|  1.2455163303850032|
+--------------------+
only showing top 20 rows



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize = (10,6))
sns.scatterplot(x = test_data1, y = test_results_new) #-residuals
plt.axhline(y = 0, color ="r", linestyle = "--")
plt.show()

In [None]:
test_results_new.rootMeanSquaredError

# Root mean square error (RMSE) is a method of measuring the difference between values predicted by a model and their actual values.

1.74814617038268

In [None]:
test_results_new.r2

# R2 : percentage of the variance in the dependent variable that the independent variables explain collectively.
# The R2 score of the model trained here is ~0.88 which is not bad. 
# If r squared score is 1, it means that the model is perfect and if it is 0, it means that the model will perform badly on an unseen dataset. 
# This also implies that the closer the value of the r squared score is to 1, the more perfectly the model is trained.

0.8850616624704153