# Expedition to Data Science and Machine Learning
## Module 4: Machine Learning with Python
### Lecture 2: Supervised Learning: Linear Regression

Instructor: Md Shahidullah Kawsar
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- Supervised Learning: Linear Regression
- train data, test data
- Understanding the equation of a straight line
- feature coefficient (slope, gradient, m)
- bias coeffcient (y-interccept, c)
- domain: x-axis, independent variable
- range: y-axis, dependent variable
- loss function, cost function, objective function, error function
- bias-variance tradeoff, overfitting, underfitting
- ordinary least square method
- gradient descent method
- residual, error, squared error, RMSE - Root Mean Squared Error

#### References:
[1] A Gentle Introduction to Machine Learning: https://www.youtube.com/watch?v=Gv9_4yMHFhI&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&ab_channel=StatQuestwithJoshStarmer
<br>[2] Linear Regression, Clearly Explained!!!: https://www.youtube.com/watch?v=nk2CQITm_eo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=10&ab_channel=StatQuestwithJoshStarmer
<br>[3] Linear Regression scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
<br>[4] Data Splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
<br>[5] Mean Squared Error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
<br>[6] RMSE calculation: https://www.youtube.com/watch?v=zMFdb__sUpw&ab_channel=KhanAcademy
<br>[7] Regression coefficients: https://statisticsbyjim.com/glossary/regression-coefficient/
<br>[8] Machine Learning Quiz 01: Linear Regression https://kawsar34.medium.com/machine-learning-quiz-01-a2fac2712a55
<br>[9] Linear Regression Assumptions: https://www.statology.org/linear-regression-assumptions/
<br>[10] Constant Variance: https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-linear-regression-model-mean
<br>[11] Multiple Regression: https://www.youtube.com/watch?v=zITIFTsivN8&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=11&ab_channel=StatQuestwithJoshStarmer
<br>[12] Linear Regression Simplified - Ordinary Least Square vs Gradient Descent: https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76

#### Terminologies:

- equation of a straight line: y=mx+c
<br> Straight lines: https://github.com/SKawsar/Data_Visualization_with_Python/blob/main/Lecture_4.ipynb
- feature coefficient (slope, gradient, m)
- bias coeffcient (y-interccept, c)
- domain: x-axis, independent variable
- range: y-axis, dependent variable
- loss function, cost function, objective function, error function
- bias-variance tradeoff, overfitting, underfitting
- ordinary least square method
- gradient descent method
- residual, error, squared error
- train data, test data


#### Import required Libraries

In [108]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#### Load data

In [109]:
df = pd.read_csv("vw.csv")

display(df.head())
print(df.shape)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
0,T-Roc,2019,25000,Automatic,13904,Diesel,49.6,2.0
1,T-Roc,2019,26883,Automatic,4562,Diesel,49.6,2.0
2,T-Roc,2019,20000,Manual,7414,Diesel,50.4,2.0
3,T-Roc,2019,33492,Automatic,4825,Petrol,32.5,2.0
4,T-Roc,2019,22900,Semi-Auto,6500,Petrol,39.8,1.5


(15157, 8)


In [110]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15157 entries, 0 to 15156
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         15157 non-null  object 
 1   year          15157 non-null  int64  
 2   price         15157 non-null  int64  
 3   transmission  15157 non-null  object 
 4   mileage       15157 non-null  int64  
 5   fuelType      15157 non-null  object 
 6   mpg           15157 non-null  float64
 7   engineSize    15157 non-null  float64
dtypes: float64(2), int64(3), object(3)
memory usage: 947.4+ KB


In [111]:
df.describe()

Unnamed: 0,year,price,mileage,mpg,engineSize
count,15157.0,15157.0,15157.0,15157.0,15157.0
mean,2017.255789,16838.952365,22092.785644,53.753355,1.600693
std,2.053059,7755.015206,21148.941635,13.642182,0.461695
min,2000.0,899.0,1.0,0.3,0.0
25%,2016.0,10990.0,5962.0,46.3,1.2
50%,2017.0,15497.0,16393.0,53.3,1.6
75%,2019.0,20998.0,31824.0,60.1,2.0
max,2020.0,69994.0,212000.0,188.3,3.2


In [112]:
120000*-0.08

-9600.0

In [113]:
df["model"].value_counts()

 Golf               4863
 Polo               3287
 Tiguan             1765
 Passat              915
 Up                  884
 T-Roc               733
 Touareg             363
 Touran              352
 T-Cross             300
 Golf SV             268
 Sharan              260
 Arteon              248
 Scirocco            242
 Amarok              111
 Caravelle           101
 CC                   95
 Tiguan Allspace      91
 Beetle               83
 Shuttle              61
 Caddy Maxi Life      59
 Jetta                32
 California           15
 Caddy Life            8
 Eos                   7
 Caddy                 6
 Caddy Maxi            4
 Fox                   4
Name: model, dtype: int64

In [114]:
df.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'mpg',
       'engineSize'],
      dtype='object')

In [115]:
df["transmission"].value_counts()

Manual       9417
Semi-Auto    3780
Automatic    1960
Name: transmission, dtype: int64

In [116]:
df["fuelType"].value_counts()

Petrol    8553
Diesel    6372
Hybrid     145
Other       87
Name: fuelType, dtype: int64

#### Correlation plot: 
https://github.com/SKawsar/Data_Analysis_with_Python/blob/main/Lecture_8.ipynb

#### Separating the features and target variable

In [117]:
features = ['year', 'mpg', 'engineSize']
target = ['price']

X = df[features]
y = df[target]

print("Shape of X = ", X.shape)
print("Shape of y = ", y.shape)

Shape of X =  (15157, 3)
Shape of y =  (15157, 1)


#### Create train and test set

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.2, 
                                                    random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(12125, 3) (3032, 3) (12125, 1) (3032, 1)


#### Linear Regression

In [119]:
model = LinearRegression()
model = model.fit(X_train, y_train)

y = m1*x1 + m2*x2 + m3*x3 + m4*x4 + c

In [120]:
coefficients = model.coef_
print(coefficients)

[[2114.02170162 -106.30742038 8759.3771598 ]]


In [121]:
c = model.intercept_
print(c)

[-4255990.43023728]


In [122]:
X.columns

Index(['year', 'mpg', 'engineSize'], dtype='object')

In [123]:
coef_df = pd.DataFrame({"features": X.columns, 
                       "coefficients": np.squeeze(coefficients)})

display(coef_df)

Unnamed: 0,features,coefficients
0,year,2114.021702
1,mpg,-106.30742
2,engineSize,8759.37716


- A positive sign indicates that as the feature variable increases, the target variable also increases.
- A negative sign indicates that as the feature variable increases, the target variable decreases.

#### Prediction

In [124]:
y_pred = model.predict(X_test)
print(y_pred)

[[13304.86271129]
 [24646.01422239]
 [18355.60686375]
 ...
 [ 8375.19033353]
 [ 5771.45345778]
 [ 6261.1686319 ]]


In [125]:
print(y_test)

       price
7342   14450
10328  23950
14992  10495
8466    9990
10347  21998
...      ...
8211   17250
8401   10450
9810   10290
7872    7499
9399    7290

[3032 rows x 1 columns]


In [126]:
# actual value - predicted value = +

#### Prediction Error

In [127]:
MAE = mean_absolute_error(y_test, y_pred)
print("MAE = ", MAE)

MSE = mean_squared_error(y_test, y_pred, squared=True)
print("MSE = ", MSE)

RMSE = mean_squared_error(y_test, y_pred, squared=False)
print("RMSE = ", RMSE)

r2 = r2_score(y_test, y_pred)
print("r_squared = ", r2)

MAE =  2681.5051965262132
MSE =  16459163.341451917
RMSE =  4056.9894431033363
r_squared =  0.7226377192887656


In [128]:
# mean_absolute_error, r2_score