# Expedition to Data Science and Machine Learning
## Module 4: Machine Learning with Python
### Lecture 2: Supervised Learning: Linear Regression

Instructor: Md Shahidullah Kawsar
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- Supervised Learning: Linear Regression
- train data, test data
- Understanding the equation of a straight line
- feature coefficient (slope, gradient, m)
- bias coeffcient (y-interccept, c)
- domain: x-axis, independent variable
- range: y-axis, dependent variable
- loss function, cost function, objective function, error function
- bias-variance tradeoff, overfitting, underfitting
- ordinary least square method
- gradient descent method
- residual, error, squared error, RMSE - Root Mean Squared Error

#### References:
[1] A Gentle Introduction to Machine Learning: <br https://www.youtube.com/watch?v=Gv9_4yMHFhI&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&ab_channel=StatQuestwithJoshStarmer
<br>[2] Linear Regression, Clearly Explained!!!: https://www.youtube.com/watch?v=nk2CQITm_eo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=10&ab_channel=StatQuestwithJoshStarmer
<br>[3] Linear Regression scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
<br>[4] Data Splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
<br>[5] Mean Squared Error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
<br>[6] RMSE calculation: https://www.youtube.com/watch?v=zMFdb__sUpw&ab_channel=KhanAcademy
<br>[7] Regression coefficients: https://statisticsbyjim.com/glossary/regression-coefficient/
<br>[8] Machine Learning Quiz 01: Linear Regression https://kawsar34.medium.com/machine-learning-quiz-01-a2fac2712a55
<br>[9] Linear Regression Assumptions: https://www.statology.org/linear-regression-assumptions/
<br>[10] Constant Variance: https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-linear-regression-model-mean
<br>[11] Multiple Regression: https://www.youtube.com/watch?v=zITIFTsivN8&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=11&ab_channel=StatQuestwithJoshStarmer
<br>[12] Linear Regression Simplified - Ordinary Least Square vs Gradient Descent: https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76

#### Terminologies:

- equation of a straight line: y=mx+c
<br> Straight lines: https://github.com/SKawsar/Data_Visualization_with_Python/blob/main/Lecture_4.ipynb
- feature coefficient (slope, gradient, m)
- bias coeffcient (y-interccept, c)
- domain: x-axis, independent variable
- range: y-axis, dependent variable
- loss function, cost function, objective function, error function
- bias-variance tradeoff, overfitting, underfitting
- ordinary least square method
- gradient descent method
- residual, error, squared error
- train data, test data


#### Import required Libraries

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import pandas as pd

#### Load data

In [28]:
df = pd.read_csv("bmw.csv")

display(df.head(10))
print(df.shape)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,5 Series,2014,11200,Automatic,67068,Diesel,125,57.6,2.0
1,6 Series,2018,27000,Automatic,14827,Petrol,145,42.8,2.0
2,5 Series,2016,16000,Automatic,62794,Diesel,160,51.4,3.0
3,1 Series,2017,12750,Automatic,26676,Diesel,145,72.4,1.5
4,7 Series,2014,14500,Automatic,39554,Diesel,160,50.4,3.0
5,5 Series,2016,14900,Automatic,35309,Diesel,125,60.1,2.0
6,5 Series,2017,16000,Automatic,38538,Diesel,125,60.1,2.0
7,2 Series,2018,16250,Manual,10401,Petrol,145,52.3,1.5
8,4 Series,2017,14250,Manual,42668,Diesel,30,62.8,2.0
9,5 Series,2016,14250,Automatic,36099,Diesel,20,68.9,2.0


(10781, 9)


#### Correlation plot: 
https://github.com/SKawsar/Data_Analysis_with_Python/blob/main/Lecture_8.ipynb

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         10781 non-null  object 
 1   year          10781 non-null  int64  
 2   price         10781 non-null  int64  
 3   transmission  10781 non-null  object 
 4   mileage       10781 non-null  int64  
 5   fuelType      10781 non-null  object 
 6   tax           10781 non-null  int64  
 7   mpg           10781 non-null  float64
 8   engineSize    10781 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB


In [30]:
print(df.columns)

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize'],
      dtype='object')


#### Separating the features and target variable

In [31]:
# only numeric features
# features = ['mileage']
# features = ['mileage', 'year']
# features = ['mileage', 'year', 'tax']
# features = ['mileage', 'year', 'tax', 'mpg']
features = ['mileage', 'year', 'tax', 'mpg', 'engineSize']
target = ['price']

X = df[features]
y = df[target]

print(X.shape, y.shape)

(10781, 5) (10781, 1)


#### Create train and test set

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(8624, 5) (2157, 5) (8624, 1) (2157, 1)


#### Linear Regression

In [33]:
model = LinearRegression()
model = model.fit(X_train, y_train)

#### Prediction

In [34]:
y_pred = model.predict(X_test)

In [35]:
print(y_pred)

[[16398.89900556]
 [11476.30940459]
 [36842.58091838]
 ...
 [27366.84175204]
 [28603.79956238]
 [37725.87896555]]


In [36]:
print(y_test)

       price
8728   15300
761    15495
7209   39875
6685   21730
8548   13799
...      ...
10677  12000
8418   11759
1702   21460
6965   52991
3125   52490

[2157 rows x 1 columns]


#### Prediction Error

In [37]:
RMSE = mean_squared_error(y_test, y_pred, squared=False)
print(RMSE)

6801.826076944502
