<a href="https://colab.research.google.com/github/kokchun/Machine-learning-AI22/blob/main/Lecture_code/L2-scikit-learn.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code


---
# Lecture notes - Scikit-learn intro
---

This is the lecture note for **scikit-learn**

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to scikit-learn. I encourage you to read further about scikit-learn. </p>

Read more

- [scikit-learn](https://scikit-learn.org/stable/)
- [train-test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test#sklearn.model_selection.train_test_split)
- [scaling data](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/)
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)
- [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean%20absolute#sklearn.metrics.mean_absolute_error)
- [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean%20squared#sklearn.metrics.mean_squared_error)
- [OLS](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
---


## Advertisement data

We will perform multiple linear regression on the same [advertisement data](https://www.statlearning.com/resources-second-edition) that we worked on in lecture 0. 



In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("../data/Advertising.csv", index_col=0)

print(f"{df.shape[0]} samples")
print(f"{df.shape[1]-1} features") # subtract one as price_unit_area is the label and not    

df.head()


200 samples
3 features


Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [3]:
X, y = df.drop("Sales", axis="columns"), df["Sales"]
X.head(2), y.head(2)


(      TV  Radio  Newspaper
 1  230.1   37.8       69.2
 2   44.5   39.3       45.1,
 1    22.1
 2    10.4
 Name: Sales, dtype: float64)

In [4]:
X.shape, y.shape

((200, 3), (200,))

---
## Scikit-learn steps

Usually when using Scikit-learn for most algorithms, there are a few steps to follow. However depending on situation, algorithm, dataset, some steps might need to be omitted or additional steps required. 

Steps: 
1. train|test split - some cases train|validation|test - split
2. Scale the dataset 
    - many algorithms require scaling, some don't
    - which type of scaling to use?
    - scale training data, test data to the training data, to avoid data leakage
3. Fit the algorithm to the training data
4. Transform the training data, transform the test data
5. Calculate evaluation metrics

Also if when using validation dataset there are some more steps in fine tuning hyperparameters, but we will come back to this later.

In [5]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((140, 3), (60, 3), (140,), (60,))

---
## Feature scaling

Two popular scaling techniques are normalization and feature standardization

Normalization (min-max feature scaling)

- $X' = \frac{X-X_{min}}{X_{max}-X_{min}}$

Feature standardization (standard score scaling)

- $X' = \frac{X - \mu}{\sigma}$

In [11]:
# we use normalization here
# instantiate an object from the class MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train) # use the training data to fit the scaler

scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

print(f"{scaled_X_train.min():.2f} ≤ scaled_X_train ≤ {scaled_X_train.max():.2f}")
print(f"{scaled_X_test.min():.2f} ≤ scaled_X_test ≤ {scaled_X_test.max():.2f}") # natural that it isn't [0,1] since we fit to training data 

# we do not scale our target variable y in this lecture 

scaled_X_test

0.00 ≤ scaled_X_train ≤ 1.00
0.01 ≤ scaled_X_test ≤ 1.13


array([[0.54988164, 0.63709677, 0.52286282],
       [0.65843761, 0.96169355, 0.52286282],
       [0.98816368, 0.57056452, 0.42644135],
       [0.03719986, 0.74395161, 0.44632207],
       [0.74264457, 0.98790323, 0.02882704],
       [0.25160636, 0.70564516, 0.52087475],
       [0.73080825, 0.88508065, 0.26739563],
       [0.16672303, 0.23387097, 0.17992048],
       [0.74974636, 0.06854839, 0.12723658],
       [0.58978695, 0.45362903, 0.31013917],
       [0.10415962, 0.49596774, 0.01888668],
       [0.18769023, 0.11491935, 0.29224652],
       [0.79066622, 0.06854839, 0.83996024],
       [0.01589449, 0.60282258, 0.09045726],
       [0.46939466, 0.04233871, 0.26143141],
       [0.5732161 , 0.15725806, 0.34691849],
       [0.02231992, 0.56653226, 0.40854871],
       [0.66587758, 0.46975806, 0.13817097],
       [0.25228272, 0.40927419, 0.32007952],
       [0.80047345, 0.55443548, 0.10636183],
       [0.77375719, 0.65120968, 0.73459245],
       [0.22691917, 0.73790323, 1.13021869],
       [0.

---
## Linear regression algorithm

### LinearRegression()

In [7]:
from sklearn.linear_model import LinearRegression

# this model uses SVD approach for solving normal equation
model_SVD = LinearRegression()
model_SVD.fit(scaled_X_train, y_train)
print(f"Parameters: {model_SVD.coef_}")
print(f"Intercept: {model_SVD.intercept_}")

Parameters: [13.02832938  9.88465985  0.69237469]
Intercept: 2.7418553248528124


### Stochastic gradient descent 

In [8]:
from sklearn.linear_model import SGDRegressor

model_SGD = SGDRegressor(loss = "squared_error", learning_rate="invscaling", max_iter = 10000)
model_SGD.fit(scaled_X_train, y_train)
print(f"Parameters: {model_SGD.coef_}")
print(f"Intercept: {model_SGD.intercept_}")

Parameters: [11.94609987  8.9907313   1.35755551]
Intercept: [3.58624155]


### Manual test

We test predict one sample to manually do a reasonability check.

In [13]:
test_sample_features = scaled_X_test[0].reshape(1,-1)
test_sample_target = y_test.values[0]

print(f"Scaled features {test_sample_features}, label {test_sample_target}")
print(f"Prediction SVD: {model_SVD.predict(test_sample_features)[0]:.2f}")
print(f"Prediction SGD: {model_SGD.predict(test_sample_features)[0]:.2f}")

Scaled features [[0.54988164 0.63709677 0.52286282]], label 16.9
Prediction SVD: 16.57
Prediction SGD: 16.59


array([0.54988164, 0.63709677, 0.52286282])

---
## Evaluation

Before we have calculated MAE, MSE, RMSE by ourselves using numpy. Now we will use scikit-learn to do it.

In [10]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# first predict on our test data 
y_pred_SVD = model_SVD.predict(scaled_X_test)
y_pred_SGD = model_SGD.predict(scaled_X_test)

mae_SVD = mean_absolute_error(y_test, y_pred_SVD)
mse_SVD = mean_squared_error(y_test, y_pred_SVD)
rmse_SVD = np.sqrt(mse_SVD)

mae_SGD = mean_absolute_error(y_test, y_pred_SGD)
mse_SGD = mean_squared_error(y_test, y_pred_SGD)
rmse_SGD = np.sqrt(mse_SGD)

print(f"SVD, MAE: {mae_SVD:.2f}, MSE: {mse_SVD:.2f}, RMSE: {rmse_SVD:.2f}")
print(f"SGD, MAE: {mae_SGD:.2f}, MSE: {mse_SGD:.2f}, RMSE: {rmse_SGD:.2f}")


SVD, MAE: 1.51, MSE: 3.80, RMSE: 1.95
SGD, MAE: 1.52, MSE: 4.10, RMSE: 2.02


---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---
