<a href="https://colab.research.google.com/github/kokchun/Machine-learning-AI22/blob/main/Lecture_code/L2-scikit-learn.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code


---
# Scikit-learn
---

This is the lecture note for **scikit-learn**

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to scikit-learn. I encourage you to read further about scikit-learn. </p>

Read more

- [scikit-learn](https://scikit-learn.org/stable/)
- [train-test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test#sklearn.model_selection.train_test_split)
- [scaling data](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/)
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)
- [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean%20absolute#sklearn.metrics.mean_absolute_error)
- [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean%20squared#sklearn.metrics.mean_squared_error)
- [OLS](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
---


## Advertisement data

We will perform multiple linear regression on the same [advertisement data](https://www.statlearning.com/resources-second-edition) that we worked on in lecture 0. 



In [20]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("../data/Advertising.csv", index_col=0)

print(f"{df.shape[0]} samples")
print(f"{df.shape[1]-1} features") # subtract one as Sales is not a feature but a label

df.head()



200 samples
3 features


Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [21]:
number_of_samples, number_of_features = df.shape[0], df.shape[1]-1 #-1 because Sales is label and not a feature
number_of_samples, number_of_features

(200, 3)

In [22]:
X, y = df.drop("Sales", axis="columns"), df["Sales"]
X.head()


Unnamed: 0,TV,Radio,Newspaper
1,230.1,37.8,69.2
2,44.5,39.3,45.1
3,17.2,45.9,69.3
4,151.5,41.3,58.5
5,180.8,10.8,58.4


In [23]:
y.head()

1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: Sales, dtype: float64

In [24]:
X.shape, y.shape

((200, 3), (200,))

---
## Scikit-learn  - typical steps

Usually when using Scikit-learn for most algorithms, there are a few steps to follow. However depending on situation, algorithm, dataset, some steps might need to be omitted or additional steps required. 

Steps: 
1. train|test split - some cases train|validation|test - split
2. Scale the dataset sometimes required
    - many algorithms require scaling, some don't
    - which type of scaling to use?
    - scaling examples are min-max, standardisation, ...
    - **scale training data and then scale test data to the training data to avoid data leakage**
3. Train the model - Fit the algorithm to the training data
4. Predict test data
5. Evaluate

*Also when using validation dataset there are some more steps in fine tuning hyperparameters, but we will come back to this later.*

---
## Train|test split

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((140, 3), (60, 3), (140,), (60,))

## Feature scaling

Two popular scaling techniques are normalization and feature standardization <br>
$X'$ is transformed/scaled $X$ matrix, excecuted for each fature seperately

Normalization (min-max feature scaling)


- $X' = \frac{X-X_{min}}{X_{max}-X_{min}}$

Feature standardization (standard score scaling) -> Z-distribution

- $X' = \frac{X - \mu}{\sigma}$

In [26]:
from sklearn.preprocessing import MinMaxScaler #minmax_scale is another possibility but MinMaxScaler prerered

# we use normalization here
# instantiate an object from the class MinMaxScaler
scaler = MinMaxScaler()

# fit() in this case MinMaxScaler looks for min and max values
scaler.fit(X_train) # Important: use for the training data and not all data to fit the scaler
                    # NOTE can not do fit() on X_test only transform

# transform() does some kind of calculation
scaled_X_train = scaler.transform(X_train)  # calculations as decribed in Markdown above are executed, NOTE OK with fit_transform()
scaled_X_test = scaler.transform(X_test)    # the same scaling on test data; NOTE transform() only, can not use fit_tranform()

print(f"{scaled_X_train.min():.2f} ≤ scaled_X_train ≤ {scaled_X_train.max():.2f}")
print(f"{scaled_X_test.min():.2f} ≤ scaled_X_test ≤ {scaled_X_test.max():.2f}")

# 0.00 ≤ scaled_X_train ≤ 1.00 - all values between 0 and 1 since scaled on this data
# 0.01 ≤ scaled_X_test ≤ 1.13 - allwasys applicable that scaled_X_test.min != 0, scaled_X_test.max != 1, since we fit to training data

# we do not scale our target variable y in this lecture 

0.00 ≤ scaled_X_train ≤ 1.00
0.01 ≤ scaled_X_test ≤ 1.13


In [27]:
scaled_X_train.shape, scaled_X_test.shape

((140, 3), (60, 3))

In [51]:
scaled_X_train.shape, y_train.shape

((140, 3), (140,))

---
## Linear regression algorithm

### LinearRegression() - OLS

In [28]:
from sklearn.linear_model import LinearRegression

model_OLS = LinearRegression()
model_OLS.fit(scaled_X_train, y_train) # linjear reg does not require scaled

print(f"Parameters: {model_OLS.coef_}")             # beta_1, beta_2, beta_3
print(f"Intercept: {model_OLS.intercept_}")         # beta_0

# analytical solution

Parameters: [13.02832938  9.88465985  0.69237469]
Intercept: 2.7418553248528124


### Stochastic gradient descent 

In [29]:
from sklearn.linear_model import SGDRegressor #regression since our prediction will be a number (and not a label which would be classification)

model_SGD = SGDRegressor(loss = "squared_error", learning_rate="invscaling", max_iter = 10000)
model_SGD.fit(scaled_X_train, y_train) # note that needs to be on scaled data
print(f"Parameters: {model_SGD.coef_}")
print(f"Intercept: {model_SGD.intercept_}")

Parameters: [11.96247559  9.00809618  1.34237759]
Intercept: [3.57317956]


### Manual prediction

We test predict one sample to manually do a reasonability check.

In [30]:
test_sample_features = scaled_X_test[0].reshape(1,-1) #
test_sample_label = y_test.values[0]

test_sample_features, test_sample_label

#print(f"Scaled features {test_sample_features}, label {test_sample_target}")
#print(f"Prediction SVD: {model_SVD.predict(test_sample_features)[0]:.2f}")
#print(f"Prediction SGD: {model_SGD.predict(test_sample_features)[0]:.2f}")

(array([[0.54988164, 0.63709677, 0.52286282]]), 16.9)

In [48]:
test_sample_features.shape, test_sample_label.shape

((1, 3), ())

In [32]:
# calculated answer
model_OLS.predict(test_sample_features)[0]

16.56539629743484

In [33]:
# simulated answer
model_SGD.predict(test_sample_features)[0]

16.592033567268214

In [34]:
X_test.iloc[0].to_numpy()

array([163.3,  31.6,  52.9])

---
## Prediction

In [35]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# 1 predict on our test data 
y_pred_OLS = model_OLS.predict(scaled_X_test) # note scaled data
y_pred_SGD = model_SGD.predict(scaled_X_test) # note scaled data



In [36]:
y_pred_SGD[:5]

array([16.59203357, 20.81463066, 21.1062088 , 11.31890245, 21.39487116])

In [40]:
y_test.shape

(60,)

In [50]:
y_test[:5].values

array([16.9, 22.4, 21.4,  7.3, 24.7])

---
## Evaluation

Before we have calculated MAE, MSE, RMSE by ourselves using numpy. Now we will use scikit-learn to do it.

In [38]:
mae_OLS = mean_absolute_error(y_test, y_pred_OLS)
mae_SGD = mean_absolute_error(y_test, y_pred_SGD)

mse_OLS = mean_squared_error(y_test, y_pred_OLS)
mse_SGD = mean_squared_error(y_test, y_pred_SGD)

rmse_OLS = np.sqrt(mse_OLS)
rmse_SGD = np.sqrt(mse_SGD)

print(f"OLS, MAE: {mae_OLS:.2f}, MSE: {mse_OLS:.2f}, RMSE: {rmse_OLS:.2f}")
print(f"SGD, MAE: {mae_SGD:.2f}, MSE: {mse_SGD:.2f}, RMSE: {rmse_SGD:.2f}")
print('Choose OLS since lower values')

OLS, MAE: 1.51, MSE: 3.80, RMSE: 1.95
SGD, MAE: 1.52, MSE: 4.09, RMSE: 2.02
Choose OLS since lower values
