<div style="max-width:66ch;">

# Lecture notes - linear regression

This is the lecture note for **linear regression** using scikit-learn. We 

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to linear regression and using scikit-learn. You are encouraged to read more about this topic, see the resources for this week to find out more. 

- [scikit-learn](https://scikit-learn.org/stable/)
- [train-test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test#sklearn.model_selection.train_test_split)
- [scaling data](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/)
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)
- [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean%20absolute#sklearn.metrics.mean_absolute_error)
- [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean%20squared#sklearn.metrics.mean_squared_error)
- [OLS](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)

</div>


<div style="max-width:66ch;">

## Advertisement data

We will perform multiple linear regression on the same [advertisement data](https://www.statlearning.com/resources-second-edition) that we worked on in lecture 0. 

</div>

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("../data/Advertising.csv", index_col=0)

print(f"{df.shape[0]} samples")
print(f"{df.shape[1]-1} features") # subtract one as price_unit_area is the label and not    

df.head()


200 samples
3 features


Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


<div style="max-width:66ch;">

### Dependent and independent variable

Each **feature** TV, Radio and Newspaper are considered independent variables or features and the Sales is the dependent variable as its value depends on the values of the features. Sales can also be considered the **label**. 

The features can be represented by a matrix X and label can be represented by a vector y. In this case y consists of continous data, and the problem we have is **regression**.  

</div>

In [3]:
X, y = df.drop("Sales", axis="columns"), df["Sales"]
X.head(2), y.head(2)

(      TV  Radio  Newspaper
 1  230.1   37.8       69.2
 2   44.5   39.3       45.1,
 1    22.1
 2    10.4
 Name: Sales, dtype: float64)

<div style="max-width:66ch;">

## Scikit-learn steps

Instead of coding out the details how to manually perform linear regression, we will be using a very common library called scikit-learn for it. If you are interested it is great to code it out manually, as you'll get deeper understanding of how the algorithm works. Note that it requires some basic knowledge of linear algebra. 

Now we'll move on to scikit-learn which will work for most classical machine learning algorithms. There are a few steps to follow that will be very familiar for you. However depending on situation, algorithm, dataset, some steps might need to be omitted or additional steps required. 

Steps: 
1. train|test split - some cases train|validation|test - split
2. Scale the dataset 
    - many algorithms require scaling, some don't
    - which type of scaling to use?
    - scale training data, test data to the training data, to avoid data leakage
3. Fit the algorithm to the training data
4. Transform the training data, transform the test data
5. Calculate evaluation metrics

Also if using validation dataset there are some more steps in fine tuning hyperparameters.

</div>

<div style="max-width:66ch;">

### 1. Train|test split

So how well did the model perform? There are several evaluation metrics that we can use to answer this question but not on the training data. Using training data for evaluation we have **data leakage**, because at prediction time we shouldn't have this data available. Data leakage will lead to overestimation of the performance as the model has already trained on the data it is using for evaluation.

We split the data into a training set and a test set, where the test set will only be used during evaluation of the model. Practically we randomly sample this dataset without replacement with certain size for training set and the rest for testing.

</div>

In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((140, 3), (60, 3), (140,), (60,))

<div style="max-width:66ch;">

### 2. Feature scaling

Two popular scaling techniques are normalization and feature standardization

Normalization (min-max feature scaling)

- $X' = \frac{X-X_{min}}{X_{max}-X_{min}}$

Feature standardization (standard score scaling)

- $X' = \frac{X - \mu}{\sigma}$


A question arises when to use one over the other and here are some considerations: 


<table border="1" style="text-transform: lowercase; display:inline-block; text-align:left;">
    <tr style="background-color: #174A7E; color: white;">
        <th>Algorithm</th>
        <th>When to use</th>
        <th>Benefits</th>
    </tr>
    <tr>
        <td>Feature standardization</td>
        <td>Scale to 0 mean and unit variance, good when algorithm assumes normal distribution or the data is normally distributed.</td>
        <td>Preserves shape of original distribution.</td>
    </tr>
    <tr>
        <td>Normalization (min-max)</td>
        <td>Scaling features to fall within range of 0-1, and good when not assume specific distribution of data like neural networks and for algorithms sensitive to scale such as KNN.</td>
        <td>Transform feature to common scale without distorting ranges of values.</td>
    </tr>
</table>

Note that you can also consider which type of scaling as a hyperparameter in which you choose and see how it affects your validation datas performance.


</div>

In [5]:
# we use normalization here
# instantiate an object from the class MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train) # use the training data to fit the scaler

# very important that we fit to training data, i.e. use training datas parameters to transform 
# both training and test data, else if we use test datas parameters to scale test data, we have 
# leaked data, which might give misleading results 
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

print(f"{scaled_X_train.min():.2f} ≤ scaled_X_train ≤ {scaled_X_train.max():.2f}")
print(f"{scaled_X_test.min():.2f} ≤ scaled_X_test ≤ {scaled_X_test.max():.2f}") # natural that it isn't [0,1] since we fit to training data 

# we do not scale our target variable y in this lecture 

0.00 ≤ scaled_X_train ≤ 1.00
0.01 ≤ scaled_X_test ≤ 1.13


### 3. Linear regression

Use linear regression to fit to training data

In [6]:
from sklearn.linear_model import LinearRegression

# this model uses SVD approach for solving normal equation
model = LinearRegression()
model.fit(scaled_X_train, y_train)
print(f"Parameters: {model.coef_}")
print(f"Intercept parameter: {model.intercept_}")

Parameters: [13.02832938  9.88465985  0.69237469]
Intercept parameter: 2.741855324852814


In [7]:
test_sample_features = scaled_X_test[0].reshape(1,-1)
test_sample_target = y_test.values[0]

print(f"Scaled features {test_sample_features}, label {test_sample_target}")
print(f"Prediction: {model.predict(test_sample_features)[0]:.2f}")

Scaled features [[0.54988164 0.63709677 0.52286282]], label 16.9
Prediction: 16.57


<div style="max-width:66ch;">

### 4. Predict on test data

Use the test data scaled to X_train and predict on it. 

</div>

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# first predict on our test data
y_pred = model.predict(scaled_X_test)

MAE: 1.51, MSE: 3.80, RMSE: 1.95


<div style="max-width:66ch;">

## 5. Evaluate performance

How well did we predict $\bf{y}$ (label) with $\hat{\bf{y}}$ (y_pred)?

To answer this question we use several **evaluation metrics** or **loss functions**: 

- Mean Absolute Error (MAE) - mean of error between $\bf{y}$ and ${\hat{\bf{y}}}$. The unit is same as measured quantity.

$$MAE = \frac{1}{m}\sum_{i=1}^m |y_i - \hat{y}_i|$$

- Mean Squared Error (MSE) - mean of squared errors between $\bf{y}$ and ${\hat{\bf{y}}}$. It punishes large errors, and the units are in square units of the measured quantity

$$MSE = \frac{1}{m}\sum_{i=1}^m (y_i - \hat{y}_i)^2$$

- Root Mean Squared Error (RMSE) - square root of MSE between $\bf{y}$ and ${\hat{\bf{y}}}$. It punishes large errors, and the units are same as measured quantity, hence easier to interpret.

$$RMSE = \sqrt{\frac{1}{m}\sum_{i=1}^m (y_i - \hat{y}_i)^2}$$




</div>

In [10]:

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"MAE: {mae:.2f}, MSE: {mse:.2f}, RMSE: {rmse:.2f}")

MAE: 1.51, MSE: 3.80, RMSE: 1.95


### next steps ... 

now you can train on all available data and use that for predicting new data points

<div style="max-width:66ch;">

## Summary

In this lecture we've covered the very basics of scikit-learn and its use for linear regression. These steps are crucial for many machine learning algorithms, so try learning them, and try to understand the details of why we do them.

</div>

<div style="background-color: #FFF; color: #212121; border-radius: 1px; width:22ch; box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px; display: flex; justify-content: center; align-items: center;">
<div style="padding: 1.5em 0; width: 70%;">
    <h2 style="font-size: 1.2rem;">Kokchun Giang</h2>
    <a href="https://www.linkedin.com/in/kokchungiang/" target="_blank" style="display: flex; align-items: center; gap: .4em; color:#0A66C2;">
        <img src="https://content.linkedin.com/content/dam/me/business/en-us/amp/brand-site/v2/bg/LI-Bug.svg.original.svg" width="20"> 
        LinkedIn profile
    </a>
    <a href="https://github.com/kokchun/Portfolio-Kokchun-Giang" target="_blank" style="display: flex; align-items: center; gap: .4em; margin: 1em 0; color:#0A66C2;">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="20"> 
        Github portfolio
    </a>
    <span>AIgineer AB</span>
<div>
</div>
