# Data Modelling with `scikit-learn`

In this tutorial, we will learn about basic data modeling (regression only) techniques using the `scikit-learn` library in Python. We will cover two commonly used machine learning models: 

- Linear Regression
- Random Forest

It is a very broad area for modelling techniques, so we cannot cover all details. But you are able to check more [mateiral](https://scikit-learn.org/stable/tutorial/basic/tutorial.html) later.

In [12]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

#### Quiz

We will use the Boston Housing Dataset again! So can you import the dataset by yourself this time? We will try to predict the MEDV based on other variables as features.

In [5]:
boston_housing = pd.read_csv(r".\datasets\BostonHousing.csv")
boston_housing.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


Generally, a complete modeling pipeline includes the steps as below:

1. Data collection and loading
2. Data pre-processing: handling missing values, removing outliers, formulate structural dataset
3. Feature extraction/engineering
4. Dataset spliting
5. Train the model on the training set to ensure the model fit the data
6. validate the model on the validate set to tune the "hyperparameters"
6. Test the model on the test set and report the performance

But on this occasion, as a tutorial, we provide a clean dataset and you are not required to do feature engineering. 

Hence, we will not focus on the step 1-3, although the first three steps are always the most challenging part costing more than 70% of time in a real-world project.

## 1. Data Spliting

When building a machine learning model, it's crucial to evaluate how well the model performs on unseen data. To do this effectively, we split our dataset into separate parts: a training set, validation set and a testing set.

- Typically, we use a larger portion of the data for training and validation (e.g., 80%) and a smaller portion for testing (e.g., 20%).
- Validation set can be furhter separated from the training set, but it reduces the number of avaliable samples for trianing!! Normally, we use K-fold cross validation techniques.
- This split ensures that we have enough data to train the model while also reserving enough data to test it.

In [6]:
X = boston_housing.drop('medv', axis=1)
y = boston_housing['medv']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Quiz

1. Can you check the shape of `X_train`, `X_test`, `y_train` and `y_test`?

In [7]:
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

Shape of X_train:  (404, 13)
Shape of X_test:  (102, 13)
Shape of y_train:  (404,)
Shape of y_test:  (102,)


## 2. Train and Test the ML model

### 2.1 Linear Regression

`scikit-learn` includes most of commonly used machine learning models. OLS-based linear regression in `sklearn.linear_model` is one of the simplest examples.

Let's import the OLS linear regressor and then instantiate it.

In [8]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

And then we could train the model by call `.fit()` method.

In [9]:
lr_model.fit(X_train, y_train)

LinearRegression()

Let's check fitted mdoel's coefficients and the intercept:

In [20]:
coefficients = lr_model.coef_
intercept = lr_model.intercept_

variable_names = X.columns
model_str = "medv = "
for i in range(len(coefficients)):
    model_str += str(coefficients[i]) + "*" + variable_names[i] + " + "
model_str += str(intercept)
print(model_str)

medv = -0.11305592398537863*crim + 0.030110464145648223*zn + 0.04038072041333297*indus + 2.7844382035079644*chas + -17.202633391781003*nox + 4.438835199513043*rm + -0.0062963622109808905*age + -1.447865368530779*dis + 0.26242973558508303*rad + -0.010646786275308219*tax + -0.9154562404680713*ptratio + 0.012351334729969164*b + -0.5085714244487934*lstat + 30.246750993923495


Check OLS linear regression model's performance on the training set:

**Note**: We evaluate the model with three common metrics for regression tasks, i.e., R-square score ($R^2$), Mean Squared Error (MSE) and Mean Absolute Error (MAE).

In [13]:
y_train_pred = lr_model.predict(X_train)

R2_score_train = r2_score(y_train, y_train_pred)
mse_train = mean_squared_error(y_train, y_train_pred)
mae_train = mean_absolute_error(y_train, y_train_pred)

print("OLS-LR Model performance on the training set:")
print("R2 Score: ", R2_score_train)
print("Mean Squared Error: ", mse_train)
print("Mean Absolute Error: ", mae_train)

Model performance on the training set:
R2 Score:  0.7508856358979672
Mean Squared Error:  21.641412753226316
Mean Absolute Error:  3.314771626783226


For OLS linear regression, we normally don't need the validation step, becuase it is simple enough and there is no "hyperparameter" to tune for model selection. Hence, we directly evaluate the model on the test set.

To test the trained model on the test set using the above same metrics:

In [14]:
y_test_pred = lr_model.predict(X_test)

R2_score_test = r2_score(y_test, y_test_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print("OLS-LR Model performance on the test set:")
print("R2 Score: ", R2_score_test)
print("Mean Squared Error: ", mse_test)
print("Mean Absolute Error: ", mae_test)

Model performance on the test set:
R2 Score:  0.6687594935356318
Mean Squared Error:  24.291119474973534
Mean Absolute Error:  3.189091965887842


### 2.2 Random Forest

A random forest is a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Let's import random forest regressor and instantiate it. For hyperparameters that can be tuned, click [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [15]:
from sklearn.ensemble import RandomForestRegressor

rf_regressor = RandomForestRegressor(n_estimators = 200, max_depth = 5, min_samples_split = 10)

There are lots of hyperparameters can be finely tuned for Random Forest regressor. So we really need a validation set to find the "best" hyperparameters.

Let's start from the simplest way: split another dataset from the training set. (In practice, K-fold cross validation is an usual choice, but due to time limitation we will skip it on this occasion.)

#### Quiz

1. Split the training set into a new training set and validation set in a proportion 90:10. Name the new training set as `X_new_train`, `X_val`, `y_new_train`, `y_val`. Specify `random_state = 17` to ensure the results are reproducable.

In [32]:
X_new_train, X_val, y_new_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=3)

2. Train the random forest on the new training set:

In [33]:
rf_regressor.fit(X_new_train, y_new_train)

RandomForestRegressor(max_depth=5, min_samples_split=10, n_estimators=200)

3. Check Random Forest regreesor's performance on the validation set and tune the hyperparameter in the Random Forest model specification till you are satisfying with the performance.

In [34]:
y_val_pred = rf_regressor.predict(X_val)

R2_score_val = r2_score(y_val, y_val_pred)
mse_val = mean_squared_error(y_val, y_val_pred)
mae_val = mean_absolute_error(y_val, y_val_pred)

print("RF Model performance on the validation set:")
print("R2 Score: ", R2_score_val)
print("Mean Squared Error: ", mse_val)
print("Mean Absolute Error: ", mae_val)

Model performance on the validation set:
R2 Score:  0.8730372108028974
Mean Squared Error:  6.12985711731928
Mean Absolute Error:  1.7600920546203451


4. Test the selected model on the test set and report the outcome metrics.

In [36]:
y_test_pred = rf_regressor.predict(X_test)

R2_score_test = r2_score(y_test, y_test_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print("RF Model performance on the test set:")
print("R2 Score: ", R2_score_test)
print("Mean Squared Error: ", mse_test)
print("Mean Absolute Error: ", mae_test)

RF Model performance on the test set:
R2 Score:  0.8623367245404001
Mean Squared Error:  10.095368791694103
Mean Absolute Error:  2.239529165891398
