# Data Modelling with `scikit-learn`

In this tutorial, we will learn about basic data modeling (regression only) techniques using the scikit-learn library in Python. We will cover three commonly used machine learning models: 

- Linear Regression
- Support Vector Machine
- Random Forest

It is a very broad area for modelling techniques, so we cannot cover all details. But you are able to check more [mateiral](https://scikit-learn.org/stable/tutorial/basic/tutorial.html) later.

In [1]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#### Quiz

We will use the Boston Housing Dataset again! So can you import the dataset by yourself this time? We will try to predict the MEDV based on other variables as features.

In [5]:
boston_housing = pd.read_csv(r".\datasets\BostonHousing.csv")
boston_housing.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


Generally, a complete modeling pipeline includes the steps as below:

1. Data collection and loading
2. Data pre-processing: handling missing values, removing outliers, formulate structural dataset
3. Feature extraction/engineering
4. Dataset spliting
5. Train the model on the training set to ensure the model fit the data
6. validate the model on the validate set to tune the "hyperparameters"
6. Test the model on the test set and report the performance

But on this occasion, as a tutorial, we provide a clean dataset and you are not required to do feature engineering. 

Hence, we will not focus on the step 1-3, although the first three steps are always the most challenging part costing more than 70% of time in a real-world project.

## 1. Data Spliting

When building a machine learning model, it's crucial to evaluate how well the model performs on unseen data. To do this effectively, we split our dataset into separate parts: a training set, validation set and a testing set.

- Typically, we use a larger portion of the data for training and validation (e.g., 80%) and a smaller portion for testing (e.g., 20%).
- Validation set can be furhter separated from the training set, but it reduces the number of avaliable samples for trianing!! Normally, we use K-fold cross validation techniques.
- This split ensures that we have enough data to train the model while also reserving enough data to test it.

In [9]:
X = boston_housing.drop('medv', axis=1)
y = boston_housing['medv']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Quiz

1. Can you check the shape of `X_train`, `X_test`, `y_train` and `y_test`?

In [10]:
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

Shape of X_train:  (404, 13)
Shape of X_test:  (102, 13)
Shape of y_train:  (404,)
Shape of y_test:  (102,)


## 2. Train the model

In [None]:
from sklearn.linear_model import LinearRegression