# 2.3. Predictive Model Selection

Module: Artificial Intelligence for Aviation Engineering

Instructor: prof. Dmitry Pavlyuk

## Statistical Model: loss function

## Statistical model: loss function

In order to quantify how well a model performs, we define a loss or error function. A common loss function for quantitative outcomes is the Mean Squared Error (MSE):

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

The quantity $|y_i - \hat{y}_i|$ is called a residual and measures the error at the $i$-th prediction.


Alternatively, we can get the root of MSE, RMSE:

$$
\text{RMSE}  = \sqrt{\text{MSE}}
$$

Alternatively, we can compare with variance of the response and construct the coefficient of determination:

$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} {\sum_{i=1}^{n} (y_i - \bar{y}_i)^2}
$$

## Toy example

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x_train = np.array(range(1,11))
y_train = np.array([2, 2, 4, 3, 5, 7, 7, 5, 9, 8])
data = pd.DataFrame({'x': x_train, 'y': y_train})
data.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
x,1,2,3,4,5,6,7,8,9,10
y,2,2,4,3,5,7,7,5,9,8


## Models: kNN, Linear regression, Decision tree

In [7]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree

knn_1 = KNeighborsRegressor(n_neighbors=1)
knn_1.fit(data[["x"]], data[["y"]])

knn_2 = KNeighborsRegressor(n_neighbors=2)
knn_2.fit(data[["x"]], data[["y"]])

knn_3 = KNeighborsRegressor(n_neighbors=3)
knn_3.fit(data[["x"]], data[["y"]])

linear_model = LinearRegression()
linear_model.fit(data[["x"]], data[["y"]])

tree = DecisionTreeRegressor(max_depth=3)
tree.fit(data[["x"]], data["y"]);

## Models: RMSE

In [13]:
from sklearn.metrics import mean_squared_error, r2_score

models = {
    "1-NN regression": knn_1,
    "2-NN regression": knn_2,
    "3-NN regression": knn_3,
    "Linear regression": linear_model,
    "Decision Tree  ": tree
}

for name, model in models.items():
    model.fit(data[['x']],data[['y']])
    print(f"{name}:\tRMSE={np.sqrt(mean_squared_error(data[['y']] , model.predict(data[['x']]))):.3f}, \t\t\
        R2={r2_score(data[['y']] , model.predict(data[['x']])):.3f}")

1-NN regression:	RMSE=0.000, 		        R2=1.000
2-NN regression:	RMSE=0.922, 		        R2=0.847
3-NN regression:	RMSE=1.049, 		        R2=0.802
Linear regression:	RMSE=1.025, 		        R2=0.811
Decision Tree  :	RMSE=0.516, 		        R2=0.952


__Question:__ Is this a good idea to choose the best model using $RMSE$ or $R^2$ for our sample _x_?

## Models: Overfitting

Overfitting occurs when a model becomes overly complex, capturing random noise in the observations instead of the true relationship between the predictors and the response.
Overfitting can occur when:
- too flexible model
- too many model parameters

A sign of overfitting may be a high training $R^2$ or low $RMSE$, accompanied by unexpectedly poor performance on testing data.

Note: There is no definitive test for overfitting, nor is there a foolproof method to prevent it. Instead, a combination of techniques can be employed to mitigate overfitting and various methods can be used to detect it.

## Models: Train/Test split

The train-test split is a fundamental technique used in machine learning to evaluate the performance of a model. It involves dividing a dataset into two distinct subsets: one for training the model and one for testing its performance on unseen data.

Purpose
- To assess how well a machine learning model generalizes to new, unseen data.
- To avoid overfitting by ensuring the model is not evaluated on the same data it was trained on.

## Models: Train/Test split

Procedure
- Divide the Dataset: The dataset is split into two parts:
  - Training Set: Used to train the model.
  - Test Set: Used to evaluate the model's performance after training.
  - A common practice is to allocate 70-80% of the data for training and 20-30% for testing.
- Randomization: It's important to randomize the split to ensure that both subsets represent the overall dataset well

## Models: Train/Test split

Advantages
- Simple and easy to implement.
- Provides a quick estimate of model performance on unseen data.

Disadvantages
- Depending on the random split, the results can vary significantly. A single split might not capture the model's performance accurately.
- If the dataset is small, a significant portion of data is set aside for testing, which may lead to less reliable performance estimates.

## Models: Train/Test split illustration

In [14]:
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test  = train_test_split(data[["x"]], data['y'], test_size=0.3)

for name, model in models.items():
    model.fit(x_train,y_train)
    print(f"{name}:\tRMSE={np.sqrt(mean_squared_error(y_test , model.predict(x_test))):.3f}, \t\
        R2={r2_score(y_test , model.predict(x_test)):.3f}")

1-NN regression:	RMSE=1.633, 	        R2=-0.714
2-NN regression:	RMSE=1.443, 	        R2=-0.339
3-NN regression:	RMSE=1.610, 	        R2=-0.667
Linear regression:	RMSE=1.210, 	        R2=0.058
Decision Tree  :	RMSE=1.633, 	        R2=-0.714


## Models: Cross-validation

Cross-validation is a statistical technique used to assess the generalization performance of a model. It helps to mitigate issues like overfitting by ensuring that the model performs well on unseen data.
Purpose
- To estimate how well a model will generalize to an independent dataset.
- To identify potential issues like overfitting or underfitting.

Steps in Cross-Validation
- Split the Dataset: Divide the dataset into training and testing sets (if applicable).
- Select 𝑘: Choose the number of folds for K-Fold cross-validation.
- ain and validate: For each fold, train the model on k−1 folds and validate it on the remaining fold.
- calculate metrics: Collect performance metrics (e.g., accuracy, MSE) for each fold.
- average the results: Compute the mean of the metrics across all folds to assess the model's performance.

Advantages
- Provides a better estimate of model performance than a single train-test split.
- Helps in hyperparameter tuning and model selection.

Disadvantages
- Computationally intensive, especially for large datasets and complex models.
- Can still lead to overfitting if not done carefully

## Models: Cross-validation illustration

In [16]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score 

kf = KFold(n_splits=2, shuffle=True)
for name, model in models.items():
    scores = cross_val_score(model, data[["x"]], data["y"], cv=kf, scoring='r2')
    print(f"{name}:\tR2={np.mean(scores):.3f}")

1-NN regression:	R2=0.397
2-NN regression:	R2=0.575
3-NN regression:	R2=0.146
Linear regression:	R2=0.430
Decision Tree  :	R2=-1.537


## Models: Repeated sampling

In [21]:
from sklearn.model_selection import ShuffleSplit

kf = ShuffleSplit(n_splits=10, test_size=0.2)
for name, model in models.items():
    scores = cross_val_score(model, data[["x"]], data["y"], cv=kf, scoring='r2')
    print(f"{name}:\tR2={np.mean(scores):.3f}")

1-NN regression:	R2=-1.460
2-NN regression:	R2=-4.423
3-NN regression:	R2=-1.941
Linear regression:	R2=-0.627
Decision Tree  :	R2=-0.896


# Thank you