# Jeudi 11 Avril

## Regularization and Hyperparameter Optimization

* Voir https://app.jedha.co/course/regularization-and-hyperparameter-optimization-ft/regularization-ft



# Regularization and hyperparameter optimization with scikit-learn

## What will you learn in this course?

Let's come back to yesterday's mutiple linear regression model. Now, our objective is to implement a **regularized** linear regression. In this process, we'll use :
* A Ridge linear regression model
* cross-validation to estimate how the generalized $R^2$ score varies depending on the choice of the validation set
* cross-validated grid search to tune the value of the regularization strength

It's quite an ambitious program, isn't it ?

But don't worry, with scikit-learn's dedicated classes, it will be quite easy and straightforward to implement

* Training Pipeline
    * Cross-validated score for a Ridge model (with default value of $\lambda$)
    * Grid search : tune $\lambda$
* Test pipeline
* Final remarks

In [14]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge                                   # ! NEW
from sklearn.model_selection import cross_val_score, GridSearchCV        # ! NEW
# from sklearn.metrics import r2_score

In [15]:
# Import dataset
print("Loading dataset...")
dataset = pd.read_csv("../12_assets/05_supervised_ML/Data.csv")
print("...Done.")
dataset.head()

Loading dataset...
...Done.


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000,No
1,Spain,27.0,48000,Yes
2,Germany,30.0,54000,No
3,Spain,38.0,61000,No
4,Germany,40.0,69000,Yes


In [16]:
# Separate target variable Y from features X
print("Separating labels from features...")
target_variable = "Salary"

X = dataset.drop(target_variable, axis = 1)
Y = dataset.loc[:,target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
0    72000
1    48000
2    54000
3    61000
4    69000
Name: Salary, dtype: int64

X :
   Country   Age Purchased
0   France  44.0        No
1    Spain  27.0       Yes
2  Germany  30.0        No
3    Spain  38.0        No
4  Germany  40.0       Yes


In [17]:
# Automatically detect names of numeric/categorical columns
numeric_features = []
categorical_features = []
for i,t in X.dtypes.items():
    if ('float' in str(t)) or ('int' in str(t)) :
        numeric_features.append(i)
    else :
        categorical_features.append(i)

print('Found numeric features ', numeric_features)
print('Found categorical features ', categorical_features)

Found numeric features  ['Age']
Found categorical features  ['Country', 'Purchased']


In [18]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



### Preprocessing

In [19]:
# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values will be replaced by columns' mean
    ('scaler', StandardScaler())
])

In [20]:
# Create pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [21]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [22]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head()) 
X_test = preprocessor.transform(X_test) # Don't fit again !!
print('...Done.')
print(X_test[0:5,:]) # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

Performing preprocessings on train set...
   Country   Age Purchased
4  Germany  40.0       Yes
9   France  37.0       Yes
1    Spain  27.0       Yes
6    Spain   NaN        No
7   France  48.0       Yes
...Done.
[[ 0.27063731  1.          0.          1.        ]
 [-0.24603392  0.          0.          1.        ]
 [-1.96827133  0.          1.          1.        ]
 [ 0.          0.          1.          0.        ]
 [ 1.64842723  0.          0.          1.        ]]

Performing preprocessings on test set...
   Country   Age Purchased
2  Germany  30.0        No
8  Germany  50.0        No
...Done.
[[-1.4516001   1.          0.          0.        ]
 [ 1.99287472  1.          0.          0.        ]]



### Cross-validated score for a Ridge model (with default value of $\lambda$)

In [24]:
# Perform 3-fold cross-validation to evaluate the generalized R2 score obtained with a Ridge model
print("3-fold cross-validation...")

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
regressor = Ridge()

# On fait pas de fit, cross_val_score le fait sur l'echantillon en cours
scores = cross_val_score(regressor, X_train, Y_train, cv=3) # Bien voir qu'on passe que X_train. Il va définir "X_validation" lui même
print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is       : ', scores.std())

3-fold cross-validation...
The cross-validated R2-score is :  0.7148695556594437
The standard deviation is       :  0.09273865480146797


### Grid search : tune $\lambda$

In [26]:
# Perform grid search
print("Grid search...")
regressor = Ridge()
# Grid of values to be tested
params = {
    'alpha': [0.0, 0.1, 0.5, 1.0]                                               # 0 corresponds to no regularization
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 3)               # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score        : ", gridsearch.best_score_)                         # C'est le R² moyen sur les différents 

Grid search...
...Done.
Best hyperparameters :  {'alpha': 1.0}
Best R2 score        :  0.7148695556594437


### Performance assessment

In [29]:
# Print R^2 scores
print("R2 score on training set : ", gridsearch.score(X_train, Y_train))
print("R2 score on test set     : ", gridsearch.score(X_test, Y_test))

R2 score on training set :  0.8859961574542502
R2 score on test set     :  0.9316887810489011


## Final remarks
* Here, we can see that the model's generalized performance was improved by using a Ridge regression and tuning the value of the regularization strength. 
* Indeed, without regularization, the $R^2$ typically varies between 0.6 and 0.8, whereas with a regularized model we achieve a test score greater than 0.9

In [27]:
gridsearch.cv_results_

{'mean_fit_time': array([0.0012792 , 0.0012931 , 0.        , 0.00072805]),
 'std_fit_time': array([0.00180906, 0.00182872, 0.        , 0.00043035]),
 'mean_score_time': array([0.        , 0.00129104, 0.        , 0.        ]),
 'std_score_time': array([0.       , 0.0018258, 0.       , 0.       ]),
 'param_alpha': masked_array(data=[0.0, 0.1, 0.5, 1.0],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'alpha': 0.0}, {'alpha': 0.1}, {'alpha': 0.5}, {'alpha': 1.0}],
 'split0_test_score': array([0.5574292 , 0.61985315, 0.76254335, 0.84245348]),
 'split1_test_score': array([-4.14353202,  0.66362103,  0.66261998,  0.62476325]),
 'split2_test_score': array([0.48396732, 0.56311698, 0.64474707, 0.67739194]),
 'mean_test_score': array([-1.03404517,  0.61553039,  0.68997013,  0.71486956]),
 'std_test_score': array([2.19894376, 0.0411443 , 0.05183315, 0.09273865]),
 'rank_test_score': array([4, 3, 2, 1])}