# Regularization and hyperparameter optimization with scikit-learn 🎯🎯

## What will you learn in this course? 🧐🧐

Let's come back to yesterday's mutiple linear regression model. Now, our objective is to implement a **regularized** linear regression. In this process, we'll use :
* A Ridge linear regression model
* cross-validation to estimate how the generalized $R^2$ score varies depending on the choice of the validation set
* cross-validated grid search to tune the value of the regularization strength

It's quite an ambitious program, isn't it ? 🥵🥵

But don't worry, with scikit-learn's dedicated classes, it will be quite easy and straightforward to implement 😌😌

* Training Pipeline
    * Cross-validated score for a Ridge model (with default value of $\lambda$)
    * Grid search : tune $\lambda$
* Test pipeline
* Final remarks

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

In [2]:
# Import dataset
print("Loading dataset...")
dataset = pd.read_csv("Data.csv")
print("...Done.")
print()

Loading dataset...
...Done.



In [3]:
# Separate target variable Y from features X
print("Separating labels from features...")
features_list = ["Country", "Age", "Purchased"]
target_variable = "Salary"

X = dataset.loc[:,features_list]
Y = dataset.loc[:,target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
0    72000
1    48000
2    54000
3    61000
4    69000
Name: Salary, dtype: int64

X :
   Country   Age Purchased
0   France  44.0        No
1    Spain  27.0       Yes
2  Germany  30.0        No
3    Spain  38.0        No
4  Germany  40.0       Yes


In [4]:
# Automatically detect positions of numeric/categorical features
idx = 0
numeric_features = []
numeric_indices = []
categorical_features = []
categorical_indices = []
for i,t in X.dtypes.iteritems():
    if ('float' in str(t)) or ('int' in str(t)) :
        numeric_features.append(i)
        numeric_indices.append(idx)
    else :
        categorical_features.append(i)
        categorical_indices.append(idx)

    idx = idx + 1

print('Found numeric features ', numeric_features,' at positions ', numeric_indices)
print('Found categorical features ', categorical_features,' at positions ', categorical_indices)

Found numeric features  ['Age']  at positions  [1]
Found categorical features  ['Country', 'Purchased']  at positions  [0, 2]


In [5]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, 
                                                    random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [6]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.tolist()
Y_test = Y_test.tolist()
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])

Convert pandas DataFrames to numpy arrays...
...Done
[['Germany' 40.0 'Yes']
 ['France' 37.0 'Yes']
 ['Spain' 27.0 'Yes']
 ['Spain' nan 'No']
 ['France' 48.0 'Yes']]
[['Germany' 30.0 'No']
 ['Germany' 50.0 'No']]

[69000, 67000, 48000, 52000, 79000]
[54000, 83000]


## Training pipeline

In [7]:
# Missing values
print("Imputing missing values...")
print(X_train[0:5,:])
print()
imputer = SimpleImputer(strategy="mean")
X_train[:,numeric_indices] = imputer.fit_transform(X_train[:,numeric_indices])
print("...Done!")
print(X_train[0:5,:]) 
print() 

Imputing missing values...
[['Germany' 40.0 'Yes']
 ['France' 37.0 'Yes']
 ['Spain' 27.0 'Yes']
 ['Spain' nan 'No']
 ['France' 48.0 'Yes']]

...Done!
[['Germany' 40.0 'Yes']
 ['France' 37.0 'Yes']
 ['Spain' 27.0 'Yes']
 ['Spain' 38.42857142857143 'No']
 ['France' 48.0 'Yes']]



In [8]:
# Encoding categorical features and standardizing numerical features
print("Encoding categorical features and standardizing numerical features...")
print()
print(X_train[0:5,:])

# Normalization
numeric_transformer = StandardScaler()

# OHE / dummyfication
categorical_transformer = OneHotEncoder(drop='first')

featureencoder = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_indices),    
        ('num', numeric_transformer, numeric_indices)
        ]
    )

X_train = featureencoder.fit_transform(X_train)
print("...Done")
print(X_train[0:5,:])

Encoding categorical features and standardizing numerical features...

[['Germany' 40.0 'Yes']
 ['France' 37.0 'Yes']
 ['Spain' 27.0 'Yes']
 ['Spain' 38.42857142857143 'No']
 ['France' 48.0 'Yes']]
...Done
[[ 1.          0.          1.          0.27063731]
 [ 0.          0.          1.         -0.24603392]
 [ 0.          1.          1.         -1.96827133]
 [ 0.          1.          0.          0.        ]
 [ 0.          0.          1.          1.64842723]]


### Cross-validated score for a Ridge model (with default value of $\lambda$)

In [9]:
# Perform 3-fold cross-validation to evaluate the generalized R2 score obtained with a Ridge model
print("3-fold cross-validation...")
regressor = Ridge()
scores = cross_val_score(regressor, X_train, Y_train, cv=3)
print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

3-fold cross-validation...
The cross-validated R2-score is :  0.7148695556594437
The standard deviation is :  0.09273865480146797


### Grid search : tune $\lambda$

In [10]:
# Perform grid search
print("Grid search...")
regressor = Ridge()
# Grid of values to be tested
params = {
    'alpha': [0.0, 0.1, 0.5, 1.0] # 0 corresponds to no regularization
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 1.0}
Best R2 score :  0.7148695556594437


In [11]:
# Predictions on training set
# The model has already be re-trained on all the training set at the end of the grid search, so we can directly use it !
print("Predictions on training set...")
Y_train_pred = gridsearch.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[68252.37511415 63990.49954339 47911.64524901 58977.5379539
 76876.08538807 58475.50214177 69868.68885626 61647.66575345]



## Test pipeline

In [12]:
# Missing values
print("Imputing missing values...")
print(X_test[0:5,:])
print()

X_test[:,numeric_indices] = imputer.transform(X_test[:,numeric_indices])
print("...Done!")
print(X_test[0:5,:]) 
print() 

Imputing missing values...
[['Germany' 30.0 'No']
 ['Germany' 50.0 'No']]

...Done!
[['Germany' 30.0 'No']
 ['Germany' 50.0 'No']]



In [13]:
# Encoding categorical features and standardizing numerical features
print("Encoding categorical features and standardizing numerical features...")
print()
print(X_test[0:5,:])

X_test = featureencoder.transform(X_test)
print("...Done")
print(X_test[0:5,:])

Encoding categorical features and standardizing numerical features...

[['Germany' 30.0 'No']
 ['Germany' 50.0 'No']]
...Done
[[ 1.          0.          0.         -1.4516001 ]
 [ 1.          0.          0.          1.99287472]]


In [14]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = gridsearch.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[54216.47721253 77644.81511193]



In [15]:
# Print R^2 scores on train/test sets for the Ridge model with optimal value of the regularization strength
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.8859961574542502
R2 score on test set :  0.9316887810489011


## Final remarks
Here, we can see that the model's generalized performance was improved by using a Ridge regression and tuning the value of the regularization strength. Indeed, without regularization, the $R^2$ typically varies between 0.6 and 0.8, whereas with a regularized model we achieve a test score greater than 0.9 🥳🥳