<img src="images/regularization.png" alt="drawing"/>

# **Regularization**

## Data Definition

Import all the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import Ridge, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, SCORERS


In [2]:
def error_evaluation(y_test, y_test_predictions):
    MAE = mean_absolute_error(y_test, y_test_predictions)
    RMSE = np.sqrt(mean_squared_error(y_test, y_test_predictions))
    print(f'MAE = {round(MAE, 2)}')
    print(f'RMSE = {round(RMSE, 2)}')

Read in the advertising data and define the features $X$ and response $y$

In [3]:
advertising = pd.read_csv('data/Advertising.csv')
X = advertising.drop('sales', axis="columns")
y = advertising['sales']
advertising.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


Setup a Polynomial Regression model and define the test & training data

In [4]:
polynomial_converter = PolynomialFeatures(degree=3, include_bias=False)
poly_features = polynomial_converter.fit_transform(X, y)

X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

print(f"Raw data shape: {poly_features.shape}")
print(f"Train data shape: {X_train.shape}")


Raw data shape: (200, 19)
Train data shape: (140, 19)


## Feature Scaling

Setup a <code>StandardScaler</code> object from the  <code>SciKitLearn</code> library and scale the train and test data. The scaler should be fitted only with the train data so that there is no data leakage.

In [5]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[0]

array([ 0.49300171, -0.33994238,  1.61586707,  0.28407363, -0.02568776,
        1.49677566, -0.59023161,  0.41659155,  1.6137853 ,  0.08057172,
       -0.05392229,  1.01524393, -0.36986163,  0.52457967,  1.48737034,
       -0.66096022, -0.16360242,  0.54694754,  1.37075536])

## L2 Regularization - Ridge Regression

From the scikit learn documentation for the <code>Ridge</code> class, the minimization function is in the form of 

$||y - Xw||^2 + \alpha ||w||^2$

Where 

$y$ = response

$X$ = feature data

$w$ = estimation coefficient

$\alpha$ = tuning parameter

In [6]:
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train_scaled, y_train)

y_test_predictions_ridge = ridge_model.predict(X_test_scaled)
error_evaluation(y_test, y_test_predictions_ridge)

MAE = 0.58
RMSE = 0.89


Beforehand we do not know wnich value of $\alpha$ to use, fortunately there is a seperate <code>RidgeCV</code> class which incorporates cross-validation for various $\alpha$ values into the model. 

By default <code>alpha = (0.1, 1.0, 10.0)</code> and <code> cv = None </code>. The parameter <code>cv</code> determines the number of K-Folds to use in the cross validation, if left at <code>None</code>, then the leave-one-out method will be used (caution this may be computationally expensive)

If the leave-one-out method is used, then part of the training data set is not used, but used as a validation data set.

The way the best $\alpha$ value is chosen is based on a scoring metric, a list of them can be found using <code>SCORERS.keys()</code>

In [7]:
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')
ridge_cv_model.fit(X_train_scaled, y_train)
print(f"Best alpha: {round(ridge_cv_model.alpha_,2)} \n")  

y_test_predictions_ridge_cv = ridge_cv_model.predict(X_test_scaled)
error_evaluation(y_test, y_test_predictions_ridge_cv)

Best alpha: 0.1 

MAE = 0.43
RMSE = 0.62


## L1 Regularization - LASSO Regression

From the scikit learn documentation for the <code>Lasso</code> class, the minimization function is in the form of 

$\frac{1}{(2  n_{samples})} ||y - Xw||^2 + \alpha ||w||$

Where 

$n_{samples}$ = number of rows in the data

$y$ = response

$X$ = feature data

$w$ = estimation coefficient

$\alpha$ = tuning parameter

Similar to <code>RidgeCV</code>, we do not know beforehand what value of $\alpha$ to use in the LASSO Regression. Instead of providing a <code>Tuple</code> of $\alpha$ values to use, <code>LassoCV</code> either requires a list of alpha values, or define the range over which it should be chosen.

You may get a warning that the alpha values did not converge. To solve this you can increase the number of iterations <code>max_iter</code> or increase the alpha ratio <code>eps</code> or play around with the tolerance <code> tol </code>

$\epsilon = \frac{\alpha_{min}}{\alpha_{max}}$

*Note: It is just a warning, it does not break the entire model*

In [8]:
lasso_cv_model = LassoCV(eps=0.001, n_alphas=100, cv=5, max_iter=1_000_000)
lasso_cv_model.fit(X_train_scaled, y_train)
print(f"Best alpha: {round(lasso_cv_model.alpha_, 4)} \n") 

y_test_predictions_lasso_cv = lasso_cv_model.predict(X_test_scaled)
error_evaluation(y_test, y_test_predictions_lasso_cv)

Best alpha: 0.0049 

MAE = 0.43
RMSE = 0.61


## L1 & L2 Regularization - Elastic Net

From the scikit learn documentation for the <code>ElasticNet</code> class, the minimization function is in the form of

$\frac{1}{(2  n_{samples})} ||y - Xw||^2 + \alpha (\gamma) ||w|| + \frac{1}{2} \alpha (\gamma - 1) ||w||^2$

Where:

$\gamma$ = ratio of L1 to L2


This works the same way as the <code>LassoCV</code> class, the value of the L1 ratio $\gamma$ <code>l1_ratio</code> is determined through cross-validation by providing a list of possible $\gamma$ values to test out <code>l1_ratio</code>. For this list it is recommended to use more values close to 1 (more towards lasso)

In [9]:
elastic_net_cv_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], 
                                    eps=0.001, n_alphas=100, max_iter=1_000_000)
elastic_net_cv_model.fit(X_train_scaled, y_train)
print(f"Best L1 ratio: {round(elastic_net_cv_model.l1_ratio_, 2)}") 
print(f"Best alpha: {round(elastic_net_cv_model.alpha_, 4)} \n") 

y_test_predictions_elastic_net_cv = lasso_cv_model.predict(X_test_scaled)
error_evaluation(y_test, y_test_predictions_elastic_net_cv)

Best L1 ratio: 1.0
Best alpha: 0.0049 

MAE = 0.43
RMSE = 0.61


As you can see the L1 ratio $\gamma$ is equal to 1.0, which means the model uses only L1 (LASSO). If you compare the tuning parameter $\alpha$ and the MAE/RMSE errors for the Elastic Net and LASSO model, they are the same. 