## <center>Regularization with SciKit-Learn</center>

Previously we created a new polynomial feature set and then applied our standard linear regression on it, but we can be smarter about model choice and utilize regularization.

Regularization attempts to minimize the RSS (residual sum of squares) *and* a penalty factor. This penalty factor will penalize models that have coefficients that are too large. Some methods of regularization will actually cause non useful features to have a coefficient of zero, in which case the model does not consider the feature.

Let's explore two methods of regularization, Ridge Regression and Lasso. We'll combine these with the polynomial feature set (it wouldn't be as effective to perform regularization of a model on such a small original feature set of the original X).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data and Setup

In [2]:
df = pd.read_csv("C:/Users/Lenovo/Desktop/Python/Machine Learning/Supervised Learning/Linear regression/Data Sets/Advertising.csv")
X = df.drop('sales',axis=1)
y = df['sales']

#### Polynomial Conversion

In [3]:
from sklearn.preprocessing import PolynomialFeatures

In [4]:
polynomial_converter = PolynomialFeatures(degree=3,include_bias=False)
poly_features = polynomial_converter.fit_transform(X)

#### Train, Test and Split

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=42)

#### Scaling the Data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. Review the theory videos for more info, as well as a discussion on why we only **fit** to the training data, and **transform** on both sets separately.

In [7]:
from sklearn.preprocessing import StandardScaler

Standard Scaler uses a Z-scaler:

$$X_{scaled} = \frac{X-\bar{X}}{\sigma}$$

In [8]:
scaler = StandardScaler()
scaler.fit(X_train)  # To avoid data leakage, the data set is only fitted to the training set
                     # y_train or y_test are not needed to be scaled.

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Ridge Regression

Ridge regression is also a linear model for regression,  so the formula it uses to make predictions is the same one used for ordinary least squares. In ridge regression, though, the coefficients (w) are chosen not only so that they predict well on the training data, but also to fit an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of w should be close to zero. Intuitively, this means each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraint is an example of what is called regularization. Regularization means explicitly restricting a model to avoid overfitting. The particular kind used by ridge regression is known as L2 regularization.

Ridge regression minimizes the objective function:

$$Error = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_{j}x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization.

In [9]:
from sklearn.linear_model import Ridge

In [10]:
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train,y_train)

Ridge(alpha=10)

In [11]:
y_pred = ridge_model.predict(X_test)

In [12]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [13]:
MAE = mean_absolute_error(y_test,y_pred)
MSE = mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(MSE)

In [14]:
MAE

0.6296591346758597

In [15]:
RMSE

0.8916327541710891

How did it perform on the training set? (This will be used later on for comparison)

In [16]:
# Training Set Performance
train_predictions = ridge_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.5230070613645759

The Ridge model makes a trade-off between the simplicity of the model (near-zero coefficients) and its performance on the training set. How much importance the model places on simplicity versus training set performance can be specified by the user, using the alpha parameter. Increasing alpha forces coefficients to move more toward zero, which decreases training set performance but might help generalization. For very small values of alpha, coefficients are barely restricted at all, and we end up with a model that resembles LinearRegression


#### Choosing an alpha value with Cross-Validation

In [17]:
from sklearn.linear_model import RidgeCV

In [18]:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')

# Important cv is set to None by default, which uses the take one out method. For large datasets, it's important 
# # to specify the value of cv, otherwise it will use many computer resources. 

# The more alpha options you pass, the longer this will take.
ridge_cv_model.fit(X_train,y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), scoring='neg_mean_absolute_error')

In [19]:
ridge_cv_model.alpha_

0.1

In [20]:
y_pred_cv = ridge_cv_model.predict(X_test)

In [21]:
MAE = mean_absolute_error(y_test,y_pred_cv)
MSE = mean_squared_error(y_test,y_pred_cv)
RMSE = np.sqrt(MSE)

In [22]:
MAE

0.46671241131523317

In [23]:
RMSE

0.5945136671816592

In [24]:
# Training Set Performance
train_predictions = ridge_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.3240172629005777

In [25]:
ridge_cv_model.coef_

array([ 5.90523815,  0.46316396,  0.68028713, -6.17743395,  3.73671928,
       -1.40708382,  0.00624704,  0.11128917, -0.2617823 ,  2.17135744,
       -0.51480159,  0.70587211,  0.60311504, -0.53271216,  0.5716495 ,
       -0.34685826,  0.36744388, -0.03938079, -0.12192939])

### Lasso Regression - least absolute shrinkage and selection operator

An alternative to Ridge for regularizing linear regression is Lasso. As with ridge regression, using the lasso also restricts coefficients to be close to zero, but in a slightly different way, called L1 regularization. The consequence of L1 regularization is that when using the lasso, some coefficients are exactly zero. This means some features are entirely ignored by the model. This can be seen as a form of automatic feature selection. Having some coefficients be exactly zero often makes a model easier to interpret, and can reveal the most important features of your model.

Lasso regression minimizes the objective function:

$$Error = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_{j}x_{ij})^2 + \lambda \sum_{j=1}^{p} |{\beta_j}|$$



In [26]:
from sklearn.linear_model import LassoCV

In [27]:
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5, max_iter=1000)
# alphas = None --> when kept at None it's more useful to declare the eps and n_alphas parameters (works like linspace)

# If warning --> increment the max_iter parameter

In [28]:
lasso_cv_model.fit(X_train,y_train)



LassoCV(cv=5, eps=0.1)

In [29]:
lasso_cv_model.alpha_

0.4924531806474871

In [30]:
y_pred = lasso_cv_model.predict(X_test)

In [31]:
MAE = mean_absolute_error(y_test,y_pred)
MSE = mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(MSE)

In [32]:
MAE

0.6811456342837983

In [33]:
RMSE

1.034912736547873

In [34]:
# Training Set Performance
# Training Set Performance
train_predictions = lasso_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.6860946674187012

In [35]:
lasso_cv_model.coef_

array([0.97675148, 0.        , 0.        , 0.        , 3.8148913 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

Only two betas/parameters are kept, the rest are discarded or set to zero. From the results of MAE and RMSE it can be seen that it performs worst than Ridge regression. However, the fact that gets these values by only considering 2 parameters, it is impressive, and may make the model more interpretable.

### Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both worlds! Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression.

$$Error = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_{j}x_{ij})^2 + \lambda_1 \sum_{j=1}^{p} \beta_j^2 + \lambda_2 \sum_{j=1}^{p} |{\beta_j}| $$



In [36]:
from sklearn.linear_model import ElasticNetCV

In [37]:
elastic_model = ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],tol=0.01)

# l1_ratio --> pass in a list of different ratios
# alphas and eps --> are also present, like in Lasso

In [38]:
elastic_model.fit(X_train,y_train)

ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], tol=0.01)

In [39]:
elastic_model.l1_ratio_

# Model is very much a Lasso model

0.95

In [40]:
y_pred = elastic_model.predict(X_test)

In [41]:
MAE = mean_absolute_error(y_test,y_pred)
MSE = mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(MSE)

In [42]:
MAE

0.6383683427025824

In [43]:
RMSE

0.7707991875643563

In [44]:
# Training Set Performance
# Training Set Performance
train_predictions = elastic_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.40444244371312293

In [45]:
elastic_model.coef_

array([ 3.95312352,  0.98671224,  0.2194859 , -1.01798785,  1.97463372,
       -0.3782983 , -0.12009502,  0.07739924,  0.02861239, -1.10946628,
        0.49812979, -0.        ,  0.95685199, -0.05702842,  0.04842821,
       -0.36288403,  0.1257612 ,  0.00643697,  0.        ])