# Regularization with SciKit-Learn

Previously we created a new polynomial feature set and then applied our standard linear regression on it, but we can be smarter about model choice and utilize regularization.

Regularization attempts to minimize the RSS (residual sum of squares) *and* a penalty factor. This penalty factor will penalize models that have coefficients that are too large. Some methods of regularization will actually cause non useful features to have a coefficient of zero, in which case the model does not consider the feature.

Let's explore two methods of regularization, Ridge Regression and Lasso. We'll combine these with the polynomial feature set (it wouldn't be as effective to perform regularization of a model on such a small original feature set of the original X).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
tv = np.array([181,9,58,120,9,200,66,215,24,98,204,195,68,281,69,147,218,237,13,228,62,263,143,240,249])
radio = np.array([11,49,33,20,2,3,6,24,35,8,33,48,37,40,21,24,28,5,16,17,13,4,29,17,27])
newspaper = np.array([58,75,24,12,1,21,24,4,66,7,46,53,114,56,18,19,53,24,50,26,18,20,13,23,23])
sales = np.array([13,7,12,13,5,11,9,17,9,10,19,22,13,24,11,15,18,13,6,16,10,12,15,16,19])

df = pd.DataFrame({'tv': tv, 'radio': radio, 'newspaper': newspaper, 'sales': sales})
df.head()

Unnamed: 0,tv,radio,newspaper,sales
0,181,11,58,13
1,9,49,75,7
2,58,33,24,12
3,120,20,12,13
4,9,2,1,5


In [3]:
X = df.drop('sales', axis=1) # keep only the features
y = df['sales']

## Polynomial Conversion

In [4]:
from sklearn.preprocessing import PolynomialFeatures

In [5]:
polynomial_converter = PolynomialFeatures(degree=3, include_bias=False)

In [6]:
poly_features = polynomial_converter.fit_transform(X)
poly_features.shape

(25, 19)

## Train and Test Splits

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=99)

## Scaling the Data

While our particular data set has all the values in the same order of magnitude (1 point of something), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. For example we cannot have one feature with values in range (0.0-1.0) and another feature in range (100-1000).

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
scaler = StandardScaler()

In [11]:
scaler.fit(X_train)

In [12]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Ridge Regression (RR)

RR is a regularization method for Linear regression.

The goal of RR is prevent overfitting by adding an additional penalty term.

In [13]:
from sklearn.linear_model import Ridge

In [14]:
ridge_model = Ridge(alpha=10)

In [15]:
ridge_model.fit(X_train, y_train)

In [16]:
test_predictions = ridge_model.predict(X_test)

In [17]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [18]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
print(RMSE)

1.8138035254398992
2.7440780352431458


How did it perform on the training set? (This will be used later on for comparison)

In [19]:
# Training Set Performance
train_predictions = ridge_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.8238580019402281

### Choosing an alpha value with Cross-Validation

In [20]:
from sklearn.linear_model import RidgeCV

In [21]:
# Negative RMSE so all metrics follow convention "Higher is better"
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')

In [22]:
# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train, y_train)

In [23]:
# it reports the alpha which performed the best
ridge_cv_model.alpha_

1.0

In [24]:
test_predictions = ridge_cv_model.predict(X_test)

In [25]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
print(RMSE)

1.3918676889454737
2.807367730096967


In [26]:
train_predictions = ridge_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.45129349026560445

In [27]:
ridge_cv_model.coef_

array([ 1.79972516,  1.45597279, -0.01397653,  0.33290289,  1.11280043,
        0.49527754,  0.37164088, -0.11889539, -0.38236088, -0.3194373 ,
        0.28229854, -0.20185405,  0.39730557,  0.10997819,  0.31880074,
       -0.17510455, -0.30388327, -0.34539895, -0.32681256])

## Lasso Regression (LR)

In [28]:
from sklearn.linear_model import LassoCV

In [29]:
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv=5)

In [30]:
lasso_cv_model.fit(X_train, y_train)

In [31]:
lasso_cv_model.alpha_

0.4468488012467369

In [32]:
test_predictions = lasso_cv_model.predict(X_test)

In [33]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
print(RMSE)

0.8780206334666918
0.993315051950396


In [34]:
train_predictions = lasso_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.681087125524813

In [35]:
lasso_cv_model.coef_

array([ 1.4070715 ,  0.        , -0.        ,  0.        ,  2.96900953,
        0.        ,  0.        , -0.        , -0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        ])

## Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both worlds

In [36]:
from sklearn.linear_model import ElasticNetCV

In [37]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], tol=0.01)

In [38]:
elastic_model.fit(X_train, y_train)

In [39]:
elastic_model.l1_ratio_

0.9

In [40]:
test_predictions = elastic_model.predict(X_test)

In [41]:
MAE = mean_absolute_error(y_test, test_predictions)
MSE = mean_squared_error(y_test, test_predictions)
RMSE = np.sqrt(MSE)

print(MAE)
print(RMSE)

0.7898847775634377
1.0689134876248785


In [42]:
train_predictions = elastic_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.32935578333943694

In [43]:
elastic_model.coef_

array([ 3.36849631,  1.79643198, -0.31355123, -0.46702517,  1.16568403,
        0.19052654, -0.46976126, -0.03771459, -0.09687643, -1.106768  ,
        0.68609938,  0.06660989,  0.40632385,  0.15853151,  0.01051156,
       -0.50007448, -0.        , -0.04295381, -0.01053247])