<img src="CampQMIND_banner.png">

# Regularization

This notebook goes over l1 and l2 regularization by explain what each does and implementing regularization through the channels of __lasso__, __ridge__ and __elastic net__. 

Author: [Umur Gokalp](https://github.com/uGokalp)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regularization" data-toc-modified-id="Regularization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Regularization</a></span><ul class="toc-item"><li><span><a href="#Why-Regularize?" data-toc-modified-id="Why-Regularize?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Why Regularize?</a></span></li></ul></li><li><span><a href="#L1-Regularization:-Least-Absolute-Shrinkage-and-Selection-Operator-(Lasso)" data-toc-modified-id="L1-Regularization:-Least-Absolute-Shrinkage-and-Selection-Operator-(Lasso)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>L1 Regularization: Least Absolute Shrinkage and Selection Operator (Lasso)</a></span></li><li><span><a href="#L2-Regularization-(Ridge)" data-toc-modified-id="L2-Regularization-(Ridge)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>L2 Regularization (Ridge)</a></span></li><li><span><a href="#Elastic-Net" data-toc-modified-id="Elastic-Net-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Elastic Net</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Resources</a></span></li></ul></div>

In [9]:
from IPython.display import IFrame, HTML
from IPython.display import display
display(HTML('<h2>Short video</h2>'),
IFrame('https://www.youtube.com/embed/sO4ZirJh9ds',560,315),
HTML("<h2>Long video</h2>"),
IFrame('https://www.youtube.com/embed/ne-MfRfYs_c',560,315))

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings("ignore")

## Why Regularize?

As we introduce more features to the model, the model can become too __complex__. 

In this case, the chance of learning the __noise__ in the dataset increases (__overfitting__) as the model becomes too __complex__.  Complexity reduces bias but may sacrifice __variance__ in the process.

A way to counteract overfitting is by __regularization__. The goal is to aim for low __variance__ and low __bias__ models.

In [2]:
df = pd.read_csv("houses.csv",index_col=0)
df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,3,65.0,8450,1,3,3,4,0,5,...,0,3,4,1,0,2,2008,8,4,12.247694
2,20,3,80.0,9600,1,3,3,2,0,24,...,0,3,4,1,0,5,2007,8,4,12.109011
3,60,3,68.0,11250,1,0,3,4,0,5,...,0,3,4,1,0,9,2008,8,4,12.317167
4,70,3,60.0,9550,1,0,3,0,0,6,...,0,3,4,1,0,2,2006,8,0,11.849398
5,60,3,84.0,14260,1,0,3,2,0,15,...,0,3,4,1,0,12,2008,8,4,12.429216


In [3]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("SalePrice",axis=1),df.SalePrice, random_state=2)

lr = LinearRegression(normalize=True)
lr.fit(X_train,y_train)
preds = lr.predict(X_test)
print("R Score",r2_score(y_test,preds))
print("MSE Score",mean_squared_error(y_test,preds))

R Score 0.6869027391795381
MSE Score 0.05125808226754255


# L1 Regularization: Least Absolute Shrinkage and Selection Operator (Lasso)

$$\beta^{Lasso} = argmin_{w} \sum_{i}^N(y_i - \sum_{j=1}^{p}x_{ij}\cdot{}w_{j})^2 + \lambda\sum_{j=1}^{p}|{w_j}|$$


The implications of this expression is the penalty term $\lambda$ puts a constraint on how large $w$ can get.

__Key Ideas__:
- High values are penalized.
- Some coefficients do become zero which allows it to be used as a feature selection tool.
- Deals with model complexity and multicolinearity.

In [4]:
from sklearn.linear_model import LassoCV
alphas = np.array([1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10])
lasso = LassoCV(alphas=alphas,normalize=True, max_iter=100000)
lasso.fit(X_train,y_train)
preds = lasso.predict(X_test)

In [5]:
print("R Score",r2_score(y_test,preds))
print("MSE Score",mean_squared_error(y_test,preds))

R Score 0.710883146708897
MSE Score 0.0473321785444374


In this case, LASSO performs significantly better than regular regression with higher $R^2$ score and lower MSE score.

In [6]:
X_train.iloc[:,lasso.coef_>0].columns  # Use it as a feature selector

Index(['LotFrontage', 'LotArea', 'Street', 'Condition1', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterCond', 'Foundation', 'BsmtFinSF1', 'BsmtFinSF2',
       'TotalBsmtSF', 'CentralAir', 'Electrical', '1stFlrSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'SaleCondition'],
      dtype='object')

# L2 Regularization (Ridge)

$$\beta^{Ridge} = argmin_{w} \sum_{i}^N(y_i - \sum_{j=1}^{p}x_{ij}\cdot{}w_{j})^2 + \lambda\sum_{j=1}^{p}{w_j}^2$$




Similar to LASSO, the implications of this expression is, the penalty term $\lambda$ puts a constraint on $w$ when $w$ gets large.

__Key Ideas__:
- High values are penalized.
- Coefficients become small but not exactly zero.
- Deals with model complexity and multicolinearity.

In [7]:
from sklearn.linear_model import RidgeCV
alphas = np.linspace(0.01, 100)
ridge = RidgeCV(alphas = alphas, normalize=True)
ridge.fit(X_train,y_train)
preds = ridge.predict(X_test)

In [8]:
print("R Score",r2_score(y_test,preds))
print("MSE Score",mean_squared_error(y_test,preds))

R Score 0.6909287387246152
MSE Score 0.05059897392737413


In this case, Ridge performs slightly better than regular regression with higher $R^2$ score and lower MSE score but lower than LASSO.

# Elastic Net

Elastic net applies both L1 and L2 regularization methods.

In [1]:
IFrame('https://www.youtube.com/embed/1dKRdX9bfIo',560,315)

In [25]:
from sklearn.linear_model import ElasticNetCV

In [39]:
l1 = np.linspace(0.2, 1.0, 10)
alphas = np.exp(np.linspace(-6, 5, 250))
elas = ElasticNetCV(l1_ratio=l1,alphas=alphas, cv=3, normalize=True)
elas.fit(X_train,y_train)
preds = elas.predict(X_test)

In [40]:
print("R Score",r2_score(y_test,preds))
print("MSE Score",mean_squared_error(y_test,preds))

R Score 0.7347718363730049
MSE Score 0.0434212902253959


We can see a combination of L1 and L2 regularization performs the best.

# Resources
- https://towardsdatascience.com/regularization-an-important-concept-in-machine-learning-5891628907ea