# Elastic Net Regression

## Limitations of Ridge and Lasso Regression

Lasso Regression addresses the issue of Ridge Regression, that is, not punishing the coefficients to zero. Still, Lasso has some limitations which are listed below:
1. It tends to fail in group selection (i.e. it chooses one variable from a group of highly correlated variables and ignores the rest). Some of the ignored variables might be important in predicting the target.
2. In a dataset containing more number of variables(d) than the number of data samples(n), ${d>n}$, Lasso Regression selects n variables only. That means it doesn't incorporate the characters of all features in the generated model.


These shortcomings were addressed by Elastic Net Regression. It is a regularized linear model that combines the lasso and ridge regression techniques to improve the model. It penalizes the coefficients based on their $L_1$ norm and $L_2$ norm.

$$Loss_{elastic}= RSS + \lambda ((1-\alpha)\cdot L_2 norm + \alpha \cdot L_1 norm) $$

$${{\sum_{i=1}^{n} (y_i-\hat{y_i})^2}+ \lambda
( {(1-\alpha)}\cdot\sum_{j=1}^{d} \boldsymbol{\beta_j^2}}+\alpha \cdot \sum_{j=1}^{d}\mid\boldsymbol{\beta_j}\mid)$$
The lasso penalty creates a sparse model. Likewise, the ridge penalty restricts the model from ignoring the highly correlated variables and stabilizes the $L_1$ regularization path.

Notice that the equation can be tuned to form Lasso as well as Ridge Regression.
1. If $\alpha=0$, the $L_2$ penalty remains and the equation reduces to Ridge Regression.
2. If $\alpha=1$, the $L_1$ penalty remains and the equation reduces to Lasso Regression.
3. If ${0<\alpha<1}$, we get the combination of Ridge and Lasso Regression.

Therefore, we should choose $\alpha$ between 0 and 1 to optimize the elastic net.

The elastic net cost in *argmin* or argument minimum representation is

$$\boldsymbol{\beta_{lasso}} = \underset{\boldsymbol{\beta\in\mathbb{R}}}{\arg\min}||y-X\beta||_2^2+\lambda(\alpha||\beta||_1^1+(1-\alpha)||\beta||_2^2)$$


## Geometrical Interpretation of Elastic Net

We came to know that the elastic net regression is combination of Ridge and Lasso Regression. Similarly, the geometric shape of elastic net penalty can also be visualized with the help of Ridge and Lasso penalty. The elastic net penalty lies in between the $L_2$ norm circle and $L_1$ norm diamond shape with the shape of a square with rounded sides as shown in the figure below.


<figure align="center">
<img src="https://i.postimg.cc/Qx9PMX1Z/shape-of-regularization-penalty.png" height="300" width="380">
<figcaption>Figure 1: Shape of Regularisation Penalty </figcaption>
</figure>


Now, lets interpret the geometry of elastic net penalty along with the OLS contours with the help of the figure below.

<figure align="center">
<img src="https://i.postimg.cc/1zJMHTwK/Geometry-of-Elastic-Net-Regression.png" height="599" width="688">
<figcaption>Figure 2: Geometry of Elastic Net Regression</figcaption>
</figure>

As discussed before, the Elastic Net penalty is the convex combination of both Lasso and Ridge penalty (dashed lines) so its shape is a geometry that is neither a complete circle nor a complete diamond. Let's call this an elastic ball. If m=2 and $0$<$\alpha$<$1$, then the equation
$$(1-\alpha)\cdot\sum_{j=1}^{d}{\beta_j^2}+\alpha\cdot\sum_{j=1}^{d}|\beta_j|\leq c$$ forms the elastic ball as shown in the figure. Similarly, the term $(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$ is the OLS solution that generates an ellipse. The centre of the ellipse denotes the point in which the least square error(RSS) is minimum. The ellipse contour plot represents the residual sum of squares (RSS) increasing from inner ellipse to the outer ellipse quadratically. Likewise, the regularization curve is an elastic ball which increases from origin towards the circumference.


The goal of this optimization problem is to find the point where the cost function is minimum. We need to find that combination of $\beta_1$ and $\beta_2$ where the penalized loss function is minimum. In the figure, the ridge estimate lies at the point of intersection of the ellipse and the circle(red point), the lasso estimate lies at the point of intersection of the ellipse and the diamond(purple point) and the elastic net estimate lies at the point of intersection of the ellipse and the elastic ball (green point).


## Implementation on Real World Dataset

For implementation of closed form Ridge Regression Equation we will use the [Boston House Prices Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html).

It is one of the datasets provided by sklearn. It has 506 instances with 13 numericals/categorical features of the Boston city. The *medv* variable is the target variable. It is the median value of owner-occupied homes per $1000.

### Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
from IPython.display import display, HTML

In [None]:
# Importing the dataset.
from sklearn.datasets import load_boston

#Load the dataset
boston_df=load_boston()

#Create dataframe of dataset
boston=pd.DataFrame(boston_df.data, columns= boston_df.feature_names)
boston['MEDV']=boston_df.target

#Print the first five samples
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [None]:
# We train the model with features other than MEDV as it is the target variable
X=boston.drop(columns=['MEDV'])
y=boston['MEDV'].values.reshape(-1,1)

We will be first scaling the data to mean value of zero(0) and standard deviation of 1. Then we implement sklearn's [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) and calculate the coefficient values.


In [None]:
# We scale the data to mean value of zero and standard deviation of 1.
from sklearn.preprocessing import StandardScaler

X=StandardScaler().fit_transform(X)

In [None]:
# Split the dataset in train and test set
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [None]:
#Implementation of Elastic Net Regression

from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error

#ElasticNet with lambda value of 0.1, Lasso and Ridge ratio of 0.5 each.
elastic=ElasticNet(alpha=0.1,l1_ratio=0.5,fit_intercept=True, normalize=False)
elastic.fit(X_train,y_train)

#Predicting the fitted model in test set.
y_pred=elastic.predict(X_test)


In [None]:
# Formatting to display 2 decimal places only
pd.options.display.float_format = "{:,.2f}".format

In [None]:
# Calculating the mean squared error and coefficient values
mean_sq_error = mean_squared_error(y_test, y_pred)
print('MSE: ',mean_sq_error)

coefficients=elastic.coef_

# The beta values corresponding to the column
index=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT']
pd.DataFrame(coefficients,columns=['Beta value'],index=index)

MSE:  35.68988731868601


Unnamed: 0,Beta value
CRIM,-0.77
ZN,0.71
INDUS,-0.26
CHAS,0.63
NOX,-1.19
RM,2.75
AGE,-0.09
DIS,-2.06
RAD,0.75
TAX,-0.78


We obtained the coefficients using Elastic net with equal share of lasso and ridge penalty (i.e we set l1_ratio=0.5). We used the learning rate $\lambda$=0.1 in our implementation and obtained the mean squared error of 35.68. Elastic Net Regression usually performs better than the Lasso and Ridge Regression because it shrinks coefficients and selects entire groups of highly correlated variables, addressing the limitations of both the regression methods. The coefficient values from Ridge, LASSO and Elastic Net has been visualized in the bar plot below.


<figure align="center">
<img src="https://i.postimg.cc/RZ0T0dj6/Comparison-of-coefficients-values-of-Ridge-LASSO-and-Elastic-Net-Regression.png" height="300" width="450">
<figcaption>Figure 3: Comparison of coefficients values of Ridge, LASSO and ElasticNet Regression </figcaption>
</figure>



## Bias Variance TradeOff

Remember the concept of bias and variance discussed in the previous course. Let's go through this concept mathematically.

Let a linear model $y_i= f(x_i)+ \epsilon_i$ where $\mathbb{E($\epsilon$)}$=0 and ${Var[\epsilon]}=\sigma^2$. We will be estimating f by minimizing the loss function $\hat f$. We will apply $\hat f$ to new $y$ at $x_0$ to obtain expected MSE equals to

$$\text{MSE}=\mathbb{E}\mathbb{[(y-\hat{f}(x_0))^2]} = [Bias(\hat{f}(x_0))]^2+Var(\hat{f}(x_0))+\sigma^2$$

where,
$$Bias(\hat{f}(x_0))=\mathbb{E}\mathbb{[\hat{f}(x_0)]}-f(x_0)$$
$$\qquad Var(\hat{f}(x_0))=\mathbb{E}[(\hat{f}(x_0)-\mathbb{E}\mathbb{[\hat{f}(x_0)])^2]}$$
$$\sigma^2=\text{Noise}$$

Here,
$\mathbb{E[\hat{f}(x_0)]}$ is the expected value of predicted output for $x_0$.
$\hat{f}(x_0)$ is the predicted output for $x_0$ and $f(x_0)$ is the true output for $x_0$.

Bias is the measure of how much a model fails in fitting training data and variance is the measure of how much a model fails in predicting testing data. And Bias Variance Tradeoff is related to finding the right balance of bias and variance which minimizes the error of the model.

Remember that OLS is an unbiased estimator, which imples that the OLS estimators are linear, unbiased(low bias) and has least variance in compared to other linear models. But if there is multicollinearity on the dataset(two or more variables are highly correlated) or there are a large number of variable in the dataset, the variance of the OLS estimators is very high as shown in the figure below. So, the model results in poor generalization of testing data.  

<figure align="center">
<img src="https://i.postimg.cc/s2d1LD9j/Bias-Variance-Trade-Off.png" height="356" width="481">
<figcaption>Figure 3: Bias Variance TradeOff</figcaption>
</figure>

The unbiased OLS lies at the rightmost side of the figure as shown above. With the help of regularisation we reduce the variance with the cost of increasing the bias of the model we the target of moving left towards the optimal model in the figure.

From the figure, we have come to know  that the model complexity can be decreased by decreasing the number of features. This can be obtained by reducing the coefficients, $\beta$'s, towards zero. Since, regularised regression performs the task of penalizing the model with the help of penalty factor $\lambda$. The greater the value of $\lambda$, the more are the coefficient values penalized. But, the value of $\lambda$ should be increased till an extent; it is because the model will lose its important characteristics if $\lambda$ is too high. Also, high value of $\lambda$ will increase the bias of the model and it may result in an underfit model. Since the lasso regression penalized coefficients exactly equal to zero, it creates a sparse model, and thus results in a model with lower variance than OLS with the cost of increase in bias.





## Key Takeaways

- Elastic Net Regression is the combination of Lasso and Ridge Regression.
- It selects all features of the group containing highly correlated variables unlike Lasso.
- It works well with dataset where the number of features is greater than the number of samples.
- It usually performs better than Lasso and Ridge regression..



