# **Regularization (Ridge - LASSO - ElasticNet)**

> What is **Regularization**?

* It is one of the most important concepts of machine learning. This technique prevents the model from overfitting by adding extra information to it.

* It is a form of regression that shrinks the coefficient estimates towards zero. In other words, this technique forces us not to learn a more complex or flexible model, to avoid the problem of overfitting.

* Now, let’s understand the “How flexibility of a model is represented?”
 *     For regression problems, the increase in flexibility of a model is represented by an increase in its coefficients, which are calculated from the regression line.

* In simple words, “In the Regularization technique, we reduce the magnitude of the independent variables by keeping the same number of variables”. It maintains accuracy as well as a generalization of the model.

> **Why Regularization ?**

Sometimes what happens is that our Machine learning model performs well on the training data but does not perform well on the unseen or test data. It means the model is not able to predict the output or target column for the unseen data by introducing noise in the output, and hence the model is called an overfitted model.

Let’s understand the meaning of “Noise” in a brief manner:

By noise we mean those data points in the dataset which don’t really represent the true properties of your data, but only due to a random chance.

So, to deal with the problem of overfitting we take the help of regularization techniques.

> **How does Regularization Work ?**

Regularization works by adding a penalty or complexity term or shrinkage term with Residual Sum of Squares (RSS) to the complex model.

Let’s consider the **Simple linear regression** equation:

Here Y represents the dependent feature or response which is the learned relation. Then,

Y is approximated to β0 + β1X1 + β2X2 + …+ βpXp

Here, **X1, X2, … ,Xp** are the independent features or predictors for Y, and

**β0, β1, … ,βn** represents the coefficients estimates for different variables or predictors(X), which describes the weights or magnitude attached to the features, respectively.

In simple linear regression, our optimization function or loss function is known as the **residual sum of squares (RSS)**.

We choose those set of coefficients, such that the following loss function is minimized:

<img src="https://s4.uupload.ir/files/r_l1_1_lpn.png" border="0" alt="آپلود عکس" />

Now, this will adjust the coefficient estimates based on the training data. If there is noise present in the training data, then the estimated coefficients won’t generalize well and are not able to predict the future data.

This is where regularization comes into the picture, which shrinks or regularizes these learned estimates towards zero, by adding a loss function with optimizing parameters to make a model that can predict the accurate value of Y.

> **Techniques of Regularization**

There are three types of regularization techniques, which are given below:

L1 Regularization
* Lasso Regression

L2 Regularization
* Ridge Regression

Combining L1 & L2
* Elastic Net



> **L1 Regularization --> Lasso Regression**

* Lasso regression is another variant of the regularization technique used to reduce the complexity of the model. It stands for Least Absolute and Selection Operator.

* It is similar to the Ridge Regression except that the penalty term includes the absolute weights instead of a square of weights. Therefore, the optimization function becomes:

<img src="https://s4.uupload.ir/files/r_l1_3_9jc4.png" border="0" alt="آپلود عکس" />


* In statistics, it is known as the L-1 norm.

* In this technique, the L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is suﬃciently large. Therefore, the lasso method also performs Feature selection and is said to yield sparse models.

* Limitation of Lasso Regression:

 * Problems with some types of Dataset: If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.

 * Multicollinearity Problem: If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

> **L2 Regularization --> Ridge Regression**

* Ridge regression is one of the types of linear regression in which we introduce a small amount of bias, known as **Ridge regression** penalty so that we can get better long-term predictions.

* In Statistics, it is known as the **L-2 norm**.

* In this technique, the cost function is altered by adding the penalty term (shrinkage term), which multiplies the lambda with the squared weight of each individual feature. Therefore, the optimization function(cost function) becomes:

<img src="https://s4.uupload.ir/files/r_l1_2_x69b.png" border="0" alt="آپلود عکس" />

In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression reduces the magnitudes of the coefficients that help to decrease the complexity of the model.

# **Now it's time to code.**

**📤 Import & Install Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import Ridge

from sklearn.metrics import mean_absolute_error, mean_squared_error

from sklearn.linear_model import RidgeCV

from sklearn.linear_model import LassoCV

from sklearn.linear_model import ElasticNetCV

%matplotlib inline

**Import the Data**

In [None]:
df= pd.read_csv('../input/real-estate-price-prediction/Real estate.csv')

**💾 Check out the Data**

In [None]:
df.head()

In [None]:
print("The Dataset has",df.shape[0],'Rows.')

print("\nThe Dataset has",df.shape[1],'Columns.')

In [None]:
df.info()

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), annot=True,cmap='Reds')

**📊 Exploratory Data Analysis (EDA)**


In [None]:
sns.pairplot(df)

**✔️ Determine the Features & Target Variable (Lable)**

**X and y arrays**

In [None]:
X= df.drop('Y house price of unit area', axis=1)
y=df['Y house price of unit area']

**✔️ Preprocessing (Polynomial Conversion)**

In [None]:
polynomial_converter= PolynomialFeatures(degree=3, include_bias=False)

In [None]:
poly_features= polynomial_converter.fit_transform(X)

In [None]:
poly_features.shape

**🧱 Train Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

**✔️ Scaling the Data**

In [None]:
scaler= StandardScaler()

In [None]:
scaler.fit(X_train)

In [None]:
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)

**Regularization**

1. Ridge Regression

In [None]:
#Train the Model

#For alpha we must choose an optimal value.
ridge_model= Ridge(alpha=10)

In [None]:
ridge_model.fit(X_train, y_train)

In [None]:
#predict Test Data
y_pred= ridge_model.predict(X_test)

In [None]:
#Evaluating the Model

MAE= mean_absolute_error(y_test, y_pred)
MSE= mean_squared_error(y_test, y_pred)
RMSE= np.sqrt(MSE)

In [None]:
pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['metrics'])

**✔️ Ridge Regression (Choosing an alpha value with Cross-Validation)**

In [None]:
#Train the Model

ridge_cv_model=RidgeCV(alphas=(0.1, 1.0, 40.0), scoring='neg_mean_absolute_error')

In [None]:
ridge_cv_model.fit(X_train, y_train)

print("The optimal value of alpha for Ridge Regression is: ", ridge_cv_model.alpha_)

In [None]:
#Predicting Test Data
y_pred_ridge= ridge_cv_model.predict(X_test)

In [None]:
MAE_ridge= mean_absolute_error(y_test, y_pred_ridge)
MSE_ridge= mean_squared_error(y_test, y_pred_ridge)
RMSE_ridge= np.sqrt(MSE_ridge)

In [None]:
pd.DataFrame([MAE_ridge, MSE_ridge, RMSE_ridge], index=['MAE', 'MSE', 'RMSE'], columns=['Ridge Metrics'])

In [None]:
ridge_cv_model.coef_

**Regularization**

2. Lasso Regression

In [None]:
lasso_cv_model= LassoCV(eps=0.1, n_alphas=10000, cv=5)

In [None]:
lasso_cv_model.fit(X_train, y_train)

In [None]:
print("The optimal value of alpha for Lasso Regression is: "
      , lasso_cv_model.alpha_)

In [None]:
y_pred_lasso= lasso_cv_model.predict(X_test)

In [None]:
MAE_Lasso= mean_absolute_error(y_test, y_pred_lasso)
MSE_Lasso= mean_squared_error(y_test, y_pred_lasso)
RMSE_Lasso= np.sqrt(MSE_Lasso)

In [None]:
pd.DataFrame([MAE_Lasso, MSE_Lasso, RMSE_Lasso], index=['MAE', 'MSE', 'RMSE'], columns=['Lasso Metrics'])

In [None]:
List_lasso_cv_model_coef=[]
List_lasso_cv_model_coef=lasso_cv_model.coef_
lasso_cv_model.coef_

In [None]:
count = 0
for i in List_lasso_cv_model_coef:
    if i!=0:
        count+=1
print("After Lasso Regression, we have only ", count , " none-zero coefficients.")

**Regularization**

3. Elastic Net

In [None]:
elastic_model= ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],cv=5, max_iter=100000)

In [None]:
elastic_model.fit(X_train, y_train)

In [None]:
elastic_model.l1_ratio_

In [None]:
y_pred_elastic=elastic_model.predict(X_test)

In [None]:
MAE_Elastic= mean_absolute_error(y_test, y_pred_elastic)
MSE_Elastic= mean_squared_error(y_test, y_pred_elastic)
RMSE_Elastic= np.sqrt(MSE_Elastic)

In [None]:
pd.DataFrame([MAE_Elastic, MSE_Elastic, RMSE_Elastic], index=['MAE', 'MSE', 'RMSE'], columns=['Elastic Metrics'])

In [None]:
list_elastic_model_coef=[]
list_elastic_model_coef = elastic_model.coef_
elastic_model.coef_

In [None]:
count = 0
for i in list_elastic_model_coef:
    if i!=0:
        count+=1
print("After Elastic Net, we have only ", count , " none-zero coefficients.")