### Regularization in Machine Learning

# what is regularization in ML

- a technique to prevent the model from overfitting by adding extra information to it.
- it maintain all variables or features in the model by reducing the magnitude of the variables. 
- Hence, it maintains accuracy as well as a generalization of the model.
- In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the same number of features."
- mainly regularizes or reduces the coefficient of features toward zero

In [1]:
# Basics of regularization

- a technique to prevent the model from overfitting by adding extra information to it.
-  maintains accuracy as well as a generalization of the mode
-  reduces the magnitude of the variables, hence maintain all variables or features
-  In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the same number of features"
- by adding a penalty or complexity term to the complex model

In [2]:
# How does Regularization Work?

Let's consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

Y represents the value to be predicted
X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude
b represents the intercept.

The loss function for the linear regression is called as RSS or Residual sum of squares.

Techniques of Regularization:
• Ridge Regression
• Lasso Regression

In [3]:
# Ridge regression:

- a small amount of bias is added
- reduces the complexity of the model, 
- also called L2 regularization
- cost function is altered by adding the penalty term to it
- amount of bias added to the model is called Ridge Regression penalty..

From the cost function of Ridge Regression we can see that if the values of λ tends to zero, the equation becomes the cost function of the linear regression model..

A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be used.

In [4]:
# Lasso regression

Lasso Regression:
- stands for Least Absolute Shrinkage and Selection Operator
- also called L1 regularization
- reduces the complexity of the model
- similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights
- Since it takes absolute values, hence, it can shrink the slope to zero
- whereas Ridge Regression can only shrink it near to 0.
- Some of the features are completely neglected for model evaluation
- hence Lasso helps in reducing overfitting and also feature selection

Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function(L). 
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function(L).

In [5]:
### Implementation of Lasso Regression

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We are going to use the Boston house prediction dataset, that is an inbuilt dataset in sklearn

In [7]:
from sklearn.datasets import load_boston
boston=load_boston()

In [8]:
# Getting attributes of boston
dir(boston)

['DESCR', 'data', 'feature_names', 'filename', 'target']

In [9]:
# printing description
boston.DESCR

".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000

In [10]:
# Printing "data" attributes of the dataset, its our input 
boston.data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [11]:
# Getting features names of the dataset
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [12]:
# Printing first 10 values of target 
boston.target[0:10]

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])

In [13]:
# Describing dataframe from the data
df=pd.DataFrame(boston.data,columns=boston.feature_names)

In [14]:
# Printing first 2 rows of the dataframe 'df'
df.head(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14


In [15]:
# adding a new column 'target' from boston.target
df['target']=boston.target

In [16]:
df.head(2)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6


In [17]:
# Printing consized summary about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  target   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


- we have 13 independent variable and one dependent (House price) variable

In [18]:
X=df.iloc[:,:-1].values
y=df.iloc[:,-1].values

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

In [21]:
print(X_train.shape,y_train.shape)

(379, 13) (379,)


In [22]:
print(X_test.shape,y_test.shape)

(127, 13) (127,)


In [23]:
# now we will start training of the model on multiple regression
from sklearn.linear_model import LinearRegression
lr=LinearRegression()

In [24]:
lr.fit(X_train, y_train)

LinearRegression()

In [25]:
lr_pred=lr.predict(X_test)

In [26]:
# calculation mean squared error
mse=np.mean((lr_pred-y_test)**2)
mse

19.73771080470582

In [27]:
# Putting together the coefficient and their columns

lr_coeff=pd.DataFrame()
lr_coeff['Columns']=df.columns
lr_coeff['Coefficient Values']=pd.Series(lr.coef_)

print(lr_coeff)

    Columns  Coefficient Values
0      CRIM           -0.066498
1        ZN            0.053051
2     INDUS            0.041127
3      CHAS            3.502430
4       NOX          -18.380600
5        RM            3.456135
6       AGE            0.012149
7       DIS           -1.543379
8       RAD            0.296151
9       TAX           -0.012449
10  PTRATIO           -0.890911
11        B            0.011632
12    LSTAT           -0.606322
13   target                 NaN


- We can see that most of the columns do not significant coefficients and hence they do not contribute much in model performance,
- we need to regularize the model

In [28]:
# Regularizing using ridge regression
from sklearn.linear_model import Ridge

In [29]:
ridge_reg=Ridge(alpha=1)
# here alpha parameter indicates Regularization strength; it must be a positive floating number

In [30]:
ridge_reg.fit(X_train,y_train)

Ridge(alpha=1)

In [31]:
y_pred=ridge_reg.predict(X_test)

In [32]:
ridge_coeff=pd.DataFrame()
ridge_coeff['columns']=df.columns
ridge_coeff['Coefficient estimates']=pd.Series(ridge_reg.coef_)
print(ridge_coeff)

    columns  Coefficient estimates
0      CRIM              -0.059764
1        ZN               0.053677
2     INDUS               0.004674
3      CHAS               3.309944
4       NOX              -9.918291
5        RM               3.558169
6       AGE               0.003945
7       DIS              -1.419434
8       RAD               0.273208
9       TAX              -0.012888
10  PTRATIO              -0.790406
11        B               0.012675
12    LSTAT              -0.614542
13   target                    NaN


- As we can observe from the above plots that alpha helps in regularizing the coefficient and make them converge faster. 
- it shows some of the coefficients become zero. In Ridge Regularization, the coefficients can never be 0, they are just too small to observe in above plots. 

### Implementation of lasso regression using sklearn

- we add Mean Absolute value of coefficients in place of mean square value
- Unlike Ridge Regression, Lasso regression can completely eliminate the variable by reducing its coefficient value to 0.

In [33]:
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=1)

In [34]:
lasso.fit(X_train,y_train)
y_pred1=lasso.predict(X_test)

In [35]:
lasso_mse=np.mean((y_pred1-y_test)**2)

In [36]:
print(lasso_mse)

25.283708842642042


In [37]:
lasso_coef=pd.DataFrame()
lasso_coef['columns']=df.columns
lasso_coef['coeffienct values']=pd.Series(lasso.coef_)

In [38]:
lasso_coef

Unnamed: 0,columns,coeffienct values
0,CRIM,-0.0
1,ZN,0.052337
2,INDUS,-0.0
3,CHAS,0.0
4,NOX,-0.0
5,RM,0.905588
6,AGE,0.030446
7,DIS,-0.74375
8,RAD,0.219849
9,TAX,-0.014176


In [39]:
type(lasso_coef)

pandas.core.frame.DataFrame

### Python implementation of Elastic Net 

In [40]:
from sklearn.linear_model import ElasticNet
elastic=ElasticNet(alpha=1)

In [41]:
elastic.fit(X_train,y_train)

ElasticNet(alpha=1)

In [42]:
y_pred2=elastic.predict(X_test)

In [43]:
elastic_mse=np.mean((y_pred2-y_test)**2)
# Here for reminding, mean squared error is the mean of sqaure of diffrence in y_predicted and y_test

print(elastic_mse)

24.422988143894155


In [44]:
# making dataframe of column wise coefficient of elasticnet

elastic_coeff=pd.DataFrame()
elastic_coeff['columns']=df.columns
elastic_coeff['coeff values']=pd.Series(elastic.coef_)

print(elastic_coeff)

    columns  coeff values
0      CRIM     -0.022867
1        ZN      0.055481
2     INDUS     -0.000000
3      CHAS      0.000000
4       NOX     -0.000000
5        RM      0.926176
6       AGE      0.029873
7       DIS     -0.802898
8       RAD      0.261508
9       TAX     -0.015532
10  PTRATIO     -0.648044
11        B      0.011629
12    LSTAT     -0.823327
13   target           NaN


In [45]:
type(elastic_coeff)

pandas.core.frame.DataFrame

- Elastic Net is a combination of both of the above regularization. It contains both the L1 and L2 as its penalty term. 
- It performs better than Ridge and Lasso Regression for most of the test cases