# Regularization

Regularization is a method used to prevent overfitting and improve the generalization of machine learning models. Regularization Techniques in term of Machine Learning:

- L1 normalization(also called Lasso) - `It rescales each sample(row) but with a different approach, ensuring the sum of the absolute values is 1 each row. The L1 norm is calculated as the sum of the absolute values is 1 in each row.`
- L2 normalization(also called Ridge) - `It is calculated as the square root of the sum of the squared vector values`.

In [59]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge

## 1. Lasso

In [60]:
df = sns.load_dataset("flights")
df.head()

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [61]:
df.shape

(144, 3)

In [62]:
df.isnull().sum()

year          0
month         0
passengers    0
dtype: int64

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   year        144 non-null    int64   
 1   month       144 non-null    category
 2   passengers  144 non-null    int64   
dtypes: category(1), int64(2)
memory usage: 2.9 KB


### train_test_split()

In [64]:
x = df.drop("passengers",axis=1)
y = df["passengers"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

In [65]:
x

Unnamed: 0,year,month
0,1949,Jan
1,1949,Feb
2,1949,Mar
3,1949,Apr
4,1949,May
...,...,...
139,1960,Aug
140,1960,Sep
141,1960,Oct
142,1960,Nov


In [66]:
y

0      112
1      118
2      132
3      129
4      121
      ... 
139    606
140    508
141    461
142    390
143    432
Name: passengers, Length: 144, dtype: int64

In [67]:
x_train

Unnamed: 0,year,month
11,1949,Dec
114,1958,Jul
117,1958,Oct
61,1954,Feb
132,1960,Jan
...,...,...
47,1952,Dec
75,1955,Apr
128,1959,Sep
139,1960,Aug


### ColumnTransformer

In [68]:
c1 = ColumnTransformer(transformers=[
    ("oneHotEncoder", OneHotEncoder(sparse_output=False, drop="first"), [1])
], remainder="passthrough")

In [69]:
c2 = ColumnTransformer(transformers=[
    ("minMaxScaler", MinMaxScaler(), slice(0,None))
])

### Pipeline

In [70]:
pipe = Pipeline([
    ("1", c1),
    ("2", c2),
    ("3", Lasso())
])

In [71]:
pipe.fit(x_train, y_train)

In [72]:
pipe.score(x_test, y_test)

0.9106041477494408

In [73]:
pipe.score(x_train, y_train)

0.9522809914027824

## 2. Ridge

Now we have Pipeline, so only changing 3rd step of pipeline.

In [78]:
pipe = Pipeline([
    ("1", c1),
    ("2", c2),
    ("3", Ridge())
])

pipe.fit(x_train, y_train)

In [79]:
pipe.score(x_train, y_train)

0.9538131219282071

In [80]:
pipe.score(x_test, y_test)

0.904861917492905

## LinearRegression()

In [81]:
pipe = Pipeline([
    ("1", c1),
    ("2", c2),
    ("3", LinearRegression())
])

pipe.fit(x_train, y_train)

In [82]:
pipe.score(x_test, y_test)

0.9267202567686732

In [83]:
pipe.score(x_train, y_train)

0.9602298138032782

## Note:
- Here, I try 3 algorithms LinearRegression, Lasso(L1), Ridge(L2). L1 and L2 reduce overfitting but here it not perform well because dataset was simple. 

- We can also use Normalizer library to do this, but that will only return us scaled dataset rather first scaling and then fitting as we do in Lasso(L1) and Ridge(L2).

## Using Normalizer Library

In [85]:
# L1 normalizer
pipe = Pipeline([
    ("1", c1),
    ("2", c2),
    ("3", Normalizer(norm="l1"))
])

pipe.fit_transform(x_train, y_train)

array([[0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.45      ],
       [0.        , 0.        , 0.        , ..., 0.55      , 0.        ,
        0.45      ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.52380952,
        0.47619048],
       [0.5       , 0.        , 0.        , ..., 0.        , 0.        ,
        0.5       ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.26666667]])

In [86]:
# L2 normalizer
pipe = Pipeline([
    ("1", c1),
    ("2", c2),
    ("3", Normalizer(norm="l2"))
])

pipe.fit_transform(x_train, y_train)

array([[0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.63323779],
       [0.        , 0.        , 0.        , ..., 0.7739573 , 0.        ,
        0.63323779],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.73994007,
        0.67267279],
       [0.70710678, 0.        , 0.        , ..., 0.        , 0.        ,
        0.70710678],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.34174306]])

## Conclusion:

We can do Regularization both ways:
- Using algorithms: Lasso, Ridge directly because they will scaled dataset and then automatically fit and give us results.
- Using normalizer Library: It will only give us scaled datasets as MinMaxScaler and other scalers do.