# **Motivation:**

* One thing should be very clear to everyone. **Overfitting is more dangerous than Underfitting.** because it ditches. It performs well on training data and makes us happy temporary. But perform worse on test dataset which is actual thing.

* Underfitting can still provide a chance to improve because that clearly shows bad result even during training.

So now, what to do to prevent overfitting?

**Use Regularization !**

# **Introduction:**

* Regularization is a way to reduce model overfitting and variance.

* What is overfitting? It is excess of variance and less of bias. So if we use a little common sense, we can judge that 'to decrease variance, we can add more bias so both (var and bias) get balanced.' Woah!

* During overfit, what the model does? Model tries to cover each point of training data and gives zero loss. Which is a trap !

* So with regularization, we penalize the loss. We want it to react normal!

* So we can say, **regularization reduce model overfitting by penalizing loss function.**

# **Types of Regularization:**

## **1. L1 Regularization - Lasso Regularization:**

1. It adds the penalty equal to 'absolute value of magnitute of coefficient' in the residual sum of squared error.

  **L1 = RSS +  λ Σ |$\beta_j$|**

2. The L1 regularization term is the sum of the absolute values of the coefficients multiplied by a regularization parameter (lambda).

3. In it, lambda decides the severity of the penalty. It can be any value from 0 - ∞.

4. It limits the size of coefficient by forcing some coefficients to become completely zero. It happens when tuning parameter lambda is sufficiently large.So it produces sparse model.So models generated from LASSO are more easier to interpret.

4. As lambda increases, the model seeks to find the coefficients that minimize the loss while keeping the sum of the absolute values of the coefficients within a certain threshold determined by lambda. 

If a coefficient does not significantly contribute to reducing the loss compared to the regularization penalty, it becomes advantageous to set that coefficient to zero. This is because setting a coefficient to zero eliminates its impact on the loss function while reducing the overall regularization penalty.

5. This is possible because the absolute value function has sharp corners at zero, allowing coefficients to become exactly zero when the regularization penalty outweighs their contribution to the loss function.

**When to use:**

When feature selection or sparsity is important, as it tends to shrink irrelevant features to zero.

In [91]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [92]:
df = pd.read_csv('/content/sample_data/Advertising.csv')
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [93]:
X = df.drop('sales', axis = 1)
y = df['sales']

In [94]:
from sklearn.preprocessing import PolynomialFeatures

polyconverter = PolynomialFeatures(degree = 3, include_bias = False)

new_features = polyconverter.fit_transform(X)

print('Original data size: {}'.format(X.shape))
print('New data size: {}'.format(new_features.shape))

Original data size: (200, 3)
New data size: (200, 19)


In [95]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(new_features, y, test_size = 0.3, random_state = 101)

In [96]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
scalar.fit(X_train)

X_train = scalar.transform(X_train)
X_test = scalar.transform(X_test)

* Sklearn calls lambda as alpha. The reason is different models in sklearn has different hyperparameters. Instead of naming all differently, sklearn names them all as alpha. Cool!

In [108]:
from sklearn.linear_model import Lasso

# directly using some value of lambda/alpha with random choice
lasso_model = Lasso(alpha = 10)
lasso_model.fit(X_train, y_train)

In [109]:
predictions = lasso_model.predict(X_test)

In [110]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE = mean_absolute_error(y_test, predictions).round(2)
RMSE = np.sqrt(mean_squared_error(predictions, y_test)).round(2)

print('Mean Absolute ERROR:  {}'.format(MAE))
print('Root Mean Sqaured ERROR:  {}'.format(RMSE))

Mean Absolute ERROR:  4.62
Root Mean Sqaured ERROR:  5.4


**Using cross validation on alpha**

* We don't know the optimal/best value of lambda/alpha. So, we will use cross validation for it. 


In [120]:
from sklearn.linear_model import LassoCV

lasso_cv_model = LassoCV(alphas = (0.1, 1.0, 10.0))

lasso_cv_model.fit(X_train, y_train)

In [115]:
predictions_cv = lasso_cv_model.predict(X_test)

In [116]:
lasso_cv_model.alpha_

0.1

In [117]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE = mean_absolute_error(predictions_cv, y_test).round(2)
RMSE = np.sqrt(mean_squared_error(predictions_cv, y_test)).round(2)

print('Mean Absolute ERROR using CV on alpha:  {}'.format(MAE))
print('Root Mean Sqaured ERROR  using CV on alpha:  {}'.format(RMSE))

Mean Absolute ERROR using CV on alpha:  0.58
Root Mean Sqaured ERROR  using CV on alpha:  0.88


Performance improved with CV!

In [121]:
lasso_cv_model.coef_

array([ 1.71353441,  0.15167646,  0.        , -0.        ,  3.86916277,
        0.        ,  0.        ,  0.        ,  0.        , -0.40990134,
       -0.        , -0.        ,  0.        ,  0.        , -0.        ,
        0.        ,  0.        ,  0.        ,  0.        ])

Look! Most of the features are turned to zero. This is how L1 works!

## **2. L2 Regularization - Ridge Regularization:**

1. It adds the penalty equal to **'square of magnitude of coefficient'** in the residual sum of square error.

  **L2 = RSS +  λ Σ $\beta_j^2$**

2. In it, lambda decides the severity of the penalty. It can be any value from 0 - ∞.

3. In it, all coefficients are shrunk by the same factor because the regularization term penalizes the squared magnitudes of the coefficients, **encouraging them to be smaller overall without favoring any particular coefficient**. This leads to a more balanced regularization effect across all coefficients.

4. It does not eliminate the coefficient entirely. 

5. Trick to remember:

    Linear Regression: RSS

    Ridge Regression: RSS + penalty

6. If penalty (lambda) is zero, then it is back to RSS, normal Linear regression.

7. It is taking square, so it will punish those large coefficients.

8. To choose lambda, you can try cross validation to choose the best value for it, based on RMSE or MAE etc.

**When to use:**

Use L2 regularization (Ridge) when you want to control the overall magnitude of the coefficients and avoid large variations between them, promoting a more balanced influence of all features.

In [97]:
from sklearn.linear_model import Ridge

# directly using some value of lambda/alpha with random choice
ridge_model = Ridge(alpha = 10)
ridge_model.fit(X_train, y_train)

In [98]:
predictions = ridge_model.predict(X_test)

In [99]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE = mean_absolute_error(y_test, predictions).round(2)
RMSE = np.sqrt(mean_squared_error(predictions, y_test)).round(2)

print('Mean Absolute ERROR:  {}'.format(MAE))
print('Root Mean Sqaured ERROR:  {}'.format(RMSE))

Mean Absolute ERROR:  0.58
Root Mean Sqaured ERROR:  0.89


**Using cross validation on alpha to find best alpha value**

In [100]:
from sklearn.linear_model import RidgeCV

ridge_cv_model = RidgeCV(alphas = (0.1, 1.0, 10.0))

ridge_cv_model.fit(X_train, y_train)

In [101]:
predictions_cv = ridge_cv_model.predict(X_test)

In [102]:
ridge_cv_model.alpha_

0.1

In [103]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE = mean_absolute_error(predictions_cv, y_test).round(2)
RMSE = np.sqrt(mean_squared_error(predictions_cv, y_test)).round(2)

print('Mean Absolute ERROR using CV on alpha:  {}'.format(MAE))
print('Root Mean Sqaured ERROR  using CV on alpha:  {}'.format(RMSE))

Mean Absolute ERROR using CV on alpha:  0.43
Root Mean Sqaured ERROR  using CV on alpha:  0.62


It shows 0.1 is best alpha/lambda value for our dataset/model.


* For cross validation of lambda/alpha for L2 regularization, sklearn uses a 'soccer object'. All soccer objects follow the convention that 'higher return value - better the value'. But we use RMSE as return value of model which we want to lower not higher. So, to handle this issue, soccer object shows 'negavtive RMSE'. The higher  - The better.


In [104]:
# Performance Evaluation

from sklearn.metrics import SCORERS

SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

From these all keys of scorrer, we will use 'neg_mean_squared_error'.

In [105]:
# Make model again to set it in the start so we can use it for evaluation at the end

ridge_model = RidgeCV(alphas = (0.1, 1.0, 10.0), scoring = 'neg_mean_squared_error')

ridge_model.fit(X_train, y_train)


In [106]:
predictions = ridge_model.predict(X_test)

In [107]:
from sklearn.metrics import SCORERS

# Highest negative mean squared error is the following that scorers has measured
ridge_model.best_score_

-0.37770668487073433

## **3. Elastic Net - Combining L1 and L2:**

1. It combines L1 and L2 with the addition of an alpha parameter, deciding the ratio between L1 and L2.

 **Elastic Net = RSS + λ { (1 - α)/2 . [Σ $\beta_j^2$] + α . [Σ |$\beta_j$|] }**

2. **If α = 0: Then L1 becomes zero.** (L2 also becomes 0.5 but we know mathematically, trying to minimize the half of something is same as minimizing whole.)

  **If α = 1: Then L2 becomes zero.**

3. Sometimes, elastic net automatically set α equal to 0 or 1 whatever it feels best.

**When to use:**

Elastic Net is used when you have many predictors ( predictors refer to the variables or features that are used to predict or explain the target variable) and want to find important features while avoiding overfitting.

In [122]:
from sklearn.linear_model import ElasticNetCV

It needs the following hyperparameters:

1. **l1-ratio:** if we give l1_ratio as 0 then L2 reg is used and when we set it as 1 then L1 is used.

  Its value must be between 0 and 1. If we give some value 0<l1_ratio<1, then the penalty will be combination of L1 and L2.

  l1_ratio can also be given in form of list, then cross validation will be performed on those values and one best will be chosen by model. This is good approach!

2. **eps:** Its upto us if we want to give value of lambda/alpha. If we set that alpha = None, then alpha value is set automatically by model using eps/n_alpha. 

  eps is ratio between min_alpha and max_alpha: 
  
  **eps = alpha_min/alpha_max**

  The values of alpha_min and alpha_max are determined by the n_alphas parameter, which specifies the number of alpha values to consider within the range. 

3. **n_alpha:** It shows number of alphas we want to use. e.g. n_alpha = 100 use 100 alphas on a line space with min_value = alpha_min and max_value = alpha_max.

  If we increase n_alpha, it will take more and more alphas which means it is going to take longer to run.

4. **max_iterations:** Maximum iterations to run. 


In [125]:
elastic_net_model = ElasticNetCV(l1_ratio = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0], 
                                 eps = 0.001, n_alphas = 100, max_iter = 1000000)

In [127]:
elastic_net_model.fit(X_train, y_train)

In [128]:
predictions = elastic_net_model.predict(X_test)

In [129]:
elastic_net_model.alpha_

0.004943070909225833

This is lambda/alpha chosen by elastic net.

In [132]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE = mean_absolute_error(predictions, y_test).round(2)
RMSE = np.sqrt(mean_squared_error(predictions, y_test)).round(2)

print('Mean Absolute ERROR:  {}'.format(MAE))
print('Root Mean Sqaured ERROR:  {}'.format(RMSE))

Mean Absolute ERROR:  0.43
Root Mean Sqaured ERROR:  0.61
