In [59]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [60]:
df = pd.read_csv('./Advertising.csv')

### Ridge Regression(L2 Regularization)

Ridge regression, also known as L2 regularization, is a linear regression technique that adds a penalty term to the cost function to prevent overfitting. It does this by adding the sum of squared coefficients multiplied by a hyperparameter alpha (λ) to the mean squared error (MSE) cost function.

The ridge regression cost function is given by:

Cost = MSE + α * Σ(coefficients^2)

MSE = cost function of linear regression

The bias-variance trade-off refers to the balance between two types of errors that affect the performance of a machine learning model:

- Bias: This represents the error introduced by approximating a real-world problem with a simplified model. High bias models tend to underfit the data, meaning they cannot capture the underlying patterns in the data.

- Variance: This represents the error introduced by the model's sensitivity to the training data. High variance models tend to overfit the data, meaning they perform well on the training set but fail to generalize to new, unseen data.

** Ridge regression introduces a regularization term that penalizes large coefficients, which helps to reduce variance. However, it may introduce some bias because it constrains the coefficients, which can lead to a less flexible model.

In [61]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [62]:
X = df.drop('sales', axis=1)
y = df['sales']

In [63]:
from sklearn.preprocessing import PolynomialFeatures

In [64]:
poly_converter = PolynomialFeatures(degree=3, include_bias=False)

In [65]:
poly_features = poly_converter.fit_transform(X)

In [66]:
from sklearn.model_selection import train_test_split

-------
### Scaling the data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. Review the theory videos for more info, as well as a discussion on why we only **fit** to the training data, and **transform** on both sets separately.

In [67]:
from sklearn.preprocessing import StandardScaler

In [68]:
scaler = StandardScaler()

In [69]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.25, random_state=101)

In [70]:
scaler.fit(X_train)

In [71]:
X_train = scaler.fit_transform(X_train)

In [72]:
X_test = scaler.fit_transform(X_test)

In [73]:
from sklearn.linear_model import Ridge

In [74]:
ridge_model = Ridge(alpha=5)

In [75]:
ridge_model.fit(X_train, y_train)

In [76]:
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
def report_model(model):
    model.fit(X_train, y_train)
    model_pred = model.predict(X_test)
    print(f'the mean absolute percentage error of the {model} is {mean_absolute_percentage_error(model_pred, y_test)}')
    print(f'the mean absolute error of the {model} is {mean_absolute_error(model_pred, y_test)}')
    print(f'the mean squared error of the {model} is { np.sqrt(mean_squared_error(model_pred, y_test))}')

In [77]:
report_model(ridge_model)

the mean absolute percentage error of the Ridge(alpha=5) is 0.07556074613735148
the mean absolute error of the Ridge(alpha=5) is 0.8717177668237119
the mean squared error of the Ridge(alpha=5) is 1.104917772313037


For aplha value of 5 we get a result of 9% error.. 

 Now we have to find the optimal value of alpha value for the Ridge model
* To find the optimal value of alpha we can go for RidgeCV model

In [78]:
from sklearn.linear_model import RidgeCV

In [79]:
alpha_values = (1,5,10,15,20,100)

In [80]:
import sklearn
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

In [81]:
ridge_cv_model = RidgeCV(alphas=alpha_values,scoring='neg_mean_squared_error')

In [82]:
report_model(ridge_cv_model)

the mean absolute percentage error of the RidgeCV(alphas=(1, 5, 10, 15, 20, 100), scoring='neg_mean_squared_error') is 0.07571805492371339
the mean absolute error of the RidgeCV(alphas=(1, 5, 10, 15, 20, 100), scoring='neg_mean_squared_error') is 0.8826842510398787
the mean squared error of the RidgeCV(alphas=(1, 5, 10, 15, 20, 100), scoring='neg_mean_squared_error') is 1.0668541375647713


In [83]:
ridge_cv_model.alpha_

1

In [84]:
ridge_cv_model.coef_

array([ 3.60873935,  0.6566454 ,  0.38489177, -1.39401938,  2.97728855,
       -0.23246345, -0.20976442,  0.15771253, -0.21564651, -0.30566009,
       -0.3998741 , -0.33322327,  0.98645905,  0.18337382,  0.17585539,
       -0.1763479 , -0.2276781 ,  0.05800896,  0.03271645])

Using the RidgeCV model the error is increased, So this problem is not suitable for Ridge model


### Lasso Regression (L1 Regularization)

Lasso regression, also known as L1 regularization or Lasso regularization, is a linear regression technique that incorporates a penalty term to the cost function in order to prevent overfitting and promote feature selection.
- The term "lasso" stands for "Least Absolute Shrinkage and Selection Operator."
- Cost = MSE + λ * Σ(coefficients)

Lasso regression is particularly useful when dealing with high-dimensional datasets, where there are many features, and not all of them are relevant for prediction.
By setting some coefficients to zero, it helps in identifying the most important predictors and simplifies the model, leading to improved generalization and reduced risk of overfitting.

In [85]:
from sklearn.linear_model import LassoCV

In [98]:
lasso_cv_model = LassoCV(eps=0.01, n_alphas=100,cv=10)

eps (epsilon):

eps is a small positive value that determines the range of alpha values to explore. The range of alpha values considered by LassoCV is given by alphas = np.logspace(np.log10(alpha_max * eps), np.log10(alpha_max), n_alphas), where alpha_max is the maximum alpha value for the problem. The default value of eps is 1e-3.

n_alphas (number of alphas):

n_alphas specifies the number of alpha values to consider in the range defined by eps. Increasing n_alphas will lead to a more finely spaced set of alpha values to explore. This parameter determines the granularity of the search for the optimal alpha. The default value of n_alphas is 100.

In [99]:
report_model(lasso_cv_model)

the mean absolute percentage error of the LassoCV(cv=10, eps=0.01) is 0.07520417057575057
the mean absolute error of the LassoCV(cv=10, eps=0.01) is 0.8825080134840829
the mean squared error of the LassoCV(cv=10, eps=0.01) is 1.080777583648209


In [88]:
lasso_cv_model.coef_

array([ 2.45179387,  0.30513505,  0.02911108, -0.        ,  3.716142  ,
       -0.        ,  0.        ,  0.02042388,  0.        , -1.02855505,
       -0.        , -0.        ,  0.        ,  0.        , -0.        ,
        0.        ,  0.        ,  0.        ,  0.        ])

In [89]:
lasso_cv_model.alpha_

0.04955891288263107

----------
### Elastic Net

Elastic Net is a regularization technique used in linear regression and machine learning to address the issues of multicollinearity (when predictor variables are highly correlated) and prevent overfitting. It combines both L1 (Lasso) and L2 (Ridge) regularization penalties. The elastic net regression equation can be represented as:

In [101]:
from sklearn.linear_model import ElasticNet

We will perfom GridSearchCV for finding the best parameter of elastic net

In [102]:
elastic_model = ElasticNet()

In [103]:
from sklearn.model_selection import GridSearchCV

In [104]:
param_grid = {
    'alpha': [0.1,0.2,0.3,0.4,0.5,0.99,1.0],       
    'l1_ratio': [0.1,0.2, 0.5, 0.8,0.9,0.99],       
    'max_iter': [1000],               
    'tol': [1e-4],                      
}

In [105]:
grid_model = GridSearchCV(estimator=elastic_model, param_grid=param_grid,cv=10)

In [106]:
grid_model.fit(X_train, y_train)

In [107]:
report_model(grid_model)

the mean absolute percentage error of the GridSearchCV(cv=10, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.99, 1.0],
                         'l1_ratio': [0.1, 0.2, 0.5, 0.8, 0.9, 0.99],
                         'max_iter': [1000], 'tol': [0.0001]}) is 0.07705699112307726
the mean absolute error of the GridSearchCV(cv=10, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.99, 1.0],
                         'l1_ratio': [0.1, 0.2, 0.5, 0.8, 0.9, 0.99],
                         'max_iter': [1000], 'tol': [0.0001]}) is 0.9023087263536199
the mean squared error of the GridSearchCV(cv=10, estimator=ElasticNet(),
             param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.99, 1.0],
                         'l1_ratio': [0.1, 0.2, 0.5, 0.8, 0.9, 0.99],
                         'max_iter': [1000], 'tol': [0.0001]}) is 1.1476730891560782


In [108]:
grid_model.best_estimator_

- Here in the result we can see that error % is nearly equal to the value of lasso regression
- The best estimator of this Grid model is aplha = 0.1 and l1_ratio = 0.99 ~ 1
- From the l1_ratio we can say that it is a Lasso regression and that's why the error is equal to the value of lasso regression