## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('advertising.csv')

## Feature Data and Labels

In [3]:
X = df.drop('sales', axis=1)
y = df['sales']

In [4]:
from sklearn.preprocessing import PolynomialFeatures

In [6]:
polynomial_converter = PolynomialFeatures(degree=3, include_bias=False)

In [7]:
poly_features = polynomial_converter.fit_transform(X)

In [8]:
poly_features.shape

(200, 19)

### Train | Test Split

In [9]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

## Scaling the Data

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
scaler = StandardScaler() 

Scaling should be done on the training set. The test set should always remain unseen. And we used this scaler (measured based in the training set) to rescale the test set. 

In [15]:
scaler.fit(X_train)

In [16]:
X_train = scaler.transform(X_train)

In [17]:
X_test = scaler.transform(X_test)

## Ridge Regression

The Ridge regression loss function is defined as:

$$
\text{Ridge Loss} = \text{MSE} + \alpha \sum_{i=1}^{n} \beta_i^2
$$

Where:
- MSE is the Mean Squared Error.
- $\alpha$ is the regularization parameter.


$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \$$


In [21]:
from sklearn.linear_model import Ridge

In [22]:
help(Ridge)

Help on class Ridge in module sklearn.linear_model._ridge:

class Ridge(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, _BaseRidge)
 |  Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=0.0001, solver='auto', positive=False, random_state=None)
 |  
 |  Linear least squares with l2 regularization.
 |  
 |  Minimizes the objective function::
 |  
 |  ||y - Xw||^2_2 + alpha * ||w||^2_2
 |  
 |  This model solves a regression model where the loss function is
 |  the linear least squares function and regularization is given by
 |  the l2-norm. Also known as Ridge Regression or Tikhonov regularization.
 |  This estimator has built-in support for multi-variate regression
 |  (i.e., when y is a 2d-array of shape (n_samples, n_targets)).
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : {float, ndarray of shape (n_targets,)}, default=1.0
 |      Constant that multiplies the L2 term, controlling regula

In [23]:
ridge_model = Ridge(alpha=10)

In [25]:
ridge_model.fit(X_train, y_train)

In [26]:
y_pred = ridge_model.predict(X_test)

## Performance Results

In [27]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [29]:
MAE = mean_absolute_error(y_test, y_pred)
MAE

0.5774404204714162

In [30]:
MSE = mean_squared_error(y_test, y_pred)
MSE

0.8003783071528354

## Choosing an alpha value with Cross-Validation

In [31]:
from sklearn.linear_model import RidgeCV

In [32]:
help(RidgeCV)

Help on class RidgeCV in module sklearn.linear_model._ridge:

class RidgeCV(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, _BaseRidgeCV)
 |  RidgeCV(alphas=(0.1, 1.0, 10.0), *, fit_intercept=True, scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False)
 |  
 |  Ridge regression with built-in cross-validation.
 |  
 |  See glossary entry for :term:`cross-validation estimator`.
 |  
 |  By default, it performs efficient Leave-One-Out Cross-Validation.
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alphas : array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)
 |      Array of alpha values to try.
 |      Regularization strength; must be a positive float. Regularization
 |      improves the conditioning of the problem and reduces the variance of
 |      the estimates. Larger values specify stronger regularization.
 |      Alpha corresponds to ``1 / (2C)`` in other linear models such as
 |

RidgeCV returns the best alpha value based on the cross-validation method. 

default=(0.1, 1.0, 10.0)

cv :
    int, cross-validation generator or an iterable, default=None
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:

    - None, to use the efficient Leave-One-Out cross-validation
    - integer, to specify the number of folds.

In [33]:
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.))

In [34]:
ridge_cv_model.fit(X_train, y_train)

The best performing alpha:

In [35]:
ridge_cv_model.alpha_

0.1

In [36]:
y_pred = ridge_cv_model.predict(X_test)

In [37]:
MAE = mean_absolute_error(y_test,y_pred)
MSE = mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(MSE)

In [38]:
MAE

0.4273774884345441

In [39]:
MSE

0.38201298815347423

In [40]:
RMSE

0.6180719926946004

In [41]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])


-----

## Lasso Regression

#### Least Absolute Shrinkage and Selection Operator

In [43]:
from sklearn.linear_model import LassoCV

In [44]:
lasso_cv_model = LassoCV(eps=0.001, n_alphas=100, cv=5)

In [45]:
lasso_cv_model.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


To fix the convergence warning, we can increase the number of iterations

In [56]:
lasso_cv_model = LassoCV(eps=0.001, n_alphas=100, cv=5, max_iter=1000000)

In [57]:
lasso_cv_model.fit(X_train, y_train)

Another way to fix this convergence warning is to minimize the actual search parameter. 
The smaller the epsilon is, the wider the range we are checking.

In [48]:
lasso_cv_model = LassoCV(eps=0.1, n_alphas=100, cv=5)

In [49]:
lasso_cv_model.fit(X_train, y_train)

In [58]:
lasso_cv_model.alpha_

0.004943070909225831

In [59]:
y_pred = lasso_cv_model.predict(X_test)

In [60]:
MAE = mean_absolute_error(y_test,y_pred)
MSE = mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(MSE)

In [61]:
MAE

0.4335034618590074

In [62]:
MSE

0.367616757419907

In [63]:
RMSE

0.6063140748984036

In [64]:
lasso_cv_model.coef_

array([ 4.86023329,  0.12544598,  0.20746872, -4.99250395,  4.38026519,
       -0.22977201, -0.        ,  0.07267717, -0.        ,  1.77780246,
       -0.69614918, -0.        ,  0.12044132, -0.        , -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ])

**As it is apparent, many of these coefficients are 0. Lasso can detect which features (or the interactions of the features) are most influential. (Most likely, these features are only related to TV and radio)** 

## Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both. 

In [65]:
from sklearn.linear_model import ElasticNetCV

In [72]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1], eps=.001,n_alphas=100,max_iter=1000)

In [68]:
elastic_model.fit(X_train ,y_train)

In [69]:
elastic_model.l1_ratio_

1.0

**It means that it is best to disregard Ridge completely and just consider Lasso.**

In [70]:
y_pred = elastic_model.predict(X_test)

In [71]:
MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

In [73]:
MAE

0.566326211756945

In [74]:
MSE

0.5603340214638839

In [75]:
RMSE

0.7485546215633726