## Regularization using Ridge and Lasso 

Ridge and Lasso Regression are techniques used to address some of the limitations of standard linear regression, specifically overfitting. Overfitting occurs when a model captures not only the underlying data pattern but also the noise, leading to poor performance on new data. Both methods add a regularization term to the cost function, which penalizes large coefficients and helps to keep the model complexity in check.

## Ridge Regression
### Concept:
Ridge regression adds a penalty term to the cost function proportional to the square of the magnitude of the coefficients.

The cost function for ridge regression can be written as:  
$ \text{Cost Function} = \sum_{i=1}^{n} (y_i - \hat{y})^2 + \lambda \sum_{j=1}^{p} m_j^2 $  
Where:
- $ \hat{y} $ is the predicted value.
- $ m_j $ are the coefficients of the model.
- $ \lambda $ is the regularization parameter that controls the strength of the penalty.

### Mechanism:
- **Penalty Term:** The penalty term $ \lambda \sum_{j=1}^{p} m_j^2 $ discourages large coefficients. Larger values of $ \lambda $ increase the penalty, shrinking the coefficients towards zero but not exactly zero.
- **Regularization Parameter:** $ \lambda $ is a hyperparameter that must be set before training the model. A small $ \lambda $ means a weak penalty (similar to ordinary least squares regression), while a large $ \lambda $ means a strong penalty.
- **Shrinkage Effect:** Ridge regression shrinks the coefficients but does not eliminate any, thus retaining all features in the model.

## Lasso Regression
### Concept:
Lasso regression adds a penalty term to the cost function proportional to the absolute value of the magnitude of the coefficients.

The cost function for lasso regression can be written as:  
$ \text{Cost Function} = \sum_{i=1}^{n} (y_i - \hat{y})^2 + \lambda \sum_{j=1}^{p} |m_j| $  
Where:
- $ \hat{y} $ is the predicted value.
- $ m_j $ are the coefficients of the model.
- $ \lambda $ is the regularization parameter that controls the strength of the penalty.

### Mechanism:
- **Penalty Term:** The penalty term $ \lambda \sum_{j=1}^{p} |m_j| $ encourages sparsity, meaning it can shrink some coefficients to zero.
- **Regularization Parameter:** Similar to ridge regression, $ \lambda $ is a hyperparameter that must be set before training. A small $ \lambda $ means a weak penalty, while a large $ \lambda $ means a strong penalty.
- **Feature Selection:** Lasso regression can effectively select features by shrinking some coefficients exactly to zero, eliminating non-informative features from the model.


## Examples Using Our auto-mpg Data

Let's go through an example with the auto-mpg dataset to see how these regressions work.

### 1. Import Libraries:

In [2]:
import pandas as pd
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures


### 2. Load Dataset and Split:

Load the dataset, separate the target variable (mpg) from the features, and split the data into training and test sets.

In [3]:
data = pd.read_csv('auto-mpg.csv')
y = data[['mpg']]
X = data.drop(['mpg', 'car name', 'origin'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)


### 3. Standardize Data:

Use MinMaxScaler to standardize the feature values between 0 and 1. Fit and transform the training data, then transform the test data using the same scaler to prevent data leakage.

In [4]:
scale = MinMaxScaler()
X_train_transformed = scale.fit_transform(X_train)
X_test_transformed = scale.transform(X_test)

### 4. Fit Models:

Fit three regression models (ridge, lasso, and standard linear regression) to the standardized training data. The alpha parameter controls the strength of the penalty in ridge and lasso regression.

In [5]:
ridge = Ridge(alpha=0.5)
ridge.fit(X_train_transformed, y_train)

lasso = Lasso(alpha=0.5)
lasso.fit(X_train_transformed, y_train)

lin = LinearRegression()
lin.fit(X_train_transformed, y_train)


### 5. Generate Predictions:

Generate predictions for both the training and test sets using each of the fitted models.

In [6]:
y_h_ridge_train = ridge.predict(X_train_transformed)
y_h_ridge_test = ridge.predict(X_test_transformed)

y_h_lasso_train = lasso.predict(X_train_transformed)
y_h_lasso_test = lasso.predict(X_test_transformed)

y_h_lin_train = lin.predict(X_train_transformed)
y_h_lin_test = lin.predict(X_test_transformed)

### 6. Evaluate Models:

In [7]:
print('Train Error Ridge Model', mean_squared_error(y_train, y_h_ridge_train))
print('Test Error Ridge Model', mean_squared_error(y_test, y_h_ridge_test))
print('\n')

print('Train Error Lasso Model', mean_squared_error(y_train, y_h_lasso_train))
print('Test Error Lasso Model', mean_squared_error(y_test, y_h_lasso_test))
print('\n')

print('Train Error Unpenalized Linear Model', mean_squared_error(y_train, y_h_lin_train))
print('Test Error Unpenalized Linear Model', mean_squared_error(y_test, y_h_lin_test))


Train Error Ridge Model 9.798079515529828
Test Error Ridge Model 17.52369243383445


Train Error Lasso Model 16.244450797081786
Test Error Lasso Model 30.034636315030966


Train Error Unpenalized Linear Model 9.700888480581275
Test Error Unpenalized Linear Model 16.74802531396471


We note that ridge is clearly better than lasso here, but that the unpenalized model performs best here. This makes sense because a linear regression model with these features is probably not overfitting, so adding regularization just contributes to underfitting.

Let's see how including ridge and lasso changed our parameter estimates.

In [8]:
print('Ridge parameter coefficients:', ridge.coef_)
print('Lasso parameter coefficients:', lasso.coef_)
print('Linear model parameter coefficients:', lin.coef_)

Ridge parameter coefficients: [[ -2.06904445  -2.88593443  -1.81801505 -15.23785349  -1.45594148
    8.1440177 ]]
Lasso parameter coefficients: [-9.09743525 -0.         -0.         -4.02703963  0.          3.92348219]
Linear model parameter coefficients: [[ -1.33790698  -1.05300843  -0.08661412 -19.26724989  -0.37043697
    8.56051229]]


Did you notice that lasso shrank a few parameters to 0? The ridge regression mostly affected the fourth parameter (estimated to be -19.26 for the linear regression model).

### Regularized Polynomial Regression vs. Polynomial Regression
Now let's compare this to a model built using PolynomialFeatures, which has more complexity than an ordinary multiple regression.

In [9]:
# Prepare data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

X_train_transformed = scale.fit_transform(X_train_poly)
X_test_transformed = scale.transform(X_test_poly)

# Fit models
ridge.fit(X_train_transformed, y_train)
lasso.fit(X_train_transformed, y_train)
lin.fit(X_train_transformed, y_train)

# Generate predictions
y_h_ridge_train = ridge.predict(X_train_transformed)
y_h_ridge_test = ridge.predict(X_test_transformed)
y_h_lasso_train = lasso.predict(X_train_transformed)
y_h_lasso_test = lasso.predict(X_test_transformed)
y_h_lin_train = lin.predict(X_train_transformed)
y_h_lin_test = lin.predict(X_test_transformed)

# Display results
print('Train Error Polynomial Ridge Model', mean_squared_error(y_train, y_h_ridge_train))
print('Test Error Polynomial Ridge Model', mean_squared_error(y_test, y_h_ridge_test))
print('\n')
print('Train Error Polynomial Lasso Model', mean_squared_error(y_train, y_h_lasso_train))
print('Test Error Polynomial Lasso Model', mean_squared_error(y_test, y_h_lasso_test))
print('\n')
print('Train Error Unpenalized Polynomial Model', mean_squared_error(y_train, y_h_lin_train))
print('Test Error Unpenalized Polynomial Model', mean_squared_error(y_test, y_h_lin_test))
print('\n')
print('Polynomial Ridge Parameter Coefficients:', len(ridge.coef_[ridge.coef_ != 0]), 
      'non-zero coefficient(s) and', len(ridge.coef_[ridge.coef_ == 0]), 'zeroed-out coefficient(s)')
print('Polynomial Lasso Parameter Coefficients:',  len(lasso.coef_[lasso.coef_ != 0]), 
      'non-zero coefficient(s) and', len(lasso.coef_[lasso.coef_ == 0]), 'zeroed-out coefficient(s)')
print('Polynomial Model Parameter Coefficients:',  len(lin.coef_[lin.coef_ != 0]), 
      'non-zero coefficient(s) and', len(lin.coef_[lin.coef_ == 0]), 'zeroed-out coefficient(s)')

Train Error Polynomial Ridge Model 5.498365263214847
Test Error Polynomial Ridge Model 10.705099905649291


Train Error Polynomial Lasso Model 16.429632826093172
Test Error Polynomial Lasso Model 30.384937999587347


Train Error Unpenalized Polynomial Model 2.610329107681425e-18
Test Error Unpenalized Polynomial Model 184189.346043964


Polynomial Ridge Parameter Coefficients: 923 non-zero coefficient(s) and 1 zeroed-out coefficient(s)
Polynomial Lasso Parameter Coefficients: 3 non-zero coefficient(s) and 921 zeroed-out coefficient(s)
Polynomial Model Parameter Coefficients: 924 non-zero coefficient(s) and 0 zeroed-out coefficient(s)


In this case, the unpenalized model was overfitting. Therefore when ridge and lasso regression were applied, this reduced overfitting and made the overall model fit better. Note that the best model we have seen so far is the polynomial + ridge model, which seems to have the best balance of bias and variance.

If we were to continue tweaking our models, we might want to reduce the alpha (
λ
) for the lasso model, because it seems to be underfitting compared to the ridge model. Reducing alpha would reduce the strength of the regularization, allowing for more non-zero coefficients.