# 07.04 - Regularisation (Elastic Net Regression)

Elastic Net is a hybrid regularization technique that combines the strengths of both Ridge and Lasso regression. It incorporates their penalties into the loss function to optimize the model's performance.

The equation to minimize in Elastic Net is given as:

### $$ \text{minimize:}\; RSS + Ridge + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \alpha\rho\sum_{j=1}^p |\beta_j| + \alpha(1-\rho)\sum_{j=1}^p \beta_j^2$$

This equation is comprised of three main parts:

- The Residual Sum of Squares (RSS) which measures the amount of variance in the data that the model did not account for.
- The Ridge Penalty, which is used to prevent overfitting by minimizing the magnitude of the coefficients.
- The Lasso Penalty, which is used to induce sparsity by setting some coefficients to zero.

The balance between the Ridge and Lasso penalties in Elastic Net is controlled by the ρ parameter. This parameter, which lies between zero and one, determines the ratio of the Lasso penalty to the Ridge penalty.

The `ElasticNet` function in sklearn has two parameters:

- `alpha`: This is the regularization strength, which dictates the amount of shrinkage. Higher values of alpha result in greater regularization which can help prevent overfitting.
- `l1_ratio`: This is equivalent to ρ and it determines the mix of Lasso vs Ridge penalties. An `l1_ratio` of 0 is equivalent to Ridge regression, while an `l1_ratio` of 1 is equivalent to Lasso regression.

In [1]:
# Import pandas library which provides data structures and data analysis tools
import pandas as pd

# Read the dataset from the 'wine.csv' file using the read_csv function provided by pandas.
# The file is assumed to be in the same directory as this script ('./' denotes current directory).
# The result is stored in the DataFrame `df`.
df: pd.DataFrame = pd.read_csv('./wine.csv')

# Replace spaces in column names and convert all columns to lowercase:
# This is done by iterating over each column name in df.columns,
# replacing spaces with underscores using the replace function,
# and converting to lowercase using the lower function.
# The resulting list of modified column names is then assigned back to df.columns.
df.columns = [x.lower().replace(' ','_') for x in df.columns]

# Display the first 5 rows of the DataFrame using the head function.
# This is useful for quickly testing if your object has the right type of data in it.
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [2]:
# We are defining our target or dependent variable that our model will predict.
# In machine learning, the target variable is the variable that the algorithm aims to predict or forecast.
# In our case, we are predicting the 'quality' of the wine based on other features in our dataset.
# We store the column name of our target variable as a string in the variable 'target'.
target: str = 'quality'

In [3]:
# Import necessary module
from typing import List

# We first want to select all columns that are not our target column.
# We create a list comprehension that iterates over all column names in our DataFrame,
# and only keeps the column name if it is not equal to our target.
# The resulting list 'nc' (non-class columns) contains all column names that are not our target.
nc: List[str] = [x for x in df.columns if x != target]

# Next, we want to normalize our data. Normalization is the process of scaling individual samples to have unit norm.
# This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel
# to quantify the similarity of any pair of samples.
# Here, we subtract the mean and divide by the standard deviation for each non-target column in our DataFrame.
# This scales all predictor variables to have a mean of 0 and a standard deviation of 1.
# This is important for many machine learning algorithms, as they can perform poorly if the input variables are not on similar scales.
# By using the mean() and std() functions, we are able to easily calculate the mean and standard deviation for each column.
# Note: We could have also used sklearn's StandardScaler for this step.
df[nc] = (df[nc] - df[nc].mean()) / df[nc].std()

In [4]:
# Define `X` as the values in the non-class columns of the DataFrame
# `.values` is used to return the numpy representation of the DataFrame.
X = df[nc].values

# Define `y` as the values in the target column of the DataFrame
y = df[target].values

In [5]:
# Import the PolynomialFeatures class from sklearn.preprocessing
# PolynomialFeatures is a class in the sklearn.preprocessing package that generates polynomial and interaction features.
from sklearn.preprocessing import PolynomialFeatures

# Initialise a PolynomialFeatures object
# The parameters for PolynomialFeatures are:
# - degree: The degree of the polynomial features. Default = 2.
# - interaction_only: If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1]**2, x[0]*x[2]**3, etc.).
# - include_bias: If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
pf = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

# Fit the PolynomialFeatures object to the `X` data
# The fit method computes the mean and std to be used for later scaling.
pf = pf.fit(X)

# Transform the `X` data to its polynomial features representation.
# This method applies the fitted transformation to the `X` data, returning the transformed data.
Xoverfit = pf.transform(X)

In [6]:
# Import train_test_split function from sklearn.model_selection
# train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data.
# With this function, you don't need to divide the dataset manually.
from sklearn.model_selection import train_test_split

# Now we use the train_test_split function to split our dataset into training and testing sets.
# X is the array of predictor variables, and y is the array of response variables.
# The test_size parameter specifies the proportion of the dataset to include in the test split (30% in this case).
# The random_state parameter is a seed for the random number generator; using a fixed number ensures that the output will be the same each time this code is run.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

# We perform the same operation on the overfit data, Xoverfit.
# Here, we do not have a corresponding 'yo_train' or 'yo_test' because the target values remain the same; only the predictor variables have been transformed.
Xo_train, Xo_test = train_test_split(Xoverfit, test_size=0.3, random_state = 42)

In [8]:
# Import ElasticNet from sklearn.linear_model
# ElasticNet is a linear regression model trained with both l1 and l2 -norm regularization of the coefficients.
from sklearn.linear_model import ElasticNet

# Define an ElasticNet regression model with specified `alpha` and `l1_ratio` parameters.
# `alpha` is the constant that multiplies the penalty terms and `l1_ratio` is the ElasticNet mixing parameter with `0 <= l1_ratio <= 1`.
# `l1_ratio` of 0 corresponds to L2 penalty, 1 to L1.
enet_reg = ElasticNet(alpha = 1, l1_ratio = 0.05)

# Fit the ElasticNet model on the training data.
enet_reg.fit(X_train, y_train)

# Use the trained model to predict the target variable on the test data.
y_pred = enet_reg.predict(X_test)

In [10]:
import numpy as np

# Import mean_squared_error function from sklearn.metrics
# mean_squared_error is a risk metric corresponding to the expected value of the squared (quadratic) error or loss.
from sklearn.metrics import mean_squared_error as mse

# Calculate the root mean squared error (RMSE) which is the square root of the mean squared error.
rmse = np.sqrt(mse(y_test, y_pred))

# Print the calculated RMSE.
print(rmse)

0.7685849129418407


In [11]:
# Define and fit a new ElasticNet model on the overfit training data.
enet_reg = ElasticNet(alpha = 1, l1_ratio = 0.05)
enet_reg.fit(Xo_train, y_train)

# Predict the target variable on the overfit test data.
yo_pred = enet_reg.predict(Xo_test)

# Calculate the RMSE for the overfit data.
rmse = np.sqrt(mse(y_test, yo_pred))

# Print the calculated RMSE for the overfit data.
print(rmse)

0.7604368453549241


In [12]:
# Import r2_score from sklearn.metrics
# r2_score computes the coefficient of determination, a measure of how well observed outcomes are replicated by the model.
from sklearn.metrics import r2_score

# Calculate the R-squared score for both the original and overfit data.
enet_r2 = r2_score(y_test, y_pred)
overfit_r2 = r2_score(y_test, yo_pred)

# Calculate the number of samples in the test set.
n = len(y_test)

# Calculate the number of features in the original and overfit training sets.
p = X_train.shape[1]
po = Xo_train.shape[1]

# Calculate the adjusted R-squared for both the original and overfit data.
# The adjusted R-squared compensates for the addition of variables and only increases if the new variable improves the model more than would be expected by chance.
enet_adj_r2 = 1 - (1 - enet_r2) * ((n - 1) / (n - p - 1))
overfit_adj_r2 = 1 - (1 - overfit_r2) * ((n - 1) / (n - po - 1))

# Print the adjusted R-squared for both the original and overfit data.
print(f'Adjusted R-squared for Elastic Net Regression: {enet_adj_r2}')
print(f'Adjusted R-squared for Elastic Net Regression Overfit: {overfit_adj_r2}')

Adjusted R-squared for Elastic Net Regression: 0.185375795883868
Adjusted R-squared for Elastic Net Regression Overfit: 0.1744265161404044


The Adjusted R-squared value is a statistical measure that provides a gauge of the goodness-of-fit of a regression model. Unlike the R-squared value, the Adjusted R-squared takes into account the number of predictors in the model, and adjusts accordingly. It increases only if the new variable improves the model more than what would be predicted by chance. It incorporates the degrees of freedom, and can handle the issue of overfitting when too many predictors are included in the model.

In the given output, we have two Adjusted R-squared values:

1. For the Elastic Net Regression model, the Adjusted R-squared is approximately 0.185. This means that around 18.5% of the variability in the dependent variable (wine quality) can be explained by the model. Considering the amount of predictors used in the model, this is the proportion of the total variation that's captured.
2. For the Overfit Elastic Net Regression model, the Adjusted R-squared is approximately 0.174. Despite the model being overfitted (with more features generated using PolynomialFeatures), the model explains about 17.4% of the variability in the dependent variable. This lower score, compared to the non-overfit model, indicates that the additional predictors in the overfit model may not contribute significantly to the model's predictive power, and may instead be causing the model to overfit the training data.

Both values are relatively low, suggesting that there is a lot of variability in the wine quality that these models are not capturing. This may suggest the need for further refinement of the models, potentially through the inclusion of more relevant predictors, better feature engineering, or the use of a different modeling approach.