# 07.03 - Regularisation (Lasso Regression)

## Defining and Visualizing the Lasso Regression

In this lab, we're going to look at the Lasso Regression, which is a different approach from the Ridge Regression we've looked at previously. With Lasso Regression, we will see a difference in how the coefficients change.

Lasso Regression, much like Ridge Regression, is a type of linear regression that uses shrinkage. Shrinkage is where the data values are shrunk towards a central point, like the mean. However, the key difference in Lasso Regression is the way it penalizes the coefficients of the predictors.

In Lasso Regression, the loss function is altered by adding the absolute value of the magnitude of the coefficient as a penalty term to the Residual Sum of Squares (RSS). The formula for this is as follows:

### $$ \text{minimize:}\; RSS + Lasso = \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p\beta_j x_j\right)\right)^2 + \alpha\sum_{j=1}^p |\beta_j|$$

Here, $|\beta_j|$ represents the absolute value of the $\beta$ coefficient for the variable $x_j$.

Meanwhile, $\alpha$ is a parameter we need to choose that decides the strength of the penalty term. As with Ridge Regression, increasing the value of α will shrink the coefficients more, potentially setting some coefficients to zero if α is large enough. This is a form of feature selection, where the model selects which features to include by shrinking the coefficients of less important features to zero.

In [1]:
# Import the pandas library as pd
import pandas as pd

# Read the dataset from the file 'wine.csv' into a pandas DataFrame
# The file path './wine.csv' indicates that the file is in the same directory as the current script
# The DataFrame is stored in the variable df
df = pd.read_csv('./wine.csv')

# The column names of the DataFrame are updated
# For each column name, the spaces are replaced with underscores and the text is converted to lowercase
# This is done using a list comprehension, where x is the name of each column in the DataFrame
# The replace() function replaces spaces with underscores, and the lower() function converts the text to lowercase
df.columns = [x.lower().replace(' ','_') for x in df.columns]

# The head() function is called on the DataFrame to display the first 5 rows
# This is often done for a quick overview of the data after it is loaded
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [2]:
# Now, we will identify the target variable, which is the variable we aim to predict.
# In machine learning, the target variable is also often referred to as the dependent variable.
# In this case, our target variable is 'quality'.
# We store the name of the target variable in the string variable `target`.
# We can use this `target` variable later in our code to refer to the column in the DataFrame that contains the target values.
target: str = 'quality'

In [3]:
from typing import List

# First, we want to select all the columns that are not the target variable.
# To do this, we create a list comprehension that iterates over the column names in the wine DataFrame.
# If the column name is not equal to the target variable, it is added to the list.
# The list of non-target column names is stored in the variable `nc`.
nc: List[str] = [x for x in df.columns if x != target]

# We then want to normalize our predictor variables.
# Normalizing means adjusting the values measured on different scales to a common scale.
# In this case, we are normalizing so that all predictor variables have a mean of 0 and a standard deviation of 1.
# This is done by subtracting the mean and dividing by the standard deviation for each predictor variable.
# We apply this normalization to all columns in `nc`, which contains all non-target variables.
# Note: We could also use sklearn's StandardScaler for this task.
df[nc] = (df[nc] - df[nc].mean()) / df[nc].std()

In [4]:
# We define our predictor matrix X as the values of all non-target columns in the DataFrame
# The .values attribute of a DataFrame returns a Numpy array of the DataFrame values
X = df[nc].values

# Similarly, we define our target matrix Y as the values of the target column in the DataFrame
y = df[target].values

In [5]:
# Import the PolynomialFeatures class from sklearn.preprocessing
# PolynomialFeatures is used to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree
from sklearn.preprocessing import PolynomialFeatures

# We initialize a PolynomialFeatures object with degree=2
# The degree parameter determines the maximum degree of the polynomial transformations
# interaction_only=True means that only interaction features are produced: features that are products of at most degree distinct input features
# include_bias=False means that a bias column (a column of ones) is not added
pf = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

# We fit the PolynomialFeatures object to our predictor matrix X
# This calculates the number of output features, which is necessary before we can use the transform method
pf = pf.fit(X)

# We transform our predictor matrix X using the fitted PolynomialFeatures object
# This applies the polynomial feature transformations to our data, creating a new overfit predictor matrix Xoverfit
Xoverfit = pf.transform(X)

In [6]:
# Import the train_test_split function from sklearn.model_selection
# This function is used to split the data into training and test sets
from sklearn.model_selection import train_test_split

# The train_test_split function is called with four arguments:
# - X: the predictors matrix
# - y: the target vector
# - test_size: the proportion of the dataset to include in the test split
# - random_state: a seed used by the random number generator to ensure the same random split each time the script is run
# The function returns four outputs:
# - X_train: the predictors matrix for the training set
# - X_test: the predictors matrix for the test set
# - y_train: the target vector for the training set
# - y_test: the target vector for the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

# The train_test_split function is called again to split the overfit predictors matrix into training and test sets
# The outputs are stored in Xo_train and Xo_test, which are the overfit predictors matrices for the training and test sets, respectively
Xo_train, Xo_test = train_test_split(Xoverfit, test_size=0.3, random_state = 42)

In [7]:
# Import the Lasso class from sklearn.linear_model
# Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent.
from sklearn.linear_model import Lasso

alpha = 0.15

# Initialize a Lasso object
lasso_reg = Lasso(alpha = alpha)

# Fit the Lasso model to the training data
# The .fit() method calculates the optimal values of the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ, using the existing input and output (x and y) as the arguments
lasso_reg.fit(X_train, y_train)

# Predict the response for the test dataset
# We use the .predict() method with X_test as the argument
y_pred = lasso_reg.predict(X_test)

In [8]:
import numpy as np

# Import the mean_squared_error function from sklearn.metrics
# Mean Squared Error (MSE) is a measure of how close a fitted line is to actual data points
from sklearn.metrics import mean_squared_error as mse

# Calculate the root mean squared error (RMSE) which is the square root of MSE
rmse = np.sqrt(mse(y_test, y_pred))

# Print the RMSE
print(rmse)

0.7616484497594653


In [9]:
# Initialize a Lasso object
lasso_reg = Lasso(alpha = alpha)

# Fit the Lasso model to the overfit training data
lasso_reg.fit(Xo_train, y_train)

# Predict the response for the overfit test dataset
yo_pred = lasso_reg.predict(Xo_test)

# Calculate the root mean squared error (RMSE) for the overfit data
rmse = np.sqrt(mse(y_test, yo_pred))

# Print the RMSE
print(rmse)

0.7616484497594653


In [10]:
# Import the r2_score function from sklearn.metrics
# R-squared is a statistical measure that represents the goodness of fit of a regression model
from sklearn.metrics import r2_score

# Calculate the R-squared score for the Lasso Regression model
lasso_r2 = r2_score(y_test, y_pred)

# Calculate the R-squared score for the Lasso Regression model on the overfit data
overfit_r2 = r2_score(y_test, yo_pred)

# Calculate the number of samples (n) in the test set
n = len(y_test)

# Calculate the number of predictors (p) in the training set
p = X_train.shape[1]

# Calculate the number of predictors (po) in the overfit training set
po = Xo_train.shape[1]

# Calculate the adjusted R-squared for the Lasso Regression model
# Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model
lasso_adj_r2 = 1 - (1 - lasso_r2) * ((n - 1) / (n - p - 1))

# Calculate the adjusted R-squared for the Lasso Regression model on the overfit data
overfit_adj_r2 = 1 - (1 - overfit_r2) * ((n - 1) / (n - po - 1))

# Print the adjusted R-squared for the Lasso Regression model
print(f'Adjusted R-squared for Lasso Regression: {lasso_adj_r2}')

# Print the adjusted R-squared for the Lasso Regression model on the overfit data
print(f'Adjusted R-squared for Lasso Regression Overfit: {overfit_adj_r2}')

Adjusted R-squared for Lasso Regression: 0.20001337822754262
Adjusted R-squared for Lasso Regression Overfit: 0.17179364704796918


The adjusted R-squared for the Lasso Regression is approximately 0.20, while the adjusted R-squared for the Lasso Regression for the overfit data is approximately 0.17. These values represent the proportion of the total variance in the dependent variable that is predictable from the independent variables.

The R-squared statistic provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. However, there is one drawback to the R-squared statistic: it will either stay the same or increase with the addition of more variables, even if those variables are only weakly associated with the response.

This is where adjusted R-squared comes in. The adjusted R-squared compensates for the addition of variables and only increases if the new variable improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

In this case, the lower adjusted R-squared for the overfit data suggests that some of the additional predictors in the model do not significantly improve its performance, and may in fact be detracting from the quality of the model.