# Chapter 24
# Taming Model Behavior with Regularization   



## Introduction  

In modern computation statistics and machine learning, we are often faced with a large number of independent variables, or features, which can be colinear or simply uninformative. Given the scale we need algorithms to reduce the number of independent variables to just the informative ones. Methods like stepwise selection are know to fail at scale as a result of the multiple hypothesis testing problem. Solution of this problem leads us to regularization and sparse models.

As a general rule, an overfit model has learned the training data too well. The overfitting likely involved learning noise present in the training data. This noise can arise from uninformative or weakly informative variables or features being used in the model. A related problem arises from using colinear independent variables or features. The colinarity confounds model training, amplifying random variation in the model fitting. Regardless of the source, the random noise causes the fitted model to exhibit high levels of random variation, or high variance.   When a new data case is presented to such a model it may produce unexpected results since the random noise will be different.    

To prevent overfitting, we seek to find **sparse models**. A sparse model uses just the most informative variables required to produce accurate outputs. Term sparse is used since uninformative variables are excluded from the model. Use of sparse models follows the principle of Occam's razor. In simple terms, Occam's razor is a scientific principle that the simplest of competing theories is preferred. A sparse model then is the simplest model that explains the behavior of the data.         

So, what is one to do to prevent overfitting of machine learning models? The most widely used set of tools for preventing overfitting are known as **regularization methods**. Regularization methods take a number of forms, but all have the same goal, to prevent overfitting of machine learning models. The sparse models resulting from the application of regularization methods will generalize better and be useful in production. 


## The Bias-Variance Trade-off

Regularization is not free, however. While regularization reduces the **variance** in the model results, it introduces **bias**. Whereas, an overfit model exhibits low bias the variance is high. The high variance leads to unpredictable results when the model is exposed to new data cases. On the other hand, the stronger the regularization of a model the lower the variance, but the greater the bias. This all means that when applying regularization you will need to contend with the **bias-variance trade-off**. 

To better understand the bias variance trade-off consider the following examples of extreme model cases:

- If the prediction for all cases is just the mean (or median), variance is minimized. The estimate for all cases is the same, so the bias of the estimates is zero. However, there is likely considerable variance in these estimates. 
- On the other hand, consider what happens when the data are fit with a kNN model with k=1. The training data will fit this model perfectly, since there is one model coefficient per training data point. The variance will be low. However, the model will have considerable bias when applied to test data. 

In either case, these extreme models will not generalize well and will exhibit large errors on any independent test data. Any practical model must come to terms with the trade-off between bias and variance to make accurate predictions. 

To better understand this trade-off you can consider the example of the mean square error, which can be decomposed into its components. The mean square error can be written as:

$$\Delta y^2 = E \Big[ \big(Y - \hat{f}(X) \big)^2 \Big] = \frac{1}{N} \sum_{i=1}^N \big(y_i - \hat{f}(x_i) \big)^2 $$

Where,
$Y = $ the label vector.  
$X = $ the feature matrix.   
$\hat{f}(x) = $ the trained model.   

Expanding the representation of the mean square error:

$$\Delta y^2 = \big( E[ \hat{f}(X)] - \hat{f}(X) \big)^2 + E \big[ ( \hat{f}(X) - E[ \hat{f}(X)])^2 \big] + \sigma^2\\
\Delta y^2 = Bias^2 + Variance + Irreducible\ Error$$

The forgoing looks a bit intimidating. How can we interpret this relationship?     

- The first term, $\Big( E[ \hat{f}(X)] - \hat{f}(X) \Big)^2$, is the expected value of the difference between the model output and the expected model output or the **bias** of the model. For a **unbiased model**, $E \Big[ \hat{f}(X)] - \hat{f}(X) \Big] = 0$. For example: an OLS model with $residuals \sim N(0, \sigma^2)$ is unbiased.        
- The second term, $E \Big[ \big( \hat{f}(X) - E[ \hat{f}(X)] \big)^2 \Big]$, is the expected squared difference between the model output and the expected model output, or the **variance** of the model. For a low variance model, $E \Big[ \big( \hat{f}(X) - E[ \hat{f}(X)] \big)^2 \Big] \rightarrow 0$. A low variance model generalizes since variance is low for each prediction, $\hat{f}(X)$.     
- Finally, $\sigma^2$ is inherent or irreducable error in data. We cannot do anything to improve this random variation.        

The relationship between bias and variance is illustrated in the figure below.

<img src="../images/BiasVariance.png" alt="Drawing" style="width:600px; height:400px"/>
<center> Trade-off between bias and variance for machine learning model <center>      
    

Study this relationship. Notice that as regularization reduces variance, bias increases. The irreducible error will remain unchanged. Regularization parameters are chosen to minimize $\Delta x$. In many cases, this will prove challenging. 

## Load and Split the Dataset

With the above bit of theory in mind, it is time to try an example. In this example you will compute and compare linear regression models using different levels and types of regularization. 

Execute the code in the cell below to load the packages required for the rest of this notebook.

In [None]:
import pandas as pd
import numpy as np
import numpy.random as nr
import statsmodels.api as sm
import statsmodels.formula.api as smf  
import scipy.stats as ss
import patsy 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize, StandardScaler
import sklearn.metrics as sklm
from patsy import dmatrices
from sklearn import metrics
from math import sqrt

%matplotlib inline
sns.set(style='ticks', palette='Set2')

The code below does the following to load and prepare the data set:    
1. Load the data set.   
2. Scale the numeric columns except the columns used as labels for the examples. 
3. Split the 195 cases into 100 training cases and 95 test cases.   

In [None]:
## Load the data frame   
auto_data = pd.read_csv('../data/AutoPricesClean.csv')

## Remove unwanted columns and scale numeric columns 
auto_data.drop(auto_data.columns[:3], axis=1, inplace=True)
numeric_columns = [col for col_type,col in zip(auto_data.iloc[:,:-3].dtypes,auto_data.iloc[:,:-3].columns) if col_type in ['int64','float64']]
auto_data.loc[:,numeric_columns] = StandardScaler().fit_transform(auto_data.loc[:,numeric_columns])

## Create a mask and use it to split the data into a train and test set   
nr.seed(6665)
mask = nr.choice(auto_data.index, size = 100, replace=False)
auto_data_train = auto_data.iloc[mask,:].reset_index()
auto_data_test = auto_data.drop(mask, axis=0).reset_index()

In [None]:
mask = nr.choice(auto_data.index, size = 100, replace=False)
auto_data_train = auto_data.iloc[mask,:].reset_index()
auto_data_train.index

## A first linear regression model

First you will create a model of city MPG using both categorical and real-valued features or independent variables and no regularization. In the terminology used before this model has high variance and low bias. In other words, this model is over-fit and provides a baseline for comparison with regularized models. 

### Dependency structure 

To get a feel for the relationship between the independent variables used in the model, we will compute and display a correlation matrix. A few notes about the code:    
1. We compute correlation using the design matrix. This approach is required, since the categorical variables must be one-hot encoded.    
2. The intercept term is not included in the design matrix. The intercept term is represented by a column of all 1a which has 0 variance, that therefore has undefined correlation coefficients.      

Execute the code and examine the results.

In [None]:
## Define model formula with no intercept - column of all 1s with 0 variance 
formula = 'city_mpg ~ -1 + C(fuel_type) + C(aspiration) + C(drive_wheels) + horsepower + compression_ratio + curb_weight + engine_size'

## Compute correlation matrix 
y, design_matrix = patsy.dmatrices(formula, data=auto_data)
corr_mat = np.corrcoef(np.transpose(design_matrix))

## Display correlation matrix  
c_names = list(design_matrix.design_info.column_name_indexes)
fig,ax = plt.subplots(figsize=(10,8))
sns.heatmap(corr_mat, xticklabels=c_names, yticklabels=c_names, ax=ax);

There are several important observations we can make about the relationship between these variables.    
1. There are several strong positive correlations. Not surprisingly, horsepower and engine size are highly correlated. Diesel fuel type is highly correlated with high compression, owing to the nature of diesel engines. And as a further example, curb weight is correlated with engine size.     
2. There are also variable pairs with strong negative correlation. Some of these are simply a result of coding, such as gas fuel type and compression ratio, or front and real drive wheels. These relationships are expected from the one-hot encoding of the design matrix. Another example is front wheel drive and engine size, since cars of this era with large engines tended to have real wheel drive.    

In summary, we can say that these variables do not meet the iid requirement of a linear model, because of the high correlation. Any model fit with these variables will be quite over-fit. As we progress with applying regularization methods it will help to keep these relationships in mind.

### An initial linear model

The code in the cell below should be familiar. In summary, it performs the following processing:
1. Define and train the linear regression model using the training features and labels.
2. Display the summary of the model. 

Execute this code and examine the results for the linear regression model. 

In [None]:
formula = 'city_mpg ~ C(fuel_type) + C(aspiration) + C(drive_wheels) + horsepower + compression_ratio + curb_weight + engine_size'
base_model = smf.ols(formula, data=auto_data_train).fit()
base_model.summary()

Base on the adjusted $R^2$ and F-statistic, this model seems to do a reasonable job of explaining the variance of the label. However, it is clear from the confidence intervals and p-values of the coefficients that this model is over-fit.   

Next, execute the code below to display fit metrics, the distribution of residuals, and the residuals plotted against the predicted values.  

In [None]:
def compute_metrics(y_true, y_predicted):
    ## Compute the usual metrics
    mse = sklm.mean_squared_error(y_true, y_predicted)
    rmse = sqrt(mse)
    mae = sklm.median_absolute_error(y_true, y_predicted)
    return mse, rmse, mae

def print_metrics(df_test, model, label_col='city_mpg'):   
    df_test['predicted'] = model.predict(df_test)
    mse, rmse, mae = compute_metrics(df_test.loc[:,label_col],df_test.loc[:,'predicted'])   
    print('MSE  = {0:6.3f}'.format(mse))
    print('RMSE = {0:6.3f}'.format(rmse))
    print('MAE  = {0:6.3f}'.format(mae))    


def residual_plot(df):
    plt.rc('font', size=12)
    fig, ax = plt.subplots(figsize=(8, 3), ) 
    RMSE = np.std(df.resids)
    sns.scatterplot(x='predicted', y='resids', data=df, ax=ax);
    plt.axhline(0.0, color='red', linewidth=1.0);
    plt.axhline(2.0*RMSE, color='red', linestyle='dashed', linewidth=1.0);
    plt.axhline(-2.0*RMSE, color='red', linestyle='dashed', linewidth=1.0);
    plt.title('Residuals vs. predicted');
    plt.xlabel('Predicted values');
    plt.ylabel('Residuals');
    plt.show()

def plot_resid_dist(df):
    resids = df.loc[:,'resids']
    plt.rc('font', size=12)
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 3));
    ## Plot a histogram
    sns.histplot(x=resids, bins=20, kde=True, ax=ax[0]);
    ax[0].set_title('Histogram of residuals');
    ax[0].set_xlabel('Residual values');
    ## Plot the Q-Q Normal plot
    ss.probplot(resids, plot = ax[1]);
    ax[1].set_title('Q-Q Normal plot of residuals');
    plt.show();    

def compute_residuals(df, model, label_col='city_mpg'):
    df['predicted'] = model.predict(df)
    df['resids'] = np.subtract(df.loc[:,'predicted'], df.loc[:,label_col]) 
    return df
    
print_metrics(auto_data_test, base_model)
auto_data_train = compute_residuals(auto_data_train, base_model)
plot_resid_dist(auto_data_train)
residual_plot(auto_data_train)

Overall the residuals look well-behaved. The distribution of the residuals is a bit skewed toward the negative, but not too seriously. Further, the residuals are approximately homoskedastic. 

### Testing the Box-Cox transform   

We might be able to improve these results using the **Box-Cox transform** of the label column `city_mpg`. The goal of the Box-Cox transform is to transform the distribution of the label values (dependent variable) to be closer to Normal. The residuals of the resulting model should also be closer to Normally distributed, a key assumption of the least-squares method.       

The Box-Cox transform is a **power law transform**, related to the **logarithmic transform**. The Box-Cox transform finds a power, $\lambda$, that creates a transform of the values to be as close to Normal as possible. The Box-Cox algorithm is defined by the following relations:   
$$
x^{(\lambda)}_i = 
\begin{cases}
      \frac{x^{\lambda}_i - 1}{\lambda},\ if \lambda \ne 0 \\
      ln(x_i),\ if \lambda = 0
\end{cases}  
$$

Note that in this formulation $\lambda=0$ is the logarithm since $log(1) = 0$. Any values $0 \gt \lambda \lt 1$ are roots of the values $x$. For example, for square root we get $\lambda = log(0.5) = -0.69$.  For values $1 \gt \lambda \gt \inf$ the variable is raised to a power. As another example, if the power is squared, then $\lambda = log(2.0) = 0.69$, same absolute value of $\lambda$ with the opposite sign. 

The code in the cell below applies the Box-Cox transform to the label column and prints the power, $\lambda$, computed. The code uses the [boxcox](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html) function from scipy.stats. The data sample split again into training and test subsets. Execute this code. 

In [None]:
## Apply Box-Cox transform to the label and print the power used   
auto_data.loc[:,'city_mpg'], power = ss.boxcox(auto_data.loc[:,'city_mpg'])
print('The power of the Box-Cox transform = {0:6.3f}'.format(power))

## Split the data using the transformed label values
## Create a mask and use it to split the data into a train and test set   
nr.seed(654566)
mask = nr.choice(auto_data.index, size = 100, replace=False)
auto_data_train = auto_data.iloc[mask,:].reset_index()
auto_data_test = auto_data.drop(mask, axis=0).reset_index() 

To compute a new OLS model, execute the code below and examine the results.   

In [None]:
## Compute the model using the transformed label and display the summary
base_model = smf.ols(formula, data=auto_data_train).fit()

## Display diagnostics using the training data
print_metrics(auto_data_test, base_model)   
print(base_model.summary())
auto_data_train = compute_residuals(auto_data_train, base_model)
plot_resid_dist(auto_data_train)
residual_plot(auto_data_train)

The value of $\lambda$ is very close to 0, indicating the Box-Cox transform is close to a logarithm. As a result, the values of adjusted $R^2$, the F-statistic and the log-likelihood have all improved. The distribution of the residuals has indeed changed, but is still noticeably non-Normal in the tail. Given the improved least-squares fit, we will continue with the transformed label.   

This model is still significantly over-fit. We employ regularization methods to address this situation. We will use the metrics displayed as a basis of comparison with regularized models.  

> **Note:** No one regularization method works in all cases. In the running example you will see deviations from ideal behavior. Do not be surprised if not all methods improve the models. Further, do not generalize the behavior you observe in the exercises to other models. The effectiveness of regularization method is problem and model dependent.  

## Apply l2 regularization

Now, you will apply **l2 regularization** to constrain the model parameter values. Constraining the model parameters over-fitting of the model. This method is also known as **Ridge Regression**. 

But, how does this work? l2 regularization applies a **penalty** proportional to the **l2** or **Euclidean norm** of the model weights to the loss function. For linear regression using squared error as the metric, the total **loss function** is the sum of the squared error and the regularization term. The total loss function can then be written as follows:  

$$J(\beta) = ||A \beta - b||^2 + \alpha^2 ||\beta||^2$$

Where the penalty term on the model coefficients, $\beta_i$, based on the Euclidean norm:

$$|| \beta||^2 =  \big(\beta_1^2 + \beta_2^2 + \ldots + \beta_n^2 \big)^{\frac{1}{2}} = \alpha \Big( \sum_{i=1}^n \beta_i^2 \Big)^{\frac{1}{2}}$$

We call $||\beta||^2$ the **l2 norm** of the coefficients, since we raise the weights of each coefficient to the power of 2, sum the squares and then raise the sum to the power of $\frac{1}{2}$. 

You can think of this penalty as constraining the 12 or Euclidean norm of the model weight vector. The value of $\alpha$ determines how much the norm of the coefficient vector constrains the solution. You can see a geometric interpretation of the l2 penalty constraint in the figure below.  

<img src="../images/L2.jpg" alt="Drawing" style="width:750px; height:400px"/>
<center>Geometric view of l2 regularization

Notice that for a constant value of the l2 norm, the values of the model parameters $B_1$ and $B_2$ are related. The Euclidean or l2 norm of the coefficients is shown as the dotted circle. The constant value of the l2 norm is a constant value of the penalty. Along this circle the coefficients change in relation to each other to maintain a constant l2 norm. For example, if $B_1$ is maximized then $B_2 \sim 0$, or vice versa. It is important to note that l2 regularization is a **soft constraint**. Coefficients are driven close to, but likely not exactly to, zero.    

### Review of Eigenvalue Decomposition

**Eigenvalues** are characteristic roots or characteristic values of a linear system of equations. The **eigenvalue-eigenvector** decomposition is a factorization of the a matrix. 

Let's start with a **square matrix**, $A$:

$$A = 
\begin{bmatrix}
   a_{11}  & a_{12} & \dots & a_{1n} \\
    a_{21}  & a_{22} & \dots & a_{2n} \\
    \vdots &\vdots &\vdots & \vdots \\
    a_{n1} & a_{n2} &  \dots & a_{nn}
\end{bmatrix}$$

Next define a vector, $x$: 

$$x = 
\begin{bmatrix}
   x_{1}\\
    x_{2}\\
    \vdots\\
    x_{n}
\end{bmatrix}$$

Then an **eigenvalue** of the matrix $A$ has the property: 

$$A x = \lambda x$$

Or,   

$$
\begin{bmatrix}
   a_{11}  & a_{12} & \dots & a_{1n} \\
    a_{21}  & a_{22} & \dots & a_{2n} \\
    \vdots &\vdots &\vdots & \vdots \\
    a_{n1} & a_{n2} &  \dots & a_{nn}
\end{bmatrix}  
\begin{bmatrix}
   x_{1}\\
    x_{2}\\
    \vdots\\
    x_{n}
\end{bmatrix} 
= 
\lambda 
\begin{bmatrix}
   x_{1}\\
    x_{2}\\
    \vdots\\
    x_{n}
\end{bmatrix}
$$



To see that the eigenvalue, $\lambda$, is a root of the matrix, $A$ we can rearrange the above as follows:   

\begin{align}
Ax - \lambda x &= 0 \\
(A - I \lambda) x &= 0
\end{align}

Where, $I$ is the **identity matrix** of 1 on the diagonal and 0 elsewhere. These relationships can be written as follows:  

$$
\begin{bmatrix}
   a_{11} - \lambda  & a_{12} & \dots & a_{1n} \\
    a_{21}  & a_{22} - \lambda  & \dots & a_{2n} \\
    \vdots &\vdots &\vdots & \vdots \\
    a_{n1} & a_{n2} &  \dots & a_{nn} - \lambda 
\end{bmatrix}  
\begin{bmatrix}
   x_{1}\\
    x_{2}\\
    \vdots\\
    x_{n}
\end{bmatrix} 
= 
0
$$


The foregoing show that the eigenvalue, $\lambda$, is a root of the matrix, $A$.

For an $n\ x\ n$ matrix, $A$, there are $n$ eigenvalues or roots. These can be found by solving the following equation, using the determinant:  

$$det(A - x) = 0$$

You can find more information on the determinant in this [article](https://en.wikipedia.org/wiki/Determinant).

### 2.1.3 Eigenvalues and the Normal Equations

Let's start by examining the **normal equation** formulation of the linear regression problem. The goal is to compute a vector of **model coefficients** or weights which minimize the mean squared residuals, given a vector of data $x$ and a **model matrix** $A$. We can write our model as:

$$x = A b$$

To solve this problem we would ideally like to compute:

$$b = A^{-1}x$$

The commonly used normal equation form can help with this problem:

$$b = (A^TA)^{-1}A^Tx$$

Now, $A^TA$ is a symmetric $m x m$ covariance matrix, where $m$ is the number of model coefficients. This is a significant reduction in size when compared to $A$. 

Now, we can perform eigenvalue-eigenvector decomposition of $A^TA$:

$$A^TA = Q \Lambda Q^{-1}$$

Where,
$Q = $ unitary matrix of orthonormal **eigenvectors**, and
$\Lambda =$ diagonal matrix of **eigenvalues**. The eigenvalue matrix is diagonal:  

$$\Lambda = 
\begin{bmatrix}
    \lambda_1  & 0 & 0 & \dots & 0 \\
    0  & \lambda_2 & 0 & \dots & 0 \\
    \vdots &\vdots &\vdots & & \vdots \\
    0 & 0 & 0 & \dots & \lambda_n
\end{bmatrix}$$


Since Q is unitary (unit norm), the inverse of $A^TA$ is easily computed:

$$(A^TA)^{-1} = Q \Lambda^{-1} Q^{-1}$$

Where,
$$\Lambda^{-1} = 
\begin{bmatrix}
    \frac{1}{\lambda_1}  & 0 & 0 & \dots & 0 \\
    0  & \frac{1}{\lambda_2} & 0 & \dots & 0 \\
    \vdots &\vdots &\vdots & & \vdots \\
    0 & 0 & 0 & \dots & \frac{1}{\lambda_n}
\end{bmatrix}$$
and $\lambda_i$ is the ith eigenvalue. 

But, **$A^TA$ can still be rank deficient!** By rank deficient we mean that there are fewer non-zero eigenvalues of $A^TA$ than the dimension, $n$. Even if the ith eigenvalue is close to zero, $\frac{1}{\lambda_i}$ becomes very large and destabilizes the inverse. 

The basic idea of $l_2$ regularization, Tikhonov regularization, or ridge regression is to stabilize the inverse eigenvalue matrix,$\Lambda$, by **adding a small bias term**, $\alpha$, to each of the eigenvalues. We can write this operation in matrix notation as follows. We start with a modified form of the normal equations (also know as the **L2 or Euclidean norm** minimization problem):

$$min [\parallel A \cdot x - b \parallel + \parallel \alpha^2 \cdot I \parallel]\\  or \\
b = (A^TA + \alpha^2 \cdot I)^{-1}A^Tx$$

In this way, the inverse values of small eigenvalues do not blow up when we compute the inverse. You can see this by writing out the $\Lambda^+$ matrix with the bias term.

$$\Lambda_{Tikhonov}^+  = \begin{bmatrix}
    \frac{1}{\lambda_1 + \alpha^2}  & 0 & 0 & \dots & 0 \\
    0  & \frac{1}{\lambda_2 + \alpha^2} & 0 & \dots & 0 \\
    \vdots &\vdots &\vdots & & \vdots \\
    0 & 0 & 0 & \dots & \frac{1}{\lambda_m + \alpha^2}
\end{bmatrix}$$

Adding this bias term ensures there are no non-zero eigenvalues, and that the inverse of $A^TA$ exists. You can also see that added the bias term adds a 'ridge' along the diagonal of the eigenvalue matrix, giving this method one of its names. 

### The Bayesian Interpretation

Another way to view l2 regularization is using a Bayesian formulation of the problem. Let's start with the posterior distribution of the weight vector $W$ given the features $X$ and labels $Y$. Using Bayes rule we can write this posterior distribution as:

$$p( W\ |\ \{X,Y\} ) = \frac{p(W)\ p(\{X,Y\}\  |\  W\ )}{p( \{X,Y\})}$$

where,

$p(W) = $ the prior distribution of $W$.   
$p(\{X,Y\}\  |\  W\ ) = $ the likelihood of $\{X,Y\}$ given $W$.

We want the **maximum a posteriori** or **MAP** of the weights, $W$. Taking the log of both sides:

$$max_W log \big( p( W\ |\ \{X,Y\} ) \big) \propto max_W\ \Big[ log \big( p(W) \big)\  + log \big( p(\{X,Y\}\  |\  W\ ) \big) \Big]\\
= max_W \Big[ log\big(prior(W)\big) + log \big( likelihood(\{X,Y\}\  |\  W\ ) \big) \Big]$$

For a Gaussian process with l2 loss and l2 regularization we formulate the problem:

$$max_W log \big( p( W\ |\ \{X,Y\} ) \big) = max_W \Big[\frac{1}{n} \sum_{i=1}^n (f_W(x_i) - y_i)^2 + \lambda || W ||^2 \big) \Big]$$  

where,   
$\frac{1}{n} \sum_{i=1}^n (f_W(x_i) - y_i)^2 = $ the likelihood.    
$\lambda || W ||^2 = $ the prior acting as a regularization penalty. 


So how are we to interpret this prior? It is useful to view the prior as constraining the norm of the weight vector close to zero. In other words, we think of the weights as **shrinking** toward zero. Thus, the shrinkage process prevents the weights from reaching extreme values. 

### An example    

How can we use the eigenvalue decomposition of the covariance matrix to better understand how stable a linear model is? One way to summarize stability is to compute the **condition number**, which is the ratio of the largest to the smallest eigenvalue:   

$$condition\ number = \frac{largest\ eigenvalue}{smallest\ eigenvalue}$$   

A linear model with a covariance matrix having a large condition number is unstable and the model coefficients will be poorly determined. In other words, when the condition number is large, we can expect high variance and poor generalization. As a rule of thumb, the condition number of a **well-posed** linear model should be less than about 100. Otherwise, we say the model is **ill-posed**.    

> **Exercise 24-1:**  You will now compute the covariance matrix for the model specified above and evaluate its eigenvalues.     
> 1. Construct the design (model) matrix using the training data subset. You can do this with the [patsy.dematrix](https://patsy.readthedocs.io/en/latest/API-reference.html) function.      
> 2. Use [numpy.transpose](https://numpy.org/devdocs/reference/generated/numpy.transpose.html) and [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) functions to compute the covariance matrix of the design matrix. Make sure you normalize by dividing by the dimension of the covariance matrix!     
> 3. Compute the [numpy.linalg.eigvals](https://numpy.org/devdocs/reference/generated/numpy.linalg.eigvals.html) function to compute the eigenvalues. Save the real part of these complex numbers using [numpy.real](https://numpy.org/devdocs/reference/generated/numpy.real.html) and print the result.   
> 4. Compute and print the condition number.   

In [None]:
### You code goes here





> What does this condition number tell you about the stability of the coefficient estimates for this model? Is this consistent with what you learned from the model summary?   

> **Answer:**  

### Example of ridge regression   

When L2 regularization is applied to regression the resulting algorithm is often referred to as **ridge regression**. For this algorithm, the regularization parameter, $\alpha$, must be selected. For a given problem, there are limited theory-based options to find the best value of $\alpha$, Therefore, one typically resorts to a **hyperparameter** search. 

There are several possible approaches to hyperparameter search. An error metric, such as MSE or MAE is used to determine the optimal value.    

The simplest approach is to search of grid (or line) of regularly spaced hyperparameter values. At each gird point the model performance metric is computed. The best model is considered the one with the best metric. We will use this simpler approach here.       

An alternative is to randomly sample the space of hyperparameter values. This later approach is more efficient.     

> **Exercise 24-2:** You will now perform a search to find a best regularization hyperparameter by the following steps:      
> 1. Complete the code in the `regularized_coefs` function to compute a regularized OLS model using the value of alpha and L1_wt using [statsmodels.regression.linear_model.OLS.fit_regularized](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html). For each value of alpha searched, this model will be computed and evaluated.      
> 2. In the space provided below, provide code to create an array of alpha values from 0.0 to 0.005 in steps of 0.00005.  
> 3. Using the `regularized_coefs` define code for the search over the values of $\alpha$.   
> 4. Execute the code and examine the results.  

In [None]:
def regularized_coefs(df_train, df_test, alphas, L1_wt=0.0, n_coefs=8,
                      formula = formula, label='city_mpg'):
    '''Function that computes a linear model for each value of the regualarization 
    parameter alpha and returns an array of the coefficient values. The L1_wt 
    determines the trade-off between L1 and L2 regualarization'''
    coefs = np.zeros((len(alphas),n_coefs + 1))
    MSE_train = []
    MSE_test = []
    for i,alpha in enumerate(alphas):
        ## First compute the training MSE
        #### Complete the line of code below

        
        ## Save the coefficient values   
        coefs[i,:] = temp_mod.params
        ## Compute and save the training RMSE
        MSE_train.append(sqrt(np.mean(np.square(df_train[label] - temp_mod.predict(df_train)))))
        ## Then compute the test RMSE
        MSE_test.append(sqrt(np.mean(np.square(df_test[label] - temp_mod.predict(df_test)))))
        
    return coefs, MSE_train, MSE_test


def plot_coefs(coefs, alphas, MSE_train, MSE_test, ylim=None, \
               title='MSE vs. regularization parameter number',\
               ylab='Root mean squared error', location='lower right'):
    fig, ax = plt.subplots(1,2, figsize=(12, 5)) # define axis
    for i in range(coefs.shape[1]): # Iterate over coefficients
        ax[0].plot(alphas, coefs[:,i])
    ax[0].axhline(0.0, color='red', linestyle='--', linewidth=0.5)
    ax[0].set_ylabel('Partial slope values')
    ax[0].set_xlabel('alpha')
    ax[0].set_title('Parial slopes vs. regularization parameter')
    if ylim is not None: ax[0].set_ylim(ylim)
    
    ax[1].plot(alphas, MSE_train, label='Training error')
    ax[1].plot(alphas, MSE_test, label='Test error')
    ax[1].set_ylabel(ylab)
    ax[1].set_xlabel('alpha')
    ax[1].set_title(title)
    plt.legend(loc=location)
    plt.show()

np.random.seed(12856)
### You code goes here


plot_coefs(Betas, alphas, MSE_train, MSE_test) 

> Examine these plots and answer these questions:     
> 1. Notice how the training error increases in value with increasing regularization hyperparameter. This is expected, since as the coefficient values of the model are forced toward 0, the training bias increases. Notice however, the behavior of the MSE for the test data. To single digit precision, approximately at which value is the test MSE minimized?         
> 2. The parameters with the largest magnitude are intercept and gas fuel. Describe how these parameters change with $\alpha$.

> **Answers:**       
> 1.                
> 2.          

> **Exercise 24-3:** Now you will evaluate the L2 regularized model using a value of $\alpha = 0.001$. This is value of $\alpha$ provides only mild regularization and will provide a basis for comparison with subsequent models.       
> 1. Compute a regularized OLS model using the training data and the [statsmodels.regression.linear_model.OLS.fit_regularized](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html) method.     
> 2. Print the coefficient comparison between the base model and the regularized model.   
> 3. Compute and print the MSE, RMSE and MAE for the model, using the test data.   
> 4. Compute the residuals, using the test data, and display distribution plots and the plot of residuals vs. predicted values.  

In [None]:
def print_coefficient_comparision(mod, compare_mod):
    df = pd.DataFrame(compare_mod.params)  
    df = pd.concat([df, pd.Series(mod.params, index=df.index)], axis=1)
    df.columns = ['Compare model','Model']                
    print(df)
    comp_mag = np.linalg.norm(df.loc[:,'Compare model'])
    mag = np.linalg.norm(df.loc[:,'Model'])
    df.drop('Intercept', axis=0, inplace=True)
    comp_mag_nointercept = np.linalg.norm(df.loc[:,'Compare model'])
    mag_nointercept = np.linalg.norm(df.loc[:,'Model'])

    print('\nMagnitude of base model = {0:4.2f}  Without intercept = {1:4.2f}'.format(comp_mag, comp_mag_nointercept))
    print('Magnitude of new model = {0:4.2f}  Without intercept = {1:4.2f}'.format(mag, mag_nointercept))

### You code goes here


## Display results   
print_coefficient_comparision(L2_model, base_model)
print('\n')
print_metrics(auto_data_test, L2_model, label_col='city_mpg')
auto_data_test = compute_residuals(auto_data_test, L2_model)
plot_resid_dist(auto_data_test)
residual_plot(auto_data_test)

> Examine these results and answer these questions:   
> 1. Examine the comparison of the model coefficients. What does the change in the norm of the parameter vector tell you about the regularization?        
> 2. Compare the RMSE and MAE of the regularized model to the same metrics for the unregularized model. In terms of which of these metrics is the regularized model better and worse? Keeping in mind that the model is a constrained least squares fit, do the results make sense?   
> 3. How does the distribution of the residuals compare to those of the unregularized models in terms of changes of skewness and the outlier?
> . Do the residuals still appear approximately homoskedastic? 
> **End of exercise.**  

> **Answers:**     
> 1.            
> 2.           
> 3.              

> **Exercise 24-4:** You will now compare the condition number of the regularized covariance matrix to the unregularized covariance matrix you computed earlier.     
> 1. Add a matrix with the value of the square root of the optimal $alpha$ value estimated along the diagonal (0 elsewhere) to the covariance matrix you computed in Exercise 24-2. Use [numpy.diag]() to instantiate the diagonal matrix.    
> 2. Compute and display the eigenvalues of the regularized covariance matrix.  
> 3. Compute and display the condition number of the regularized covariance matrix. 

In [None]:
### You code goes here





> Compare the condition numbers you have computed for the regularized and unregularized condition number. Has regularization made a significant difference? Does the regularized model still appear to have an undesirably large condition number?  

> **Answer:**           

## Apply l1 regularizaton

Regularization can be performed using norms other than l2. The **l1 regularizaton** or **Lasso**  method limits the sum of the absolute values of the model coefficients. The l1 norm is sometime know as the **Manhattan norm**, since distance are measured as if you were traveling on a rectangular grid of streets. This is in contrast to the l2 norm that measures distance 'as the crow flies'. 

We can compute the l1 norm of the model coefficients as follows:

$$||\beta||^1 = \big( |\beta_1| + |\beta_2| + \ldots + |\beta_n| \big) = \Big( \sum_{i=1}^n |\beta_i| \Big)^1$$

where $|\beta_i|$ is the absolute value of $\beta_i$. 

Notice that to compute the l1 norm, we raise the sum of the absolute values to the first power.

As with l2 regularization, for l1 regularization, a penalty term is multiplied by the l1 norm of the model coefficients. A penalty multiplier, $\alpha$, determines how much the norm of the coefficient vector constrains values of the weights. The complete loss function is the sum of the squared errors plus the penalty term which becomes: 

$$J(\beta) = ||A \beta - b||^2 + \alpha^2 ||\beta||^1$$

You can see a geometric interpretation of the l1 norm penalty in the figure below.  

<img src="../images/L1.jpg" alt="Drawing" style="width:700px; height:400px"/>
<center> Geometric view of L1 regularization

The l1 norm is constrained by the sum of the absolute values of the coefficients. This fact means that values of one parameter highly constrain another parameter. The dotted line in the figure above looks as though someone has pulled a rope or lasso around pegs on the axes. This behavior leads the name lasso for l1 regularization.  

Notice that in the figure above that if $B_1 = 0$ then $B_2$ has a value at the limit, or vice versa. In other words, using a l1 norm constraint forces some weight values to zero to allow other coefficients to take non-zero values. Thus, you can think of the l1 norm constraint **knocking out** some weights free the model altogether. In contrast to l2 regularization, l1 regularization does drive some coefficients to exactly zero.

> **Exercise 24-5:** Continuing with the running example you will now apply L1 regularization to the model.    
> 1. Create an array with values of $\alpha$ from 0.0 to 0.05 in steps of 0.005.   
> 2. Using the hyperparameter search function you created for exercise 24-2, compute the model performance metrics for each value of $\alpha$ and with `L1_wt` set to 1.0; all weight on L1 regularization. 
> 3. Plot the results using the `plot_coefs` function. It will help your understanding to set `ylim=[-0.3,0.3]` for the `plot_coefs` function. 
> 4. Execute your code.  

In [None]:
### You code goes here




> Examine these plots and answer these questions.    
> 1. Examine the change in parameter values with increasing $\alpha$. What evidence do you see that the L1 regularization is working as expected?    
> 2. Notice how the training error increases with increasing regularization hyperparameter. This is expected, since as the coefficient values of the model are forced toward 0, the training bias increases. Notice the behavior of the MSE for the test data. Where is the minimum? What does this tell you about the bias-variance trade-off?              

> **Answers:**      
> 1.        
> 2.       

> **Exercise 24-6:** Now you will evaluate the L1 regularized model using the optimal a value $\alpha$ where there is a minimum in the test error curve.      
> 1. Compute a regularized OLS model using the training data and your estimate of the optimal value of $\alpha = 0.015and `L1_wt=1.0`.     
> 2. Compute and print the MSE, RMSE and MAE for the model, using the test data.   
> 3. Compute the residuals, using the test data, and display distribution plots and the plot of residuals vs. predicted values.   
> 4. Print the model coefficients. These coefficients are the `params` attribute of the model object.   

In [None]:
### You code goes here





> Examine these results and answer these questions.   
> 1. Examine the comparison of the model coefficients. What does the change in the norm of the parameter vector tell you about the regularization?    
> 2. Examine the model coefficients, noticing that some are 0.0 as expected with L1 regularization. What does this tell you about the usefulness of some of the model features?     
> 3. Compare the RMSE and MAE of the regularized model to the same metrics for the unregularized model. In terms of which of these metrics is the regularized model better and worse and is this outcome expected?    
> 4. How does the distribution of the residuals compare to those of the unregularized models in terms of changes of skewness, kurtosis and the outlier?

> **Answers:**    
> 1.          
> 2.           
> 3.               
> 4.       

## Elastic Net Regularization    

We have now examined a bit of theory and examples of L2 and L1 regularization. We can compare the characteristics of these methods as follows:  

- L2 regularization works well for **colinear features**    
   - Down-weights colinear features   
   - But soft constraint so poor model selection 

- L1 regularization provides **good model selection** by hard constraint    
   - But poor selection for colinear features     

But, we do not always have to choose between soft constraint of L2 and hard constraint of L1. The **elastic net regularization** combines the behavior of both methods. The loss function for elastic net is expressed:         

$$min \Big[ \parallel A \cdot x - b \parallel +\ \lambda\ \alpha^2 \parallel b\parallel^1 +\ (1- \lambda)\ \alpha^2 \parallel b\parallel^2 \Big]$$        

This model has two hyperparameters:    
- $\lambda$ weights L1 vs. L2 regularization.      
- $\alpha$ sets strength of regularization.   

Tuning this model, requires a 2-dimensional hyperparameter search. This search can be done on a grid or by random sampling, as was discussed previously.           

> **Exercise 24-7:**  Continuing with the running example you will now apply elastic net regularization to the model.    
> 1. Create an array with values of $\alpha$ from 0.0 to 0.03 in steps of 0.0005.   
> 2. Using the hyperparameter search function you created for exercise 24-2 compute the model performance metrics for each value of $\alpha$ and with `L1_wt` set to 0.5. 
> 3. Plot the results using the `plot_coefs` function. It will help your understanding to set `ylim=[-0.5,1.0]` for the `plot_coefs` function.    
> 4. Execute your code.  

**Note:** In this case, we equal weight L2 and L1 regularization in order to simplify the hyperparameter search.
 Performance could possibly improved if a 2-hyperparameter search was performed.       

In [None]:
### You code goes here




> One again, the curve of training error does not have a well defined minimum, except at $\alpha = 0$. 
> 1. How are the parameter values changing as the regularization parameter increases, in particular coefficients driven to 0?    
> 2. Is there any well defined minimum for the test error?     
> 3. Based on this behavior, do you expect that elastic net regularization to improve model generalization?      

> **Answers:**     
> 1.          
> 2.            
> 3.        

> **Exercise 24-8:** Now you will evaluate the elastic net regularized model using an arbitrary value of $\alpha$.      
> 1. Compute a regularized OLS model using the training data an estimate of the optimal value of $\alpha = 0.015$ and `L1_wt=0.5`, putting equal weight on L2 and L1.     
> 2. Compute and print the MSE, RMSE and MAE for the model, using the test data.   
> 3. Compute the residuals, using the test data, and display distribution plots and the plot of residuals vs. predicted values.   
> 4. Print the model coefficients. These coefficients are the `params` attribute of the model object.    

In [None]:
### You code goes here







> Examine these results and answer these questions.   
> 1. Compare the RMSE and MAE of the regularized model to the same metrics for the unregularized model. In terms of which of these metrics is the regularized model better and worse and is this behavior expected? 
> 2. Do the residuals still appear approximately Normal and homoscedastic?    
> 3. Examine the model coefficients? Are any of the coefficients 0? Is this behavior expected from the soft constraint of L2 regularization or the hard constraint of L1 regularization.        

> **Answers:**    
> 1.            
> 2.              
> 3.            

## Extended Example: GLM with ElasticNet     

Let's try an end-to-end example. In previous chapters we worked with an HR dataset with the objective of classifying employees who are likely to leave a company. We will now apply ElasticNet regularization to a generalized linear model (GLM) to this problem.       

We will start by loading and preparing the dataset. Execute the code in the cell below.  

In [None]:
hr_data = pd.read_csv('../data/HR_comma_sep.csv')
for col in ['satisfaction_level','average_montly_hours','last_evaluation', 'number_project', 'time_spend_company']:
    hr_data.loc[:,col] = (hr_data.loc[:,col] - np.mean(hr_data.loc[:,col]))/np.std(hr_data.loc[:,col])
## Create a mask and use it to split the data into a train and test set  
frac = 0.6
nr.seed(665)
mask = nr.choice(hr_data.index, size = int(frac * hr_data.shape[0]), replace=False)
hr_train = hr_data.iloc[mask,:]
hr_test = hr_data.drop(mask, axis=0)         
hr_train.columns

There are quite a few columns in this dataset. We use most of these columns as independent variables for our GLM model. The code in the cell below constructs the model and prints the summary. Execute this code. 

In [None]:
formula = 'left ~ satisfaction_level + average_montly_hours + last_evaluation + C(salary) + C(promotion_last_5years) +\
          C(Work_accident) + number_project + time_spend_company + C(promotion_last_5years) + C(sales)'
hr_glm = smf.glm(formula=formula, data=hr_train, family=sm.families.Binomial()).fit()
hr_glm.summary()

Examine the model summary. The deviance and Pearson $\chi^2$ indicate the model makes significant predictions. However, notice that the model is significantly overfit, with nearly all coefficients not being significant.      

To evaluate this model execute the code in the cell below.  

In [None]:
def display_metrics(hr_test, threshold):
    print('\n\nPrediction of leaving:')
    print(hr_test.loc[:5,'predicted'])
    
    print('\nConfusion Matrix')
    Confusion_Matrix = metrics.confusion_matrix(hr_test.loc[:,'left'], hr_test.loc[:,'predicted'])
    accuracy = metrics.accuracy_score(hr_test.loc[:,'left'], hr_test.loc[:,'predicted'])
    precision = metrics.precision_score(hr_test.loc[:,'left'], hr_test.loc[:,'predicted'])
    recall = metrics.recall_score(hr_test.loc[:,'left'], hr_test.loc[:,'predicted'])
    Confusion_Matrix = pd.DataFrame(Confusion_Matrix, index=['True Stay', 'True Leave'], columns = ['Predicted Stay', 'Predicted Leaving'])
    print(Confusion_Matrix)
    print(f"\nAccuracy = {round(accuracy, 3)}")
    print(f"Precision = {round(precision, 3)}")
    print(f"Recall = {round(recall, 3)}")

threshold = 0.4
hr_test.loc[:,'predicted_prob'] = hr_glm.predict(hr_test)
hr_test.loc[:,'predicted'] = np.where(hr_test.loc[:,'predicted_prob'] > threshold, 1, 0)
display_metrics(hr_test, threshold)

We will use these performance metrics as a basis of comparison.     

Next, we will test a number of regularization parameter values. Execute the code in the cell below and examine the results.   

In [None]:
def regularized_coefs_glm(df_train, df_test, alphas, L1_wt=0.0, n_coefs=18,
                      formula = formula, label='left', threshold = 0.4):
    '''Function that computes a linear model for each value of the regualarization 
    parameter alpha and returns an array of the coefficient values. The L1_wt 
    determines the trade-off between L1 and L2 regualarization'''
    coefs = np.zeros((len(alphas),n_coefs + 1))
    err_train = []
    err_test = []
    for i,alpha in enumerate(alphas):
        ## First compute the training MSE
        #### Complete the line of code below
        temp_mod = smf.glm(formula, data=df_train, family=sm.families.Binomial()).fit_regularized(alpha=alpha,L1_wt=L1_wt)
        
        ## Save the model coefficeints
        coefs[i,:] = temp_mod.params
        ## Compute training error 
        err_train.append(np.sum(np.abs(df_train.loc[:,label] - np.where(temp_mod.predict(df_train) > threshold, 1, 0)))/float(len(hr_train)))
        ## Then compute the test error
        err_test.append(np.sum(np.abs(df_test.loc[:,label] - np.where(temp_mod.predict(df_test) > threshold, 1, 0)))/float(len(df_train)))        
    return coefs, err_train, err_test


alphas = np.arange(0.0, 0.3, step = 0.02)
Betas, err_train, err_test = regularized_coefs_glm(hr_train, hr_test, alphas, L1_wt=0.5, formula = formula)

plot_coefs(Betas, alphas, err_train, err_test, title='Error vs. regularization parameter',\
           ylab='Classification error rate', location='upper left')

The test classification error is minimized at the $\alpha=0.16$. Notice that at $\alpha=0.16$ only 7 of the original 19 model coefficients are nonzero.  

Finally, we can evaluate this model at the optimal value of $\alpha$. Execute the code in the cell bellow to construct and evaluate the regularized model. 

In [None]:
hr_glm_elastic = smf.glm(formula=formula, data=hr_train, family=sm.families.Binomial()).fit_regularized(alpha=0.16,L1_wt=0.5)

print_coefficient_comparision(hr_glm_elastic, hr_glm)

threshold = 0.4
hr_test.loc[:,'predicted_prob'] = hr_glm_elastic.predict(hr_test)
hr_test.loc[:,'predicted'] = np.where(hr_test.loc[:,'predicted_prob'] > threshold, 1, 0)
display_metrics(hr_test, threshold)

Compare these results to the unregularized model. These metrics are somewhat improved over the unregularized model. This result is a bit unusual, but seem to arise from the fact that the unregularized model was so severely overfit. The regularization has improved the generalization, hence the better performance with the test dataset.         

## Summary

In this lab you have explored the basics of regularization. Regularization can prevent machine learning models from being overfit. Regularization is required to help machine learning models generalize when placed in production. Selection of regularization strength involves consideration of the bias-variance trade-off. 

L2 and l1 regularization constrain model coefficients to prevent overfitting. L2 regularization constrains model coefficients using a Euclidian norm. L2 regularization can drive some coefficients toward zero, usually not to zero. On the other hand, l1 regularization can drive model coefficients to zero.    

The elastic net algorithm provides weighted behavior between the L1 and L2 methods. 

#### Copyright 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024 Stephen F Elston. All rights reserved. 