# Comparison of Shrinkage and Selection Methods for Linear Regression

In this article we will look at seven popular methods for subset selection and shrinkage in linear regression. After an introduction to the topic justifying the need for such methods, we will look at each approach one by one, covering both mathematical properties and a Python application.

## Why shrink or subset and what does this mean?

In the linear regression context, subsetting means choosing a subset from available variables to include in the model, thus reducing its dimensionality. Shrinkage, on the other hand, means reducing the size of the coefficient estimates (shrinking them towards zero). Note that if a coefficient gets shrunk to exactly zero, the corresponding variable drops out of the model. Consequently, such a case can also be seen as a kind of subsetting.

Shrinkage and selection aim at improving upon the simple linear regression. There are two main reasons why it could need an improvement:

* **Prediction accuracy:** Linear regression estimates tend to have low bias and high variance. Reducing model complexity (the number of parameters that need to be estimated) results in reducing the variance at the cost of introducing more bias. If we could find the sweet spot where the total error, so the error resulting from bias plus the one from variance, is minizmized, we can improve the model's predictions.


* **Model's interpretability:** With too many predictors it is hard for a human to grasp all the relations between the variables. In some cases we would be willing to determing a small subset of variables with the strongest impact, thus sacrificing some details in order to get the big picture.

## Setup & Data Load

Before jumping straight to the methods themselves, let us first look at the data set we will be analysing. It comes from a study by Stamey et al. (1989) who investigated the impact of different clinical measurements on the level of prostate specific antigen (PSA). The task is to identify the risk factors for prostate cancer, based on a set if clinical and demographic variables. The data, together with some desciptions of the variables, can be found [on the website of Hastie's et al. "The elements of statistical learning" textbook](http://web.stanford.edu/~hastie/ElemStatLearn/), in the Data section.

We will start by importing the modules used throughout this article, loading the data and splitting it into training and testing sets, keeping the targets and the features separately. We will then discuss each of the shrinkage and selection methods, fit it to the training data and use the test set to check how well can it predict the PSA levels on new data.

In [1]:
# Import necessary modules and set options
import pandas as pd
import numpy as np
import itertools

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV, LarsCV
from sklearn.cross_decomposition import PLSRegression
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

# Load data
data = pd.read_csv("prostate_data", sep = "\t")
print(data.head())

# Train-test split
y_train = np.array(data[data.train == "T"]['lpsa'])
y_test = np.array(data[data.train == "F"]['lpsa'])
X_train = np.array(data[data.train == "T"].drop(['lpsa', 'train'], axis=1))
X_test = np.array(data[data.train == "F"].drop(['lpsa', 'train'], axis=1))

     lcavol   lweight  age      lbph  svi       lcp  gleason  pgg45      lpsa  \
0 -0.579818  2.769459   50 -1.386294    0 -1.386294        6      0 -0.430783   
1 -0.994252  3.319626   58 -1.386294    0 -1.386294        6      0 -0.162519   
2 -0.510826  2.691243   74 -1.386294    0 -1.386294        7     20 -0.162519   
3 -1.203973  3.282789   58 -1.386294    0 -1.386294        6      0 -0.162519   
4  0.751416  3.432373   62 -1.386294    0 -1.386294        6      0  0.371564   

  train  
0     T  
1     T  
2     T  
3     T  
4     T  


## Linear Regression

Let us start with the simple linear regression, which will constitute our benchmark. It models the target variable, _y_, as a linear combination of _p_ predictors, or features _X_:

<p align="center">
<img src="img/linreg_model.png" width=200 style="display: block; margin: auto;" />
</p>

This model has _p_ + 2 parameters that have to be estimated from the training data:

* The _p_ feature $\beta$-coefficients, one per viariable, denoting their impacts on the target;
* One intercept parameter, denoted as $\beta_0$  above, which is the prediction in case all Xs are zero. It is not necessary to include it in the model, and indeed in some cases it should be dropped (e.g. if one wants to include a full set of dummies denoting levels of a categorical variable) but in general it gives the model more flexibility, as you will see in the next paragraph;
* One variance parameter of the Gaussian error term. 

These parameters are typically estimated using the Ordinary Least Square (OLS). OLS minimizes the sum of squared residuals, given by

<p align="center">
<img src="img/linreg_rss.png" width=300 style="display: block; margin: auto;" />
</p>

It is helpful to think about this minimization criterion graphically. With only one predictor _X_, we are in a 2D space, formed by this predictor and the target. In this setting, the model fits such a line in the _X-Y_ space that is the closest to all data points, with the proximity measured as the sum of squared vertical distances of all data points - see the left panel below. If there are two predictors, $X_1$ and $X_2$, the space grows to 3D and now the model fits a plane that is closest to all points in the 3D space - see the right panel below. With more than two features, the plane becomes the somewhat abstract hyperplane, but the idea is still the same.

<p align="center">
<img src="img/linreg_3d_pic.png" width=600 style="display: block; margin: auto;" />
</p>

The minimization problem described above turns out to have an analytical solution, and the $\beta$-parameters can be calculated as

<p align="center">
<img src="img/linreg_coefs.png" width=175 style="display: block; margin: auto;" />
</p>

Including a column of ones in the _X_ matrix allows to express the intercept part of the $\hat{\beta}$ vector in the formula above. The "hat" above the $\beta$ denotes that it is an estimated value, based on the training data.

In statistics, there are two critical characteristics of estimators to be considered: the bias and the variance. The bias is the difference between the true population parameter and the expected estimator. It measures the inaccuracy of the estimates. The variance, on the other hand, measures the spread between them.

<p align="center">
<img src="img/bias_vs_variance.jpg" width=400 style="display: block; margin: auto;" />
</p>

Clearly, both bias and variance can harm the model's predictive performance if they are too large. The linear regression, however, tends to suffer from variance, while having a low bias. This is especially the case if there are many predictive features in the model or if they are highly correlated with each other. **This is where subsetting and regularization come to rescue. They allow to reduce the variance at the cost of introducing some more bias, ulimately reducing the total error of the model.**

Before discussing these methods in detail, let us fit linear regression to out prostate data and check it's out-of-sample Mean Prediction Error (MAE).

In [2]:
linreg_model = LinearRegression(normalize=True).fit(X_train, y_train)
linreg_prediction = linreg_model.predict(X_test)
linreg_mae = np.mean(np.abs(y_test - linreg_prediction))
linreg_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((linreg_model.intercept_, linreg_model.coef_), axis=None), 3))
)

print('Linear Regression MAE: {}'.format(np.round(linreg_mae, 3)))
print('Linear Regression coefficients:')
linreg_coefs

Linear Regression MAE: 0.523
Linear Regression coefficients:


{'Intercept': 0.429,
 'lcavol': 0.577,
 'lweight': 0.614,
 'age': -0.019,
 'lbph': 0.145,
 'svi': 0.737,
 'lcp': -0.206,
 'gleason': -0.03,
 'pgg45': 0.009}

## Best Subset Regression

A straightforward approach to choose a subset of variables for linear regression is to try all possible combinations and pick one that minimizes some criterion. This is what Best Subset Regression aims for. For every $k \in {1, 2, ..., p}$, where _p_ is the total number of available features, it picks the subset of size _k_ that gives the smallest residual sum of squares. However, sum of squares cannot be used as a criterion to determine _k_ itself, as it is necessarily decreasing with _k_: the more variables are included in the model, the smaller its residuals. It does not guarantee better predictive performance though. That's why another criterion should be used to select the final model. For models focused on prediction, a (possibly cross-validated) error on test data is a common choice.

As far as I know, Best Subset Regression is not implemented in any Python package, so we have to loop over _k_ and all subsets of size _k_ manually. The following chunk of code does the job.

In [3]:
results = pd.DataFrame(columns=['num_features', 'features', 'MAE'])

# Loop over all possible numbers of features to be included
for k in range(1, X_train.shape[1] + 1):
    # Loop over all possible subsets of size k
    for subset in itertools.combinations(range(X_train.shape[1]), k):
        subset = list(subset)
        linreg_model = LinearRegression(normalize=True).fit(X_train[:, subset], y_train)
        linreg_prediction = linreg_model.predict(X_test[:, subset])
        linreg_mae = np.mean(np.abs(y_test - linreg_prediction))
        results = results.append(pd.DataFrame([{'num_features': k,
                                                'features': subset,
                                                'MAE': linreg_mae}]))

# Inspect best combinations
results = results.sort_values('MAE').reset_index()
print(results.head())

# Fit best model
best_subset_model = LinearRegression(normalize=True).fit(X_train[:, results['features'][0]], y_train)
best_subset_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((best_subset_model.intercept_, best_subset_model.coef_), axis=None), 3))
)

print('Best Subset Regression MAE: {}'.format(np.round(results['MAE'][0], 3)))
print('Best Subset Regression coefficients:')
best_subset_coefs

   index       MAE            features num_features
0      0  0.466876     [0, 1, 2, 4, 7]            5
1      0  0.467043  [0, 1, 2, 4, 6, 7]            6
2      0  0.471730     [0, 1, 2, 4, 6]            5
3      0  0.478344        [0, 1, 4, 7]            4
4      0  0.479609        [0, 1, 4, 6]            4
Best Subset Regression MAE: 0.467
Best Subset Regression coefficients:


{'Intercept': -0.599,
 'lcavol': 0.497,
 'lweight': 0.81,
 'age': -0.012,
 'lbph': 0.413,
 'svi': 0.005}

## Ridge Regression

One drawback of Best Subset Regression is that it does not tell us anything about the impact of the variables that are excluded from the model on the response variable. Ridge Regression provides an alternative to this hard selection of variables that splits them into incldued in and excluded from the model. Instead, it penalizes the coefficients to shrink them towards zero. Not exactly zero, as that would mean exlusion from the model, but in the direction of zero, which can be viewed as decreasing model's complexity in a continuous way, while keeping all variables in the model.

In Ridge Regression, the Linear Regression loss function is augmented in such a way to not only minimize the sum of squared residuals but also to penalize the size of parameter estimates:

<p align="center">
<img src="img/ridgereg_loss.png" width=400 style="display: block; margin: auto;" />
</p>

Solving this minimization problem results in an analytical formula for the $\beta$s:

<p align="center">
<img src="img/ridgereg_coefs.png" width=250 style="display: block; margin: auto;" />
</p>

where _I_ denotes an identity matrix. The penalty term $\lambda$ is the hyperparameter to be chosen: the larger its value, the more are the coefficients shrinked towards zero. One can see from the formula above that as $\lambda$ goes to zero, the additive penalty vanishes and $\hat{\beta}^{ridge}$ becomes the same as $\hat{\beta}^{OLS}$ from linear regression. On the other hand, as $\lambda$ grows to infinity, $\hat{\beta}^{ridge}$ approaches zero: with high enough penalty, coefficients can be shrinked arbitrarily close to zero. 

But does this shrinkage really result in reducing the variance of the model at the cost of introducing some bias as promised? Yes, it does, which is clear from the formulas for ridge regression estimates' bias and variance: as $\lambda$ increases, so does the bias, while the variance goes down!

<p align="center">
<img src="img/ridgereg_bias_variance.png" width=400 style="display: block; margin: auto;" />
</p>

Now, how to choose to best value for $\lambda$? Run cross-validation trying a set of different values and pick one that minimizes cross-validated error on test data. Luckily, Python's scikit-learn can do this for us.

In [4]:
ridge_cv = RidgeCV(normalize=True, alphas=np.logspace(-10, 1, 400))
ridge_model = ridge_cv.fit(X_train, y_train)
ridge_prediction = ridge_model.predict(X_test)
ridge_mae = np.mean(np.abs(y_test - ridge_prediction))
ridge_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((ridge_model.intercept_, ridge_model.coef_), axis=None), 3))
)

print('Ridge Regression MAE: {}'.format(np.round(ridge_mae, 3)))
print('Ridge Regression coefficients:')
ridge_coefs

Ridge Regression MAE: 0.517
Ridge Regression coefficients:


{'Intercept': 0.155,
 'lcavol': 0.51,
 'lweight': 0.605,
 'age': -0.016,
 'lbph': 0.14,
 'svi': 0.692,
 'lcp': -0.134,
 'gleason': 0.009,
 'pgg45': 0.008}

## LASSO

Lasso, or Least Absolute Shrinkage and Selection Operator, is very similar in spirit to Ridge Regression. It also adds a penalty for non-zero coefficients to the loss function, but unlike Ridge Regression which penalizes sum of squared coefficients (the so-called L2 penalty), LASSO penalizes the sum of their absolute values (L1 penalty). As a result, for high values of $\lambda$, many coefficients are exactly zeroed under LASSO, which is never the case in Ridge Regression. 

Another important differences between them is how they tackle the issue of multicollinearity between the features. In Ridge Regression, the coefficients of correlated variables tend be similar, while in LASSO one of them is usually zeroed and the other receives the entire impact. Because of this, Ridge Regression is expected to work better if there are many large parameters of about the same value, i.e. when most predictors truly impact the response. LASSO, on the other hand, is expected to come on top when there are a small number of significant parameters and the others are close to zero, i.e. when only a few predictors actually influence the response.

In practice, however, one doesn't know the true values of the parameters. So, the choice between Ridge Regression and LASSO can be based on out-of-sample prediction error. Another option is to combine these two approaches in one - see the next section!

LASSO's loss function looks as follows:

<p align="center">
<img src="img/lasso_loss.png" width=300 style="display: block; margin: auto;" />
</p>

Unlike in Ridge Regression, this minimization problem cannot be solved analytically. Fortunately, there are numerical algorithms able to deal with it.

In [5]:
lasso_cv = LassoCV(normalize=True, alphas=np.logspace(-10, 1, 400))
lasso_model = lasso_cv.fit(X_train, y_train)
lasso_prediction = lasso_model.predict(X_test)
lasso_mae = np.mean(np.abs(y_test - lasso_prediction))
lasso_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((lasso_model.intercept_, lasso_model.coef_), axis=None), 3))
)

print('LASSO MAE: {}'.format(np.round(lasso_mae, 3)))
print('LASSO coefficients:')
lasso_coefs

LASSO MAE: 0.5
LASSO coefficients:


{'Intercept': 0.074,
 'lcavol': 0.459,
 'lweight': 0.456,
 'age': -0.0,
 'lbph': 0.05,
 'svi': 0.352,
 'lcp': 0.0,
 'gleason': 0.0,
 'pgg45': 0.002}

## Elastic Net

Elastic Net first emerged as a result of critique on LASSO, whose variable selection can be too dependent on data and thus unstable. Its solution is to combine the penalties of Ridge Regression and LASSO to get the best of both worlds. Elastic Net aims at minimizing the loss function that includes both the L1 and L2 penalties:

<p align="center">
<img src="img/enet_loss.png" width=400 style="display: block; margin: auto;" />
</p>

where $\alpha$ is the mixing paramter between Ridge Regression (when it is zero) and LASSO (when it is one). The best $\alpha$ can be chosen with scikit-learn's cross-validation-based hyperparaneter tuning.

In [6]:
elastic_net_cv = ElasticNetCV(normalize=True, alphas=np.logspace(-10, 1, 400), l1_ratio=np.linspace(0, 1, 100))
elastic_net_model = elastic_net_cv.fit(X_train, y_train)
elastic_net_prediction = elastic_net_model.predict(X_test)
elastic_net_mae = np.mean(np.abs(y_test - elastic_net_prediction))
elastic_net_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((elastic_net_model.intercept_, elastic_net_model.coef_), axis=None), 3))
)

print('Elastic Net MAE: {}'.format(np.round(elastic_net_mae, 3)))
print('Elastic Net coefficients:')
elastic_net_coefs

Elastic Net MAE: 0.5
Elastic Net coefficients:


{'Intercept': 0.074,
 'lcavol': 0.459,
 'lweight': 0.456,
 'age': -0.0,
 'lbph': 0.05,
 'svi': 0.352,
 'lcp': 0.0,
 'gleason': 0.0,
 'pgg45': 0.002}

## Least Angle Regression

So far we have discussed one subsetting method, Best Subset Regression, and three shrinkage methods: Ridge Regression, LASSO and their combination, Elastic Net. This section is devoted to an approach located somewhere in between subsetting and shrinking: Least Angle Regression (LAR). This algorithm starts with a null model, with all coefficients equal to zero, and then works iteratively, at each step moving the coefficient of one of the variables towards its least squares value. 

More specifically, LAR starts with identifying the variable most correlated with the response. Then it moves the coefficient of this variable continuously toward its leasts quares value, thus decreasing its correlation with the evolving residual. As soon as another variable “catches up” in terms of correlation with the residual, the process is paused. The second variable then joins the active set, i.e. the set of variables with non-zero coefficients, and their coefficients are moved together in a way that keeps their correlations tied and decreasing. This process is continued until all the variables are in the model, and ends at the full least-squares fit. The name "Least Angle Regression" comes from the geometrical interpretation of the algorithm in which the new fit direction at a given step makes the smallest angle with each of the features that already have non-zero coefficents.

The code chunk below applies LAR to the prostate data.

In [9]:
LAR_cv = LarsCV(normalize=True)
LAR_model = LAR_cv.fit(X_train, y_train)
LAR_prediction = LAR_model.predict(X_test)
LAR_mae = np.mean(np.abs(y_test - LAR_prediction))
LAR_coefs = dict(
    zip(['Intercept'] + data.columns.tolist()[:-1], 
        np.round(np.concatenate((LAR_model.intercept_, LAR_model.coef_), axis=None), 3))
)

print('Least Angle Regression MAE: {}'.format(np.round(LAR_mae, 3)))
print('Least Angle Regression coefficients:')
LAR_coefs

Least Angle Regression MAE: 0.499
Least Angle Regression coefficients:


{'Intercept': 0.05,
 'lcavol': 0.46,
 'lweight': 0.46,
 'age': 0.0,
 'lbph': 0.054,
 'svi': 0.362,
 'lcp': 0.0,
 'gleason': 0.0,
 'pgg45': 0.002}

## Principal Components Regression

We have already discussed methods for choosing variables (subsetting) and decreasing their coefficients (shrinkage). The last two methods explained in this article take a slightly different approach: they squeeze the input space of the original features into a lower-dimensional space. Mainly, they use *X* to create a small set of new features _Z_ that are linear combinations of *X* and then use those in regression models.

The first of these two methods is Principal Components Regression. It applies the PCA to obtain principal components with high variance, so that they can explain the variance of the target, and then uses them as features in simple linear regression. This makes it similar to Ridge Regression, as both of them operate on the principal components space of the original features. The difference is that PCR discards the components with least informative power, while Ridge Regression simply shrinks them stronger. 

The number of components to reatain can be viewed as a hyperparameter and tuned via cross-validation, as is the case in the code chunk below.

In [7]:
regression_model = LinearRegression(normalize=True)
pca_model = PCA()
pipe = Pipeline(steps=[('pca', pca_model), ('least_squares', regression_model)])
param_grid = {'pca__n_components': range(1, 9)}
search = GridSearchCV(pipe, param_grid)
pcareg_model = search.fit(X_train, y_train)
pcareg_prediction = pcareg_model.predict(X_test)
pcareg_mae = np.mean(np.abs(y_test - pcareg_prediction))
n_comp = list(pcareg_model.best_params_.values())[0]
pcareg_coefs = dict(
   zip(['Intercept'] + ['PCA_comp_' + str(x) for x in range(1, n_comp + 1)], 
       np.round(np.concatenate((pcareg_model.best_estimator_.steps[1][1].intercept_, 
                                pcareg_model.best_estimator_.steps[1][1].coef_), axis=None), 3))
)

print('Principal Components Regression MAE: {}'.format(np.round(pcareg_mae, 3)))
print('Principal Components Regression coefficients:')
pcareg_coefs

Principal Components Regression MAE: 0.551
Principal Components Regression coefficients:


{'Intercept': 2.452,
 'PCA_comp_1': 0.019,
 'PCA_comp_2': -0.018,
 'PCA_comp_3': -0.114,
 'PCA_comp_4': 0.495,
 'PCA_comp_5': 0.513,
 'PCA_comp_6': -0.46}

## Partial Least Squares

The final method discussed in this artical is Partial Least Squares (PLS). Similarly to Principal Components Regression, it also uses a small set of linear combinations of the original features. The difference is in how these combinations are constructed. While Principal Components Regression uses only *X* themselves to create the derived features _Z_, Partial Least Squares additionally uses the target *y*. Hence, while constructing _Z_, PLS seeks directions that have high variance (as these can explain variance in the target) and high correlation with the target. This stays in contrast to the principal components appraoch, which focuses on high variance only.

Under the hood of the algorithm, the first of the new features, $z_1$, is created as a linear combination of all features _X_, where each of the *X*s is weighted by its inner prodcut with the target *y*. Then, *y* is regressed on $z_1$ giving PLS $\beta$-coefficients. Finally, all *X* are orthogonalized with respect to $z_1$. Then the process starts anew for $z_2$ and goes on until the desired numbers of components in *Z* is obtained. This number, as usual, can be chosen via cross-validation.

It can be shown that although PLS shrinks the low-variance components in *Z* as desired, it can sometimes inflate the high-variance ones, which might lead to higher prediction errors in some cases. This seems to be the case for our prostate data: PLS performs the worst of all discussed methods.

In [8]:
pls_model_setup = PLSRegression(scale=True)
param_grid = {'n_components': range(1, 9)}
search = GridSearchCV(pls_model_setup, param_grid)
pls_model = search.fit(X_train, y_train)
pls_prediction = pls_model.predict(X_test)
pls_mae = np.mean(np.abs(y_test - pls_prediction))
pls_coefs = dict(
  zip(data.columns.tolist()[:-1], 
      np.round(np.concatenate((pls_model.best_estimator_.coef_), axis=None), 3))
)

print('Partial Least Squares Regression MAE: {}'.format(np.round(pls_mae, 3)))
print('Partial Least Squares Regression coefficients:')
pls_coefs

Partial Least Squares Regression MAE: 1.008
Partial Least Squares Regression coefficients:


{'lcavol': 0.281,
 'lweight': 0.186,
 'age': 0.087,
 'lbph': 0.101,
 'svi': 0.213,
 'lcp': 0.187,
 'gleason': 0.131,
 'pgg45': 0.171}

## Recap & Conclusion

With many, possibly correlated features, linear models fail in terms of prediction accuracy and model's interpretability due to large variance of model's parameters. This can be alleviated by reducing the variance, which can only happen at the cost of introducing some bias. Yet, finding the best bias-variance trade-off can optimize model's performance. 

Two broad classes of approaches allowing to achieve this are subsetting and shrinkage. The former selects a subset of variables, while the latter shrinks the coefficients of the model towards zero. Both approaches results in a reduction of model's complexity, which leads to the desired decrease in parameters' variance.

This article discussed a couple of subsetting and shrinkage methods:

* __Best Subset Regression__ iterates over all possible feature combination to select the best one;
* __Ridge Regression__ penalizes the squared coefficient values (L2 penalty) enforcing them to be small;
* __LASSO__ penalizes the absolute values of the coefficients (L1 penalty) which can force some of them to be exactly zero;
* __Elastic Net__ combines the L1 and L2 penalties, enjoying the best of Ridge and Lasso;
* __Least Angle Regression__ fits in between subsetting and shrinkage: it works iteratively, adding "some part" of one of the features at each step;
* __Principal Components Regression__ performs PCA to squeeze the original features into a small subset of new features and then uses those as predictors;
* __Partial Least Squares__ also summarizes orignal features into a smaller subset of new ones, but unlike PCR, it also makes use of the targets to construct them.

As you can see from the applications to the prostate data, most of these methods perform similarly in terms of prediction accuracy. The first 5 methods' errors range between 0.467 and 0.517, beating least squares' error of 0.523. The last two, PCR and PLS, perform worse, possbily due to the fact that there are not that many features in the data, hence gains from dimensionality reduction are limited.

## Sources

1. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer.
2. https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net