# Shrinkage models

1. Ridge Regression
2. LASSO
3. Elastic Net

The lecture draws from Chapter 6 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---
# 1. Ridge Regression

Here we return to the issue of high dimensionality, but with a slightly different focus. Last class we talked about best subset & stepwise selection methods. The logic of these methods was simple.

1. Compare model fits while adjusting for number of parameters.
2. Find the best model by fitting a range of models configurations.

Both subset and stepwise selection approaches are _post hoc_ methods (i.e., they test models after being fit) and are computationally expensive. A more efficient way to do _feature selection_ (i.e., identify which parameters to keep and which to exlcude) is to incorporate constraints for model dimensionality in the fitting process itself.

This is the logic of so-called **shrinkage** models. They are a class of linear models that penalize for dimensionality in the fitting process itself, and select only for the most robust predictor variables in your model. 

The first shrinkage model we'll explore is [**ridge regression**](https://en.wikipedia.org/wiki/Tikhonov_regularization) which penalizes weak regression coefficients so that they become so small that they are functionally ineffective.

Remember the objective function for least square regression tries to minimize the residual sums of squares (RSS).

$$ RSS = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 $$

Ridge regression starts with the least squares objective function (i.e., RSS) except that the coefficients are estimated with an additional term added to the objective function.

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \lambda \sum_{j=1}^{p} \hat{\beta}_{j}^2$$

Here $\lambda$ is alsays greater than or equal to zero (when $\lambda=0$ ridge regresssion becomes least squares regression). $\lambda$ is called a _tuning parameter_ and it is another free parameter in the model that needs to be estimated. This means that for different values of $\lambda$ you'll get different estimates of regression coefficients $\beta_{\lambda}^R$ (the term $\beta_{\lambda}^R$ refers to the ridge (R) regression parameters you get with a specific value of $\lambda$).

The term 

$$\lambda \sum_{j=1}^{p} \hat{\beta}_{j}^2$$

is called the _shrinkage penalty_ because when $\beta$'s are small, it reduces magnitude of the model parameters so that coefficients with small magnitudes are “shrunk” to be functionally negligible (“toward zero”). Thus the tuning parameter $\lambda$ controls the degree of this penalty on the regression coefficients during estimations. Higher values of $\lambda$ produce more sparse models (i.e., with fewer parameters). This is illustrated in the plot on the left below. As $\lambda$ increases, the estimated regression coefficients get smaller and smaller in magnitude.

![Ridge Sparsity](imgs/L17_RidgeRegressionSparsity.png)


We call the shrinkage penalty in ridge regression the _l2-norm_ and it reflects the fact that we are penalizing the squared value of the regression coefficients. Thus the value itself never actually reaches zero, but just gets so small as to be functionally removed from the model.

The right hand panel on the figure above shows how the values of the ridge regression parameters $||\beta_{\lambda}^R||_2$ diverge from the estimates of the least squares regression coefficients ($||\beta||_2$. When $\lambda=0$, the ratio $||\beta_{\lambda}^R||_2 / ||\beta||_2$ is 1, indicating the case where the solutions to the ridge model and least squares model are the same. As $\lambda$ increases, the ratio $||\beta_{\lambda}^R||_2 / ||\beta||_2$ reduces, but so does the magnitude of the estimated regression coefficients.

<br>

## Finding $\lambda$

There are two critical advantages of ridge regression over least squares regression.

1. Ridge better manages the bias-variance tradeoff in contexts where overfitting is a concern. As $\lambda$ increases, the flexibility of ridge regression fits decreases, leading to lower variance and higher bias.

2. Ridge works best in other situations where least squares estimates have high variance.

However, because the tuning parameter $\lambda$ is a free parameter (i.e., a parameter that we need to fit), we have to find the best value of $\lambda$ empirically. Thus we have to use cross validation to "tune" the ridge regression model. So before we can estimate our regression coefficients we have to use validation sets to find the right sparsity constraint. 

Just like using cross-validation to determine the bias-variance tradeoff, to tune the ridge model we'll want to take a subset of the data, parse it into testing and training sets, and use test error to determine the best value of $\lambda$. The figure below shows the variance (green), bias (red), and test error (black) for a simulated data set with differnt levels of $\lambda$. As you can see, there is an optimal $\lambda$ that maximizes the bias-variance tradeoff even though test error will always increase with larger $\lambda$s.

![Ridge Bias-Variance](imgs/L17_RidgeVarianceBias.png)

**Critically, the subset of data you use to tune the $\lambda$ parameter can not be used for the subsequent parameter fitting to find the best regression coefficients.** Thus you have to be careful in how you search for the best $\lambda$ to avoid circular inference. 

<br>

## Concerns

There are several concerns to keep in mind with ridge regression.

1. Ridge regression does not actually remove any model terms, it simply makes them small in magnitude. Thus ridge regression will always generate a model with all p predictors. However, because of the shrinkage penalty, weak predictor variables are reduced to a level that makes them _functionally_ removed from the model.

2. Unlike regression coefficients in least squares models, the regression coefficients in ridge regression are not _scale equivariant_. Scale equivariance means that if you multiple your predictor variables uniformly by a scalar _c_ (i.e., $cX$) then all you need to do is multiple the regression coefficients by the same scalar (i.e., $c\beta$). However, the shrinkage penalty means that not all predictor variables map onto $Y$ in the same way. Thus, you cannot simply account for a scalar increase in $X$ by multipling the ridge regression coefficients by the same scalar.





---

# The LASSO

As mentioned above, ridge regression doesn't actually do feature selection because it doesn't actually remove any predictor variables from the model. Thus given $p$ variables, a ridge model will always return $p$ coefficients. Increasing $\lambda$ will reduce the magnitude of all coefficients (including weak predictors), but it will not remove anything from the model.

Another method that includes a shrinkage penalty _and_ will do feature selection is the Least Absolute Shrinkage and Selection Operator (LASSO). LASSO has a very similar structure to its objective function in that it starts with RSS and includes an aditional shrinkage term. However, the form of this shrinkage term is much different.

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \lambda \sum_{j=1}^{p}|\hat{\beta}_{j}|$$

Notice that the only difference is that we changed the $\hat{\beta}^2$ term from ridge regression to be the absolute value of the regression coefficient ($|\hat{\beta}|$). Thus LASSO uses an _l1-penalty_ (because $\beta^1=\beta$). Just like ridge regression, LASSO shrinks the estimates of the regression coefficients to zero based on the tuning parameter $\lambda$; however, the l1-penalty allows for solutions or weak predictor variables to be exactly zero (particularly when $\lambda$ is large). When the solutions converge to zero, a parameter is thus removed from the model.

An example of this effect of tuning to select parameters is shown in the image below. As $\lambda$ increases (left panel), not only do the estimated regression coefficients get smaller, but eventually the weaker effects converge to zero. As with ridge regression, the ratio $||\beta_{\lambda}^L|| / ||\beta||_2$ is 1 when $\lambda=0$, meaning that the LASSO and least squares regression produce the same solutions. 

![LASSO Sparsity](imgs/L17_LASSOSparsity.png)

Thus, just like best subset selection, given _p_ predictor variables, with $\lambda >0$, LASSO will return $\leq p$ parameters. We will now compare the two shrinkage models with best subset selection.

<br>

## The symmetry of ridge and LASSO

Let's consider a special case. Let's say we have an $X$ where _n=p_ (thus a very high dimensional model), where $X$ is a diagnonal matrix with all 1's along the diagognal, and 0's on the off diagnal. 

In this context it is easy to show that the least square objective function 

$$ \sum_{j=1}^{p}(y_i - \hat{\beta}_jX_{i,j})^2$$ 

produces the regression coefficients such that

$$\hat{\beta}_i = y_i$$

This is because there is only one observation for each of the _p_ variables. Thus the best estimate is the observed value.

In this setting, the ridge objective function 

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \lambda \sum_{j=1}^{p} \hat{\beta}_{j}^2$$

is minimized to produce the regression coefficients 

$$\hat{\beta}_{j}^R = \frac{y_j}{1+\lambda}$$

Similarly, the LASSO objective function 

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \lambda \sum_{j=1}^{p}| \hat{\beta}_{j}|$$

is minimized to produce the regression coefficients

\begin{equation}
  \hat{\beta}_{j}^L =\left\{
  \begin{array}{@{}ll@{}}
    y_j - \frac{\lambda}{2}, & \text{if}\ y_j > \frac{\lambda}{2} \\
    y_j + \frac{\lambda}{2}, & \text{if}\ y_j < \frac{-\lambda}{2} \\
    0, & \text{if}\ |y_j| \leq \frac{\lambda}{2} \\
  \end{array}\right.
\end{equation} 

When looked at in this context it is easy to see what ridge regression and LASSO are doing to the fits because we can look at how the coefficients change compared to $Y$. This is illustrated in the figure below.

![Ridge vs. LASSO](imgs/L17_RidgeLASSOComparison.png)

On the left panel we see the fit for ridge regression with a specific, non-zero $\lambda$ (red line) compared to the least squares regression fit (dashed black line). Notice that the regression coefficients for ridge are generally smaller than for least squares. The degree of this reduced magnitude of the estimated coefficients gets stronger as $\lambda$ increases.

In contrast, for a similar $\lambda$, the LASSO (right panel) both underestimates the real coefficients _and_ zeros out the coefficients at values of $\frac{\lambda}{2}$. As $\lambda$ increases, both the width of this "zero zone" increases _and_ the magnitude of the non-zero coefficients decreases.


<br>

## Comparing ridge & LASSO to best subset selection

Let's return to the best subset selection method and see it's symmetry with shrinkage models. We can reframe the best subset selection problem as a special case of LASSO. If we just restructure the objective function for LASSO as

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \lambda \sum_{j=1}^{p}I( \hat{\beta}_{j} \neq 0)$$

Then the LASSO problem _is_ the best subset selection problem. Here $I(\beta_{j} \neq 0)$ is an indicator variable such that it is 1 if the regression coefficient $\beta_j$ is not 0, otherwise it returns 0. 

So why not always use LASSO instead of ridge regression or best subset selection? Well there are several cases where one method is preferred over the others.

* When $X$ has a lot of collinearity between variables, LASSO will pick the "best" variable out of the set of correlated variables to be the "representative" variable. In many cases the "best" variable can be arbitrary (particularly in cases of high correlation). Thus, the other variables, which may have meaningful interpretive values, are removed.

* LASSO can often be more conservative than ridge or best subset selection because it always assumes that, when $\lambda>0$, there is always some set of variables that must be zero. Thus in conditions where all _p_ variables are relevant to $Y$, LASSO will remove meaningful predictors. However, in cases where only a subset of terms in $X$ are meaningful, then the ridge regression and best subset selection routines will be too lenient and leave in unnecessary predictors.





---
# 3. Elastic net

As mentioned above, ridge regression and LASSO play off of each other. If you have a large degree of correlation in the predictor variables, ridge regression is prefereed. If you do not and have reason to think that only a sparse set of predictors will relate to $Y$, then LASSO is preferred.

**But what do you do if your dataset contains both clusters of correlated variables _and_ sparsity between the clusters?**

An example of this type of problem can be found in fMRI data. In fMRI, you typically have clusters of observations (i.e., voxels) that relate to a task condition of interest, separated by large regions of task-irrelevant voxels. If you use ridge regression in this context, you may risk overfitting your data with too flexible of a model. If you use LASSO, on the otherhand, you would only identify a single voxel of interest within any given cluster.

In situations like this it is ideal to have a way to mix the mutual advantages of the l2- and l1-shrinkage penalities. This is exactly what the _elastic net_ does. Elastic net includes a new tuning parameter, called $\alpha$ that reflects how much you want to weight the l1- and l2- penalities together. 

The objective function for elastic net thus looks like this.

$$ \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_jX_{i,j})^2 + \alpha\lambda \sum_{j=1}^{p} \hat{\beta}_{j}^2 +  (1-\alpha)\lambda \sum_{j=1}^{p}| \hat{\beta}_{j}|$$

When $\alpha=1$ the solution boils down to a pure ridge regression problem. When $\alpha=0$ the solution becomes a LASSO problem. For any other scenario where $0 \lt \alpha \lt 1$, you get a mixture of the two constraints.

Just like $\lambda$, you'll need to be careful to find $\alpha$ using cross-validation, since it is a tuning parameter, and this validation set used to find $\alpha$ must be independent from: a) the validation set used to tune $\lambda$ and b) the validation set used to fit the regression coefficients.

IF you can independently tune both $\alpha$ and $\lambda$, then elastic net gives you the best of both worlds because it doesn't assume anything about the sparseness of the relationship between $X$ and $Y$. If your data is really sparse, then the tuned $\alpha$ will converge to be close to zero. IF your data is highly correlated and not spares, then your tuned $\alpha$ will converge to be close to 1. _Thus, with elastic net, you let your data define the sparsity of the model._