# Linear Model Selection and Regularization

The linear model
$$
Y = \beta_0 + \beta_1 X_1+ \beta_2 X_2+\cdots + \epsilon
$$
is fitted using the least squares method. In a lot of cases, this method gives good result, however, it is not always the case. Using some other method to fit the model may result in better *prediction accuracy* and *model interpretability*.

* **Prediction accuracy:** Provided that the true relationship between the
response and the predictors is approximately linear, the least squares
estimates will have low bias if $n>>p$ hence it will generalize well on test data. If n is not much larger than p, the bias will be greater and if $n<p$ there
is no longer a unique least squares coefficient estimate: the variance
is infinite so the method cannot be used at all.

* **Model interpretability:**  It is often the case that some or many of the
variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to
unnecessary complexity in the resulting model. By removing these
variables—that is, by setting the corresponding coefficient estimates
to zero—we can obtain a model that is more easily interpreted. 

There are many alternatives, both classical and modern, to using least
squares to fit. Some are:
*  **Subset Selection:** This approach involves identifying a subset of the p
predictors that we believe to be related to the response. We then fit
a model using least squares on the reduced set of variables.
* **Shrinkage:** This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero
relative to the least squares estimates.
* **Dimension Reduction:** This approach involves projecting the p predictors into a M-dimensional subspace, where $M < p$. This is achieved
by computing M different linear combinations, or projections, of the
variables. Then these M projections are used as predictors to fit a
linear regression model by least squares.

Let's learn them one by one.

## Subset Selection

### Best Subset Selection

To perform best subset selection, we fit a separate least squares regression
for each possible combination of the p predictors. That is, we fit all p models selection
that contain exactly one predictor, all $^pP_2 = p(p-1)/2$ models that contain
exactly two predictors, and so forth. We then look at all of the resulting
models, with the goal of identifying the one that is best.
The problem of selecting the best model from among the $2^p$ possibilities
considered by best subset selection is not trivial. This is usually broken up
into two stages, as described in Algorithm:
1. Let $M_0$ denote the null model, which contains no predictors. This
model simply predicts the sample mean for each observation.
2. For $k = 1, 2, \cdots p$:
    - Fit all $^pP_k$ models that contain exactly k predictors.

    - Pick the best among these $^pP_k$ models, and call it $M_k$.

3. Select a single best model from among $M_0, M_1, \cdots, M_p$ using cross-validated prediction error.

As there are $2^p$ possible models, this type of selection becomes computationally expensive for large $p$.

### Stepwise Selection

For computational reasons, best subset selection cannot be applied with
very large p. Best subset selection may also suffer from statistical problems
when p is large. The larger the search space, the higher the chance of finding
models that look good on the training data, even though they might not
have any predictive power on future data. Thus an enormous search space
can lead to overfitting and high variance of the coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.

#### Forward Stepwise Selection

Forward stepwise selection
begins with a model containing no predictors, and then adds predictors
to the model, one-at-a-time, until all of the predictors are in the model.
In particular, at each step the variable that gives the greatest additional
improvement to the fit is added to the model. More formally, the forward
stepwise selection procedure is given in Algorithm 

1. Let $M_0$ denote the null model, which contains no predictors.
2. For $k = 0, \cdots , p − 1$:
    - Consider all $p − k$ models that augment the predictors in $M_k$ with one additional predictor.
    - Choose the best among these $p − k$ models, and call it $M_{k+1}$. Here best is defined as having smallest RSS or highest $R^2$.
3. Select a single best model from among $M_0, \cdots , M_p$ using crossvalidated prediction error.

This algorithm amounts in total $1+p(p+1)/2$ models, in place of $2^p$ models that are required by best subset selection. This is a substantial difference: when
p = 20, best subset selection requires fitting 1,048,576 models, whereas
forward stepwise selection requires fitting only 211 models.

Though forward stepwise tends to do well in practice,
it is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the p predictors. For instance, suppose that in a
given data set with $p = 3$ predictors, the best possible one-variable model
contains $X_1$, and the best possible two-variable model instead contains $X_2$
and $X_3$. Then forward stepwise selection will fail to select the best possible
two-variable model, because $M_1$ will contain $X_1$, so $M_2$ must also contain
$X_1$ together with one additional variable.

#### Backward Stepwise Selection

The Algorithm is:
1. Let M
p denote the full model, which contains all p predictors.
2. For $k = p, p − 1, \cdots , 1$:
    - Consider all k models that contain all but one of the predictors in $M_k$, for a total of k − 1 predictors.
    - Choose the best among these k models, and call it $M_k−1$. Here best is defined as having smallest RSS or highest $R_2$.
3. Select a single best model from among $M_0, M_1, \cdots , M_p$ using crossvalidated prediction error.