# 06 Linear Model Selection & Regularization

Prediction Accuracy: least squares will tend to have low bias. If $n >> p$ then the least squares estimate will also have low variance and therefore perform well on test observations. If $n$ is not large then there could be a lot of variability in the least squares fit, which results in overfitting and poor predictions on future observations. If $p > n$ then the variance is infinite and the method cannot be used at all.  

## Subset Selection

#### Best Subset Selection

###### Best Subset Selection
Fit a seperate least squares regression for each possible combination of the $p$ predictors. We will fit all the $p$ models that contain exactly one predictor, all:
$$\frac{p(p - 1)}{2}$$

that contain exactly 2 predictors, and so on. Then we select the best model from all of these possible models. 

###### Algorithm 
- Let $M_0$ denote the null model, which contains no predictors (predicts sample mean for each observation).
- For $k = 1, ..., p$
    - Fit all (p choose k) models that contain exactly $k$ predictors
    - Pick the best among these and call it $M_k$. Here best is defined as having the smallest RSS, or largest $R^2$.
- Select a single best model from among the $M_0, ..., M_p$ using cross-validated prediction error, $C_p$, AIC, BIC, or adjusted $R^2$.

It is best to use Best subset selection when $p < 40$.

#### Stepwise Selection

###### Forward Stepwise Selection
- Let $M_0$ denote the null model, which contains no predictors
- For $k = 0, ..., p - 1$
    - Consider all p - k models that augment the predictors in $M_k$ with one additional predictor
    - Choose the best among these p - k models, and call it $M_{k + 1}$. Here the best is defined as having the smallest RSS or highest $R^2$.
- Select a single best model from the $M_0, ..., M_p$ using cross-validation prediction error. 

Doing this we will end up with:
$$\frac{1 + p(p + 1)}{2}$$

###### Backward Stepwise Selection
-Let $M_0$ represent the full model with all of the predictors $p$.
- For $k = p, p-1, ..., 1$
    - Consider all $k$ models that contain all but one of the predictors in $M_k$ for a total of $k - 1$ predictors.
    - Choose the best among these $k$ models and all it $M_{k - 1}$. Here best is defined as having the smallest RSS or $R^2$.
- Select the best model from all models using cross-validation prediction error. 

Backward requires that the number of $p$ is smaller than $n$. 

#### $C_p$, AIC, BIC, Adjusted $R^2$
$$C_p = \frac{(RSS + 2d \hat \sigma^2)}{n}$$

where $\hat \sigma^2$ is an estimate of the vsriance of the error $\epsilon$ associated with each response measurement. Usually this is done using the full model. The penalty $2d \hat \sigma^2$ will increase as the number of predictors increase to adjust for the decrease in training RSS. 

$$AIC = \frac{(RSS + 2d \hat \sigma^2)}{n \hat \sigma^2}$$

is defined for a large class of models fit by maximum likelihood. It is important to note that Mallow's $C_p$ and AIC are proportional.

$$BIC = \frac{(RSS + log(n) d  \hat \sigma^2)}{n \hat \sigma^2}$$

and BIC tends to take on a small value for a model with low test error (like AIC). BIC places a penalty for models with many variables and favors smaller models. 

$$Adjusted R^2 = 1 - \frac{ \frac{RSS}{(n - d - 1)} }{ \frac{TSS}{(n - 1)} }$$

and a model with a large Adjusted $R^2$ indicates a model with a small test error. This model will also only include the correct variables. 

#### Validation and Cross-Validation
These methods have a leg up over the above criterion simply because these make less assumptions and evaluate the error directly on the test data. 

###### One-Standard-Error Rule
Calculate the standard error of the estimated test MSE for each model size, then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. If a set of models appear to be equally good, it makes sense to just choose the simpler model. 

## Shrinkage Methods

Shrink the model coefficients toward 0 minimizes the variance.

#### Ridge Regression
The coefficient estimates $\hat \beta^R$ are the vales that minimize:
$$\sum_{i = 1}^n\ (y_i - \beta_0 - \sum_{j = 1}^p\ \beta_j\ x_{ij})^2 + \lambda\ \sum_{j = 1}^p\ \beta_j^2\ = RSS + \lambda\ \sum_{j = 1}^p\ \beta_j^2$$

where $\lambda \ge 0$ is a tuning parameter. The shrinkage parameter $\lambda\ \sum_{j = 1}^p\ \beta_j^2$ is small when the coefficients $\beta$ are near 0 (null model). When $\lambda$ is 0 then the tuning parameter has no effect and the ridge regression will produce the least squares estimates. As $\lambda -> \infty$ the impact of this shrinkage parameter grows and the coefficient approaches 0.

Ridge Regression will produce a seperate set of Regression coefficient estimates $\hat \beta^R_\lambda$ for each value of $\lambda$.

If the variables have been centered with mean set to 0 before ridge regression is performed, then the estimated intercept will take the form $\hat \beta_0 = \bar y = \sum_{i = 1}^n\ \frac{y_i}{n}$

In Ridge Regression multiplying a given predictor by a constant can change substantially. THe standard least squares duscussed in Chapter 3 are equivariant: multiplying a predictor by a constant leads to a scaling of the least squares coefficient estimates by a factor of $\frac{1}{c}$. With this in mind:
$$X_j \hat \beta_{j, \lambda}^R$$ will not only depend on the value of $\lambda$, but also onthe scaling of the $j$th predictor. It could also matter on the scaling of the other predictors. 

This is why it is important to apply Ridge Regression after standardizing the predictors using the following formula:
$$\tilde x_{ij} = \frac{x_{ij}}{\sqrt( \frac{1}{n}\ \sum_{i = 1}^n\ (x_{ij} - \bar x_{j})^2 )}$$

The denominator is essentially th estimated standard deviation of the $j$th predictor. This will make all of the standardized predcictors have a standard deviation of 1 thus making the final fit not dependent on the scale of the predictors. 

As $\lambda$ increases the flexibility of the Ridge Regression fir decreases leading to a decreased variance but increased bias. So small $\lambda$ leads to high variance, and as $\lambda$ increases the variance decreases at the expense of slight increase in the bias. 

Ridge Regression will work wonders where the least squares estimate has high variance, which is when the number of predictors is close to the number of observations. 

#### Lasso
