# Introduction to Statistical Learning - Chapter 5

- [5. Linear Model Selection and Regularization](#5.-Linear-Model-Selection-and-Regularization)
    * [5.1. Subset Selection](#5.1.-Subset-Selection)
        + [5.1.1. Best Subset Selection](#5.1.1.-Best-Subset-Selection)
        + [5.1.2. Stepwise Selection](#5.1.2.-Stepwise-Selection)
        + [5.1.3. Choosing the Optimal Model](#5.1.3.-Choosing-the-Optimal-Model)
    * [5.2. Shrinkage Methods](#5.2.-Shrinkage-Methods)
        + [5.2.1. Ridge Regression](#5.2.1.-Ridge-Regression)
        + [5.2.2. The Lasso](#5.2.2.-The-Lasso)
    * [5.3. Dimension Reduction Methods](#5.3.-Dimension-Reduction-Methods)
        + [5.3.1. Principal Components Regression](#5.3.1.-Principal-Components-Regression)     

# 5. Linear Model Selection and Regularization

$$ Y = \beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p} + \epsilon $$

- While linear model has distinct advantages in terms of inference, most problems are non-linear
    * Alternative fitting can yield better `prediction accuracy` and `interpretability`

**Prediction Accuracy**
- Provided the true relationship between response and predictors is approximately linear, the least squares estimates will have low bias
    * If n is >> p, then the least squares estimates will have low variance

**Model Interpretability**
- Some or many of variables used in the multiple regression model are in fact not associated with the response
    * Including irrelevant variables leads to unnecessary complexity in the resulting model
        + Need for feature selection or variables selection to exclude irrelevant variables
        
**Classes of Feature Selection**
- Subset Selection
    * Identifying a subset of the p predictors that we believe to be related to the response
        + Fit the model on these reduced set of variables
- Shrinkage
    * Fitting a model with all p predictors
        + But, the estimated coefficients shrunken towards zero relative to the least squares estimates to reduce variance
        + Also known as regularization
- Dimension Reduction
    * Projecting the p predictors into a M-dimensional subspace where M < p
        + Computing M different linear combinations or projections of the variables

## 5.1. Subset Selection

### 5.1.1. Best Subset Selection

- Fit a separate least squares regression for each combination of the $p$ predictors
    * Obtain the best model from each model size according to $RSS$ or $R^{2}$
    * For other types of models, like logistic regression, `deviance` can be used as a measure of comparison
        + $-2\times maximum likelihood$; smaller the deviance, the better the fit
- Methodology:
    * Let $M_{0}$ denote the null model with no predictors
    * For k = 1,2, ... p :
        + Fit all $(^{p}_{K})$ models that contain exactly k predictors
        + Pick the best among these $(^{p}_{K})$ models, defined as the smallest RSS or largest $R^{2}$
    * Model Selection using cross-validation prediction error
        + $C_{p}$, AIC, BIC or adjusted $R^{2}$
- RSS of the models decrease monotonically and the $R^{2}$ increase monotonically when the number of features included in the model increases.
    * However, we are interested in the lowest test error rate and not the lowest training error rate
    
**Limitations of Best Subset Selection**
- Becomes computationally infeasible for values of p greater than 40
- Only work for least squares linear regression
- The larger the search space, the higher the chance of finding models that look good on the training data

### 5.1.2. Stepwise Selection

**Forward Stepwise Selection**
- Computationally efficient alternative to best subset selection as it only fits $1 + p(p+1)/2$ models
- Methodology:
    * Begin with a null model with no predictors
    * Add predictors one-at-a-time, until all of the predictors are in the model
    * At each set, the variables that gives the greatest additional improvement to the fit is added to the model
    * Selection of the best model using cross validated prediction error , $C_{p}$, AIC, BIC or adjusted $R^{2}$
- Limitations of Forward Selection
    * Not guaranteed to find the best possible model

**Backward Stepwise Selection**
- Only searches through 1 + p(p+1)/2 models
- Methodology:
    * Begin with the full least squares model containing all $p$ predictors
    * Iteratively remove the least useful predictor one-at-a-time
        + Compare using smallest RSS or highest $R^{2}$
    * Select a single best model using cross validated prediction error, $C_{p}$, $AIC$, $BIC$ or adjusted $R^{2}$
- Requires that the number of samples $n$ is larger than the number of variables $p$

**Hybrid Approaches**
- After adding each new variable, the method also remove any variables that no longer provide an improvement in the model fit
    * More closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection

### 5.1.3. Choosing the Optimal Model

- Approaches to choosing optimal model:
    * Indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting
    * Directly estimate the test error, using either a validation set approach or a cross-validation approach

**C_{p}**
$$ C_{p} = \frac{1}{n}(RSS + 2d\hat{\sigma}^{2}) $$

where $\hat{\sigma}^{2}$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement
- $C_{p}$ statistic adds penalty of $2d\hat{\sigma}^{2}$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error

**AIC**
$$ AIC = \frac{1}{n\hat{\sigma}^{2}}(RSS + 2d\hat{\sigma}^{2}) $$
- Defined for a large class of models fit by maximum likelihood

**BIC**
$$ BIC = \frac{1}{n\hat{\sigma}^{2}}(RSS + log(n)d\hat{\sigma}^{2}) $$
- Derived from a Bayesian point of view
- BIC take on a small value for a model with a low test error
- Since log n > 2 for any n > 7, the BIC statistic places a heavier penalty on models with many variables

**Adjusted $R^{2}$**

$$ Adjusted R^{2} = 1 = \frac{RSS/(n-d-1)}{TSS/(n-1)}$$

where $R^{2} = 1 - RSS/TSS$ and $TSS = \sum(y_{i}-\bar{y})^{2}$
- A larger value of adjusted R^{2} indicates a model with a small test error
- Once all the correct varaibles have been included, additional variables will only lead to a very small decrease in RSS
    * Results in a increase in $\frac{RSS}{n-d-1}$ and consequently a decrease in the adjusted $R^{2}$
    
**Validation and Cross-Validation**
- $AIC$, $BIC$, $C_{p}$ and adjusted $R^{2}$ were attractive approaches when cross-validation was computationaly prohibitive
- Another way to check for the best model would be to use the one-standard error rule
    * Calculate the standard error of the estimated test MSE for each model sieze
    * Select the model for which the estimated test error is within one standard error of the lowest point on the curve

## 5.2. Shrinkage Methods

- Fit a model containing all predictors using a technique that constrains and regularizes the coefficient estimates or equivalently that shrinks the coefficient estimates towards zero
    * Shrinking the coefficient estimates can significantly reduce their variance

### 5.2.1. Ridge Regression

$$ \sum^{n}_{i=1}\left(y_{i}-\beta_{0}-\sum^{p}_{j=1}\beta_{j}x_{ij}\right)^{2} + \lambda\sum^{p}_{j=1}\beta^{2}_{j} = RSS + \lambda\sum^{p}_{j=1}\beta^{2}_{j}$$
where $\lambda \ge0$ is a tuning parameter
- Ridge regression seeks coefficient estimates that fit the data well, by making the RSS small
- The second term $\lambda\sum_{j}\beta_{j}^2$ called a `shrinkage penalty` is small when $\beta_{1},...,\beta_{p}$ are close to zero
- Tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficent
    * When $\lambda = 0$, the penalty terms has no effect and the ridge regression will produce the least squares estimates
    * When $\lambda \rightarrow \infty$, impact of shrinkage penalty grows and ridge regression coefficients estimates will approach zero
        + Ridge regression produces a different set of coefficient estimates, $\hat{\beta_{\lambda}^{R}}$ for each value of $\lambda$
        + Applied to predictors and not to the intercept

**Least Squares vs Ridge Regression**
- Least squares coefficient estimates are scale equivariant
    * Multiplying X_{j} by a constant c simply leads to scaling of the least squares coefficient estimates
- Ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant
    * Scaling depends on $\lambda$ and the $j$th predictor
        + Best to apply ridge regression after standardizing the predictors using the formula
$$ \tilde{x_{ij}} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum^{n}_{i=1}(x_{ij}-\bar{x_{j}})^{2}}}$$

**Improvement of Least Squares using Ridge Regression**
- Ridge regression's advantage lies in the *bias-variance trade-off*
    * As $\lambda$ increases, flexibilty of the ridge regression fit deceases, leading to decreased variance but increase bias
    * As $\lambda$ increases, the shrinkage of the ridge coefficent estimates leads to substantial reduction in the variance of the predictions at the expense of a slight increase in bias
        + For $\lambda$ up to 10, variance decrease rapidly with very little increase in bias, thus MSE drops
- If p>n, least squares estimate do not have a uniue solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance
    * Ridge regression works best in situations where the least squares estimates have high variance
- Ridge regreesion has a computational advantage of least squares as it only fits a single model and model-fitting procedure can be performed quickly

### 5.2.2. The Lasso

**Disadvantage of Ridge Regression**
- Ridge regression will shrink all the coefficients towards zero but not set any to exactly zero (unless $\lambda = \infty$)
- Creates a challenge in model interpretationwhere number of variables $p$ is quite large

**Lasso Regression**
$$ \sum^{n}_{i=1}\left(y_{i}-\beta_{0}-\sum^{p}_{j=1}\beta_{j}x_{ij}\right)^{2} + \lambda\sum^{p}_{j=1}|\beta_{j}| = RSS + \lambda\sum^{p}_{j=1}|\beta_{j}|$$

where $|\beta_{j}|$ is the lasso penalty

- The differentce between ridge regression and lasso regression is that the lasso penalty $|\beta_{j}|$ has the effect to force some of the coefficient estimates to be exactly zero when the tuning parameter $\lambda$ is sufficiently large
- When $|\beta_{j}| = 0$, it will produce a least squares model but when $|\beta_{j}|$ is large, it will produce a null model

**Comparing the Lasso and Ridge Regression**
- Advantages of Lasso over Ridge Regression
    * More interpretable models that involve only a subset of the predictors
    * When only a subset of predictors is truly related to the response, Lasso outperform Ridge Regression
- Advantages of Ridge over Lasso Regression
    * Ridge Regression will perform better when the response is a function of many predictors

## 5.3. Dimension Reduction Methods

- Reduces the problem of estimating $p + 1$ coefficients to the simple problem of estimating $M + 1$ coefficients where M < p
- Dimension reduction serves to constrain the estimated $\beta_{j}$ coefficients 
    * However, it has the potential to bias the coefficient estimates
- Dimension reduction works in two steps:
    * Transformation of predictors
    * Model fitted using these M predictors

### 5.3.1. Principal Components Regression

- Popular approach for deriving a low-dimensional set of features from a large set of variables

**An Overview of Principal Components Analysis**
- PCA is a technique for reducing the dimension of a $n \times p$ data matrix **X**
    * First principal component direction of the data is that along which the observations vary the most
    
$$ Z_{1} = \phi_{11} \times (X_{1} -\bar{X_{1}}) + \phi_{21} \times (X_{2} -\bar{X_{2}}) $$

where $\phi_{11}$ and $\phi_{21}$ are principal component loadings and $Z_{1} is the principal component scores

- This first principal component line minimizes the sum of squared perpendicular distances between each point and the line
    * Projected observations are as close as possible to the originial observations
    * Values of principal components $Z_{1}$ summarizes the joint values for each of the predictors for each location
- The second principal component $Z_{2}$ is a linear combination of the variables that is uncorrelated with $Z_{1}$ and has the largest variance subject to this constraint
    * The zero correlation between $Z_{1}$ and $Z_{2}$ is equivalent to the condition that the direction must be perpendicular, or orthogonal to the first principal component direction

$$ Z_{2} = \phi_{21} \times (X_{1} -\bar{X_{1}}) - \phi_{11} \times (X_{2} -\bar{X_{2}}) $$

**The Principal Components Regression Approach**