# Introduction to Statistical Learning - Chapter 5


# 5. Linear Model Selection and Regularization

$$ Y = \beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p} + \epsilon $$

- While linear model has distinct advantages in terms of inference, most problems are non-linear
    * Alternative fitting can yield better `prediction accuracy` and `interpretability`

**Prediction Accuracy**
- Provided the true relationship between response and predictors is approximately linear, the least squares estimates will have low bias
    * If n is >> p, then the least squares estimates will have low variance

**Model Interpretability**
- Some or many of variables used in the multiple regression model are in fact not associated with the response
    * Including irrelevant variables leads to unnecessary complexity in the resulting model
        + Need for feature selection or variables selection to exclude irrelevant variables
        
**Classes of Feature Selection**
- Subset Selection
    * Identifying a subset of the p predictors that we believe to be related to the response
        + Fit the model on these reduced set of variables
- Shrinkage
    * Fitting a model with all p predictors
        + But, the estimated coefficients shrunken towards zero relative to the least squares estimates to reduce variance
        + Also known as regularization
- Dimension Reduction
    * Projecting the p predictors into a M-dimensional subspace where M < p
        + Computing M different linear combinations or projections of the variables

## 5.1. Subset Selection

## 5.1.1. Best Subset Selection

- Fit a separate least squares regression for each combination of the $p$ predictors
    * Obtain the best model from each model size according to $RSS$ or $R^{2}$
    * For other types of models, like logistic regression, `deviance` can be used as a measure of comparison
        + $-2\times maximum likelihood$; smaller the deviance, the better the fit
- Methodology:
    * Let $M_{0}$ denote the null model with no predictors
    * For k = 1,2, ... p :
        + Fit all $(^{p}_{K})$ models that contain exactly k predictors
        + Pick the best among these $(^{p}_{K})$ models, defined as the smallest RSS or largest $R^{2}$
    * Model Selection using cross-validation prediction error
        + $C_{p}$, AIC, BIC or adjusted $R^{2}$
- RSS of the models decrease monotonically and the $R^{2}$ increase monotonically when the number of features included in the model increases.
    * However, we are interested in the lowest test error rate and not the lowest training error rate
    
**Limitations of Best Subset Selection**
- Becomes computationally infeasible for values of p greater than 40
- Only work for least squares linear regression
- The larger the search space, the higher the chance of finding models that look good on the training data

## 5.1.2. Stepwise Selection

**Forward Stepwise Selection**
- Computationally efficient alternative to best subset selection as it only fits $1 + p(p+1)/2$ models
- Methodology:
    * Begin with a null model with no predictors
    * Add predictors one-at-a-time, until all of the predictors are in the model
    * At each set, the variables that gives the greatest additional improvement to the fit is added to the model
    * Selection of the best model using cross validated prediction error , $C_{p}$, AIC, BIC or adjusted $R^{2}$

**Limitations of Forward Selection**
- Not guaranteed to find the best possible model

**Backward Stepwise Selection**
