In [1]:
import pandas as pd
import numpy as np

import itertools
import time
import statsmodels.api as sm

from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error

# plots
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['text.color'] = 'k'
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Linear Model Selection and Regularization

* [Variable Selection](#Variable-Selection)
* [The Test Error](#The-Test-Error)
    * [Indirect Estimatoins](#Indirect-Estimatoins)
* [Shrinkage Methods](#Shrinkage-Methods)
    * [Ridge Regression](#Ridge-Regression)
    * [Lasso Regression](#Lasso-Regression)

Recall the linear regression model: $$Y= \beta_0 + \sum_{j=1}^{p}\beta_{j} X_{j},$$ where $X_j$ represents the j-th predictor and $\beta_j$ quantifies the association between that variable and the response.

The linear regression model is commonly used to describe the relationship between a response Y and a set of variables $X_j,~1\leq j\leq p$. 

**Prediction Accuracy**: Provided that the true relationship between the response and the predictors is approximately linear, the least-squares estimates will have low bias.
Let n denotes the number of observations and $p$ denotes the number of variables.

* If $n \gg  p$  ($n$ is much larger than $p$), least squares estimates tend to also have low variance and, thus, can perform well on test observations.

* If $n$ is **not** much larger than $p$, then there can be a lot of variability in the least-squares fit, resulting in **overfitting** and consequently poor predictions on future observations not used in model training.

* If $n<p$, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the **method cannot be used at all**. 

The variance at the cost of a negligible increase in bias can be reduced significantly by **constraining** or **shrinking** the estimated coefficients.

**Model Interpretability**: In many multiple regression models, several variables are not associated with the response. The resulting model can become more complex if such irrelevant variables are included.
Including leads to. Nonetheless, removing these variables can lead to a model that is more easily interpreted. This task can be done by simply setting the corresponding coefficient estimates
to zero! On the other hand, least-squares is extremely unlikely to yield any coefficient estimates that are exactly zero.

## Variable Selection
There are three important classes of methods that can be used for **variable selection**

* **Subset Selection**: a subset of the $p$ predictors is used for fitting a model using least squares.

* **Shrinkage (regularization)**: A model is fitted using all $p$ predictors. However, the estimated coefficients are shrunken towards zero relative to the least-squares estimates.

* **Dimension Reduction**. the $p$ predictors are projected into a M-dimensional subspace where $M < p$. Then these $M$ projections are used as predictors to fit a linear regression model by least squares.

* <font color='Blue'>**Subset Selection**</font>:
    * **Best Subset Selection**: All $\left(\begin{array}{c}p\\k \end{array}\right)$ models that contain exactly k predictors are fitted, and then a single best model is choosen.
    * **Stepwise Selection**:
        * <font color='Green'>**Forward Stepwise Selection**</font>: It begins with a model containing no predictors and then predictors are added to the model iteratively until all of the predictors are in the model. The variable that gives the greatest additional
improvement to the fit is added to the model at each step.
        * <font color='Green'>**Backward Stepwise Selection**</font>: Unlike forward stepwise selection, here the process begins  with the full least squares model containing all $p$ predictors, and then the least useful predictors are removed iteratively,
        * <font color='Green'>**Hybrid Approaches**</font>: Variables are added to the model sequentially; however, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit.

## The Test Error
Selecting the best model for the test error:

* The test error can be **indirectly** estimated by adjusting the training error to account for the bias due to overfitting.
* The test error can be **directly** estimated using either a validation set approach or a *cross-validation* approach.

### Indirect Estimatoins
Let $\hat{\sigma}^2$ be an estimate of the variance of the error terms. Then
\begin{align}
C_p &= \frac{1}{n} (RSS + 2d\hat{\sigma}^2 ),\\
AIC &= \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2),\\
AIC &= \frac{1}{n\hat{\sigma}^2}(RSS + \log(n)d\hat{\sigma}^2),\\
\text{Adjusted }R^2 &= 1 -\frac{RSS/(n − d − 1)}{TSS/(n − 1)}. 
\end{align}

### Direct Estimatoins
Cross-validation used to be computationally prohibitive for many large problems; however, with recent advantages in computing, performing cross-validation is no longer an issue. Thus, cross-validation is a very attractive approach for selecting from among a number of models under consideration.

## Shrinkage Methods

### Ridge Regression
The main difference between least squares and this method is that the coefficients are estimated by minimizing a slightly different quantity.

The ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize

$$\underbrace{\sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2}_{RSS}
+ \sum_{j=1}^{p} \beta_j^2=RSS+\sum_{j=1}^{p} \beta_j^2$$
where $\lambda \geq$ 0 is a *tuning parameter*. 

### Lasso Regression
The lasso coefficients, $\hat{\beta}^L_\lambda$, minimize the quantity

$$\underbrace{\sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2}_{RSS}
+ \sum_{j=1}^{p} |\beta_j^2|=RSS+\sum_{j=1}^{p} |\beta_j^2|$$