# Tuesday: Model selection and regularization 

**Motivation**

* Prediction accuracy, if n > p.
* model interpretability
* lower practical requirement i.e less compution
* visualization


**Methods**

&rarr; Feature selection/subset selection: selecting k out of p features. Imposible for high p features.

&rarr; Feature extraction/dimension reduction : map p features to k new features using PCA.

&rarr; shrinkage: adjust coefficient so that some features are used to lesser extent. 




**Feature / subset selection**

1. let $M_0$ be the null model which predicts the sample mean
2. for k = 1, 2, ... p &rarr; fit all $\large {p}\choose{k}$ models that contain k predictors, the full function is given below. $$\large {p\choose k} = \frac{k!}{2!(p-k)}$$
3. select the best model from $M_0$ to $M_p$, based on the RSS, cross-validation etc, and $R^2$ be careful because adding predictors will never **increase** the residual sum of squares.

![image.png](attachment:4cbdb9d4-fa59-4a65-a47c-1f4e0a3d4ee6.png)

**Forward stepwise selection**

* forward stepwise selection build model iteratively and add 1 more predictor than $M_0$. Pick the best model and let this be $M_1$. Your not looking at every possible combination but at the previous best combination with one extra predictor each time, until there is no yield. k= 1,2,3..p
* not guaranteed to find the best model.


**Backward stepwise selection**

* start with $M_p$ which is the full model with all predictors. Each iteration remove one predictor and choose the best model. k= p-1, p-2, p-3,...p - p
* can only be done when n > p because we can only build the full model when n > p.


**Estimating test error**

* indirectly: making an adjustment to the training error
* directly: using a test set and doing cross-validation


![image.png](attachment:743ecd59-9630-46b5-830c-acc86e0e1e0d.png)

**Mallow's $C_p$**

$$\Large C_{p}=\frac{1}{n}\left(\mathrm{RSS}+2d\hat{\sigma}^{2}\right)$$

Where d is the number of predictors and $\sigma$ is the variance of the error terms. We bassically adjust the RSS so that it willbe larger when there are more predictors.  

**AIC**

Is criterion that is defined for models fit by maximum likelihood e.g. logistic regression etc. In linear regression the likelihood is the same as RSS as we fit the line closest to the points or to where the points are most likely to be. Hence AIC is practically the same as Mallow's Cp

$$\Large \mathrm{AIC}=-2\log L+2\cdot d =\frac{1}{n}\left(\mathrm{RSS}+2d{\hat{\sigma}}^{2}\right)$$

**BIC**

Very similar to Mallow's Cp and AIC, contains the term $\log{n}$ instead of 2. However when n is larger than 7 the log of n is larger 2, thus the penalty is greater. BIC is more conservative than Mallow's and AIC.

$$\Large {{{\mathrm{BIC}}}}=\frac{1}{n}\left(\mathrm{RSS}+\log(n)d\hat{\sigma}^{2}\right)$$

**Adjusted R squared**

Adjustment to make the R squared comparable across models as more predictors will always reduce the sum of squares hence you'll always choose the larger model. Because of d in the denominator the adjusted R quared decrease when there are many predictors, penalizing it despite the reduction in RSS. 

$$\Large \mathrm{Adjusted~}R^{2}=1-\,\frac{\mathrm{RSS/}(n-d-1)}{\mathrm{TSS}/(n-1)}$$

***One-standard-error rule***: Select the model for which the estimated test error is within one standard error of the lowest point on the curve.

&rarr; allows for automation of picking the best model  

**Shrinkage methods**

shrinking coefficients of the predictors towards zero, reduces variance.

**Ridge regression**

In normal regression parameters are estimated using: 

$$\Large \mathrm{RSS}=\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}x_{i j}\right)^{2}$$

In ridge regression we add a **tuning parameter** $\lambda$ which penalizes the growth of the coefficients. The larger $\lambda$ the higher the penalty. 

$$\Large \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}=\mathrm{RSS}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}$$

By looking at how $\lambda$ impacts the coefficients of the predictor we can make a selection. As seen below in the left plot some coefficients are non-zero for high $\lambda$ while other get pushed to zero. On the right, the picture is flipped and standardized, hence the x-axis is between 0 and 1. 
![image.png](attachment:26846637-d8ee-4f15-ac4f-b1b2c0afb1ad.png)

***In ridge regression the features need to be standardized/scaled to make them comparable as they are competing to be non-zero***

$$\Large \tilde{x}_{i j}=\frac{x_{i j}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{i j}-\overline{{{x}}}_{j})^{2}}}$$

The left plot below shows the bias, variance and MSE as a function of lambda. At x=0 the model is build using least squares, after x=0 the model is build with increasing $\lambda$, it is clearly visible that tuning the coefficients toward zero decreases the variance and MSE, while retaining low bias. 

![image.png](attachment:6d2cbbd3-bae5-408d-b33b-444891e15927.png)

**Lasso**

Lasso is almost the same as ridge regression but the penalty is the sum of the absolute values of the coefficients instead of the sum of the squares of the coefficients.

$$\Large \sum_{i=1}^{n}\left(\begin{array}{c}{{}}\\ {{y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}x_{i j}}}\end{array}\right)^{2}+\lambda\sum_{j=1}^{p}|\beta_{j}|\ =\mathrm{RSS}+\lambda\sum_{j=1}^{p}|\beta_{j}|.$$

Lasso will generaly select a subset of the predictors instead of all of them, this optimizes the computation. In the plot below on the left we see the combination of shrinkage and selection of predictors.

![image.png](attachment:46bb7144-fa35-4176-b897-4401c6602621.png)

We can say that lasso and ridge regression make a trade-off between the size of the coefficients and the fit of the model. The difference between these methods is shown below where. The red circles indicate the contours of the error which has a single optimal value given a unique combination of coefficients, however if we take a suboptimal value we have more combinations of coefficients. The blue zone indicate a constraint region on the left we have Lasso and on the right we have ridge regression. Where the contours hit the blue zone is the best combination given we want to reduce the coefficients. If the contour hits and edge of the constraint zone (only with Lasso) one of the predictors is reduced to zero, hence lasso is more prone to take a subset of the predictors.
![image.png](attachment:afdcf606-5269-4d04-a833-471fba1b7af2.png)

**Selection of the tuning parameter**

The tuning parameter is chosen by varying the values for $\lambda$ and calculating the cross-validation error. The model with the lowest error is chosen. 


**Dimension Reduction methods**

We use p predictors to get m < p predictions, by taking linear combinations of the p predictors. This is necesary as when p > n regression is not possible and parameter estimation become difficult. 


# Questions 

**1. We perform best subset, forward stepwise, and backward stepwise
selection on a single data set. For each approach, we obtain p + 1
models, containing 0, 1, 2, . . . , p predictors. Explain your answers:**

**(a) Which of the three models with k predictors has the smallest
training RSS?**

&rarr; the largest model always has the smallest RSS as least squares method with more predictors/coefficients will always give a better fit. 

**(b) Which of the three models with k predictors has the smallest
test RSS?**

&rarr; probably the simpler model as it is more general due to its dependency on less predictors.

**(c) True or False:**

i. The predictors in the k-variable model identified by forward
stepwise are a subset of the predictors in the (k +1)-variable
model identified by forward stepwise selection.

&rarr; True

ii. The predictors in the k-variable model identified by back-
ward stepwise are a subset of the predictors in the (k + 1)-
variable model identified by backward stepwise selection.

&rarr; False, they are a superset

iii. The predictors in the k-variable model identified by back-
ward stepwise are a subset of the predictors in the (k + 1)-
variable model identified by forward stepwise selection.

&rarr; False, they are a superset as backward models start from the full model.

iv. The predictors in the k-variable model identified by forward
stepwise are a subset of the predictors in the (k +1)-variable
model identified by backward stepwise selection.

&rarr; False

v. The predictors in the k-variable model identified by best
subset are a subset of the predictors in the (k + 1)-variable
model identified by best subset selection.

&rarr; False, they are the same as it will always find the same model.

**2. For parts (a) through (c), indicate which of i. through iv. is correct.
Justify your answer.**

**(a) The lasso, relative to least squares, is:**

i. More flexible and hence will give improved prediction ac-
curacy when its increase in bias is less than its decrease in
variance.

&rarr; False, False

ii. More flexible and hence will give improved prediction accu-
racy when its increase in variance is less than its decrease
in bias.

&rarr; False, True

iii. Less flexible and hence will give improved prediction accu-
racy when its increase in bias is less than its decrease in
variance.

&rarr; True, True

iv. Less flexible and hence will give improved prediction accu-
racy when its increase in variance is less than its decrease
in bias.

&rarr; False, False

**(b) Repeat (a) for ridge regression relative to least squares.**



**(c) Repeat (a) for non-linear methods relative to least squares.**

**4. Suppose we estimate the regression coefficients in a linear regression
model by minimizing
$$\Large \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}$$
for a particular value of λ. For parts (a) through (e), indicate which
of i. through v. is correct. Justify your answer.**

(a) As we increase λ from 0, the training RSS will:

i. Increase initially, and then eventually start decreasing in an
inverted U shape.

&rarr; False, False, False, False, False

ii. Decrease initially, and then eventually start increasing in a
U shape.

&rarr; False, True, False, False, False

iii. Steadily increase.

&rarr; True, False, False, True, False

iv. Steadily decrease.

&rarr; False, False, True, False, False

v. Remain constant.

&rarr; False, False, False, False, True

(b) Repeat (a) for test RSS.
(c) Repeat (a) for variance.
(d) Repeat (a) for (squared) bias.
(e) Repeat (a) for the irreducible error.