# Chapter 6 Exercises - Linear Model Selection and Regularization

In [1]:
import numpy as np
import pandas as pd
from math import exp, log, sqrt, pi
import time
import itertools
from tqdm import trange

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import scale, StandardScaler 
from sklearn import model_selection
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error

import os.path

## Conceptual

**1.(a)** Best subset will have the smallest training RSS. The approach considers all possible permutations of p predictors. Consequently, we will identify the best portfolio of *p* predictors for the training set. The drawback is that this model is likely to be overfitted to the training data and is computationally expensive.

**1.(b)** No one approach is guaranteed to the provide the best performing model, i.e., smallest test error.

- Best subset is problematic because of the challenges in transferring the best training model to test model. The features that produce the smallest training MSE do not neccessarily generate the smallest testing MSE. If able to properly translate the training MSE to the test MSE, then best subset like is most likley to produce the smallest test error.
- Forward stepwise is most unlikely to provide the lowest MSE because the approach does not random permutations. It linearly appends features based on the which feature provides the most information gain. However, the order of features may not align with the portfolio of features that create the best answer. For example, forward stepwise will never select feature 2 and feature 3 because it is required to start with feature 1, which on its own may provide the most information gain.
- Backward stepwise requires that the number of observations is greater than the number of features.

**1.(c)**<br>
i) True  
<br>
ii) True 
<br>
iii) False
<br>
iv)False
<br>
v) False

### Notes on Model Flexibility

Model flexibility is how much a model's behavior is influenced by the data's characteristics. The Ridge and Lasso penalities/shrinkage factors decrease the affect of the dataset's characteristics on the model's behavior. Consequently, the Right and Lasso are less flexible than least square estimates.

In general, the **more flexible** the model, the **less bias** (in aboslute value) and the **more variance** you'll get when predicting on the test dataset.

#### Rules of Thumb
1. Sample size is large and the number of predictors is small - A flexible model performs better. The larger the sample size, the less likely to overfit even when using a more flexible model. Meanwhile, a more flexible model tends to reduce bias.

2. Number of predictors is large and the sample size is small - An inflexible model performs better. A flexible model will cause overfitting because of the small sample size. This usually means a bigger inflation in variance and a small reduction in bias.

3. Relationship between the predictors and response is highly non-linear - A flexible performs better. A flexible model is required to find the non-linear effect.

4. Variance of the errors is large - An inflexible model performs better. A flexible model will capture too much of the noise in the data due to the large variance of the errors.

#### Lasso vs Least Squares

**2.(a).i** Incorrect - The Lasso is less flexible than least squares.

**2.(a).ii** Incorrect - The Lasso is less flexible than least squares.

**2.(a).iii** Correct - The Lasso's L2 penalty decreases the variance by constraining the influence of the dataset's characteristics. Some variables may be zero'ed out b/c of the shrinkage factors. This works well when the variance error is large or there are more features than observations.

**2.(a).iv** Incorrect - The Lasso decreases variance and increases bias.

#### Ridge vs Least Squares

**2.(b).i** Incorrect - The Ridge is less flexible than least squares.

**2.(b).ii** Incorrect - The Ridge is less flexible than least squares.

**2.(b).iii** Correct - The Ridge's L1 penalty decreases the variance by constraining the influence of the dataset's characteristics. The penalty decreases the values of the coefficients close to zero, but not to zero. This works well when the variance error is large or there are more features than observations.

**2.(b).iv** Incorrect - The Ridge decreases variance and increases bias.

#### Non-linear Methods vs Least Squares

**2.(c).i** Correct - PCR and PLS are more flexible as they can handle non-linear relationships. These models decrease the variance my coaslescing independent variables. This decreases the variances by decreasing the impact of any single feature. Consequently, the models increase the bias.

**2.(c).ii** Incorrect - PCR and PLS increase the bias and decrease the variance.

**2.(c).iii** Non-linear methods are more flexible than least squares.

**2.(c).iv** Incorrect - Non-linear methods are more flexible than least squares.