# 5 Resampling Methods

## 5.1 Cross-Validation

#### Validation Set Approach
Randomly divide the available set of observations into two parts:
- training set
- validation set

Model is fit on training set, and the fited model is used to predict responses for observation in validation set. The validation set error rate is MSE and provides estimate of test error rate. An example case is to consider if a term should be a quadratic in regressions. Rather than looking at the p-values we split the data into a training set and a validation set. A smaller MSE for a quadratic term would mean that term is significant. 

###### Drawbacks
- The validation estimate of the test error can be highly variable based on the observations included in each split.
- Only a subset of observations are used to fit the model. This can lead to overestimation of the test error.

#### Leave-One-Out Cross-Validation (LOOCV)
Unlike the validation set, this only leaves one observation out for the validation. Since one observation is not used for the fitting, MSE is calculated by:
$$\sum_{i = 1}^N\ (y_i - \hat y_i)^2$$

or, for each observation leave it out as validation and form model on the remaining training points. Then grab the mean of the MSE's for all iterations:
$$CV_(n) = \frac{\sum_{i = 1}^n}{n}$$

###### Advantages
- Less bias
- Does not lead to overestimation of the test error rate
- ALways yield the same results

###### Disadvantages
- Potential to be expensive to implement

With least squares linear or polynomial regression this shortcut can make the cost of LOOCV the same as a single model fit:
$$CV_(n) = \frac{\sum_{i = 1}^n\ (\frac{(y_i - \hat y_i)}{(1 - h_i)})^2}{n}$$

where $h_i$ is the leverage. Essentially the $i$th residual is divided by $(1 - h_i)$. The leverage lies between $\frac{1}{n}$ and 1 and reflects the amount an observation influences its own fit.

#### K-Fold Cross-Validation
Split data into $K$ folds and use the current fold for each iteration as the validation set and the remaining data as the data used to train a model. This is similar to LOOCV except that instead of $N$ folds, the number of folds is determined by the user. Classic folds are 5 or 10. 

K-Fold has the benefit of being computationally cheaper. Other advantages regard the bias-variance trade-off. 

#### Bias-Variance Trade-off for K-Fold CV
Using K-Fold CV will have an intermediate level of bias since it uses only $\frac{(k - 1)n}{k}$ observations. So if possible use LOOCV since it will have the lowest bias. But it is important to note that LOOCV has a higher variance than K-Fold CV. This is because we are training $n$ models trained on an almost identical set of observation (correlated outputs).

#### Cross-Validation on Classification Problems
Instead of using MSE we can use the number of misclassified examples:
$$CV(n) = \frac{\sum_{i = 1}^n\ I(y_i \neq \hat y_i)}{n}$$

When it comes to deciphering the numbr of polynomial terms to use and their degree, we can us CV and plot the Error Rate on the Y-axis and the Order of the Polynomial on the X-axis.

## Bootstrap

The goal is to minimze:
$$Var( \alpha X + (1 - \alpha) Y )$$

where:
$$\alpha = \frac{\sigma^2_Y - \sigma_{XY}}{\hat \sigma^2_X + \hat \sigma^2_Y - 2 \hat \sigma_{XY}}$$

where $\sigma_{XY}$ is the covariance of $X$ and $Y$.

The sampling in bootrapping is done with replacement. Compute the standard error of these bootstrap estimates using:
$$SE_B(\hat \alpha) = \sqrt( \frac{1}{(B - 1)}\ \sum_{r = 1}^B\ ( \hat \alpha^{*r} - \frac{1}{B}\ \sum_{r' = 1}^B\ \hat \alpha^{*r'} )^2 )$$