# Introduction to Statistical Learning - Chapter 4

- [4. Resampling Methods](#4.-Resampling-Methods)
    * [4.1 Cross-Validation](#4.1-Cross-Validation)
        + [4.1.1 The Validation Set Approach](#4.1.1-The-Validation-Set-Approach)
        + [4.1.2 Leave-One-Out Cross Validation](#4.1.2-Leave-One-Out-Cross-Validation)
        + [4.1.3 k-Fold Cross-Validation](#4.1.3-k-Fold-Cross-Validation)
        + [4.1.4 Bias-Variance Trade-off for k-Fold Cross-Validation](#4.1.4-Bias-Variance-Trade-off-for-k-Fold-Cross-Validation)
        + [4.1.5 Cross-Validation on Classification Problems](#4.1.5-Cross-Validation-on-Classification-Problems)
    * [4.2 The Bootstrap](#4.2-The-Bootstrap)

# 4. Resampling Methods

- Involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model
    * Computationally expensive
        + Fitting the same statistical model multiple times using different subsets of the training data
- Resampling Methods
    * Cross-validation
        + Estimate the test error associated with a given statistical learning method in order to evaluate the model's performance **(model assessment)** 
        + Select the appropriate level of flexibility **(model selection)**
    * Bootstrap
        + Provide a measure of accuracy of a parameter estimate or of a given statistical learning method

## 4.1 Cross-Validation

### 4.1.1 The Validation Set Approach

- A simple strategy to estimate the test error associated with fitting a particular statistical learning method on a set of observations
- Randomly dividing the available set of observations into two parts
    * Training set
    * Validation set
        + Assessed using the MSE in the case of a quantitative response

**Drawbacks of Validation Set Approach**
- Validation estimate of the test error can be highly variable
    * Dependent on the observations included in the training set and which observations are included in the validation set
- Only a subset of the observations are used to fit the model
    * Statistical methods tend to perform worse when trained on fewer observations

### 4.1.2 Leave-One-Out Cross Validation

- Leave-one-out cross-validation (LOOCV) involves splitting the set of observations into two parts
    * A single observation is used for the validation set
    * Rest of the remaining observations make up the training set
- Repeating this approach n times produced n squared errors
    * LOOCV estimate for the test MSE is the average of these n test error estimates  

$$ CV_{(n)} = \frac{1}{n}\sum^{n}_{i=1}MSE_{i} $$

**Advantages and Disadvantages of LOOCV**
- Advantages:
    * Far less bias than validation set
        + Repeatedly fits the statistical method using training sets that contains n-1 observations so LOOCV tends to no overestimate the test error rate
    * LOOCV will always yield the same result as compared to validation set which will tield different results when applied repeated due to randomness in the training/validation set splits
- Disadvantage:
    * Potentially to be computationally expensive to implement
        + Model has to be fitted n times

### 4.1.3 k-Fold Cross-Validation

- An alternative to LOOCV
- LOOCV is a special case of k-fold CV in which k is set to equal n
- k-fold CV involves randomly dividing the set of observations into k groups, or fold, of approximately equal size
    * First fold is treated as a validation set and the method is fit on the remaining k-1 folds.
    * Mean squared error  is computed on te observations in the held-out fold
- Rpeating this k times, with a different group used as a validation set, the k-fold CV estimate is computed by averaging these values

$$ CV_{(k)} = \frac{1}{k}\sum^{k}_{i=1}MSE_{i} $$

**Advantages and Disadvantages**
- Advatages:
    * Less computationally intensive than LOOCV as the model is only fitted k times
    * Variability in test error is much lower than the validation set approach
- Disadvantage
    * CV curves tend to obseestimate the test set MSE for higher degrees of flexibility

### 4.1.4 Bias-Variance Trade-off for k-Fold Cross-Validation

- k-fold CV gives more accurate estimates of the test error rate than LOOCV
    * Bias comparison
        + LOOCV gives approximately unbiased estimated estimate of the test error
        + k-fold CB lead to an intermediate level of bias, since each training set contain $(k-1)n/k$ observations
    * Variance comparison
        + LOOCV has a higher variance than does k-fold CB with $k<n$
        + In LOOCV, we are average the outputs of n fitted modesl, leading to an almost identical set of observations
        + k-fold CV suffers neither suffers from excessively high bias nor from very high variance

### 4.1.5 Cross-Validation on Classification Problems

- For cross validation when Y is qualitative, we can use the number of misclassified observations instead of the MSE
- The LOOCV error rate takes the form

$$ CV_{(n)} = \frac{1}{n}\sum^{n}_{i=1}Err_{i} $$

where $Err_{i} = I(y_{i} \neq \hat{y_{i}})$

- When dealing with non-linear data:
    * Logistic regression does not have enough flexibility to model the Bayes decision boundary
        + Transforming to a polynomial logistic regression model to resolve this problem
    * Training errors tend to decease as the flexibility of the fit increases an
        + Cross validation can help determine the best model that is close to that of the true test error

## 4.2 The Bootstrap

- Used to quantify the uncertainty associated with a given estimator or statistical learning method
    * Estimate the standard errors of the coefficients
        + Even for statistical learning methods which have variances that is difficult to obtain
- Obtain distinct data sets by repeatedly sampling observations from the original data set with replacement
    * Standard error of these bootstap estimates can be computed via:

$$ SE_{B}(\hat{(\alpha)} = \sqrt{\frac{1}{B-1}\sum^{B}_{r=1}\left(\hat{\alpha^{*r}} - \frac{1}{B}\sum^{B}_{r'=1}\hat{\alpha^{*r'}}\right)^{2}} $$