# Cross Validation

Sections:

1. Validation Sets
2. Leave One Out Cross Validation (LOOCV)
3. K-fold cross-validation
4. Bias-variance tradeoff with cross validation

The lecture draws from Chapters 5 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---
# 1. Validation Sets

We normally think of determining the value (or significance) of a statistic by way of the [p-value](https://en.wikipedia.org/wiki/P-value). Technically the p-value defines the probability that you'd get the data that you observed ($X$) _if the null hypothesis ($H_0$) were true_. 

$$ p = P(X | H_0=TRUE) $$

But what exactly does the p-value tell you? At a fundamental level, the p-value is not very interpretable without a designated criterion for accepting or rejecting the $H_0$. Traditionally we call this $\alpha$. By tradition we use $\alpha = 0.05$ as our decision criterion for accepting or rejecting $H_0$ (i.e., if p < 0.05 we reject $H_0$, otherwise we accept it). 

<br>

This tradition dates back to [Fisher's original proposition of the p-value](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/15191/1/48.pdf). The exact level of 0.05 (or 5% probability that $H_0$ is true) stems from a heuristic that seemed like a reasonable criterion for making the binary (accept/reject) decision. But as we will see later, this can become problematic when trying to make inferences off of multiple tests (i.e., the multiple comparison problem). In this way the p-value is only useful for making a simple binary decision. Beyond that any interpretation of the exact p-value is relatively meaningless.

<br>

But what we want to do with our models is define a $Y=f(X)$ such that it allows us to not only describe the data we have, but also _extend to other observations of the same variables._ In other words we want our model to make **predictions**. Prediction is just as valuable as determining the probabilty of $H_0$ in for several reasons.

* It allows for you to evaluate not just whether $X$ is meaningful with relationship to $Y$ _but to what degree it explains variance in $Y$_.

* It explicitly evaluates the **external validity** of your model. External validity is the extent to which the results of a study can be genralized to other situations.

<br>

## Training vs. Testing Error

Let's revisit a topic we have mentioned several times before. 


* **Training Error:** The accuracy of a model in explaining the data that it was fit to.
 - What you are probably most accustomed to from undergraduate statistics.
 - Useful for intefering the structure of $f(X)$.
 - _Biased to the particular structure of the noise of the data_ (i.e., as your model increases in complexity during the fit, you risk over-fitting.

* **Test Error:** The accuracy of a model against an independent dataset.
 - Measure of the _extensibility_ of a model.
 - More conservative 
 
Remember that we can use the relationship between Training (grey line) and Test Error (red line) to find the optimal balance of bias and variance (i.e., complexity, flexibility) of a model.

![Variance-bias tradeoff](imgs/L6TrainVsTest.png)
 
In an ideal world you would run two identical versions of every experiment: one experimental data set would be used to train $Y=f(X)$ and one experimental data set to fit it (i.e., evaluate Test Error). However, in many cases this isn't feasible. Luckily, there are ways of generating _validation sets_ within a single data set that can allow you to evaluate Trainig & Test Error.

* **Validation Sets:** Randomly selected subsets of data that are divided into training and test sets for evaluating hold out Test Error.

In the next sections we consider two types of validations sets: leave one out cross validation (LOOCV) and k-fold cross validation. 



---
# 2. Leave One Out Cross Validation (LOOCV)

Leave one out cross validation (LOOCV) is a method of estimating test error one observation at a time. You can think of it this way. If you have a data set with $n$ observations, then what you end up running is $n$ separate experiments where the $Y=f(X)$ relationship is learned on all $n-1$ observations (Training set) and evaluated on predicting a single observation (Test set).

We'll call each experiment a _fold_. So we start with the first observation (1). Taking this observation out, we then make a new data set using observations 2 through $n$ and learn the $Y=f(X)$ relationship. Then taking that learned model we try to predict observation 1.

![LOOVC](imgs/L14_LOOVC.png)

<br>

**Example:** Linear regression

Consider the case of linear regression where p=1 (not the p-value, but number of variables in $X$). The algorithm for executing LOOCV in this case would be

* Step 1: Take a single observation $x_i$.
* Step 2: Make a new data set $X^*$ from all observations not $x_i$.
* Step 3: Fit $Y = \hat{\beta}_0 + \hat{\beta}_1 X^*$.
* Step 4: Calculate the residual error on observation $x_i$ (i.e., $RSS_i = (y_i - \hat{y}_i)^2$).
* Step 5: Repeat Steps 1-4 for all _n_ observations and collect the mean of all residual errors from meach cross validated fold ($CV_n = \frac{1}{n} \sum_{i=1}^{n} RSS_i $).

<br>

Another way to evaluate LOOCV accuracy is to look at the correlation between the predicted and observed values on each experiment. In this case the algorithm would look like this:

* Step 1: Take a single observation $x_i$.
* Step 2: Make a new data set $X^*$.
* Step 3: Fit $Y = \hat{\beta}_0 + \hat{\beta}_1 X^*$.
* Step 4: Predict the value of observation _i_ ($\hat{y}_i$). 
* Step 5: Repeat Steps 1-4 for all _n_ observations.
* Step 6: Calculate the correlation ($r$) between the vector of observed data ($Y$) and the predictions from LOOCV ($\hat{Y}_i$). 


<br>

The process is similar for classifiers, except instead of predicting the residual error, you evaluate the classification accuracy. 

<br>

LOOCV is an efficient way of taking a single data set and evaluating Test Error, particularly if you have a small sample size ($n$) that precludes doing k-fold cross validation (see next section). Let's consider the pros & cons of LOOCV.

<br>

**Pros:** 

* LOOCV gives you a large training set size on each iteration that produces more stable model fits and more consistent results.
* LOOCV gives you a direct measure of generalizability of a particular $f(X)$. 

<br>

**Cons:**

* LOOCV is computationally costly, because you have to run as many experiments (iterations) as you have observaitons.
* LOOCV can be more sensitive to high leverage points than k-fold cross validation methods. See the book for explanations on how you can account for this in your Test Error calculation.
* In the context of regression, LOOCV can result in a negative bias when the model fit is evaluated using permutation tests (next lecture). See [Russ Poldrack's post on this](http://www.russpoldrack.org/2012/12/the-perils-of-leave-one-out.html) for more information. 


---
# 2. K-Fold Cross Validation

Rather than treat each single observation as an isolated test set you can run fewer experiments with larger samples in each test set. Here you ranodmly divide the full sample into k groups (aka- k folds) of approaximately equal size and evaluate the test error on each experiment (fold) independnetly.

![k-fold cross validation](imgs/L14_KFold.png)

Just like LOOCV you estimate the MSE or predicted vs. observed correlation on each fold. 

<br>

* Step 1: Take a set of _m_ observations ($m=\frac{n}{k}$).
* Step 2: Make a new data set $X^*$ from the remaining data outside the selected _m_.
* Step 3: Fit $Y = \hat{\beta}_0 + \hat{\beta}_1 X^*$.
* Step 4: Calculate either the 
    - (i) Residual error for all _m_ observations in the $i^{th}$ fold (i.e., $RSS_i = \frac{1}{m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2$).
    - (ii) Collect the vector of all m predicted values in the $i^{th}$ fold (i.e., $\hat{Y}_i = [\hat{y}_1 ... \hat{y}_m]$). 


* Step 5: Repeat Steps 1-4 for all k folds.
* Step 6: Calculate either the:
    - (i) Cross validated error: $CV_k = \frac{1}{k \cdot m}\sum_{i=1}^{k}RSS_i$
    - (ii) Predicted vs. observed correlation: $r_{pred,obs}$. 
    
 <br>
 
You can now see how LOOCV is a special case of k-fold cross validation (i.e., LOOCV is k-fold where k=n). However, k-fold cross validation is preferred when n is very large because you don't have to run all n-experiments. Nor does it give you a negative bias like LOOCV does. **Thus if you can do a k-fold cross validation, it is preferred over LOOCV**.



---
# 4. Variance-bias tradeoff for k-fold cross validation

As mentioned in the preceding section, k-fold cross validation provdies much better estimates of the true test error than LOOCV due to the variance-bias tradeoff. Specifically, when you have a single validation (i.e., LOOCV), you _underestimate_ the Test Error. This is in contrast to say a split-half validation (also known as a 1-fold cross validation) , which can dramatically _overestimate_ Test Error. Wherease a nicely balanced k-fold cross validation sits squarely in the middle.

As a general rule you can think of it this way:
* LOOCV has the lowest bias, but highest variance.
    - This is because you are training n-models with the same data, so when exposed to new data, the model is not flexible enough.

* 2-fold cross validation has the highest bias, but lowest variance.
* k-fold cross validation sits between LOOCV and 1-fold cross validation.


Let's explore this in a bit more detail. The plots below show the Test Error for a regression model (MSE) and kNN classifier across a range of models with increasing flexibilty/complexity. The blue line shows the Training Error for both models. The orange (brown) line shows the Test error for a single validation set (i.e, 2-fold cross validation). The black line shows the Test error for a 10-fold cross validation on the same data set.

![Variance Bias Tradeoff](imgs/L14_VarianceBiasTradeoff.png)

Notice how the Test error for 10-fold cross validation and a single validation evaluation converge to close, but not identical solutions to the variance bias tradeoff. In both cases, the 10-fold cross validation approach lead to a slightly more conservative model (i.e., lower complexity) than the single validation test. Also the Test error was overall smaller in both cases using 10-fold cross validation. 