# Table of Contents:

1. [Introduction to Model Selection](#Introduction-to-Model-Selection)
2. [Bias-Variance Trade-off](#Bias-Variance-Trade-off)
3. [Validation/Cross-validation of Model](#Validation/Cross-validation-of-Model)

## Introduction to Model Selection

**Model selection** is the the act of choosing the regression form (simple vs. multiple vs. logistic, etc.) and selecting appropriate feature variables to use. It is a diverse topic, with many techniques that will not be covered here.

Remember the aphorism attributed to George Box, "All models are wrong, but some are useful." All of the models that we choose to fit a dataset will produce some error. The issue for the analyst is to find a model that:

1. **estimates the relationship between feature variables and a response variable**
2. **predicts accurate future response values**

given an acceptable amount of error.

[(back to top)](#Table-of-Contents:)

## Bias-Variance Trade-off

Why not find a model that gives no errors at all? Surely that will both explain the relationship between feature/response variables and give completely accurate predictions.

Unfortunately, when the model completely determines all of the response variables (no error), we are fitting the noise and sensitive fluctuations instead of the underlying deterministic equation. Thus, for the initial dataset, we create a line that exactly hits every one of the response values.

For example, let's denote a true model to be:

\begin{equation}
Y = 2X + \varepsilon
\end{equation}

with the error term having a mean of $0$ and a standard deviation of $1$. This produces the black dots in the following image:

<img src="images/overfitting example plot.png" alt="" align=center/>

The blue line is a regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x
\end{equation}

The estimated parameters are $\widehat{b}_0 = 0.066$ and $\widehat{b}_1 = 1.502$, which is pretty close to the original values of $b_0 = 0$ and $b_1 = 1.5$. Now, let's form a regression model with many more terms:

\begin{equation}
\widehat{y} = b_0 + b_1 x + b_2 x^2 + b_3 x^3 + ... + b_9 x^9
\end{equation}

That produces the red line above. The red line is an example of an **overfitted line** and is characterized by hitting all of the original response values exactly. The overfitted line captures all of the minute fluctuations in our original feature  values but does not generalize well to new values:

<img src="images/overfitting with additional values.png" alt="" align=center/>

That is because the regression model is trained to our original feature values too closely. It is inflexible to variations as a result of new data, having estimated precisely estimated its parameters based on variations in the original data. This is known as the **error due to variance**. Generally, the error due to variance is the error produced by the model as a result of different values in the feature variables.

Alternatively, if the $SSE$ of a model is large, then the fit line will overall not be close to the response values. For example, consider the true model below:

\begin{equation}
Y = (X - 3)^2 + \varepsilon
\end{equation}

with the error term having a mean of $0$ and a standard deviation of $1$. This produces the black dots in the following image:

<img src="images/high bias example plot.png" alt="" align=center/>

The blue line is a regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x
\end{equation}

whereas the red line is the regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x^2
\end{equation}

Given the true model, the red line is the more appropriate form and produces a smaller $SSE$ as a result. The blue line is said to **underfit** the data, charaterized by the large error calculated. A large error between the original response values and the predicted ones is known as the **error due to bias**. Generally, the error due to bias indicates that some model assumptions are invalid.

By introducing a model with lower bias, you wind up increasing the variance. Conversely, by minimizing variance you raise the bias. This equilibrium is known as the **bias-variance trade-off** and is an inherent problem when selecting models. Knowing how much to minimize bias or variance error depends on the situation. Indications for recognizing whether a model has too much bias or variance is covered in the next section.

[(back to top)](#Table-of-Contents:)

## Validation/Cross-validation of Model

One of the first steps to take when diagnosing a model for overfitting is to look at the degrees of freedom. For a feature variable set, $df = n - k - 1$, where $k$ is the number of $b_i$'s (not including $b_0$) there are. Imagine each degree of freedom as how much "wiggle room" a model has to adapt to new response values. As $k$ increases, the $df$ will decrease, signifying a reduction of this "wiggle room." Therefore, as the amount of $k$ increases, your model will become less and less flexible to new results. If $k$ approaches $n$, then the danger of overfitting increases.

A common technique to check for underfitting/overfitting is to split your data into two different datasets: a **training set** and a **cross-validation set**. The training set contains the data you use to build the model. The cross-validation set is the data you use once you've estimated $\widehat{b}_i$ from the training set to see whether your model returns comparable results. The cross-validation set can be thought of as "new" data that the model has not seen. A typical training set consists of $2/3$ of the data, with the cross-validation set representing the other $1/3$. Let's modify our definition of $SSE$ a bit to get the **cost function**:

\begin{equation}
J\left(\vec{\theta} \right) = MSE = \frac{1}{n} SSE_n = \frac{1}{n} \sum_{i = 1}^{n} \left(y_i - \mathbf{\vec{x}_i} \mathbf{\vec{b}} \right)^2
\end{equation}

where $n$ is the amount of observations in whatever dataset (whether it be training or cross-validation) that you are looking at. The cost function defined is essentially just the mean squared error. Since datasets from similar sources will produce smaller $SSE$ if their number of observations is smaller, the $MSE$ helps normalize this difference by looking at the average squared deviation.

With the training set and cross-validation set in place, we can compare the $MSE$s for a variety of different models to see which one has a happy balance between bias and variance.

For polynomial regression, we can plot the $MSE$ for both datasets on the y-axis and the number of polynomial terms ($x$, $x^2$, $x^3$, etc.) on the x-axis. We start at $b_1 x$ and find the $MSE$ of both the training and cross-validation sets and add those values to our plot. Then we add an $b_ 2 x^2$ term to the model and find the $MSE$s of the two datasets again. We repeat the process for $b_3 x^3$ and so on. As a result, we will have two curves, one representing the cost of increasing polynomial order for the training data, and another for the cross-validation data. To diagnose the model:
* **underfitting**: both cost function plots will have high cost and converge to similar values
* **overfitting**: the training cost function plot will decrease monotonically with each polynomial degree, while the cross-validation plot will decrease for a while, then increase again creating a substantial gap between it and the training plot

The appropriate amount of polynomial terms to use would be indicated by both the training and the cross-validation cost function plots being relatively low-cost and the gap between them is minimal.

Another way to examine cross-validation is to look at the cost function plots as a result of increasing the amount of observations used to calculate them. Start with just $m$ values in both the training and cross-validation sets, where $m < n_{CV}$. Calculate the cost of a particular model using both the training and cross-validation sets, and plot them with the x-axis as the amount of observations in each set. Increase this amount by some set number (usually $+1$), and recalculate the costs. Plot the costs again with this increased number of observations. Repeat this process until $m$ approaches $n_{CV}$. To diagnose the model:

* **underfitting**: both cost function plots will have high cost and converge to similar values 
* **overfitting**: the training cost function plot will be relatively low-cost and there will be a substantial gap between that and the cross-validation plot

If the regressions are being done in SAS, SAS uses a statistic known as the PRESS statistic. It is a form of cross-validation, where the $i$th observation is left out for $\mathbf{\vec{b}}$ estimation. $\widehat{y}_i$ is calculated with the estimated $\mathbf{\vec{b}}$, with the residual (known as the PRESS residual) being:

\begin{equation}
r_{i, PRESS} = y_{i} - \widehat{y}_{i, -i}
\end{equation}

with the $-i$ indicating that the $i$th observation was left out when determining $\mathbf{\vec{b}}$ for $\widehat{y}_i$.

If the sum of squares of the PRESS statistic:

\begin{equation}
PRESS = \sum_{i = 1}^{n} r_{i, PRESS}
\end{equation}

is small, then the model is considered a good candidate for the dataset. Given that $R^2$ compares the model against the mean response, the PRESS statistic is a much better indicator of the predictive power of the model.

[(back to top)](#Table-of-Contents:)