# Table of Contents:

1. [Introduction to Model Selection](#Introduction-to-Model-Selection)
2. [Bias-Variance Trade-off](#Bias-Variance-Trade-off)
3. [Coefficient of Determination](#Coefficient-of-Determination)
4. [Validation/Cross-validation of Model](#Validation/Cross-validation-of-Model)
5. [Stepwise Parameter Selection](#Stepwise-Parameter-Selection)
6. [Regularization Methods](#Regularization-Methods)

## Introduction to Model Selection

**Model selection** is the the act of choosing the regression form (simple vs. multiple vs. logistic, etc.) and selecting appropriate feature variables to use. It is a diverse topic, with many techniques that will not be covered here.

Remember the aphorism attributed to George Box, "All models are wrong, but some are useful." All of the models that we choose to fit a dataset will produce some error. The issue for the analyst is to find a model that:

1. **estimates the relationship between feature variables and a response variable**
2. **predicts accurate future response values**

given an acceptable amount of error.

[(back to top)](#Table-of-Contents:)

## Bias-Variance Trade-off

Why not find a model that gives no errors at all? Surely that will both explain the relationship between feature/response variables and give completely accurate predictions.

Unfortunately, when the model completely determines all of the response variables (no error), we are fitting the noise and sensitive fluctuations instead of the underlying deterministic equation. Thus, for the initial dataset, we create a line that exactly hits every one of the response values.

For example, let's denote a true model to be:

\begin{equation}
Y = 2X + \varepsilon
\end{equation}

with the error term having a mean of $0$ and a standard deviation of $1$. This produces the black dots in the following image:

<img src="files/images/overfitting%20example%20plot.png">

The blue line is a regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x
\end{equation}

The estimated parameters are $\widehat{b}_0 = 0.066$ and $\widehat{b}_1 = 1.502$, which is pretty close to the original values of $b_0 = 0$ and $b_1 = 1.5$. Now, let's form a regression model with many more terms:

\begin{equation}
\widehat{y} = b_0 + b_1 x + b_2 x^2 + b_3 x^3 + ... + b_9 x^9
\end{equation}

That produces the red line above. The red line is an example of an **overfitted line** and is characterized by hitting all of the original response values exactly. The overfitted line captures all of the minute fluctuations in our original feature  values but does not generalize well to new values:

<img src="files/images/overfitting%20with%20additional%20values.png">

That is because the regression model is trained to our original feature values too closely. It is inflexible to variations as a result of new data, having estimated precisely estimated its parameters based on variations in the original data. This is known as the **error due to variance**. Generally, the error due to variance is the error produced by the model as a result of different values in the feature variables. It is easy to overfit the data by using the $R^2$ statistic alone as a judge of model adequacy. An overfitted model has an $R^2$ value that is very close to $1$.

Alternatively, if the $SSE$ of a model is large, then the fit line will overall not be close to the response values. For example, consider the true model below:

\begin{equation}
Y = (X - 3)^2 + \varepsilon
\end{equation}

with the error term having a mean of $0$ and a standard deviation of $1$. This produces the black dots in the following image:

<img src="files/images/high%20bias%20example%20plot.png">

The blue line is a regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x
\end{equation}

whereas the red line is the regression fit to the data with a model using the following form:

\begin{equation}
\widehat{y} = b_0 + b_1 x^2
\end{equation}

Given the true model, the red line is the more appropriate form and produces a smaller $SSE$ as a result. The blue line is said to **underfit** the data, charaterized by the large error calculated. A large error between the original response values and the predicted ones is known as the **error due to bias**. Generally, the error due to bias indicates that some model assumptions are invalid.

By introducing a model with lower bias, you wind up increasing the variance. Conversely, by minimizing variance you raise the bias. This equilibrium is known as the **bias-variance trade-off** and is an inherent problem when selecting models. Knowing how much to minimize bias or variance error depends on the situation. Indications for recognizing whether a model has too much bias or variance is covered in the next section.

[(back to top)](#Table-of-Contents:)

## Coefficient of Determination

One way to judge the appropriateness of a model is through a statistic known as the **coefficient of determination**, more commonly referred to as $R^2$. $R^2$ can be defined in the following way:

\begin{equation}
R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}
\end{equation}

$R^2$ is often described as the ratio between the variability in the data as explained by the model to the total variability in the data. Looking more closely at the definitions of $SSR$ and $SST$ however, it is clear that it describes the variability of the predicted and original values, respectively, around the mean of the original values:

\begin{equation}
R^2 = \frac{SSR}{SST} = \frac{\sum_{i = 1}^{n} (\widehat{y}_i - \overline{y})^2}{\sum_{i = 1}^{n} (y_i - \overline{y})^2}
\end{equation}

In other words, $R^2$ describes how your model's variance around the mean of the response values compares to the variance of the of the original response data. $R^2$ is bounded between $0$ and $1$. Often times, models are created that attempt to drive $R^2$ as close to $1$ as possible, the reasoning being that the closer to $1$ $R^2$ is, the better a model is at describing the underlying relationship between the selected feature variable(s) and the response variable. This is not necessarily the case. For example, let's assume that a true model is:

\begin{equation}
y = x
\end{equation}

We add noise to the model coming from a normal distribution with a mean of $0$ and a variance of $9$ (standard devation of $3$). Then, we fit the model:

\begin{equation}
\widehat{y} = \widehat{b}_0 + \widehat{b}_1 x
\end{equation}

Plotting the results:

<img src="files/images/Rsq%20example.png">

The blue line is the true model while the red dotted line is our estimated model. $R^2$ has been calculated using values from the true model and the estimated model, as indicated in the title of the image. Notice how the $R^2$ for the estimated model is a lot higher than the one for the true model, despite the true model describing the underlying deterministic function.

Now, let's go even further and add a polynomial term to our estimated model:

\begin{equation}
\widehat{y} = \widehat{b}_0 + \widehat{b}_1 x + \widehat{b}_2 x^2
\end{equation}

Fitting the model now:

<img src="files/images/Rsq%20example2.png">

Adding a second polynomial term increases the estimated $R^2$ value, even though this new model is a worse representation of the true one. By increasing the polynomial terms, we have begun to map our parameters to the noise in the data instead of the underlying deterministic function. This will give us a false sense of confidence that our model is performing well.

$R^2$ is a useful statistic, but it needs to be taken in context with the other significance tests described in previously.

Since $R^2$ is heavily affected by overfitting, the **adjusted coefficient of determination** can be used to help balance the result:

\begin{equation}
R^2_{adj} = 1 - \frac{\frac{SSE}{n - k - 1}}{\frac{SST}{n - 1}} = 1 - \frac{n - 1}{n - k - 1} \frac{SSE}{SST}
\end{equation}

The $R^2_{adj}$ penalizes the addition of more parameters with the $n - k - 1$ term in the denominator. Depending on the data though, $R^2_{adj}$ won't necessarily be a quick fix for all overfitting problems. Again, context is very important.

Some other notes about $R^2$:
* $R^2$ can't be compared between transformed/untransformed data
* $R^2$ cannot be compared between different datasets for the same model
* $R^2$ is not/does not give any measure of error
* high $R^2$ does not definitely indicate an explanatory relationship of feature/response variables are correlated

In the end, $R^2$ is simply a summary statistic that describes how well your model is doing explaining the response values compared to the mean response value. It should never be the only determining factor in deciding between different models.

[(back to top)](#Table-of-Contents:)

## Validation/Cross-validation of Model

One of the first steps to take when diagnosing a model for overfitting is to look at the degrees of freedom. For a feature variable set, $df = n - k - 1$, where $k$ is the number of $b_i$'s (not including $b_0$) there are. Imagine each degree of freedom as how much "wiggle room" a model has to adapt to new response values. As $k$ increases, the $df$ will decrease, signifying a reduction of this "wiggle room." Therefore, as the amount of $k$ increases, your model will become less and less flexible to new results. If $k$ approaches $n$, then the danger of overfitting increases.

A common technique to check for underfitting/overfitting is to split your data into two different datasets: a **training set** and a **cross-validation set**. The training set contains the data you use to build the model. The cross-validation set is the data you use once you've estimated $\widehat{b}_i$ from the training set to see whether your model returns comparable results. The cross-validation set can be thought of as "new" data that the model has not seen. A typical training set consists of $2/3$ of the data, with the cross-validation set representing the other $1/3$. Let's modify our definition of $SSE$, where $SSE = \sum_{i = 1}^{n} \left(y_i - \mathbf{\vec{x}_i} \mathbf{\vec{b}} \right)^2$, a bit to get the **cost function**:

\begin{equation}
J\left(\vec{\theta} \right) = MSE = \frac{1}{n} SSE_n = \frac{1}{n} \sum_{i = 1}^{n} \left(y_i - \mathbf{\vec{x}_i} \mathbf{\vec{b}} \right)^2
\end{equation}

where $n$ is the amount of observations in whatever dataset (whether it be training or cross-validation) that you are looking at. The cost function defined is essentially just the mean squared error. Since datasets from similar sources will produce smaller $SSE$ if their number of observations is smaller, the $MSE$ helps normalize this difference by looking at the average squared deviation.

With the training set and cross-validation set in place, we can compare the $MSE$s for a variety of different models to see which one has a happy balance between bias and variance.

For polynomial regression, we can plot the $MSE$ for both datasets on the y-axis and the number of polynomial terms ($x$, $x^2$, $x^3$, etc.) on the x-axis. We start at $b_1 x$ and find the $MSE$ of both the training and cross-validation sets and add those values to our plot. Then we add an $b_ 2 x^2$ term to the model and find the $MSE$s of the two datasets again. We repeat the process for $b_3 x^3$ and so on. As a result, we will have two curves, one representing the cost of increasing polynomial order for the training data, and another for the cross-validation data. To diagnose the model:
* **underfitting**: both cost function plots will have high cost and converge to similar values
* **overfitting**: the training cost function plot will decrease monotonically with each polynomial degree, while the cross-validation plot will decrease for a while, then increase again creating a substantial gap between it and the training plot

The appropriate amount of polynomial terms to use would be indicated by both the training and the cross-validation cost function plots being relatively low-cost and the gap between them is minimal. For example, consider the true model:

\begin{equation}
y = 0.08 x + 0.02 x^3 + \varepsilon
\end{equation}

where $\varepsilon$ is normally distributed with a mean of $0$ and a standard deviation of $100$ (to make the data fluctuations more pronounced). A comparison between the training and cross-validation cross curves can be found here:

<img src="files/images/cross-validation%20example.png">

The cost is displayed on a log scale to make the minimums clearer to see. It is obvious that the cross-validation cost-minimum is at 3 polynomial terms. Using just 3 polynomial terms, we can fit the data very well:

<img src="files/images/cross-validation%20model%20selection.png">

Therefore, a polynomial term of 3 in this case represents a good candidate model for the data.

Another way to examine cross-validation is to look at the cost function plots as a result of increasing the amount of observations used to calculate them. Start with just $m$ values in both the training and cross-validation sets, where $m < n_{CV}$. Calculate the cost of a particular model using both the training and cross-validation sets, and plot them with the x-axis as the amount of observations in each set. Increase this amount by some set number (usually $+1$), and recalculate the costs. Plot the costs again with this increased number of observations. Repeat this process until $m$ approaches $n_{CV}$. To diagnose the model:

* **underfitting**: both cost function plots will have high cost and converge to similar values 
* **overfitting**: the training cost function plot will be relatively low-cost and there will be a substantial gap between that and the cross-validation plot

If the regressions are being done in SAS, SAS uses a statistic known as the PRESS statistic. It is a form of cross-validation, where the $i$th observation is left out for $\mathbf{\vec{b}}$ estimation. $\widehat{y}_i$ is calculated with the estimated $\mathbf{\vec{b}}$, with the residual (known as the PRESS residual) being:

\begin{equation}
r_{i, PRESS} = y_{i} - \widehat{y}_{i, -i}
\end{equation}

with the $-i$ indicating that the $i$th observation was left out when determining $\mathbf{\vec{b}}$ for $\widehat{y}_i$.

If the sum of squares of the PRESS statistic:

\begin{equation}
PRESS = \sum_{i = 1}^{n} r_{i, PRESS}
\end{equation}

is small, then the model is considered a good candidate for the dataset. Given that $R^2$ compares the model against the mean response, the PRESS statistic is a much better indicator of the predictive power of the model.

[(back to top)](#Table-of-Contents:)

## Stepwise Parameter Selection

Given a multitude of feature variables to choose from, it can be hard to determine exactly where to start. A common procedure for choosing subsets of feature variables is **stepwise regression**. Stepwise regression involves sequentially selecting or removing feature variables one at a time until an approprate model has been found. These selection techniques are built into many software packages and are automated for the analyst.

Adding feature variables to a model is known as **forward selection**. A model first starts with the single feature variable which gives the best value of some criteria. Another variable which improves the value of the criteria is added. This continues until either there are no more feature variables to use or the criteria value can only worsen with the addition of new features. No variables are removed from the growing model once added.

Removing feature variables from a model is known as **backward selection**. A model first starts off with all feature variables. A variable is then removed that gives the best value for some criteria. Then, additional variables are removed that improve the value of the criteria until either just one variable remains or the crteria can't be improved further. No variables are added from the shrinking model once removed.

Combining both methods is one known as **stepwise selection**. Variables can be added or removed until a criteria is met.

The most common criteria used is the $F$-statistic, though some algorithms make use of Mallow's $C_p$ statistic (not covered) or $R_{adj}^2$. 

In SAS, the following procedures offer parameter selection:
* `PROC REG`
* `PROC GLMSELECT`
* `PROC LOGISTIC`
* `PROC PHREG`

and can be used in the `selection=` option in the `model` statement.

[(back to top)](#Table-of-Contents:)

## Regularization Methods

**Regularization** is a class of techniques to help reduce the amount of overfitting model estimates can produce by shrinking the size of the model estimates. This can create a model in which all terms contribute to the estimation of the response variable or simply remove specific feature variables from the model. It incurs a penality to the model based on the amount of feature variables the model contains. The most common type of regularization is ridge regression, also known as Tikhonov regularization.

Ridge regression adds a new term to the cost function:

\begin{equation}
J\left(\vec{\theta} \right) = \frac{1}{n} \sum_{i = 1}^{n} \left(y_i - \mathbf{\vec{x}_i} \mathbf{\vec{b}} \right)^2 + \lambda \sum_{j = 1}^{k} b_i^2
\end{equation}

Now, the cost has to be the minimum of the mean squared error plus some value proportional to the sum of squares of the coefficients. This reduces overfitting by preventing certain coefficients from getting too big. The $\lambda$ constant controls how important it is that the estimated values of the coefficients are as small as possible.

Common regularization methods found in `PROC GLMSELECT` are:
* `LASSO`
* least angle regression (`LAR`)
* Elastic Net selection (`ELASTICNET`)

SAS uses a modification of ridge regression known as **LASSO** (least absolute shrinkage and selection operator), where the mean squared error is minimized, but the coefficient estimates are subject to $\sum_{i = 1}^{n} \left|b_i \right| \le t$, where $t$ is some chosen parameter. Depending on the value for $t$, some of the coefficients will shrink to $0$, effectively eliminating its corresponding feature variable from the model. Thus, LASSO combines both regularization and feature selection methods.

**Least angle regression** (LAR) is a type of forward selection algorithm that takes into account the correlation of both the response variable and the feature variable. All coefficients are first set to $0$. The coefficient of the feature variable most correlated with the response is increased in the direction of the correlation until another feature variable becomes correlated with the resulting residuals. Then the original coefficient and the coefficient of this second feature variable are then increased based on the joint least squares direction until a third feature variable becomes correlated with the resulting residuals. This repeats until all of the feature variables are included or the stop criteria is met. This method produces similar solutions to LASSO.

**Elastic Net** combines both LASSO and ridge regression, where coefficients have to satisfy both constraints. This is useful for the following cases:
* there are more feature variables than observations ($k > n$)
* there are high correlations between different feature variables

Choosing between the different regularization methods requires an examination of the data. If feature variables are correlated with one another, the use the Elastic Net approach. Otherwise, it is best to try out both LAR and LASSO to see which generates more appropriate results. If there are certain variables that should be removed, try the LASSO option.

[(back to top)](#Table-of-Contents:)