# Variable selection and model selection

The main feature of multiple linear regression is that it uses several explanatory variables together. Therefore, the natural question is: which variables should I use to build the best possible model for my objective? This question leads us to introduce model evaluation criteria and variable selection methods.

## What will you learn in this course? 🧐🧐

This course will focus on teaching you evaluation methods for multiple linear regression models.

* Evaluation of multiple linear regression models
    * Analysis of Variance (ANOVA)
    * F-Statistics by Fisher
    * $R^{2}$ (R square)
    * $R^2_{adjusted}$
    * P-values
* Model selection.
    * Step by step methods
* Final remarks



## Evaluation of multiple linear regression models 💯

Some of the evaluation criteria presented below may be used for models other than multiple linear regression. It is therefore all the more important to introduce them now and to remember their respective interpretations.


### Analysis of Variance (ANOVA)

The analysis of variance allows to quantify the performance of a statistical model in terms of estimation error. The different values that we will discuss now will be used to build other performance metrics:

* SST: Sum of Square Total is an indicator of the dispersion of the values of the target variable $Y$ (whose values are noted $y_{1}, ..., y_{n}$) over the population considered, which is written mathematically :

$$
SST = \sum_{i=1}^{n}(y_{i}-\bar{y})^2
$$

It is the sum of the squared deviations from the mean of the target variable $Y$ for the $n$ observations considered.

* SSE: Sum of Square Explained is an indicator that represents the amount of dispersion of the target variable that is explained by the model, which is defined as:

$$
SSE = \sum_{i=1}^{n}(\hat{y}_{i}-\bar{y})^2
$$

It is the sum of the squared mean differences between the model estimates for each observation and the mean of the target variable for the population of interest. In statistics variation is information, you cannot possibly defferentiate samples if they are all described by the exact same set of values.

* SSR: Sum of Squared Residual is an indicator that quantifies the error committed by the model, or in other words the portion of the dispersion of the target variable that is not explained by the model, hence the idea of residual. Its formula is as follows:

$$
SSR =\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2
$$

It is essential to understand these values because they will allow us to build all the evaluation metrics for the multiple linear regression model we will see now.

To summarize, SST is proportional to the total variance of the target variable, which can be decomposed into two components : SSE is the variance explained by the model, which is proportional to the amount of variance of our estimates relative to the actual mean of the observed population, and SSR is the sum of the squares of the differences between our estimates and the actual values of the target variable. In other words, SST is the total amount of information, SSE is the information explained by the model and SSR is the information that remains to be explained, or the error committed.


### F-Statistics by Fisher

A statistical test is a process by which we try to show whether a hypothesis is confirmed or disproved by the data at our disposal. This test hypothesis, also called null hypothesis and noted $H_{0}$, would have consequences on the properties of the observed data if it is actually verified. These properties are summarized by a test statistic, the value of which gives an idea of the probability that H_0 is true.

Fisher's F-statistic allows to test the veracity of the following hypotheses:


* When the Fisher test is applied to the model as a whole, the null hypothesis, noted $H_{0}$, is "the variables chosen to construct the model are not jointly significant in describing the target variable". If the hypothesis is true, the F-statistic should follow a Fisher probability distribution law noted F-distribution of parameters $(n - 1, n - 1)$ where $n$ is the number of observations used to train the model. However, if the value of the F-statistic, noted "F", is outside the most probable regions of the distribution, then we can reject the null hypothesis and conclude that the chosen model has a real explanatory power on the target variable.

It may seem a little farfetched but all statistical tests work like that. We make an assumption, this assumption if it held would cause the test statistic to follow a given distribution, if the actual value of the statistic lands too far from the probable scope of the hypothetical distribution we are allowed to reject the null hypothesis.

Mathematically, the F-statistic is written:

$$
F = \frac{SSE}{SSR}
$$

The F-test can also compare two nested models (model 1 which includes "model_1_variables" and model 2 which includes "model_1_variables + $X_d$". In this case the F-statistic follows an F-law of parameters $(n - 1, n - 1)$ if the assumption that the simplest model (model 1) of the two models best describes the target variable is verified. The mathematical formula of F is then :

$$
F = (\frac{SSR_{2}-SSR_{1}}{p_{2}-p_{1}})(\frac{n-p_{1}}{SSR_{1}})
$$

If the value of F-statistic is in an unlikely region of the F-distribution, then the hypothesis is rejected and the test suggests that the more complex Model 2 provides significant additional information compared to the simpler Model 1.

Graphically the F-test can be illustrated as follows:

![F-statistic](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/curve.png)

We represent the density distribution of the F-distribution, as in any test we define a level $\alpha$ between 0 and 1 which will influence the size of the hypothesis rejection zone. Very often, we choose $\alpha = 5%$ when no specific knowledge can help us modulate our standards. The F test is one-sided, only large values of F will allow us to reject the hypothesis. More precisely if the value of F is in the upper part of the expected distribution equivalent to 5% probability, then we can say that the hypothesis is rejected at $1 - \alpha$ 95%.

This first metric allows us to test the hypothesis that the explanatory variables have no influence on the target variable, we will now look at metrics that indicate the performance level of the model.

* $R^{2}$ (R square)

$R^{2}$, or R-squared, is a statistic that quantifies the explanatory power of the model with respect to the target variable.

$$
R^{2} = 1-\frac{SSR}{SST}
$$

$R^2$ monotonically increases with each additional explanatory variable. It varies between 0 and 1, if the model is not very relevant, the sum of the residual squares $SSR$ will be close to the sum of the total squares $SST$ and $R^{2}$ will be closer to 0, on the contrary, if the model allows to explain the target variable faithfully, then $SSR$ will be closer to 0 and $R^2$ will be closer to 1. So mechanically, with each addition of variable to the model, the prediction of $Y$, the target variable, will be better and $R^2$ will be higher. In fact, $R^2$ is a performance indicator that only allows to compare two models that have the same number of explanatory variables.

* $R^2_{adjusted}$

$R^2_{adjusted}$ is a modified version of $R²$ that penalizes the number of explanatory variables selected to build the model. Its mathematical formula is:

$$
R^2_{ajusted} = 1-\frac{n-1}{n-p-1}(1-R^2)
$$

Where $p$ is the number of explanatory variables used and $n$ is the number of observations used. The growth of $R^2$ as a function of $p$ is compensated by the decrease of $\frac{n-1}{(n-p-1)}$ as a function of $p$. Consequently, if the information contribution of an explanatory variable is not significant enough, then $R^2_{adjusted}$ will decrease. In fact, it is possible to use this indicator to compare the performance of models that do not necessarily have the same number of explanatory variables.

* P-values

P-values are evaluation metrics that make it possible to evaluate the contribution of each explanatory variable individually as opposed to evaluating the model as a whole. Unfortunately it cannot be easily computed using sklearn so we will introduce the statsmodels library that calculates all important metrics automatically. The p_value can be interpreted as the probability that a given parameter's true value is 0, in other words the probability that a variable does not bring any significant information to the linear model. Usually we consider that a p_value inferior to 5% means that the variable is significant, otherwise it is consider not significant, however it depends on the context and the standards of the industry you are working in : for example web marketing agencies typically have lower standards than pharmaceutical companies because their goals and constraints are fundamentally different.

## Model selection 🤔🤔

When we have at our disposal $p$ explanatory variables, the number of models that it is possible to construct can be counted in the following way: for each explanatory variable we can build a model with or without it, applying this reasoning to all explanatory variables we can potentially build $2^p$ models. In practice, when the number of $p$ explanatory variables is large, we cannot reasonnably explore the $2^p$ models that can be built in order to select the best one. Different methods exist which allow to avoid using brute force.

### Step by step

The step-by-step selection is divided into three variants:

* Forward selection: the variables are added one by one to the model by selecting at each step the one that maximises the F statistic. It stops when all the variables are used or when the null hypothesis of increase in explained information cannot be rejected, i.e. when $F = (\frac{SSR_{2}-SSR_{1}}{p_{2}-p_{1}})(\frac{n-p_{1}}{SSR_{1}})$ falls into the probable area of the F-distribution.
* Elimination (backward): This time we start with a model using all the explanatory variables. At each step, the variable with the highest p-value associated with the Fisher test is eliminated from the model. The procedure stops when all the remaining variables have p-values higher than a threshold set by default at 0.05 (but which can be adapted according to the precision needs of the considered problem).
* Stepwise: This algorithm alternates between a selection step and an elimination step after each addition of a variable, in order to remove any variables that would have become less relevant in the presence of those that have been added. Example, if after adding the most useful variable according to the F-stat criterion one of the variables becomes non-significant, we remove it and proceed with a new forward selection step.

### Final Notes

The model evaluation and selection methods introduced above are perfectly valid for all linear models, as well as the logistic regression that we will discuss later.

## Resources 📚📚

* How do I interpret R Squared - http://bit.ly/2pP83Eb
* Adjusted R Squared - http://bit.ly/2qqz55b