# Machine learning supervised - Regularized linear regressions

<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you will learn in this course](#what-you-will-learn-in-this-course-)
* [Ridge model](#ridge-model)
  * [Why does the linear model fail?](#why-does-the-linear-model-fail-)
  * [Which solution to overcome this difficulty?](#wich-solution-to-overcome-this-difficulty-)
    * [What is a cost function?](#what-is-a-cost-function-)
    * [Bias vs Variance](#bias-vs-variance)
    * [Ridge model penality](#ridge-model-penality)
    * [Special case of interception](#special-case-of-interception)
* [Lasso model](#lasso-model)
  * [Large variable selection problem](#large-variable-selection-problem)
  * [The lasso model](#the-lasso-model)
    * [Intuition](#intuition)
    * [Lasso estimator](#lasso-estimator)
    * [Underfitting and Overfitting](#underfitting-and-overfitting)
<!-- TOC END -->





## What you'll learn in this class ##

The purpose of this course is to teach you two variations on the linear regression model called regularized regression. Regularization is a principle widely used in statistics that allows us to constrain models to adopt certain behaviours. In this case, it allows us to reduce the importance of variables or remove explanatory variables that do not provide relevant information for the model or to deal with problems where the explanatory variables are much more numerous than the number of observations available.

Linear models are popular for estimating a continuous target variable $Y$ that depends on explanatory variables $X_{1}, X_{2}, ... , X_{p}$. In general, the volume of observations / samples (often noted $n$) is much greater than the number of explanatory variables $p$. However, in some cases we find ourselves in the opposite situation: we have a number of explanatory variables much higher than the number of observations. For example, in the field of genetic statistics: time and money constraints do not allow to sequence the DNA of more than a thousand individuals, while the number of genes in our DNA is about 20,000! Conventional linear models are not well suited for this type of problem with very high dimensionality. Therefore, in this course we will discuss new linear models that can tackle this difficulty.


## Ridge model

#### What is a cost function ?

The concept of **cost function** will be of great use in the future: it is omnipresent in machine learning. The cost function is the mathematical quantity that we wish to minimize when optimizing a statistical model. In classical multiple linear regression, the cost function is:

### $||y-X^{t}\beta||^{2}_{2}$

Which is the Euclidean norm for vectors (i.e. the square root of the sum of its squared components) squared, also called Mean Square Error. In fact the estimator $\beta$ which gives the coefficients of the model is the vector which minimizes the cost function, which is written mathematically:

### $argmin_{\beta}(||y-X^{t}\beta||^{2}_{2})$


This cost function provides an estimator with the lowest bias possible. In the present case with this number of explanatory variables, the variance of the predictions is the highest possible given this set of parameters for a linear model. Sometimes a better compromise can be found.

This is why we introduce the notion of **penalty**, it is a modification that we make to the cost function in order to control the trade-off between **bias vs. variance**.

#### Bias vs Variance

Bias and variance are two ubiquitous notions in statistics and particularly in machine learning when trying to make predictions.

Let's consider a problem where $Y$ is the target variable, $X$ is the matrix of explanatory variables and $\epsilon$ is the error term of mean 0 and variance $\sigma^2$.

We wish to model $Y$ using the explanatory variables $X$ and assume that there is a function $f$ which represents the true relationship between $Y$ and $X$ such that :

### $Y=f(X)+\epsilon$


At the end of our work as data scientists, we will have obtained an estimate $\widehat{f}$ of the function $f$. In this case the mean square error (often noted MSE) can be decomposed in terms of bias (or mean deviation from the true function) and in terms of variance (dispersion of the estimator with respect to its mean). This decomposition is written as follows:

### $E[(Y-\widehat{f})^2]=E[f-\widehat{f}]^2+E[\hat{f}^2]-E[\widehat{f}]^2+\sigma^2$

### $E[(Y-\widehat{f})^2]=bias^2+variance+\sigma^2$

The bias is the square commited by the model on average, the variance is the amount of variation of the predictions and $sigma^2$ is the amount of information that cannot be explained by the model, also called noise.

#### Ridge model penalty

The Ridge model is a penalized version of the multiple linear model whose cost function is:

### $||y-X^{t}\beta||^{2}_{2}+\lambda||\beta||^{2}_{2},\lambda\in\mathbb{R^{+*}}$ 

This penalty implies that any variation in the parameters $\beta$ may have a beneficial impact on the quality of the estimate (looking at the first term) but contributes to the increase of the second term, the penalty. This forces the model to favour the parameters associated with the explanatory variables that actually contain the most relevant information to describe the target variable and to keep parameters associated with explanatory variables that are not very relevant at values closer to zero.

In terms of bias vs. variance trade-off, the Ridge model behaves as follows:

* $\lambda = 0$ corresponds to the linear model, which is unbiased
* Bias increases as $\lambda$ increases
* Variance decreases as $\lambda$ increases
* $\lambda = \infty$ corresponds to the model where all the parameters are zero, hence $\beta_i = 0 \forall i \neq 0$ and the estimator is $\bar{Y}$ (the average of $Y$ values). The lowest variance model possible.

#### Special case of interception

The intercept is the parameter $\beta_0$ of the model, it is not associated with any explanatory variable, it represents the estimation of the mean level of $Y$ when the other explanatory variables are 0. That is why it is not penalized in practice. The cost function of the Ridge model with intercept is therefore :

## $||y-\beta_0-X^{t}\beta||^{2}_{2}+\lambda||\beta||^{2}_{2},\lambda\in\mathbb{R^{+*}}$



## Lasso model

### Why does the linear model fail?

As we saw earlier, the linear model is written as follows:

### $Y_i = \beta_0 + X_{i,1}\beta_1 + ... + X_{i,p}\\beta_p + \epsilon_i \forall_i \in[\![1,n]\!]$


Or in matrix form:

### $Y=X^{t}\beta+\epsilon$



If we have $p > n$, in the first representation, we have to solve $n$ equations using $p + 1$ parameters which are the $beta_0,\beta_1,...,\beta_p$. However, a system of equations with a higher number of parameters than independent equations is undetermined and has an infinity of solutions.

In matrix vision, $X^{t}X$ is a matrix of dimensions $p*p$ while its rows and columns are linear combinations from $n$ vectors. The rank of this matrix is thus lower or equal to ***n***: it is therefore non-invertible and the parameters cannot be estimated.

### Which solution to overcome this difficulty ?

#### Large variable selection problem

The Ridge model, as we have seen, is well suited when some of the explanatory variables are not very informative in the model, because it shrinks the coefficients associated with these variables. However, in the case where the true value of many coefficients is supposed to be 0, i.e. the explanatory variables with which they are associated have no influence on the target variable, the Ridge model does not penalize unniformative variables enough. Ridge cannot be used for variable selection.

That is why a model called Lasso exists, which is able select the relevant variables and sets the coefficients of the meaningless variables to zero.

### The Lasso model

#### Intuition

The fundamental intuition for using a Lasso model is that a number of explanatory variables at our disposal have no influence on the target variable, and therefore the associated coefficients would have a true value of 0 in the linear model.

This intuition is characterized mathematically as follows: given $n$ the number of observations, $p$ the number of explanatory variables and $s$ the number of relevant explanatory variables. Then the intuition (called the **sparsity** hypothesis) is written as follows.

### $s<<n<<p$

By selecting the relevant variables $s$, we hope that the selection will lead to a situation where the linear model can be applied without difficulty.

Mathematically, we wish to find $\beta$ that minimizes the following cost function:

## $\widehat{\beta_n}=arg\min_{\beta:||\beta||_{0}\leq{s}}||Y-X^{t}\beta||^{2}_{2}$

This equation means that $\widehat{\beta_n}$ is the vector that minimizes the value $|||Y-X^{t}\beta||^{2}_{2}$ under the constraint that the number of non-zero elements in $\beta$ is at most $s$.

With $||\:||_{0}$ is the "zero standard" that counts the number of non-zero elements in $\beta$.

However, this constraint does not allow optimization because it does not define a convex space. Instead, we are forced to choose a weaker constraint that allows us to obtain a convex cost function based on the "norm 1" $||\:||_{1}$ which is defined as the sum of the absolute values of the components of a vector.  

#### Lasso estimator

Here we introduce the _Least Absolute Shrinkage and Selection Operator_ (LASSO), defined by the following cost function :

## $|||Y-X^{t}\beta||^{2}_{2}+\lambda|||\beta||_{1}$

The important things to know about the LASSO model are the following:

* $\lambda$ , the penalty constant, must be carefully chosen. In general, LASSO resolution algorithms perform a cross-validation (the theory of which is developed in the following) and compute the estimator for many values of $\lambda$ to identify the most relevant values.
* The higher $\lambda$ is, the more the solution $\hat{\beta}_{LASSO}$ will be sparse (i.e. will contain few non-zero elements), the higher the bias and the lower the variance.
* The smaller $lambda$ will be, the more the number of non-zero coefficients increases. This decreases the bias of the model but can dramatically increase the variance (this is called an overlearning situation).
* In practice you will soon discover that the bias introduced by LASSO estimation can be very large. Depending on your constraints on the precision of the results, it is possible to use LASSO to select the best variables and then estimate a linear model keeping only these variables to remove the bias.

#### Underfitting and Overfitting

Two notions were mentioned above which can be explained as follows:

* Underlearning (underfitting) is the fact that a model is too simple to return a good estimate of the target variable.
* Over-learning (overfitting) is the opposite phenomenon: when a very complex model is used that fits the training data too well. This is useless in practice, as it is very unlikely to generalize well on new unknown data.

![under_over_fitting](https://drive.google.com/uc?export=view&id=1W3Y-X__zrkB-fOBGnVmLb0_IheOEAt05)

The figure above represents, from left to right, an under-learning situation, a good estimate situation, and an over-learning situation.