# **BIAS-VARIANCE AND MODEL SELECTION**

## **No free lunch theorems**
Let's start by considering one of the most famous and important theorem in machine learning. Focus now the attention on a binary classification problem and define:

$$ Acc_G(L) = \textrm{Generalization accuracy of learner L} = \textrm{Accuracy of learner L on non-training samples} $$

$$ \mathcal F = \textrm{set of all possible concepts,}(y = f(\textbf x)) = \textrm{set of all possible functions with binary output} $$

We can view at $\mathcal F$ as the space of all the possible hypothesis.

**Theorem**

*For any learner L, $\frac{1}{|\mathcal F|}\sum_{\mathcal F}Acc_G(L) = \frac{1}{2}$, given any distribution $\mathcal P$ over $\textbf x$ and training set size $N$.*

**Proof**

For every concept $f$ where $Acc_G(L) = 0.5 + \delta$, exists a concept $f'$ where $Acc_G(L)=0.5 - \delta:$$ \forall \textbf x \in \mathcal D, f'(\textbf x) = f(\textbf x); \forall \mathcal \notin \mathcal D, f'(\textbf x) \neq f(\textbf x)$ 

**Corollary**

*For any two learners $L_1$ and $L_2$, if $\exists$ learning problem s.t. $Acc_G(L_1)> Acc_G(L_2)$, then $\exists$ learning problem s.t. $Acc_G(L_2)> Acc_G(L_1)$*

Accuracy $0.5$ means that the learner is guessing randomly. It is important to notice that this value is the same no matter what is the training size considered. Any function has the same probability to be the true function and the dataset is useless. This can seems something weird and imposes some precautions. ML is based on the hypothesis that not all the functions have the same probability and we assume also some regularity on functions which allow us to generalize. No algorithm is a priori better than the others so there is no a single approach that can dominate. The best algotirhm is problem dependent and depends also on the assumpions made.

> So what? You can't fall in love with a single technique

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_1.png?raw=1" width="300">
</p>

Since there is no a best learner for all the situations you have to try different approaches.


## **Bias-Variance tradeoff**

Our goal is to find a model $y(\textbf x)$ that approximates $t$ as well as possible. In order to evaluate the goodness of the prediction we use a loss function, which expected value is given by

$$ \mathbb E[\mathcal L] = \int \int \mathcal L(t, y(\textbf x)) p(\textbf x, t) d\textbf x dt $$

Which in the specific case of squared loss becomes

$$ \mathbb E[\mathcal L] = \int \int (y(\textbf x) - t)^2 p(\textbf x, t) d\textbf x dt $$

There is immediately a problem, however, since we don't know the probability $p(\textbf x, t)$. The bias-variance decomposition is a framework to evaluate the performance of models. Let's assume we have a data set $\mathcal D$ with $N$ samples generated from the model

$$ t_i = f(\textbf x_i) + \epsilon $$

where $\epsilon$ has expected value $\mathbb E[\epsilon] = 0 $ and variance $Var[\epsilon] = \sigma^2 $ (notice that this in not the same as Gaussian assumption).  Let's consider the expected square error on an unseen sample 

$$
\begin{align}
\mathbb E[(t - y(\textbf x))^2] &= \mathbb E[t^2 + y(\textbf x)^2 - 2ty(\textbf x)] \quad \textrm{compute the square} \\
&= \underbrace{\mathbb E[t^2]}_\text{$2^{nd}$ moment of $t$} + \underbrace{\mathbb E[y(\textbf x)^2]}_\text{$2^{nd}$ moment of $y$}  - \mathbb E[2ty(\textbf x)] \quad \textrm{thanks to the linearity of $\mathbb E[\cdot]$} \\
&= \mathbb E[t^2] \pm \mathbb E[t]^2 + \mathbb E[y(\textbf x)^2] \pm \mathbb E[y(\textbf x)]^2 - 2f(\textbf x)\mathbb E[y(\textbf x)] \quad \textrm{since $\mathbb E[t] = f(\textbf x)$} \\
&= Var[t] + \mathbb E[t]^2 + Var[y(\textbf x)] + \mathbb E[y(\textbf x)]^2 - 2f(\textbf x)\mathbb E[y(\textbf x)] \quad \textrm{since $\mathbb E[x^2] - \mathbb E[x]^2 = Var[x]$} \\
&= Var[t] + Var[y(\textbf x)] + (f(\textbf x) - \mathbb E[y(\textbf x)])^2 \\
&= \underbrace{Var[t]}_{\sigma^2} + \underbrace{Var[y(\textbf x)]}_{variance} + \underbrace{\mathbb E[f(\textbf x) - y(\textbf x)]^2}_{(bias)^2} \quad \textrm{since $\mathbb E[f(\textbf x)] = f(\textbf x)$}
\end{align}$$

$$ \textrm{expected loss = (bias)$^2$ + variance + noise} $$

The first thing that we can observe is that all the three terms are always positive and rised to the second power so, they have the same additive importance. The first term derive from the intrinsic variability of the target data and since it is independent of $y(\textbf x)$, it represents the irreducible minimum value of the loss function. To better understand this term just think at the procedure to collect the points inside the data set. We can perform for example an experiment which, also if perfectly controlled, it is inevitably subjected to randomness which means that given the same combination of test parameters we will obtain results that are not exactly equal. The last two terms are those on which we can work on.

* **bias**: represents the extent to which the average of our estimate differs from the desired regression function. It is high when we have few samples and many parameters.

* **variance**: measures the extent to which the solution vary around the average. It is high when we have a high hypothesis space.

It is important to notice that we have to minimize the sum of these two term not just one since their effect is additive and no one can be negative. When we have a good algorithm in supervised learning means that it is able to correctly balance the bias-variance trade-off.

We can generalize the previous consideration assuming that we have a larger number of data sets each of size $N$ and each drawn from the same distribution. For any given dat set $\mathcal D$ we can run our learning algorithm and obtain a prediction $y(\textbf x, \mathcal D)$. Since each data set is different, we will obtain differen value of the loss function running the same algorithm on each of them. The performance of each particular learning algorithm can be addressed by averaging over the ensamble of data sets. For each ensamble we have to integrate over all the input space to compute the desired quantities

$$ \textrm{bias}^2 = \int(f(\textbf x) - \mathbb E[y(\textbf x)])^2 p(\textbf x) d\textbf x $$

$$ \textrm{variance} = \int \mathbb E[(y(\textbf x) - \mathbb E[y(\textbf x)])^2]p(\textbf x) d\textbf x $$

The picture below helps to give an intuitive overview of the possible cases in which a machine learning algorithm can falls.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_2.png?raw=1" width="400">
</p>

On the top right we have the **overfitting** problem since the solution is really sensitive to the dataset, as witnessed by the high variance of the points. On the bottom left, we have the opposite case called **underfitting** in which the variance is low since the model is not so complex but, it is completely unable to capture the true behavior of the system and the bias is high. These two are the most usual situations that one can encounter during the design of a machine learning problem. At this point one can think why we can't try to move the biased underfitted model toward the center of the target. The answer is in the bias-variance equations showed before. If we try to move the points in the bottom left target towards the center they will start to spread as in the top right picture. You still don't have understand what is overfitting? It's easy..

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_22.png?raw=1" width="400">
</p>

We have that the only way to decrease the bias is with more complex model so, increasing the size of the hypothesis space. In order to decrease the variance we have to move in the opposite direction decreasing the hypothesis space since simple models have low variance. Differently from bias, though, we can reduce the variance also by increasing the number of samples. The perfect solution is to use many many samples and a very complex model but this causes scomputational issues.

The picture below shows the shape of the bias and variance curves as the complexity of the model is varied. Any supervised learning algorithm has some parameters that allow to obtain the required balance between the two terms. The regularization techniques tend to move the model toward the left increasing the bias.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_3.png?raw=1" width="400">
</p>

Let's consider now that we have a data set $\mathcal D$ with $N$ samples and we chose for example the RSS loss function. The loss on the training set is so computed as

$$
\begin{align}
L_{train} &= \frac{1}{N} \sum_{n=1}^N(t_n - y(\textbf x_n))^2 \quad \textrm{for regression} \\
&= \frac{1}{N} \sum_{n=1}^N(I(t_n \neq y(\textbf x_n)) \quad \textrm{for classification}
\end{align}
$$

As the model complexity increases the loss on the training set tends to decrease as tends to decrease also the bias from the true model. The problem is that as we have already said, in machine learning we are trying to minimize a surrogate of the real objective function. Our goal is not to obtain the minimum of the loss on the training points but to generalize our model to unseen samples. So, we try to minimize the loss on the training set in the hope to minimize the loss also on unseen points. As we can see from the picture below, the training loss is a good approximation of the real loss just when we have simple models. When the complexity of the model increases to much the two curves tend to diverge.

> The training loss is an **optimistically biased** estimate of the prediction error

This consideration should give an idea of why we need another set, the **test** set. What is done in practice is to "randomly" split the data set into train and test, and then used the training set to optimize the parameters and the test set to evaluate the prediction error. This strategy allows us to obtain an unbiased estimate of the model goodness. The loss on the test set is computed as

$$ L_{test} = \frac{1}{N_{test}} \sum_{n=1}^{N_{test}}(t_n - y(\textbf x_n))^2 $$

It is important to remark that the test set is unbiased only if it is not used also for the training. In the pictures below the red curve is the prediction error while the black one is its estimates obtained with the test set, which is noisy since we are working with finite data.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_5.png?raw=1" width="300">
</p>

The bad news is that there is still no solution on the number of points that must make up the test and the number that must make up the training set. In the picture below we fix the complexity of the model and look at the curves as the number of points in the training set is varied. As the number of points increases the training error increases but decreases the prediction error. When we have a lot of samples the training error becomes a good estimate of the prediction error.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_7.png?raw=1" width="300">
</p>

The relative trend of the train and test losses can gives important informations on the goodness of the model. If the test error and the training error are distant it means that the variance is high and so the hypothesis space is too large. If the two curves tend to converge fast it means that we have a too high bias.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_8.png?raw=1" width="400" hspace='20'> <img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_9.png?raw=1" width="400">
</p>

The bias-variance trade-off can be managed using different techniques:

* **Model selection**: features selection, regularization, dimension reduction

* **Model ensamble**: bagging, boosting

Bagging and boosting are metatechniques that allow to reduce one error without modifying to much the other.

## **Curse of dimensionality**
In real world applications we usually have to deal with input vectors with a lot of features, not just two ora three as we are used to visualize. The high dimensionality of the input space poses serious challenges and has a high impact on the design of pattern recognition techniques. The difficulties related to design of techniques able to work with inputs of high dimensionality are described by a phenomenon called **curse of dimensionality**. To better understand this phenomenon just imagine a problem in which we want to classify a point based on its neighbors. An easy way to calculate the class of a point relying on its neighbors is to divide the space into cells and then calculate the class with maximum probability inside each cell. This example is really usefull to understand the problem related to high dimensional inputs because it is easy to see that the number of cells grows exponentially with the number of features.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_9b.png?raw=1" width="400">
</p>

If the number of cells grows exponential also the number of inputs need to follow the same growth in order to assure that all the cells could contain at least one point. Another geometric example is the calculation of the volume fraction of a $D$-dimensional sphere that lies between $r = 1 - \epsilon$ and $r = 1$ where $r$ is the radius of the sphere. After some basic math calculations we will see that, for large $D$, this fraction tends to $1$. In spaces of high dimensionality most of the volume of the sphere is concentrated in a thin shell near the surface.

> Not all the intuitions developed in spaces of low dimensionality will generalize to spaces of many dimensions


We can summarize the problems related to high dimensional input space as

* Large variance

* Many samples required

* High computational costs

These problems can be tackled in different ways

* **Feature selection:** usually real data is often confined into a region of the space having lower effective dimensionality. This consideration can be exploited by identifying a subset of input features that are most related to the output.

* **Dimension reduction:** usually real data have particular directions along which import variations in the target variables occur. This consideration brings to the intuition that the input variables could be projected into a lower-dimensional subspace able to describe just the directions of higher variability.

* **Shrinkage:** real data will typically exhibit some smoothness properties so we can think to use all the input features and then shrink some coefficient to zero as in regularization, reducing the variance.

## **Feature selection**
The most simple and intuitive way of calculating the best set of features is to evaluate all the possible subset of the original features. This approach is called **Best Subset Selection** and is based on the following steps:

1. Let $\mathcal M_0$ denote the null model, which contains no input feature so it only predicts the sample mean of each observation. 

2. For $k=1, ...,M$:

   * Fit all $\binom{M}{k}$ models that contain exactly $k$ features
   
   * Pick the best among these $\binom{M}{k}$ models and call it $\mathcal M_k$. With best model we mean the one that has the smallest $RSS$ o largest $R^2$.
   
3. Select the single best model among $\mathcal M_0, ..., \mathcal M_M$ (will be clear later how to choose the best model)

Below the envelope of the RSS and of R$^2$ of the best subset with the number of features is reported. As we can see we can't just pick the model with the best score since both of the performance metrics are monotonic.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_21.png?raw=1" width="400">
</p>

With this simple procedure we can definitely get the best subset of features but, it has serious computational problem when $M$ is large as the possible number of subset grow as $2^M$. Moreover, if we don't choose the final best subset carefully we will end up in overfitting. There exists three metaheuristics to tackle the first problem.

* **Filter:** we rank the features and select the best ones

* **Embedded:** the learning algorithm exploits its own variable selection technique as in *Lasso*, *Decision rees*, *Auto-encoding*, etc.. In this case Lasso is used just to wee which features are associated to zero parameters and then we will use another model for the regression as Artificial Neural Network or Gaussian Processes. In Decision trees we can remove the features that are not use by the tree

* **Wrapper:** instead of considering all the possible subset we can follow a sequential procedure. Two possible methods are present:

    * **Forward Step-wise Selection:** starts from an empty model and then adds features one-at-a-time untill all features are in the model. At each time is added the feature that gives the greatest additional improvement
    
    * **Backward Step-wise Selection:** starts with all the features and removes the last useful features, on-at-a-time
    
Both the wrapper methods here presented do not guarantee to get the best model.

As visible from the picture above, the model containing all the features is for sure the one that has the smallest training error. However, as we have already said, the train error is not always a good approximation of the prediction error. Remember that our goal is to perform well on unseen data not on the training set. Instead of looking at the training error we have to look at the error on the test set and there are two approches to estimate it:

* *Direct* estimation of the prediction error using a validation approach

* *Indirect* estimation by making an *adjestment* of the training error to account for model complexity

### **Cross-Validation**
Let's consider now the direct approach. We already know that we can't use the train error to evaluate the goodness of a model and of curse we can't use the test error . The test error cannot be used to influence our decision during the building of the model, it must be independent of the whole learning process. A simple possible solution is to split the data set not in two parts but in three parts:

* **Training data:** used to learn the parameters of the model

* **Validation data:** used for model selection or model assessment

* **Test data:** used for the very final evaluation of the model performance

So we have introduced a new set that will be used for the fine tuining of the parameters of the learning algorithm. The problem is that also this solution is not completely safe. The problem is related to the fact that using the validation for model selection we can overfit also this set obtaining bad results on the test points. Another problem is related to the fact that in real case applications the data is limited and removing data from the training set may compromise the performance of the model. All the time we wish to use the maximum number of points to train the model. A solution could be to use just a small part of the training set to build the validation one but in this case we will obtain a noisy and not representative estimate of predictive performance. One solution to this dilemma is to use **cross-validation**. Let's consider the whole training set composed by $D$ samples and to pick only one sample from this set to build the validation set. In this case we will have $D-1$ samples in the training set and $1$ sample in the validation one. Now we learn the model usign the dataset given by $\mathcal D \backslash \{n\}$, where $\mathcal D$ contains all the points, and then we estimate the prediction error using just the single point $\{n\}$. We will repeat this process using all the time a different sample for the validation untill all the samples are used for approximating the error.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_9c.png?raw=1" width="400">
</p>

This technique is called **leave-one-out cross validation** (LOOCV) and the final performance are obtained averaging over all the model obtained.

$$ L_{LOO} = \frac{1}{N} \sum_{n=1}^N (t_n - y_{\mathcal D \backslash \{ n \} }(\textbf x_N))^2 $$

After this process the model can be learned used all the data points inside the training. LOO is almost unbiased  and slightly pessimistic since we use $N-1$ point inside the training set instead of $N$. Obviously, also in this case there is a problem :). Suppose we have $100,000$ data points and our learning algorithm requires just $1s$ to complete. The entire procedure of LOO will require $1$ entire day and if we have also to repeat the process in order to tune the hyperparameters as $\lambda$ it will take forever!

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_9d.jpg?raw=1" width="400">
</p>

Another better solution is to use **k-fold cross validation**. In this case, instead of using just one sample inside the validation set we divide the training data into $k$ equal parts $\mathcal D_1, ..., \mathcal D_k$. We repeat the same procedure as in leave-one-out but in this case every time one of the $k$ part is used for validation and all the other for the training as depicted below.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_9cc.png?raw=1" width="400">
</p>

For every run the estimate of the error will be

$$ L_{\mathcal D_i} = \frac{k}{N} \sum_{(\textbf x_n, t_n) \in \mathcal D_i} (t_n + y_{\mathcal D \backslash \mathcal D_i}(\textbf x_n))^2 $$

And the $k$-fold cross validation error is the average over all the data splits

$$ L_{k-fold} = \frac{1}{k} \sum_{i=1}^k L_{\mathcal D_i} $$

It is easy to see that LOOCV is just a special case of the more general $k$-fold CV.The $k$-fold cross validation is much faster than LOO and more pessimistically biased since in this case there are less points inside the training set at each run. Usually $k$ is taken equal to $10$ but actually depends on the specific problem.

Differently from direct methods, adjustment techniques involve just the training data and then correct the accuracy and error estimate inserting terms related to the model complexity. Since these methods are not iterative they are faster than direct methods. Just to mention some techniques we can use $C_P$,*Akaike information criterion* (AIC), *Bayesian information criterion* (BIC) or *Adjusted* $R^2$.

## **Shrinkage**
Let's now consider the effect of shrinkage methods. We have already seen that regularization approaches such Ridge regression or Lasso shrink the parameters towards zero and this shrinkage has the effect of reducing the variance. We will consider an example involving simulated data containing $45$ features and $50$ observations.

In the picture below we see a comparison os Ridge regression and OLS. The black curve is the squared bias, the green (or light blue) is the variance, the purple is the MSE and the dashed line is the minimum achivable MSE. We see on the left plot that as $\lambda$ increases the flexibility of the Ridge decreases, witnessed by the decrease of the variance. We see that when $\lambda=0$, which corresponds to the OLS estimate, the variance is high but the bias is zero. We can notice how, untill the purple cross, the reduction in variance is much higher than the increase in bias.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_10.png?raw=1" width="600">
</p>

The well know disadvantage of Ridge regression is that it shrinks all the coefficient toward zero but actually, no coefficient will be exactly null. This may cause a problem not for the prediction but for the model interpretation when the number of features is high. Below we compare the Ridge regression and the Lasso one. In the figure on left the curves are associated to the Lasso while on the right we have solid curves for Lasso and dashed curves for Ridge. We can see that in this case the performance of the two methods are quite similar, with Ridge that has slightly lower variance. The data used to fit the models in the figures below are generated in such a way that all the features were related to the response. 

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_11.png?raw=1" width="600">
</p>

But, what happens if just $2$ of the $45$ feastures are important?

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_12.png?raw=1" width="600">
</p>

We see that in this case the Lasso outperform the accuracy obtained with Ridge. These two examples illustrate that neither Ridge resssion not the Lasso will universally dominate the other.

In both cases, the major problem is related to the choice of the best value of the regularization coefficient. One possibility to find the best possible value of $\lambda$, the one corresponding to the purple cross in the picture above is to use cross-validation using a grid of $\lambda$ values. We find with cross-validation the best model associated to the best regularization value and then we retrain the model with the optimal $\lambda$ and all the points inside the training set. 

## **Dimensionality reduction**

In dimensionality reduction methods we operate a transformation of the original fetures and then the model is learnd on the transformed variables. This concept differs from feature selction in two main aspects:

1. it uses all the features

2. it is an unsupervised approach

These techniques fall into **unsupervised learning** and some examples of them are:

* Principal Component Analysis (PCA)

* Independent Component Analysis (ICA)

* Self-organizing Maps

* Autoencoders

* t-SNE

### **Principal component analysis**
PCA is also known as *Karhunen-Loève* transform and is widely used for dimensionality reduction, lossy data compression, feature extraction and data visualization. The idea behind PCA is to perform an orthogonal projection of the data onto a lower dimensional linear space, known as the *principal subspace* which accounts for most of the variance in the data. It can be also viewed as the **linear projection** that minimizes the average projection loss, defined as the squared distance between the original points and their projections. An important hypothesis that we make before applying PCA is that we assume regularity in the function. If this assumption is not meet can happens that a function vary a lot in the directions of minimum variance of the inputs and in this case PCA becomes useless. In the picture below we see that the points are distributed in the plane and vary both along the orizontal and the vertical axes. After applying PCA we discover the two directions that maximize the variance of the data. The first principal direction is the one that explane most of the variation of the input.

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_14.png?raw=1" width="400">
</p>

The conceptual algorithm of PCA is to start finding the direction such that when the data is projected onto that line, it has the maximum variance. After that we search for the second line, orthogonal to the first one, that has maximum projected variance. This process is repeated until the desired number of dimensions is achieved. Our goal is to project the data, onto a space having dimensionality $M<D$ while maximizing the variance of the projected data. The first step is to compute the data in order to center them

$$ \bar{\textbf x} = \frac{1}{N} \sum_{n=1}^N \textbf x_n $$

Now we define the variance matrix

$$ \textbf S = \frac{1}{N-1} \sum_{n=1}^N (\textbf x_n - \bar{\textbf x}_n)(\textbf x_n - \bar{\textbf x}_n)^T $$

where $(N-1)$ in the denominator is for obtaining an unbiased estimate of the variance. The direction of the first principal component will be the first eigenvector $\textbf e_1$ of $\textbf S$ associated to the first eigenvalue $\lambda_1$ which represents the variance expressed along the direction $\textbf e_1$. We can decide the dimension of the final projection space by taking the first $k$ eigenvectors which represent a new orthogonal basis for the feature space whose axes are aligned with the maximum variances of the original data. We can rewrite the new input space as

$$ \textbf X' = \textbf X \textbf E_k $$

The new input data $\textbf X'$ will not contain all the original information unless all the eigenvectors are used for the transorfmation. Since our goal is to reduce the input dimension we have to accept the loss of some information at the benefit of a smaller problem. We can calculate the amount of variance captured by looking at the eigenvalues

$$ \textrm{Variance captured} = \frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^N \lambda_j} $$


<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_18.png?raw=1" width="450">
</p>

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_19.png?raw=1" width="450">
</p>

An historical application of PCA is to face recognition. A typical image size is of $256 \times 128$, so it is associated to a $32768$ - dimensional space. Each face image lies somewhere in this high dimensional space. An important consideration behind this application is that all the faces have some similar feature, they cannot be randomly distributed in this space. According to this idea PCA can help finding the appropriate subspace able to capture and correctly describe a face.  The picture below shows the mean and the first $15$ eigenvectors (shown as images). These eigenvectors are called **eigenfaces**. 

<p align="center">
<img src="https://github.com/stepyt/machine_learning_notes/blob/master/storage/pict4/figure4_20.png?raw=1" width="400">
</p>

Now we can construct a face by randomly taking a point inside this new space.

We previously mentioned also autoencoder, they are similar to PCA. The main difference is that PCA can operate just a linear projection, is fast and can be computed in closed form. Autoencoders are more powerfull since allow also nonlinear projections but require long learning time.

## **Bagging and boosting**
The methods seen so far allow to reduce the variance by increasing the bias and vice versa. Someone may wonder if there is the possibility to reduce the variance without increasing the bias or decrease the bias without increasing the variance. The answer is yes, the first goal can be achieved with **bagging** and the second with **boosting**. These two algorithms are called meta-algorithms since they can be applied to any existing algorithm. The basic idea is that the performance into a specific task can be increased by combining multiple models together instead of usign just a single model for the solution. Let assume that we have $N$ independent datasets and to learn from them $N$ different models, *bagging* compute the prediction of the output as

$$ y_{COM} = \frac{1}{N} \sum_{n=1}^Ny_i $$

The subscript COM stands for **commettees** since this is the name given to these combination of models.

If the datasets are independent the variance of the committee will be $1/N$ of the variance of the single model $y_i$. This can be shown by considering a random variable $\bar{x} = \sum_N x$ and calculating its variance

$$ Var(\bar{x}) = \frac{1}{N^2}\sum_NVar(x)=  \frac{1}{N}Var(x) $$

Unfortunately this result is based on the assumption that the $N$ datasets are all independent. Obviously, this is not actually possibile to obtain since the dataset is just one and it is impossibile to correctly satisfy the assumption of $N$ independent errors of the models. What usually happen is that we can obtain $N$ correlated data sets and the reduction in variance will be always lower than the one in the equation above. One possibility to obtain these $N$ data sets is to use **bootstrap**. Suppose our training set $\textbf X$ consists of $n$ data points, we can create a new data set by drawing $n$ points with replacement from $\textbf X$. Repeating this procedure $N$ times allows to obtain $N$ different data sets since in each one can be repetitions of same points of $\textbf X$ and the absence of others. So, in bagging we create $N$ data sets, we train a model from each data set generated and then we compute the prediction for new samples by evaluating all the trained model and combining the outputs with majority voting for classifications or averaging for regression. Bagging is usefull when it is applied to *unstable learner*, i.e., learners that change significantly with even small changes in the data. This learner are characterized by low bias and high variance that is what usually happens when the hypothesis space is really large. When we apply this method we want that our single learners overfit the data so it is used with really deep neural network, linear models with many features or with decision trees. Usually all the learners are made by the same model but this is not strictly required, we can obtain an ensamble by combining also different algorithms. It is important not notice that bagging does not help when learners are robust models (high bias low variance). Another important aspect of bagging is that all the models can be trained in parallel on a parallel architecture.

Boosting is another techniques to generate ensamble model with small bias by combining weak learners. With weak learners we mean models that are just slightly better than random if we consider for example a classification application. In other words in boosting the single models underfit the data, in this way we can have low variance on the single model and combining them we reduce also the bias. These models are trained sequentially as follows:

* weight all training samples equally

* Train model on train set

* Compute error

* Increase weights on point where model gets wrong

* Train new model on re-weighted train set

* Repeat until tired

* Final performance given by weighted prediction of each model

So, starting from the first model trained with all the points in the training set equally important, we increase the focus of subsequent learners on points where the previous model is wrong by increasing their weights somehow. Notice that all learner are global learner not local but is local each loss function since it is weighted unevenly in the design space. 

The most widely used form of boosting is called *Adaboost* and its pseudo-code is reported below

**image**

The log term $\log 1/\beta_r$ is always positive since $\beta_r \in [0, 1]$. It is important for boosting to be applied to weak learner since it is not able to take the variance under control.

In conclusion:

* Bagging:

    * reduces variance
    
    * doesn't work well with stable model
    
    * can be applied to noisy data
    
    * almost always help also if sometimes just few
    
    * parallel
    
* Boosting:

    * reduces bias
    
    * works with stable leearners
    
    * might have problems with noisy data
    
    * on average helps more than bagging but can also hurt performance sometimes
    
    * serial