# Bias-Variance Tradeoff

The **bias-variance tradeoff** is a fundamental property of machine learning algorithms.

## Bias

**Bias** is the tendancy of an estimator to pick a model for the data that is not structurally correct. A biased estimator is one that makes incorrect assumptions on the model level about the dataset. For example, suppose that we use a linear regression model on a cubic function. This model will be biased: it will structurally underestimate the true values in the dataset, always, no matter how many points we use.

Given points $x$, a true value $f(x)$, and a model $\hat{f}(x)$, bias can be expressed mathematically as:

$$\text{Bias}[\hat{f}(x)] = E[\hat{f}(x) - f(x)]$$

Where $E[\cdot]$ is the [expected value](https://en.wikipedia.org/wiki/Expected_value) function (e.g. the mean value).

The key to understanding bias is to understand that, given enough "perfect" data points sampled from a compatible distribution which has no errors or variance in it (such as a line, in the case of linear regression), an unbiased model will fit every point exactly correctly.

Bias is also known as **underfitting**. Once we've selected a model bias becomes something that we want to, *within reason*, reduce.

## Variance

**Variance** is error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (**overfitting**). It can expressed mathematically as:

$$Var(X) = E[X^2] - E[X]^2$$

Variance is slightly more fundamental than bias to understanding ML, and I covered it [here](https://www.kaggle.com/residentmario/gaming-cross-validation-and-hyperparameter-search/notebook).

## Decomposition

Bias and variance are linked with one another. Recall that mean squared error or MSE (covered in [Model Fit Metrics](https://www.kaggle.com/residentmario/model-fit-metrics/)) measures the square of the average amount of error made by our model. Let $f(x)=y$ be the true target value of a point, and let $\hat{f}(x)=\hat{y}$ be its model-predicted value. Then we may write:

$$\text{MSE} = E[(f(x)-\hat{f}(x))^2]$$

We will use the follow lemma:

**Lemma**: $E[(X-E[X])^2] = E[X^2] - E[X]^2$

Proof:

$$Var[X] = E[(X- E[X])^2]$$
$$= E[X^2] - E[X]^2$$
$$= E[X^2 - 2 X E[X] + E[X]^2]$$
$$= E[X^2] - 2E[X]^2 + E[X]^2$$
$$= E[X^2] - E[X]^2$$
$$\therefore E[(X-E[X])^2] = E[X^2] - E[X]^2$$
$$\text{ QED.}$$

We use this fact in the following proof. 

$$\text{MSE} = E\big[(y - \hat{y})^2\big]$$
$$= E[y^2 + \hat{y}^2 - 2 y\hat{y}]$$
$$= E[y^2] + E[\hat{y}^2] - E[y\hat{y}]$$
$$= \text{Var}(y) + E[y]^2 + \text{Var}[\hat{y}] + E[\hat{y}]^2 - 2 y E[\hat{y}]$$
$$= \text{Var}(y) + \text{Var}(\hat{y}) + (y^2 - 2 y E[\hat{y}] + E[\hat{y}]^2)$$
$$= \text{Var}(y) + \text{Var}(\hat{y}) + (y - E[\hat{y}])^2$$
$$= \varepsilon^2 + \text{Var}[\hat{y}] + \text{Bias}[\hat{y}]^2$$

This result shows that squared estimator error is the sum of the variance of the estimator (how poorly it generalizes; its level of overfitting), the bias (how structurally underfitted it is), and an irreductible error in the underlying dataset, $\varepsilon$.

The takeaway from all this is that overall error in a model is the combination of these three components. The fact that *purposefully* biased models like [elastic net regression](https://www.kaggle.com/residentmario/nyc-buildings-part-1-elastic-net/) (itself a combination of [L1 and L2 norms](https://www.kaggle.com/residentmario/l1-norms-versus-l2-norms)) outperform unbiased models like linear regression in mean squared error at least some of the time stems from the fact that it's oftentimes possible to increase the bias of a model in a way that strongly decreases the variance, resulting in a better model overall.

For example, a good [blog post on the topic](https://theclevermachine.wordpress.com/tag/bias-variance-decomposition/) includes the following graphic demonstrating training a polynomial regressor:

![](https://i.imgur.com/ZEnV1xH.png)

In this case bias fell off a cliff on the third iteration, all the way to almost zero, because the underlying data was a third-degree polynomial. Variance on the other hand climbed every iteration, and as a result, the best model (according to MSE) occured at degree three.