# Bias Variance Tradeoff


Suppose that we have a training set consisting of a set of points $x_{1},\dots ,x_{n}$ and real values $ y_{i}$ associated with each point $x_{i}$. We assume that there is a function with noise ${\displaystyle y=f(x)+\varepsilon }$, where the noise, $\varepsilon$ , has zero mean and variance $\sigma ^{2}$.
<br>
We want to find a function ${\displaystyle {\hat {f}}(x;D)}$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm based on a training dataset (sample)  ${\displaystyle D=\{(x_{1},y_{1})\dots ,(x_{n},y_{n})\}}$. We make "as well as possible" precise by measuring the mean squared error between $y$ and ${\displaystyle {\hat {f}}(x;D)}$: we want ${\displaystyle (y-{\hat {f}}(x;D))^{2}}$ to be minimal, both for $ x_{1},\dots ,x_{n}$ and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the $y_{i}$ contain noise  $\varepsilon$ ; this means we must be prepared to accept an irreducible error in any function we come up with.


${\displaystyle \operatorname {Var} [X]=\operatorname {E} [X^{2}]-{\Big (}\operatorname {E} [X]{\Big )}^{2}.} $
<br>
Rearranging, we get:

${\displaystyle \operatorname {E} [X^{2}]=\operatorname {Var} [X]+{\Big (}\operatorname {E} [X]{\Big )}^{2}.}$ 
<br>
Since $f$ is deterministic, i.e. independent of $D$,
${\displaystyle \operatorname {E} [f]=f.}$ 

Thus, given ${\displaystyle y=f+\varepsilon }$ and ${\displaystyle \operatorname {E} [\varepsilon ]=0}$ , implies ${\displaystyle \operatorname {E} [y]=\operatorname {E} [f+\varepsilon ]=\operatorname {E} [f]=f.}$

Also, since ${\displaystyle \operatorname {Var} [\varepsilon ]=\sigma ^{2},}$

${\displaystyle \operatorname {Var} [y]=\operatorname {E} [(y-\operatorname {E} [y])^{2}]=\operatorname {E} [(y-f)^{2}]=\operatorname {E} [(f+\varepsilon -f)^{2}]=\operatorname {E} [\varepsilon ^{2}]=\operatorname {Var} [\varepsilon ]+{\Big (}\operatorname {E} [\varepsilon ]{\Big )}^{2}=\sigma ^{2}}$

Thus, since $\varepsilon$  and ${\hat {f}}$ are independent, we can write

![title](../Images/bias-var.svg?sanitize=true)


![title](../Images/bias-var-1.svg?sanitize=true)


where

${\displaystyle \operatorname {Bias} _{D}{\big [}{\hat {f}}(x;D){\big ]}=\operatorname {E} _{D}{\big [}{\hat {f}}(x;D){\big ]}-f(x)} $
and

${\displaystyle \operatorname {Var} _{D}{\big [}{\hat {f}}(x;D){\big ]}=\operatorname {E} _{D}[{\hat {f}}(x;D)^{2}]-\operatorname {E} _{D}[{\hat {f}}(x;D)]^{2}.}$

#### Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. Here it is important to understand that no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.

#### Bias and variance using bulls-eye diagram


![title](../Images/bias_var_diag.png)


![title](../Images/high_var_high_bias.png)



To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.


![title](../Images/tradeoff.png)


REFERENCES
* https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
* https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
* https://towardsdatascience.com/bias-and-variance-in-linear-models-e772546e0c30