# Bias variance tradeoff

Recall that our predictor is determined by solving the empirical minimization problem:

$$ \hat{f} = \underset{f}{\text{argmin}}\; \frac{1}{n}\sum_{i=1}^n \ell(f(X_i), y_i)$$

However, what we ultimately aim for is to minimize the expected risk:

$$R(f) = \mathbb{E}[\ell(f(X), Y)]$$

It's possible that we successfully minimize the empirical risk but still have a large expected risk. In such cases, we aren't truly "learning." Therefore, it's crucial to evaluate the generalization error of our predictor, denoted as $R(\hat{f})$. For regression problems (e.g., square loss functions), there exists a convenient way to decompose the generalization error. This decomposition helps us understand how the generalization error behaves as the model complexity changes.

## Bias variance decomposition

For the rest of this section, we assume that we are working with regression problems. In other words, we assume that the loss function is the square loss function. Recall that in the previous section, we showed that the Bayes optimal predictor for regression problem is given by the conditional expectation

$$f^*(x)=\mathbb{E}[Y|X=x]$$

In this section, we make another assumption that

$$Y=f^*(X)+\epsilon$$

Where $\epsilon$ is some random variable with $\mathbb{E}[\epsilon]=0, \text{Var}(\epsilon)=\sigma^2$. Let $\hat{f}$ denote our solved predictor. The generalization error is given by 

$$
\begin{align*}
R(\hat{f}) &= \mathbb{E}[(\hat{f}(X)-Y)^2]\\
&= \mathbb{E}[(\hat{f}(X)-f^*(X)-\epsilon)^2]\\
&= \mathbb{E}[(\hat{f}(X)-f^*(X))^2]-2\mathbb{E}[\epsilon(\hat{f}(X)-f^*(X))] + \mathbb{E}[\epsilon^2]\\
&= \mathbb{E}[(\hat{f}(X)-f^*(X))^2] - 2\mathbb{E}[\epsilon]\mathbb{E}[\hat{f}(X)-f^*(X)] + \mathbb{E}[\epsilon^2]\\
&= \mathbb{E}[(\hat{f}(X)-f^*(X))^2] + \sigma^2
\end{align*}
$$

Next, since $\hat{f}$ depends on the observations, it is a random quantity. Therefore, by adding and subtracting its expectation, we have

$$
\begin{align*}
R(\hat{f}) &= \mathbb{E}[(\hat{f}(X)-f^*(X))^2] + \sigma^2\\
&= \mathbb{E}[(\hat{f}(X)-\mathbb{E}[(\hat{f}(X)]+\mathbb{E}[(\hat{f}(X)]-f^*(X))^2] + \sigma^2\\
&= \mathbb{E}[(\hat{f}(X)-\mathbb{E}[(\hat{f}(X)])^2] + \mathbb{E}[(\mathbb{E}[(\hat{f}(X)]-f^*(X))^2] + \sigma^2\\
&= \text{Var}(\hat{f}(X)) + (\text{Bias}(\hat{f}(X)))^2 + \sigma^2
\end{align*}
$$

Where in the above argument, we ignored the cross product term in line 3 as it evaluates to $0$. This suggests that

$$\text{Generalization error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible noise}$$

Since the last term is irreducible, the minimize the generalization error, we want our predictor to have 
1. Low bias: Meaning that on average, the prediction $\hat{f}(X)$ is around the true value $f^*(X)$
2. Low variance: Meaning that if train the model on different samples, we should get similar results.

However, it turns out that in practice, it is hard to acheive low bias and low variance at the same time. More often, the generalization error curve takes in a U-shape (Figure 1).

<img src="images/tradeoff.png" style="width: 30%; height: auto;">

We observe that

* Low model complexity $\implies$ high bias and low variance
* High model complexity $\implies$ low bias and high variance

Therefore, it is more desirable to balance the levels of bias and variance of our predictor. In practice, this is done by evaluating the model error on a validation dataset. 