# 7-2 Bias-Variance Tradeoff

The bias–variance tradeoff is frequently mentioned in statistics, econometrics, and machine learning. Though bearing the same name, the tradeoff between bias and variance of an estimator displays different mathematical implication in different contexts.

## 7.1 Bias-Variance Tradeoff in Econometrics

### Omitted Variable Bias


Consider the true data generating process is:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +u$$

with $u \sim N(0,\sigma^2)$.

Our question is, what if we do not have variable $x_2$ in the data. In order word, what if we specified a wrong model - $y = \beta_0 + \beta_1 x_1 + u$?


In that case, we will run an OLS with an erroneous specification.

$$min_{\beta_0,\beta_1} \sum_i (y_i - \beta_0 - \beta_1 x_1)^2$$

The solution to which is:

$$\hat{\beta_0} = \bar{y}-\hat{\beta_1}\bar(x_1)$$
$$\hat{\beta_1} = \frac{\sum(y_i-\bar{y})(x_{1i}-\bar{x_1})}{\sum(x_{1i}-\bar{x_1})^2}$$

### Unbias

One important goal of statistical inference is to obtain an unbiased point estimator - i.e. when the sample changes (the unobservables change), the expected value of estimator is equal to the population parameter. In mathematical terms, this means

$$E_u[\hat{\beta}] = \beta$$

You can also treat $x_1 and x_2$ as random variables, but that will make the calculation more difficult.

Assume we are interested in estimating the slope parameter $\beta_1$. Let's see if $\hat{\beta_1}$ is an unbiased estimator.

$$
\begin{aligned}
    E_u[\hat{\beta_1}] &= \frac{1}{\sum (x_{1i}-\bar{x_1})^2}E_u[\sum y_i(x_{1i}-\bar{x_1})] - 0 \\
    &=\frac{1}{\sum (x_{1i}-\bar{x_1})^2}E_u[\sum (\beta_0 + \beta_1 x_{1i} + \beta-2 x_{2i} + u)(x_{1i}-\bar{x_1})] \\
    &=\frac{1}{\sum (x_{1i}-\bar{x_1})^2}E_u[\sum (\beta_1 x_{1i} + \beta_2 x_{2i})(x_{1i}-\bar{x_1})] \\
    &=\frac{\beta_1 E_u[\sum x_{1i}(x_{1i}-\bar{x_1})]}{\sum (x_i-\bar{x})^2} + \frac{E_u[\sum \beta_2 x_{2i}(x_{1i}-\bar{x_1})]}{\sum (x_{1i}-\bar{x_1})^2} \\
    &=\beta_1+\frac{\beta_2  \sum x_{2i}(x_{1i}-\bar{x_1})}{\sum (x_{1i}-\bar{x_1})^2}                      
\end{aligned}
$$

$\hat{\beta_1}$ is not an unbiased estimator if the sample covariance of $x_2$ and $x_1$ is not equal to 0, and $\beta_2$ is not equal to zero. (If you treat $x_1$ and $x_2$ as random variables, then it means $Cov(x_1,x_2)\neq 0$).

Hence, we can draw a conclusion that if an relevant variable is omitted from the model specification, and the omitted variable is correlated with the variable of interest, OLS estimators will be biased.

### Inflated Variance

One apparent solution to the omitted bias problem is to collect data on $x_2$ and include it in the model specification. This will make $\hat{\beta_1}$ unbiased.

But this remedy is based on the assumption that you already know the true data generating process, which is unlikely in real life - In real-life application, our starting point is a dataset, we can only "guess" how GOD designs the mechanism.

Alternatively, we can always add as many variables as possible into the specification. This method will act as a safety net and can be shown to yield unbiased estimators as long as the residuals satisfies Gauss-Markov Assumption 1: E[u|x]=0.

However, there is a **disadvantage** with this alternative method - adding more variables will inflate the variance of the estimator. To see this, let's consider the omitted variable example we used in the previous section

Using the misspecified model $y=\beta_0 + \beta_1 x_1 + u$, the variance of the estimator $\hat{\beta_1}$ is given by

$$Var_u(\hat{\beta_1}) = \frac{\sigma^2}{n Var(x_1)}$$

After adding $x_2$ into the model and rerun OLS, the variance of $\hat{\beta_1}^{more}$ becomes
$$Var_u(\hat{\beta_1}^{more}) = \frac{\sigma^2}{n Var(x_1)}\frac{1}{1-R_1^2}$$

where $R_1^2$ is the r-squared from regression $x_1$ on $x_2$

If $x_1$ and $x_2$ are not perfectly uncorrelated, $R_1^2>0$, and $\frac{1}{1-R_1^2}$ will be greater than 1, therefore "inflate" the first term in variance. This leads to $Var(\hat{\beta_1}^{more})>Var(\hat{\beta_1}^{less})$

> **Conclusion** \
Adding nonorthogonal (i.e. $Cov(x_{add},x_{rest})\neq 0$) variables into the model will increase the variance the of the old estimators.

## 7.2 Bias-Variance Tradeoff in Machine learning

Machine learning focus on the predicted value instead of individual coefficients. In this regard, let's focus on the bias and variance of the predicted value $\hat{y}$. 

Since the goal of machine learning is to make the "best" predictions, we need to first set up a numerical "measure" to evaluate how "good" a prediction is. Such a measure is known as the **loss (or goal) function**.

A loss function closely related to OLS is the mean squared error loss.

$$MSE = E[(y - \hat{y})^2]$$

To obtain $\hat{y}$, the statistical convention is to estimate a sample version of the loss function, i.e.

$$\min_{\beta_0,\beta_1,...\beta_k} \frac{1}{N}\sum_i (y_i - \hat{y})^2$$

> Intuition: to minimize the loss function in the population, we should first minimize a similar loss function using the sample (the dataset we have).

Note that this minimization problem is exactly the same as an OLS regression. Hence the estimators and predictions will also be the same as OLS.

### Bias-Variance Decomposition of MSE

People soon became unsatisfied with the OLS type of estimation, and started seeking other ways to further minimize the population loss function - MSE.

> NOTE: Econometricians only care about the coefficients, therefore there is no need to pursue a better prediction (smaller loss) given that an unbiased and efficient estimator is already generated by OLS.

To achieve this goal, let's first gain a better understanding of the components of MSE. A decomposition can help us target the minimization effort at a specific part of the loss function.

$$
\begin{aligned}
    MSE &\equiv E_u[(y - \hat{y})^2]\\
        &= E[(f+u-\hat{y})^2] \\
        &= E[(f+u-\hat{y}+E[\hat{y}]-E[\hat{y}])^2] \\
        &= E[[(f-E[\hat{y}]) - (\hat{y}-E[\hat{y}])+u]^2] \\
        &= E[(f-E[\hat{y}])^2 + (\hat{y}-E[\hat{y}])^2 + u^2  - 2(f-E[\hat{y}])(\hat{y}-E[\hat{y}])) + 2(f-E[\hat{y}])u - 2 (\hat{y}-E[\hat{y}])u]
\end{aligned}
$$

By assumption $E[u] = 0$. 

$$
\begin{aligned}
MSE &= E[(f-E(\hat{y}))^2] + E[(\hat{y}-E[\hat{y}])^2] + \sigma^2 \\
& = (f-E[\hat{y}])^2 + Var(\hat{y}) + \sigma^2 \\
&\equiv Bias(\hat{y})^2 + Var(\hat{y}) + \sigma^2
\end{aligned}
$$

> In statistics, the difference between the true parameter $\beta$ and the expected value of an estimator $E[\hat{\beta}]$ is called the bias of the estimator $\hat{\beta}$.

### Bias-Variance Tradeoff
The Bias-Variance decomposition of the MSE loss shows us that we can lower MSE by either reducing the bias of the prediction or by reducing the variance of the prediction - This is a trivial conclusion, one that can be drawn w/o math.

<img src="./images/bias-variance3.png" style="width: 50%; margin: auto" alt=""/>

The question is rather: how can we reduce $Bias(\hat{y})$ or $Var(\hat{y})$?

### Solution 1 - add in new variables or high ordered terms
This idea comes from section 7.1, the omitted variable bias. By introducing possibly omitted variables, we can reduce the bias of the estimated coefficient, which leads to a reduced bias of prediction. To see this, we have

$$Bias(\hat{y}^{less}) = Bias(\hat{\beta_0}+\hat{\beta_1}x_1) \neq 0,$$

in the presence of an omitted variable $x_2$. After adding $x_2$ to the model, 
$$Bias(\hat{y}^{more}) = Bias(\hat{\beta_0}^{more}+\hat{\beta_1}^{more}x_1 + \hat{\beta_2}^{more}x_2) = 0 $$ 
because when using the correct specification, OLS yields unbiased estimators.

But at the same time, $Var(\hat{y})$ will be higher due to two reasons: (1) $Var(\hat{y}^{more})$ has one more term inside the parenthesis than $Var(\hat{y}^{less})$; (2) Variance of the old parameters are higher if $x_2$ is non-orthogonal to the other variables.

Hence we have a tradeoff.

**Graphical display of the Tradeoff** - assume dgp is known

![bias-variance.png](./images/bias-variance1.png)

The image above is a classical representation of the bias-variance tradeoff as higher ordered polynomials are added into the model. The left panel fit models with different polynomial flexibility to the same dataset:
- Yellow line: $y = \beta_0 + \beta_1 x + u$
- Black curve: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + u$
- Blue curve: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + ...+ \beta_{10} x^{10} + u$

> Note that high-order polynomials are rarely used in econometrics because the starting point of econometrics should be a known economic theory - to prevent inflated variance, while also guarantee an unbiased estimator as mentioned in section 7.1. And in economic theories, high-order polynomials are rare.

The right panel displays the value of MSE  (assume the data generating process is known) as more polynomial terms are added into the model.

Switching from a linear function to a cubic function significantly reduces MSE. This is because the true model contains terms higher than the $x^3$. The reduction in Bias is higher than the increase in Variance.

Switching from a cubic function to a 10th-ordered function still reduces MSE. But if more terms are added into the mode (green dot in the right panel), the increase in variance will dominate the aggregate effect and drives MSE up.

**Graphical display of Tradeoff** - using train and test set.

Although the previous figure is very illustrative, it is not possible to plot when we do not know the d.g.p.

Alternatively, we can plot the sample MSE using a test dataset to mimic the situation in a population. This involves a train/test splitting at the beginning of the analysis. Models will be only fitted to the train data, and the test data is left-out for model evaluation.

![bias-var](./images/bias-variance2.png)

In the right panel, the y-axis shows the *sample* mean squared error - $\frac{1}{n}\sum_i (y_i-\hat{y})^2$. The red line displays the sample MSE using the test data as more polynomial terms are added into the model. The curve is U-shaped, and the lowest sample MSE is obtained when the "blue" model (cubic function) is used.

The gray line has nothing to do with the bias-variance tradeoff. It shows the sample MSE using train data. It decreases in the number of variables simply because you are enlarging the choice set.

> Originally you are choosing $\beta_0, \beta_1$ to minimize sum of squared error, after adding in one more variable, you can choose over all possible values of $\beta_0, \beta_1, and \beta_2$. By choosing $\hat{\beta_2}=0$, you can obtain at least the same SSE as the less-termed model.

### Solution 2: use other regression methods

Another solution to reduce MSE is by deliberately introducing bias with the hope that the variance will reduce at a higher rate. You can refer to the Ridge regression and the LASSO model. Regularization methods as such introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.