# Error Decomposition



## Learning Objectives
After going through this notebook, students should be able to:
- Decompose error into bias, variance, and noise.
- Define bias, variance, and noise.
- Identify the cause behind the model's poor performance, i.e, bias, variance, or noise.
- Propose a solution to improve the performance of the machine learning model.

## Introduction

A machine learning model trained on a dataset may not be 100% accurate. It has some errors. The errors may be introduced due to:  

* The model being too simple or very complex.
* Ambiguity in the data.  

In this chapter, we will break the errors of a machine learning model into its components, examine sources of these errors and methods that can help us eliminate these errors.


## Error Decomposition For Regression

Let us discuss some assumptions and notations that will be used in this notebook.

* $\textit{A}$ represents a machine learning algorithm.
* $D = \{(\mathbf{x_1}, y_1),\dots, (\mathbf{x_n}, y_n)\}$ represents a dataset drawn i.i.d from some distribution $P(X, Y)$.
* $y$ is the target of a regression problem.
* $h_D$ represents a machine learning model.
* The dataset is noisy. For the same value of features, $\textbf{x}$, the output/label $y$ may be different.


<div align="center">
 <figure>
 <img src="https://doc.google.com/a/fusemachines.com/uc?export=download&id=1lIzk034tgOWRx0lFj1_ImtKBno-79rmt" width="600" >
 <figcaption>Figure 1: Machine learning algorithm when applied to a dataset results in a machine learning model</figcaption>
 </figure>
</div>

<!-- https://drive.google.com/file/d/1lIzk034tgOWRx0lFj1_ImtKBno-79rmt/view?usp=sharing -->



Let us define a few more terms.


An **expected label** is the label we expect to get from the trained machine learning model for an input $\bf x$. It is denoted by $\bar y(\bf x)$. It is computed as:
$\bar y(\textbf{x}) = E_{y|\textbf{x}}[Y] = \int_y y Pr(y|\textbf{x}) \partial y$

**Note**   
Since the random variable involved, in this case the output/label, is continuous, we are using integration to compute the expectation. If the output/label was discrete, we would have used summation as follows:
$\bar y(\textbf{x}) = E_{y|\textbf{x}}[Y] = \sum_y y Pr(y|\textbf{x})$

Our model, $h_D$, is not 100% accurate. To compute its error, we sample some data points $(\textbf{x}, y)$ from the original distribution $P$ and compute the squared loss. Mathematically, **expected test error** of our model $h_D$ is given as:
$E_{(\textbf{x}, y)\sim P}[(h_D(\textbf{x})- y)^2]=\int_x \int_y (h_D(\textbf{x})- y)^2 Pr(\textbf{x}, y)\partial y \partial \textbf{x}$


Our model, $h_D$, depends on the dataset $D$ sampled from a distribution $P$. There are different sets of data that can be sampled from $P$. Different datasets will result in different models, even when the same algorithm, $ A $, trains the model. This makes $h_D$ a random variable.
<!-- https://drive.google.com/file/d/12ZjCrQaj96xS8Dpc5bLgicMlUZKOP60C/view?usp=sharing -->

<div align="center">
 <figure>
 <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?export=download&id=12ZjCrQaj96xS8Dpc5bLgicMlUZKOP60C" width="300" > -->
 <img src="https://i.postimg.cc/8CvG0cGf/image.png" width="300">
 <figcaption>Figure 2: Multiple datasets results in multiple models. Small green circles represent the universe/distribution of data points from where $D_i$'s are sampled.</figcaption>
 </figure>
</div>

An **expected model** can be obtained using the formula,
$\bar h = E_{D \sim P^n}h_D = \int_D h_D Pr(D) \partial D$
Here,
$\bar h$ represents the expected model.
$D \sim P^n$ represents a dataset with $n$ instances drawn from a distribution $P$.

The expression for **expected test error** provided earlier computes the expected error of a model $h$ trained on data $D$ sampled from the distribution $P$ using the algorithm $A$. However, we are interested in the expected test error made by a model ($h$) trained using algorithm $A$ no matter what dataset is used, as long as it is sampled from the distribution $P$. Expected test error, given an algorithm $A$ and distribution $P$, can be computed as:

$$
E_{D \sim P^n, (\mathbf{x}, y)\sim P} \left[(h_D(\mathbf{x})- y)^2\right]= \int_D\int_\mathbf{x}\int_y (h_D(\mathbf{x})- y)^2\, Pr(\mathbf{x},y)\,Pr(D)\, d\mathbf{x}\, dy\, dD
$$

Now, let us decompose this expected test error into its components.

$$
\begin{align}
E_{\mathbf{x}, y, D}[(h_D(\mathbf{x})- y)^2] &= E_{\mathbf{x}, y, D}\left[\left(h_D(\mathbf{x})- \bar h(\mathbf{x}) + \bar h(\mathbf{x}) - y\right)^2\right] \\
&= E_{\mathbf{x}, D}[(h_D(\mathbf{x})- \bar h(\mathbf{x}))^2] \\
&\quad + 2 E_{\mathbf{x}, y, D} \left[(h_D(\mathbf{x})- \bar h(\mathbf{x})) (\bar h(\mathbf{x}) - y)\right] \\
&\quad + E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - y)^2] \tag{1}
\end{align}
$$

The middle term in equation (1) becomes 0 as shown below:

$$
\begin{align}
2 E_{\mathbf{x}, y, D} \left[(h_D(\mathbf{x})- \bar h(\mathbf{x})) (\bar h(\mathbf{x}) - y)\right] 
&= 2 E_{\mathbf{x}, y}\left[E_D[h_D(\mathbf{x}) - \bar h(\mathbf{x})] (\bar h(\mathbf{x})-y)\right] \\
&= 2 E_{\mathbf{x}, y}\left[(E_D[h_D(\mathbf{x})] - \bar h(\mathbf{x})) (\bar h(\mathbf{x})-y)\right] \\
&= 2 E_{\mathbf{x}, y}\left[(\bar h(\mathbf{x}) - \bar h(\mathbf{x})) (\bar h(\mathbf{x})-y)\right] \\
&= 0
\end{align}
$$

Equation (1) can now be written as:

$$
E_{\mathbf{x}, y, D}[(h_D(\mathbf{x})- y)^2] = E_{\mathbf{x}, D}[(h_D(\mathbf{x})- \bar h(\mathbf{x}))^2] + E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - y)^2] \tag{2}
$$

We can further break down the right-most expression of equation (2) as:

$$
\begin{align}
E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - y)^2] &= E_{\mathbf{x}, y}\left[\left((\bar h(\mathbf{x}) - \bar y(\mathbf{x})) + (\bar y(\mathbf{x}) - y)\right)^2\right] \\
&= E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - \bar y(\mathbf{x}))^2] \\
&\quad + E_{\mathbf{x}, y}[(\bar y(\mathbf{x}) - y)^2] \\
&\quad + 2 E_{\mathbf{x}, y} [(\bar h(\mathbf{x}) - \bar y(\mathbf{x}))(\bar y(\mathbf{x}) - y)] \tag{3}
\end{align}
$$

Just like the earlier expression, the third term in equation (3) also reduces to 0:

$$
\begin{align}
E_{\mathbf{x}, y} [(\bar h(\mathbf{x}) - \bar y(\mathbf{x}))(\bar y(\mathbf{x}) - y)] 
&= E_{\mathbf{x}} \left[(\bar h(\mathbf{x}) - \bar y(\mathbf{x})) E_{y|\mathbf{x}}[\bar y(\mathbf{x}) - y]\right] \\
&= E_{\mathbf{x}} \left[(\bar h(\mathbf{x}) - \bar y(\mathbf{x})) (\bar y(\mathbf{x}) - E_{y|\mathbf{x}}[y])\right] \\
&= E_{\mathbf{x}} \left[(\bar h(\mathbf{x}) - \bar y(\mathbf{x})) (\bar y(\mathbf{x}) - \bar y(\mathbf{x}))\right] \\
&= 0 \tag{4}
\end{align}
$$

Finally, using (3) and (4), we can write equation (2) as:

$$
E_{\mathbf{x}, y, D}[(h_D(\mathbf{x})- y)^2] = E_{\mathbf{x}, D}[(h_D(\mathbf{x})- \bar h(\mathbf{x}))^2] + E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - \bar y(\mathbf{x}))^2] + E_{\mathbf{x}, y}[(\bar y(\mathbf{x}) - y)^2]
$$

Or, in terms of error components:

$$
\underbrace{E_{\mathbf{x}, y, D}[(h_D(\mathbf{x})- y)^2]}_{\text{Expected test error}} =
\underbrace{E_{\mathbf{x}, D}[(h_D(\mathbf{x})- \bar h(\mathbf{x}))^2]}_{\text{Variance}} +
\underbrace{E_{\mathbf{x}, y}[(\bar h(\mathbf{x}) - \bar y(\mathbf{x}))^2]}_{\text{Bias}^2} +
\underbrace{E_{\mathbf{x}, y}[(\bar y(\mathbf{x}) - y)^2]}_{\text{Noise}}
$$

The above expression shows the error of machine learning models along with its components: **Variance**, **Bias$^2$**, and **Noise**. For simplicity, we will refer to **Bias$^2$** as **Bias**.

## Error Decomposition for Classification

Now, let us see how we can decompose the error for a classification problem. The assumptions and terminologies we talked about earlier still hold. But, the loss function changes to $L(h_D(\textbf{x}), y)$. Here, the output $y$ is a discrete label.  


$
L(h_D(\textbf{x}), y) = \begin{cases}
1 \text{ if }  h_D(\textbf{x}) \neq y, \\
0 \text{ otherwise}.
\end{cases}
$


For classification, our goal is to decompose $E_P[L(h_D(\textbf{x}), y)]$ into bias, variance and noise. To do so, we define a main prediction as,
$y^m(\textbf{x}) = argmin_{h_D(\textbf{x})}E_P[L(h_D(\textbf{x}), y)]$.   
Hence, main prediction is the one that gives us minimum expected loss. The loss is minimum if the predictions made by majority of all possible models is equal to $y$ i.e the mode of predictions is equal to $y$.    


Now, expression for bias, variance, and noise can be written as:
* **Bias** $= L(y^m, \bar y)$

  * $
\text{Bias} = \begin{cases}
1 \text{ if }  y^m \neq \bar{y}, \\
0 \text{ otherwise}.
\end{cases}
$
* **Variance** $ = E[L(h_D(\textbf{x}), y^m)]$
   * $
\text{Variance} = \begin{cases}
1 \text{ if }  h_D(\textbf{x}) \neq y^m, \\
0 \text{ otherwise}.
\end{cases}
$
* **Noise** $ = E[L(y, \bar y)]$
 * $
\text{Noise} = \begin{cases}
1 \text{ if }  y \neq \bar y, \\
0 \text{ otherwise}.
\end{cases}
$

## Bias, Variance, and Noise

From the above relation, we can define bias, variance, and noise as:
* **Variance**:
 Variance tells us how different is our model $h_D$ trained on data $D$ from the expected model or the best model that can be obtained using the algorithm $A$. Variance captures how much our model changes if we choose a different training set, $D$.

* **Bias**:
 Bias tells us how bad is the best model, $\bar h(\textbf{x})$, that can be trained using algorithm $A$ in capturing the information/pattern in our data.

* **Noise**:
 Noise gives the measure of ambiguity in our data. It is the property of the dataset.

Now we know that our test error is made up of three quantities. Let us see how we can detect and deal with each of these quantities.
* **Variance**
 **Detection:** A model is said to have high variance if it performs well on the train set but lags in the test set. For a high variance model, the train error is much less than the threshold, $\varepsilon$, but the test error is greater than the threshold.   
 **Solution:** High variance can be removed by:
 * Adding more training data
 * Using a simpler model
 * Bagging

* **Bias**
 **Detection:** If the model's train error is higher than some threshold $\varepsilon$, the model is said to be highly biased. A model with high bias can never achieve 100% accuracy on the training set whatsoever.   
 **Solution:** High bias can be removed by:
 * Using a complex model
 * Adding more features
 * Boosting


* **Noise**
 **Detection**: If two(or more) instances have the same feature values $\textbf{x}$ but have different labels $y$, then the dataset is noisy.   
 **Solution:** The noise in the dataset should be removed, or if possible, the dataset should be changed: use a new one.


Thus, to improve our model's performance, we need to find out the source of the error $-$bias, variance, or noise. Depending on this source, we apply suitable techniques to improve the performance of the model. In the next few chapters, we will study about bagging, random forest, and boosting that can help us to reduce bias and variance systematically.








## Key Takeaways


* From a given distribution($P$), we can sample multiple dataset $D_1, D_2, … $ and train multiple models.


* Expected Test error can be decomposed into three components: bias, variance, and noise.  


* Bias tells us how bad an algorithm is in capturing the underlying pattern/information. It can be reduced by adding more features, using a complex model or boosting.


* Variance captures how much our model changes when we change the dataset. It can be reduced by  adding more training data, using simpler model or bagging.


* Noise gives the measurement of ambiguity in our data. To reduce noise, we identify noisy instances and remove them from the dataset.
