# Bayes Classifier and Bayes Risk

_Kevin Siswandi_  
**Fundamentals of Machine Learning**  
June 2020  

In this chapter, we will discuss the best possible classifier in machine learning.

## Bayes Theorem

First let's review some basic concepts of probabilites. The product rule is

$$ P(\alpha \cap \beta) = P(\alpha | \beta) * P(\beta) $$

which is the foundation of Bayes Theorem:

$$ P(\alpha | \beta) = \frac{P(\beta|\alpha) P(\alpha)}{P(\beta)}$$

This can be generalized to multiple events using the *chain rule of probability*:

$$ P(\alpha, \beta, \gamma) = P(\alpha | \beta, \gamma) P(\beta|\gamma) P(\gamma) = P(\beta|\alpha,\gamma) P(\alpha | \gamma) P(\gamma) = ...$$

One consequence is that the expectation values also follow the rule

$$ \mathbb{E}_{\alpha,\beta}[.] = \mathbb{E}_{\alpha}\mathbb{E}_{\beta|\alpha}[.] $$

or in the case of independent sampling process:

$$ \mathbb{E}_{\alpha,\beta}[.] = \mathbb{E}_{\alpha}\mathbb{E}_{\beta}[.]$$


## Multivariate Distribution

- Multivariate is just statistician's speak for "multidimensional"
- Multivariate probability distribution, just like any other probability distribution, is normalized

Say we have a discrete distribution of two random variables as shown below (which are obviously not independent)

<img src="img/multivariate.png" alt="Drawing" style="width: 800px;"/>

The conditional probability $P(b|a = a_0)$ is conditioned on a specific value of $a = a_0$ and the function is really only dependent on $b$. The key thing to remember is that marginal distribution of a joint normal is normal, and the conditional distribution of a joint normal is normal. However, the joint distribution of normal marginal distributions is not necessarily normal.

## Statistical decision theory

Assume that we have a binary classification problem with set of class labels $Y \in \{1, 2\}$. The questions that we would like to answer are:
1. What does the best conceivable classifier look like?
2. How good is it?

In a nutshell: **Bayes classifier** is the proxy for the best possible classifier (not necessarily perfect), and **Bayes risk** measures how good a Bayes classifier is. Bayes classifier is obtained by minimizing Bayes risk.

Statistical model of classification gives a recipe for how to generate an observation $(x,y)$:

<img src="img/stats-classification.png" alt="Drawing" style="width: 800px;"/>

In order for this task to make sense, the joint distribution of the features should be different for different classes.

## Defining and minimizing risk

First we define a loss function $L(y, \hat{y})$, where $y$ is the true class and $\hat{y}$ is the predicted class. Although some mistakes are costlier in real life, a simple loss is the 0-1 loss below:

<img src="img/01loss.png" alt="Drawing" style="width: 600px;"/>

Aim of classification: minimize the expected loss (i.e. **risk**) of the classifier $f$:

$$ R(f) = \mathbb{E}_X \mathbb{E}_{Y|X} L(Y = y, f(X = x))$$

where $Y$ and $X$ are random variables and $f$ is your deterministic classifier which could be kNN, RF, etc. By the definition of the expectation value, we can write (assuming continuous RV)

$$ R(f) = \int_X \mathbb{E}_{Y|X} L(y, f(x)) p(x) dx $$

Therefore, to minimize risk, we minimize $ \mathbb{E}_{Y|X} L(y, f(x)) $ at every point $x$ in space. We can write this as a discrete summation over all possible class labels:

<img src="img/bayes-risk.png" alt="Drawing" style="width: 600px;"/>

Note that the second summation involves an indicator function that charges the loss only for the instances that are actually predicted by the classifier. Also, only wrong predictions incur loss. The final result is that, for the 0-1 loss, Bayes classifier is

$$ \arg \min_f R(f) $$

which is given by

$$ f(x) = \arg \max_{z \in Y} p(z |x) $$

Comments:
* Bayes classifier relies on the knowledge of the true density (of the classes and also the prior probability for each class).
* Bayes classifier is the best possible classifier, but best is not necessarily perfect.

## Bayes decision boundary

Consider the following toy example. There are two classes: red (class 1) and blue (class 2) in two dimensional feature space. The observations are sampled from some underlying distribution. 

<img src="img/bayes-theorem.png" alt="Drawing" style="width: 600px;"/>

We know the true probability density $p(x)$ as well as the priors $p(1)$ and $p(2)$, where $p(x, 1) = p(x|1) * p(1)$. Knowing these, we can compute the class density. The posterior probability $p(*|x)$ can be computed in terms of the prior and class density normalized by the density:
- Class density, e.g. $p(x|1)$, is also called likelihood.
- Class probability, e.g. $p(1)$, is also called prior probability.
- Probability density, $p(x)$, is also called evidence.

In practice, all terms in the RHS can be estimated from the data, e.g. kernel density estimate on the red points to get $p(x|1)$. For the 0-1-loss discussed above, the rule of Bayes classifier is to predict the class with the largest posterior probability $p(.|x)$. The **Bayes decision boundary (in green)** is the set for which the posterior probability for class 1 equals the one for class 2, namely $p(1|x) = p(2|x) = 0.5$. However, note that even the Bayes classifier is not perfect, because there are observations which are misclassified.

**The Bayes Classifier is the best possible classifier for a given distribution and loss function**.