*Updated 01-11-2023 (First commited 01-08-2023)*

(bayesian-decision-theory)=
# Bayesian Decision Theory (BDT)

**Bayesian decision theory** is a **statistical** view to the machine learning problems. 
It makes the assumption that the decision problem is posed in probabilistic terms, and that all of the relevant probability values are known.
The classifiers obtained under the framework of BDT will always give the best decision rule for each given test instance to minimize the expected total cost defined by the loss function.  

## Preliminary

### Statistics 

- [Joint Probability]()

- [Chain rule (probability)]()

- [Bayes Theorem]()

- [The log trick]()

## Basic concepts

### Probability view of machine learning

When modeling a machine learning problem in the probability setting, both instances $\mathbf{x}$ and labels $y$ are sampled from different random variables. 

- All instances with $d$ features are sampled from a random process of $d$ random variables $\mathbf{X} = \{ X_{1}, \dots, X_{d} \}$.

- All possible labels are also sampled from a random variable $Y$.

Thus, there is always a probability associated with each term:

- $\mathbb{P}_{\mathbf{X}}(\mathbf{x})$: the probability that the instance $\mathbf{x}$ happens in the real world (the joint probability of different features that happen in the real world).

- $\mathbb{P}_{Y}(y)$: the probability that label $y$ happens in the real world.

- $\mathbb{P}_{\mathbf{X}, Y}(\mathbf{x}, y)$: the joint probability that the both $\mathbf{x}$ and $y$ happens in the real world.

We are particularly interested in $\mathbb{P}_{\mathbf{X}, Y}(\mathbf{x}, y)$, as we can know what should be the correct label $y_{t}$ for the test instance $\mathbf{x}_{t}$ by selecting $y$ that has the highest $\mathbb{P}_{\mathbf{X}, Y}(\mathbf{x}_{t}, y)$.

We can decompose the joint probability according to the chain rule:

$$ 
\mathbb{P}_{\mathbf{X}, Y}(\mathbf{x}, y) = \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y), 
$$

where 

- $\mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y)$ is called class conditional probability, which gives the probability of the instance if we know the label is $y$.

- $\mathbb{P}_{Y}(y)$ is the class probability.

### Decision function

Given an instance $\mathbf{x}$, the decision function $g(\cdot)$ determines its label $\hat{y}$ according to some rules. 

$$ 
\hat{y} = g(\mathbf{x}). 
$$

### Loss function

Given two labels, which usually are the predicted label from the decision function $\hat{y}$ and an arbitrary label $y$, the loss function $L(\hat{y}, y)$ defines the cost of predicting label $\hat{y}$ with respect to the label $y$.  

- The cost returned by loss functions should be a non-negative value.

- For classification problems where the label is a discrete random variable, the loss function can be specified by a matrix $\mathbf{L} = \mathbb{R}^{d \times d}$, where the cost of predicting label 1 with respect to the label 2 is $\mathbf{L}_{1, 2}$.

## Bayes decision rule

### Risk

Assuming we have a probability model $\mathbb{P}_{\mathbf{X}, Y}(\mathbf{x}, y)$ of the joint probability of $\mathbf{X}$ and $Y$, the **risk** function of the decision function $g$ is defined as the expectation of the loss function over the joint probability

$$
\begin{aligned}
R(g) 
& = \mathbb{E}_{\mathbf{X}, Y} \left[
    L (g(\mathbf{x}), y)
\right]
\\
& = \int \int \mathbb{P}_{\mathbf{X}, Y} (\mathbf{X}, y) L (g(x), y) \mathop{d \mathbf{x}} \mathop{dy}
& [\text{definition of expectation}]
\\
& = \int \int \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{X}) \mathbb{P}_{\mathbf{X}} (\mathbf{x}) L (g(x), y) \mathop{d \mathbf{x}} \mathop{dy}
& [\text{probability chain rule}]
\\
& = \int \mathbb{P}_{\mathbf{X}} (\mathbf{x}) \int \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{X}) L (g(x), y) \mathop{dy} \mathop{d \mathbf{x}}
\\
& = \mathbb{E}_{\mathbf{X}} \left[
    \mathbb{E}_{Y \mid \mathbf{X}} \left[
        L (g(x), y)
    \right]
\right]
\\
& = \mathbb{E}_{\mathbf{X}} \left[
    R \left(
        \mathbf{x}, g
    \right)
\right],
\\
\end{aligned}
$$

where $R(\mathbf{x}, g)$ is **conditional risk** (Bayes risk?), which is the risk given that $\mathbf{x}$ is known.

### Bayes decision rule (BDR)

**Bayes decision rule** is the particular decision function $g^{*}(\mathbf{x})$ that minimizes the risk

$$ 
\begin{aligned}
g^{*} (\mathbf{x}) 
& = \arg\min_{g (\mathbf{x})} R (g)
\\
& = \arg\min_{g (\mathbf{x})} R (\mathbf{x}, g) & [\text{only } R (\mathbf{x}, g) \text{ contains } g (\mathbf{x})].
\\
\end{aligned}
$$

The risk that Bayes decision rule achieves is called **Bayes Risk**, which is the minimum risk that any decision function can achieve, if we know the true probability model and its parameters $\mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x})$.

## Example: BDR with 0-1 loss

Often time, we are dealing with the classification problem where $\mathbf{X}$ is a group of continuous random variables and $Y$ is a discrete random variable with $m$ unique values. 
0-1 loss is frequently used for the classification problem.

### 0-1 loss

The 0-1 loss is a simple and robust loss function for the classification problems. The 0-1 loss function can be written as:

$$ 
L(g(\mathbf{x}), y) = 
\begin{cases}
1 & g(\mathbf{x}) \neq y \\
0 & g(\mathbf{x}) = y, \\
\end{cases}
$$

which can also be written as a matrix of $\mathbb{R}^{m}$, where the entries in the diagonal are all $0$ ($g(\mathbf{x}) = y$) and rest are all $1$ ($g(\mathbf{x}) \neq y$).

### MAP rule

If we choose 0-1 loss as the loss function for BDR, 

$$
\begin{aligned}
g^{*} (\mathbf{x}) 
& = \arg\min_{g(\mathbf{x})} \mathbb{E}_{Y \mid \mathbf{X}} \left[
    L (g(x), y)
\right]
\\
& = \arg\min_{g (\mathbf{x})} \sum_{y=1}^{m} \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x}) L (g(\mathbf{x}), y) 
\\
& = \arg\min_{g (\mathbf{x})} \sum_{y = g (\mathbf{x})}^{m} \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x}) \times 0 + \sum_{y \neq g(\mathbf{x})}^{m} \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) \times 1 
\\
& = \arg\min_{g (\mathbf{x})} \sum_{y \neq g (\mathbf{x})}^{m} \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x}) 
\\
& = \arg\min_{g (\mathbf{x})} 1 - \mathbb{P}_{Y \mid \mathbf{X}} (g (\mathbf{x}) \mid \mathbf{x}) 
\\
& = \arg\max_{g (\mathbf{x})} \mathbb{P}_{Y \mid \mathbf{X}} (g (\mathbf{x}) \mid \mathbf{x}) & [\arg\min_{x} (1 - f(x)) = \arg\max_{x} (f(x))] 
\\
& = \arg\max_{y} \mathbb{P}_{Y \mid \mathbf{X}} (y \mid \mathbf{x}).
\end{aligned}
$$

Since the last equation is maximizing the posterior probability according to Bayes Theorem, the optimal decision rule for 0-1 loss is also called **maximum a-posteriori probability (MAP) rule**.

According to Bayes Theorem, 

$$ 
\begin{aligned}
\arg\max_{y} \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) 
& = \arg\max_{y} \frac{\mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y)}{\mathbb{P}_{\mathbf{X}}(\mathbf{x})} 
\\
& = \arg\max_{y} \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) & [\mathbb{P}_{\mathbf{X}}(\mathbf{x}) \text{ doesn't depend on } y], 
\\
\end{aligned}
$$

MAP rule can thus be computed using the class conditional probability (likelihood) and the class probability (prior). 

In practice, the class conditional probability and class probability can be more easily obtained from the data than the posterior probability. 

### The log trick

Using the log trick, the BDR for 0-1 loss is often calculated using: 

$$ 
\begin{aligned}
\arg\max_{y} \ln \mathbb{P}_{Y \mid \mathbf{X}}(y \mid \mathbf{x}) 
& = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) \mathbb{P}_{Y}(y) 
\\
& = \arg\max_{y} \ln \mathbb{P}_{\mathbf{X} \mid Y}(\mathbf{x} \mid y) + \ln \mathbb{P}_{Y}(y).
\\
\end{aligned}
$$

## Example: BDR with squared error loss

For regression problems where the labels are continuous values, a common loss function is squared error loss

$$
L (g(x), y) = (g(x) - y)^{2}.
$$

Plug the loss function in BDR

$$
\begin{aligned}
g^{*} (\mathbf{x}) 
& = \arg\min_{g(\mathbf{x})} \mathbb{E}_{Y \mid \mathbf{X}} \left[
    L (g(x), y)
\right]
\\
& = \arg\min_{g(\mathbf{x})} \mathbb{E}_{Y \mid \mathbf{X}} \left[
    (g (\mathbf{x}) - y)^{2}
\right]
\\
& = \arg\min_{g(\mathbf{x})} \mathbb{E}_{Y \mid \mathbf{X}} \left[
    (g (\mathbf{x})^{2} - 2 g (\mathbf{x})^{2} y + y^{2}
\right]
\\
& = \arg\min_{g(\mathbf{x})} g (\mathbf{x})^{2} - 2 g (\mathbf{x}) \mathbb{E}_{Y \mid \mathbf{X}} \left[
    y
\right] + \mathbb{E}_{Y \mid \mathbf{X}} \left[
    y^{2}
\right].
\end{aligned}
$$

The minimization problem can be solved by setting the its derivative w.r.t $g (\mathbf{x})$ to 0

$$
\begin{aligned}
\frac{
    \mathop{d}
}{
    \mathop{d g (\mathbf{x})} 
} \left[
    g (\mathbf{x})^{2} - 2 g (\mathbf{x}) \mathbb{E}_{Y \mid \mathbf{X}} \left[
        y
    \right] + \mathbb{E}_{Y \mid \mathbf{X}} \left[
        y^{2}
    \right]
\right] 
& = 0
\\
2 g (\mathbf{x}) - 2 \mathbb{E}_{Y \mid \mathbf{X}} \left[
    y
\right]
& = 0
\\
g (\mathbf{x}) 
& = \mathbb{E}_{Y \mid \mathbf{X}} \left[
    y
\right]
\\
\end{aligned}
$$

## References

- http://pillowlab.princeton.edu/teaching/mathtools16/slides/lec18_BayesianEstim.pdf