# Decision Theory

## Classification
Consider a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer or not.  
- $\mathbf{x}$, input vector, the set of pixel intensities in the image.
- $t$, discrete output value 0 or 1, represent the presence of cancer or not.
- $C_1$, class, coressponds to $t=0$, the presence of cancer.
- $C_2$, class, coressponds to $t=1$, the absence of cancer.
We are interestd in the probabilities of the two classes given the new patient's X-ray image, which are given by
$$p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)p(C_k)}{p(\mathbf{x})}$$
In this Bayes' theorem, 
- $p(C_k)$ is the prior probability for the class $C_k$, represents the probability of presence/absence of cancer overall.
- $p(C_k|\mathbf{x})$ is the posterior probability, represents the probability of the presence/absence of cancer given the image intensities $\mathbf{x}$.
- $p(\mathbf{x}|C_k)$ is the probability from our distribution model after observing the given data (training data).
- $p(\mathbf{x})$ draws from $\displaystyle{\sum_k p(\mathbf{x}|C_k)p(C_k)}$

The solution of evaluating the posterior probability $p(C_k|\mathbf{x})$ is called *inference*.  
After receiving the probability, we also need to take actions, like give treament to the patient or not, which is the *decision* step.  

## Decision in classification
<font color='red'>Suppose we have got the posterior distibution. With the posterior distribution $p(C_k, \mathbf{x}) $ or joint distribution $p(\mathbf{x}, C_k)$, how we shall make decisions base on this probabilities, which is what we will discuss in this article</font>.
### Minimizing the misclassification rate
*Here, on the top, we will talk about how to choose the class for new input data $\mathbf{x}$, although it's trivial intuitively*.  
Suppose that our goal is simply to make as few misclassifications as possible. We need a rule that assigns **each value** of $\mathbf{x}$ to one of the available classes. Such a rule will divide the input space into regions $\mathcal{R}_k$ (for class $C_k$) called decision regions. The boundaries between decision regions are called *decision boundaries or decision surfaces*. The probability of classifying the region of each value of $\mathbf{x}$ to wrong classes is given by
$$\begin{align*}p(mistake)&=p(\mathbf{x}\in\mathcal{R}_1,C_2)+p(\mathbf{x}\in\mathcal{R}_2,C_1)\\ &=\int_{\mathcal{R}_1}p(\mathbf{x},C_2)d\mathbf{x}+\int_{\mathcal{R}_2}p(\mathbf{x},C_1)d\mathbf{x} \end{align*}$$
Clearly to minimize $p(mistake)$ we should arrange that each $\mathbf{x}$ is assigned to which ever class has the smaller value of the integrand. In another word, <font color='red'>to assign the class of $\mathbf{x}$ to whose probability is larger</font>.  
Minimize the mistake is to maximize the correctness. For the more general case of $K$ classes, it is slightly easier to maximize the probability of being correct, which is given by
$$\begin{align*}p(correct) &=\sum_{k=1}^Kp(\mathbf{x}\in\mathcal{R}_k,C_k)\\&=\sum_{k=1}^K\int_{\mathcal{R}_k}p(\mathbf{x},C_k)d\mathbf{x}\end{align*}$$
Again, follow the Bayes' theorem $\displaystyle{p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)p(C_k)}{p(\mathbf{x})}=\frac{p(\mathbf{x}, C_k)}{p(\mathbf{x})}}$, $p(C_k)$ is prior probability, the demostrator $p(\mathbf{x})$ is common to all terms, we see that each $\mathbf{x}$ should be assigned to the class having the largest posterior probability $p(C_k|\mathbf{x})$.

### Minimizing the expected loss
In the medical diagnosis problem, if a healthy patient is diagnosed as having cancer, the consequeces may be some patient disress plus the need for futher investigations. Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death due to lack of treatment.  
We can formalize such issues through the introducion of a *loss function*, also called a *cost function*, which is a single, overall measure of loss incurred in taking any of the available decisions or actions.  
Suppose that, for a new value of $\mathbf{x}$, the **true class is $C_k$** and that we **assign $\mathbf{x}$ to class $C_j$** (where $j$ may or may not be equal to $k$). In so doing, we incur some level of loss that we denote by $L_{kj}$, which we can view as the $k,j$ element of a *loss matrix*. 
$$\begin{matrix}k\ /\ j &cancer & normal\\
cancer & 0 & 1000\\
normal & 1 & 0\end{matrix}$$
Correct classification will generate no loss, whereas mistakes could cause 1 or even up to 1000 loss.The average loss is computed with respect to this distribution, which is given by  
$$\mathbb{E}[L]=\sum_k\sum_j\int_{\mathcal{R}_j}L_{kj}p(\mathbf{x},C_k)d\mathbf{x}$$
Our goal is to choose the region that minimize the expected loss. 
$$\mathbb{E}[L]=\sum_j\int_{\mathcal{R}_j}\sum_kL_{kj}p(\mathbf{x},C_k)d\mathbf{x}$$
As before, we can use the product rule $p(\mathbf{x},C_k)=p(C_k|\mathbf{x})p(\mathbf{x})$ to eliminate the common factor of $p(\mathbf{x})$. Thus the decision rule that minimizes the expected loss is the one that <font color='red'>assigns each new $\mathbf{x}$ to the class $j$ for which the quantity below is minimun</font>.
$$\sum_{k}L_{kj}p(C_k|\mathbf{x})$$

## Inference and Decision
### <font color='red'>Three approaches for classification</font>  
- (a) **Generative models**. First solve the inference problem of determining the class-conditional densities $p(\mathbf{x}|C_k)$ for each class $C_k$ individually. Then we can use the Bayes' theorem to achive the posterior propability $p(C_k|\mathbf{x})$ on which we make dicision base. So as to solve the jiont distribution $p(\mathbf{x}, C_k)$ for achiving the posterior.
- (b) **Discriminative models**. First solve the inference problem of determing the posterior class probabilities $p(C_k|\mathbf{x})$, and then subsequently classify the $\mathbf{x}$.
- (c) **Discriminat function**. Find a function $f(x)$ to output $0$ or $1$ directly.

The three approaches follows a complexity decreased order as well as a flexiblity decreased order. 
- Approach (a) involves finding the joint distribution over both $\mathbf{x}$ and $C_k$. For many applications , $x$ will have high dimensionality, and consequently we may need a large training set in order to be able to determine the calss-condictional densities to reasonable accuracy, which would also waste the computatonal resources. But acturally, what we need is just to classify the data.
- Approach (b) provides the posterior probabilities $p(C_k|\mathbf{x})$ that we only need.
- Approach (c) outputs the class directly without giving us the posterior probability, which is also very important for us to make decisions.

### <font color='red'>Posterior probability is very important to making decisions</font>
- **Mimimizing risk**. As before we minimizing the expected loss for diagnosis.
- **Reject option**. If the probability is in the ambiguous region that we think it not sufficient to classify on machine, then the machine should pick it up for experts to classify.
- **Compensating for class priors**. Like our medical X-ray problem again, If we follow the real-world prior probability to pick the examples, then even a large data set will contain few cancer examples, and so the learning algorithm will not be exposed to a board range of examples of such images and hence is not likely to generalize well. <font color='red'>A balanced data set in which we have selected equal numbers of examples from each of the classes would allow us to find a more accurate model</font>. After we use such a balanced data set and get the posterior probabilities, we first divide by the class fractions in that data set and then multiply by the class fractions in the population to which we wish to apply the model. Finally, we need to normalize to ensure that the new posterior probabilities sum to one.
- **Combining models**. For complex applications, we may wish to break the problem into a number of smaller subproblems each of which can be tackled by a separate module. Consider 2 separated modules with different inputs $\mathbf{x_I},\mathbf{x_B}$, which we assumed to be independent one another. So that 
$$p(\mathbf{x}_I,\mathbf{x}_B|C_k)=p(\mathbf{x}_I|C_k)p(\mathbf{x}_B|C_k)$$
The posterior probability is given by
$$\begin{align*}p(C_k|\mathbf{x}_I,\mathbf{x}_B)&\propto p(\mathbf{x}_I,\mathbf{x}_B|C_k)p(C_k)\\
&\propto p(\mathbf{x}_I|C_k)p(\mathbf{x}_B|C_k)p(C_k)\\
&\propto \frac{p(C_k|\mathbf{x}_I)p(C_k|\mathbf{x}_B)}{p(C_k)}
 \end{align*}$$
We can easily estimate from the form above and normalize the resulting posterior probabilities so they sum to one.

## Loss function for regression
Normally, we use the square loss
$$L(t,y(\mathbf{x}))=\big(y(\mathbf{x})-t\big)^2$$
More generally, the Minkovwki loss, where $q$ can be various values.
$$L(t,y(\mathbf{x}))=\big(y(\mathbf{x})-t\big)^q$$