In probabilistic machine learning we frame problems in terms of some $X$, $D$ and $\theta$, where $X$ is an unknown data/event we want to model, $D$ is our seen data, and $\theta$ reoresents the parameters of our model. We tend to want to know $p(X|D)$, or $p(X|\theta)$, where $\theta$ is learned from $D$. <br> 
Elements in $D$ and $X$ are often iid (independent, identically distributed) which means $p(X|\theta,D)=p(X|\theta)$ <br>
In general there are three options for doing predictions in this setting: <br>
1. Max Likelihood Esimation (MLE)
2. Maximum a posteriori (MAP)
3. Sum/Integrate out $\theta$

With option 1. we say $p(X|D)$ is essentially $p(X|\theta)$, where $\theta = \text{argmax}_\theta p(D|\theta)$ Essentially we maximize the probability of the known data under the model parameters, then use those parameters to predict our unknown data. <br> <br>
With option 2. similarly, $p(X|D)$ is modeled $p(X|\theta)$, but $\theta = \text{argmax}_\theta p(\theta|D)$. This is a subtle change, in that now we find the most likely values for $\theta$, given the data, which involves a prior: $ \text{argmax}_\theta p(\theta|D) = \text{argmax}_\theta p(D|\theta)p(\theta)$ (using bayes rule, ignoring $p(D)$ denominator, as it is the same everywhere)<br> <br>
In option 3. we expand $p(X|D) = \sum_{\theta} p(X,\theta|D) = \sum_{\theta} p(X|\theta,D)p(\theta|D) $ <br>
Which can be further simplified: $= \sum_{\theta} p(X|\theta)p(\theta|D)$, if we assume that the data is iid. <br>
So in the third case we can multiply the probability of $X$ conditioned on $\theta$ by the probability of $\theta$ conditioned on $D$. To show these different approaches, consider the following example:

**3 option bent coin example:** <br>
Consider that I have a coin $\theta$, which has a bentness. There are only 3 options for $\theta$: A fair coin (0.5), heads on both sides (1) and tails on both sides (0). My coin is one of those three options (0/0.5/1). <br>
Say I flip the coin twice and show the results to give $D$, the known data. I also flip it twice to get the unknown data $X$. Then $p(D|\theta)$ and $p(X|\theta)$ are known: <br>

|         | HH   | HT   | TH   | TT   |
|---------|------|------|------|------|
| **0**   | 0    | 0    | 0    | 1    |
| **0.5** | 0.25 | 0.25 | 0.25 | 0.25 |
| **1**   | 1    | 0    | 0    | 0    |

Say I observe the D=HH, and want to know $p(X=HH|D=HH)$.

**1: Max likelihood:**

The value of $\theta$ with the highest $p(D|\theta)$ is $\theta=1$. Therefore, to predict the probability $p(X=HH|D=HH)$, I get $p(X=HH|\theta=1)=1$

**2: Maximum a posteriori:**

Say my prior $p(\theta)$ is $p(\theta=0)=0.1$,  $p(\theta=0.5)=0.8$,  $p(\theta=1)=0.1$. <br>
$p(\theta=0|D=HH) \propto 0\times 0.1 $ <br>
$p(\theta=0.5|D=HH) \propto 0.25 \times 0.8 $ <br>
$p(\theta=1|D=HH) \propto 1\times 0.1 $ <br>

These values are $0$, $0.2$ and $0.1$ respectively. So the most likely MAP value of $\theta$ is $0.5$.

Therefore, to predict the probability $p(X=HH|D=HH)$, I get $p(X=HH|\theta=0.5)=0.25$

**3: Sum out $\theta$:**

We want $\sum_{\theta} p(X=HH|\theta)p(\theta|D=HH)$. We know from the MAP values $p(\theta|D=HH)$, unnormalized. Using bayes rule $p(\theta|D)$ is actually $p(D|\theta)p(\theta)/p(D)$. $p(D)$ is constant, so is left out when finding the max likelihood and MAP estimates. <br>
We know the proportional values of $p(\theta|D=HH)$ sum as $0+0.2+0.1 = 0.3$, so have $p(D)=0.3$ <br> 
Therefore: <br>
$p(\theta=0|D=HH)=0/0.3 = 0$ <br>
$p(\theta=0.5|D=HH)=0.2/0.3 = \frac{2}{3}$ <br>
$p(\theta=1|D=HH)=0.1/0.3 = \frac{1}{3}$ <br>
So: <br>
$\sum_{\theta} p(X=HH|\theta)p(\theta|D=HH) = 0\times 0 + \frac{2}{3} \times {0.25} + \frac{1}{3} \times {1} = 0.5 $ <br>

This third option makes the most sense, but relies on calculating the normalized probabilities of $\theta$ given the data, which is intractable in many cases. One case where it is not is the beta-binomial model. <br>