In [2]:
import numpy as np

In probabilistic machine learning we frame problems in terms of some $X$, $D$ and $\theta$, where $X$ is an unknown data/event we want to model, $D$ is our seen data, and $\theta$ represents the parameters of our model. We tend to want to know $p(X|D)$, or $p(X|\theta)$, where $\theta$ is learned from $D$. <br> 
Some models don't include a $\theta$. These are called non-parametric models, and often don't scale well with data.
In a parametric model knowledge from the known data ($D$) is encoded into the parameters $\theta$. This means $p(X|\theta,D)=p(X|\theta)$ <br>
In general there are three options for doing predictions in this setting: <br>
1. Max Likelihood Esimation (MLE)
2. Maximum a posteriori (MAP)
3. Sum/Integrate out $\theta$

*Option 1:* We say $p(X|D)$ is essentially the same as $p(X|\theta)$, where $\theta = \text{argmax}_\theta p(D|\theta)$ Essentially we maximize the probability of the known data under the model parameters, then use those parameters to predict our unknown data. <br> <br>
*Option 2:* Similarly, $p(X|D)$ is modeled $p(X|\theta)$, but $\theta = \text{argmax}_\theta p(\theta|D)$. This is a subtle change, in that now we find the most likely values for $\theta$, given the data, which involves a prior: $ \text{argmax}_\theta p(\theta|D) = \text{argmax}_\theta p(D|\theta)p(\theta)$ (using bayes rule, ignoring $p(D)$ denominator, as it is a constant and dropped when finding the maximum)<br> <br>
*Option 3:* We expand $p(X|D) = \sum_{\theta} p(X,\theta|D) = \sum_{\theta} p(X|\theta,D)p(\theta|D) $ <br>
Which can be further simplified: $= \sum_{\theta} p(X|\theta)p(\theta|D)$, as we are using a parametric model. We multiply the probability of $X$ conditioned on $\theta$ by the probability of $\theta$ conditioned on $D$, then integrate out $\theta$. This is the most powerful way to learn $p(X|D)$, but relies on finding $p(D)$ which is often intractable. <br> <br>
To show these different approaches, consider the following example:

### bent coin example: <br>
Consider that I have a coin which has a bentness, $\theta$. This bentness gives the probability of getting heads on any throw. Say there are only 3 options for $\theta$: A fair coin ($\theta=0.5$), heads on both sides ($\theta=1$) and tails on both sides ($\theta=0$). My coin is one of those three options. <br>
Say I flip the coin once and show the results to give $D$, the known data. I will then flip it again later to get $X$, the unknown data. If $\theta$ is known trials are independent, so $p(D|\theta)$ and $p(X|\theta)$ are known: <br>

|         | H    |  T   |
|---------|------|------|
| **0**   | 0    | 1    |
| **0.5** | 0.5  | 0.5  |
| **1**   | 1    | 0    |

Say I observe D=H, and want to know $p(X=H|D=H)$.

**1: Max likelihood**

The value of $\theta$ with the highest $p(D|\theta)$ is $\theta=1$. Therefore, to predict the probability $p(X=H|D=H)$, I get $p(X=H|\theta=1)=1$

**2: Maximum a posteriori**

Say my prior $p(\theta)$ is $p(\theta=0)=0.1$,  $p(\theta=0.5)=0.8$,  $p(\theta=1)=0.1$. Then: <br>
$p(\theta=0|D=H) \propto 0\times 0.1 $ <br>
$p(\theta=0.5|D=H) \propto 0.5 \times 0.8 $ <br>
$p(\theta=1|D=H) \propto 1\times 0.1 $ <br>

These values are $0$, $0.4$ and $0.1$ respectively. So the most likely MAP value of $\theta$ is $0.5$.

Therefore, to predict the probability $p(X=H|D=H)$, I get $p(X=H|\theta=0.5)=0.5$

**3: Sum out $\theta$**

We want $\sum_{\theta} p(X=H|\theta)p(\theta|D=H)$. We know from the MAP values $p(\theta|D=H)$, unnormalized. Using bayes rule $p(\theta|D)$ is actually $p(D|\theta)p(\theta)\div p(D)$. $p(D)$ is constant, so is left out when finding the max likelihood and MAP estimates. <br>
We know the proportional values of $p(\theta|D=H)$ sum as $0+0.4+0.1 = 0.5$, so have $p(D)=0.5$ <br> 
Therefore: <br>
$p(\theta=0|D=H)=0 \div 0.5 = 0$ <br>
$p(\theta=0.5|D=H)=0.4 \div 0.5 = 0.8 $ <br>
$p(\theta=1|D=H)=0.1 \div 0.5 = 0.2 $ <br>
So: <br>
$\sum_{\theta} p(X=H|\theta)p(\theta|D=H) = 0\times 0 + 0.8 \times 0.5 + 0.2 \times 1 = 0.6 $ <br>

This third option makes the most sense, but relies on calculating the normalized probabilities of $\theta$ given the data, which is intractable in many cases. One case where it is not is the beta-binomial model. <br>

**Very simple way of testing these models:** <br>
For testing just assume the prior is correct, then throw away all samples which don't generate $D=H$. Then generate another sample, $X$, and count how many times $X=H$. This is monte-carlo sampling, which is covered later.

In [15]:
X_H_counts = 0
X_total_counts = 0
for trial in range(10000):
    coin = np.random.choice([0,0.5,1],p=[0.1,0.8,0.1])
    D = np.random.choice(["H","T"],p=[coin,1-coin])
    if(D=="H"):
        X = np.random.choice(["H","T"],p=[coin,1-coin])
        if(X=="H"):
            X_H_counts+=1
        X_total_counts+=1

print("esimated p(X=H|D=H) =", np.round(X_H_counts/X_total_counts,3), "%")

esimated p(X=H|D=H) = 0.594 %


As expected, the ratio is almost exactly 0.6

### Extension of the bent coin example:
Now considering a coin which can have any bentness in the range 0 to 1. $\theta$ is still the probability of getting heads. <br>
$p(D=H|\theta)=\theta$ <br>
$p(D=T|\theta)=1-\theta$ <br>
Say we have a uniform prior: <br>
$p(\theta)=1$ in the range $[0-1]$ (the integration of $p(\theta)$ is 1, so this is a valid distribution) <br>
So, again saying we want to know $p(X=H|D=H)$: <br>

**1: MLE** <br>
$p(X=H|\theta)=\theta.$ <br>
$\text{argmax}_\theta p(D=H|\theta) = \text{argmax}_\theta \theta = 1$ <br>
Therefore: $p(X=H|\theta)=1$ <br> <br>

**2: MAP** <br>
$\text{argmax}_\theta p(\theta|D=H) = \text{argmax}_\theta \theta\times1 = 1$ <br>
Therefore: $p(X=H|\theta)=1$ <br> <br>

**3: Integrate out $\theta$** <br>
$\int_\theta p(X=H|\theta)p(\theta|D=H) = \theta \times 1 \times \theta \div p(D=H)$ <br>
So, now to find $p(D=H)$. Luckily, we can do that in this case: <br>
$\int_\theta p(\theta|D=H)=1$ <br>
So, the integral of the posterior must be 1. The full posterior using bayes rule is: $p(D=H|\theta)p(\theta)\div p(D=H)$. <br>
As $p(D=H)$ does not include $\theta$, it can be moved out of the integral over $\theta$:<br>
$1 = (\int_\theta p(D=H|\theta)p(\theta)) \div p(D=H)$ <br>
$1 = (\int_\theta \theta \ times 1) \div p(D=H)$ <br>
The integration of $\theta$ between 0 and 1 $=\left[\frac{1}{2}\theta\right]_0^1 = 0.5$ <br>
So: $1 = 0.5 \div p(D=H)$. Therefore, $p(D)=0.5$<br>
Then: <br>
$\int_\theta p(X=H|\theta)p(\theta|D=H) = \theta \times 1 \times \theta \div 0.5 $ <br>
$= \int_\theta {2 \theta^2 }$ <br>
$= [2 \div 3 \times \theta^3]_0^1$ <br>
$= \frac{2}{3}$

Another simple test, same as before (but with the uniform prior instead):

In [25]:
X_H_counts = 0
X_total_counts = 0
for trial in range(10000):
    coin = np.random.rand() # the uniform prior
    D = np.random.choice(["H","T"],p=[coin,1-coin])
    if(D=="H"):
        X = np.random.choice(["H","T"],p=[coin,1-coin])
        if(X=="H"):
            X_H_counts+=1
        X_total_counts+=1
print("esimated p(X=H|D=H) =", np.round(X_H_counts/X_total_counts,3), "%")

esimated p(X=H|D=H) = 0.659 %


As predicted the estimated probability is around 2/3

### The Beta-Binomial model
Now the model is extended to talk about any amount of data $D$ (not just heads). As all coin flip data is iid (independent, identically distributed) we still only need to care about $p(X=H|D)$ and $p(X=T|D)$. So the only thing that changes is the likelihood and prior. <br>
For $n$ tosses of the coin, $h$ heads and $n-h$ tails, then the likelihood is: <br>
$p(D|\theta)=\theta^h(1-\theta)^{n-h}$ <br>
However, the above assumes we know the order of the flips of the coins. HHT and HTH should give the same final value for $\theta$ as the data is iid. If we say our data is only the numbers $h$ and $n$ (without the exact order), then we have the binomial distribution: <br>
$p(D|\theta)=\left(\frac{n}{h}\right)\theta^h(1-\theta)^{n-h}$ <br>
We also need a more informative prior than just a uniform distribution, and one which is defined over the range 0 to 1. <br>
We want a prior that looks similar to the likelihood to make the maths easy. The beta distribution is that prior: <br>
$\text{Beta}(\theta|a,b)\propto \theta^{a-1}(1-\theta)^{b-1}$ <br>
$a$ and $b$ are hyper-parameters. With a value of $a=1$ and $b=1$ then the distribution is the uniform one.
Therefore: 
$p(D|\theta)p(\theta) \propto \left(\frac{n}{h}\right)\theta^h(1-\theta)^{n-h} \theta^{a-1}(1-\theta)^{b-1}$ <br>
$p(D|\theta)p(\theta) \propto \left(\frac{n}{h}\right)\theta^{h+a-1}(1-\theta)^{n-h+b-1}$ <br>
We can recognise the above equation as proportional to the beta distribution again. This makes the beta distribution a conjugate prior. <br>
The normalization of the beta distribution is known for any $a$ and $b$, so plugging in $h+a$ and $n-h+b$ gives the full posterior (normalized): <br>

$p(\theta|D) = \text{Beta}(h+a,n-h+b)$ <br>
So, now it is possible to integrate out $\theta$: <br>
$\int_\theta p(X=H|\theta)p(\theta|D) = \int_\theta \theta \text{Beta}(h+a,n-h+b)$ <br>
This ends up simply being: <br>
$ (h+a) \div (h+a + n-h+b)$ <br>
$ = (h+a) \div (a+n+b)$ <br>
**Demonstrating:** <br>

In [32]:
def beta_bin_model_prob_heads(a,b,n,h):
    return (h+a)/(a+n+b)
print("prob for a single toss after seeing heads",beta_bin_model_prob_heads(1,1,1,1))

prob for a single toss after seeing heads 0.6666666666666666


The answer is exactly correct, in line with the predictions for the D=H case. <br>
Checking with Monte-Carlo sampling (for any a,b,n,h):

In [70]:
a = 1
b = 1
n = 16
h = 5

X_H_counts = 0
X_total_counts = 0
for trial in range(10000):
    coin = np.random.beta(a,b)
    D = np.random.choice(["H","T"],n,p=[coin,1-coin])
    heads = len(D[D=="H"])
    if(heads==h):
        X = np.random.choice(["H","T"],p=[coin,1-coin])
        if(X=="H"):
            X_H_counts+=1
        X_total_counts+=1

print("beta-binomial p(X=H|D=H) = ", beta_bin_model_prob_heads(a,b,n,h))
print("esimated p(X=H|D=H) =", np.round(X_H_counts/X_total_counts,3), "%")

beta-binomial p(X=H|D=H) =  0.3333333333333333
esimated p(X=H|D=H) = 0.333 %


The predictions of the beta-binomial model are largely in line with the results from sampling. Though, obviously with a high value of n then most coins will just be thrown out. 