In [None]:
import matplotlib.pyplot as plt
import numpy as np

from scipy import stats
from scipy.special import binom

## Intro

Our goal in Bayesian inference is to compute the posterior $p(\theta \mid D)$ which represents our updated beliefs about the parameters after we have seen data. This is, by Bayes' rule:

$$p(\theta \mid D) = \frac{p(D \mid \theta) p(\theta)}{p(D)}$$

We will start by focusing on just one of those terms, the "likelihood", $p(D \mid \theta)$.

## Just what is a likelihood anyway?

Let's call $\theta$ the probability of landing heads from a coin (or the "bias" of the coin). We toss the coin $n$ times and observe $y$ heads. 

Let's assume we toss the coin 10 times ($n=10$) and see 8 heads ($y=8$). Using just the data, what is our best guess for the bias of the coin, $\theta$?


In this example the (correct) likelihood (or model) is given by a Binomial distribution:

$$p(D \mid \theta ) = \text{Bin}(n, \theta) = {n \choose y} \theta^y (1-\theta)^{n -y}$$

We can write this in Python as:

```python
def likelihood(n, y, theta):
    return binom(n, y) * theta**y * (1 - theta)**(n - y)
```

and evaluate it for different choices of $\theta$. Let's do that.

In [None]:
def likelihood(n, y, theta):
    return binom(n, y) * theta**y * (1 - theta) ** (n - y)

In [None]:
x_grid = np.linspace(0, 1, 101)
plt.plot(x_grid, [likelihood(10, 8, x) for x in x_grid], linewidth=5);

**Question**

Why is the peak where it is?

## Beta-Binomial model

We can turn this into a fully Bayesian model by using a prior distribution, $p(\theta)$ to represent our prior belief about the bias of the coin before we have seen any data (i.e. tossed it). This section walks through how to calculate the posterior distribution in closed-form.

As before, let's call $\theta$ the probability of tossing heads from a coin. We toss the coin $n$ times and observe $y$ heads. We are interested in calculating $p(\theta \mid D)$.

The likelihood (or model) is the following:

$$p(D \mid \theta ) = \text{Bin}(n, \theta) = {n \choose y} \theta^y (1-\theta)^{n -y}$$

and we know: 

$${n \choose y} = \frac{y!}{n! (n-y)!}$$

If we assume a uniform prior, $p(\theta) = 1$ then we have the posterior proportional to:

\begin{align}
p(\theta \mid D) &\propto p(D \mid \theta) \, p(\theta) \\
&\propto \frac{y!}{n! (n-y)!} \theta^y (1-\theta)^{n -y} \cdot 1 \\
&\propto \theta^y (1-\theta)^{n -y}
\end{align}

where we dropped all terms that didn't depend on $\theta$, including $p(D)$.

Recall that a Beta distribution has PDF:

$$\text{Beta}(\alpha, \beta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{\text{B}(\alpha, \beta)}$$

where $$\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \, \Gamma(\beta)} {\Gamma(\alpha + \beta)}$$

Note we only care about terms involving $\theta$, everything else is treated as a constant.

We can thus compare the terms in the posterior expression with the PDF for a Beta distribution and deduce that the posterior is given by:

$$p(\theta \mid D) = \text{Beta}(y + 1, n - y + 1)$$

when we assume a uniform prior. A helpful fact is that $\text{Beta}(1, 1)$ is actually the uniform distribution. 

By following the same reasoning as above, if we switched our prior from uniform to a $\text{Beta}(\alpha, \beta)$ distribution we would have a posterior of:

\begin{align}
p(\theta \mid D) &\propto p(D \mid \theta) \, p(\theta) \\
&\propto \underbrace{\theta^y (1-\theta)^{n -y}}_{\text{Likelihood}} \underbrace{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}_{\text{Beta prior}} \\
&\propto \theta^{y + \alpha - 1} (1-\theta)^{n - y + \beta - 1} \\
&= \text{Beta}(y + \alpha, n - y + \beta)
\end{align}

## Examples!

Let's now look at some code examples to make this clearer.

In [None]:
def likelihood(n, y, theta):
    return binom(n, y) * theta**y * (1 - theta) ** (n - y)

In [None]:
x_grid = np.linspace(0, 1, 1_000)

In [None]:
n = 10
y = 8

a = 25
b = 25

prior = stats.beta(a, b)
posterior = stats.beta(y + a, n - y + b)
likelihoods = [likelihood(n, y, p) for p in x_grid]

f, ax = plt.subplots(figsize=(16, 8))
ax.plot(x_grid, prior.pdf(x_grid), label="prior, $p(\\theta)$", linewidth=5)
ax.plot(
    x_grid, posterior.pdf(x_grid), label="posterior, $p(\\theta \mid D)$", linewidth=5
)
ax2 = ax.twinx()
ax2.plot(
    x_grid,
    likelihoods,
    label="likelihood, $p(D \mid \\theta )$",
    linewidth=5,
    color="green",
)
ax.axvline(y / n, color="red", label="MLE")
ax.grid()
ax.legend()
ax2.legend(loc="upper left");

## Extra: simulate

Here we show the change in posterior as we see more and more data.

In [None]:
true_theta = 0.75
a = 3
b = 3
prior = stats.beta(a, b)

# Sample some data:
num_tosses = 250
samples = np.random.binomial(1, true_theta, num_tosses)
print(f"Sample has mean: {samples.mean()} and true coin bias is: {true_theta}")

In [None]:
fracs = [0.01, 0.05, 0.1, 0.25, 0.5, 1]
ns = np.round(np.array(fracs) * num_tosses).astype(int)

f, ax = plt.subplots(2, 3, figsize=(18, 10))
ax = ax.ravel()

for i in range(len(ns)):
    ax[i].plot(x_grid, prior.pdf(x_grid), label="prior", linewidth=5)
    heads = samples[: ns[i]].sum()
    post = stats.beta(heads + a, ns[i] - heads + b)
    ax[i].plot(x_grid, post.pdf(x_grid), label="posterior", linewidth=5)
    ax[i].set_title(f"Number of samples seen: {ns[i]}, heads so far: {heads}")
    ax[i].axvline(true_theta, color="grey", linestyle="--")
    ax[i].legend()

plt.tight_layout()
plt.suptitle(
    f"True $\\theta$: {true_theta}, actual heads in sample: {samples.sum()} => MLE {samples.mean():.2f}",
    fontsize=14,
    y=1.03,
);