# Bayesian Workflow

This notebook covers the three steps of the Bayesian workflow in the context of the coin tossing example. More precisely, we have a coin and wish to determine the probability $\theta$ of getting heads by flipping it $n$ times and observing $y$ heads. Let us assume that we were lazy and only tossed the coin 3 times. Every time we observed head.

In [None]:
N_HEADS = 3
N_TRIALS = 3

## 1. Defining a joint probability model

#### Prior

Let’s suppose we think all values of $\theta$ are equally likely, then our prior is simple a uniform distribution, i.e., $\theta \sim \text{Unif}(0,1)$. 

<!-- The associated probability density function (pdf) is,  -->

<!-- $$ p(\theta) = \begin{cases} 
      1 & 0\leq \theta \leq 1 \\
      0 & \text{otherwise}. 
   \end{cases}$$ -->



In [None]:
from scipy.stats import uniform

In [None]:
def prior(theta):
    return uniform.pdf(theta)

#### Sampling distribution

If the probability of heads is known, then the probability of $y$-times heads in $n$ flips is described by the binomial distribution, i.e., $y \sim \text{Binomial}(\theta, n)$. 
<!-- The associated probability density function is, -->

<!-- $$ p(y \vert n, \theta) = {n \choose y} \theta^y (1 - \theta)^{(n - y)}.$$ -->

In [None]:
from scipy.stats import binom

In [None]:
def likelihood(theta, y=N_HEADS, n=N_TRIALS):
    return binom.pmf(y, n, theta)

## 2. Conditioning on the observed data

We want to evaluate the distribution of parameters given observed data

$$p(\theta | y) $$

This is known as the posterior distribution and represents our beliefs about the parameters after having observed the data.
Bayes’ rule tells us how to calculate the posterior from the sampling distribution and prior

$$
p(\theta | y)  \propto p(\theta)p(y|\theta)  \propto \theta^y (1-\theta)^{(n-y)}\label{eq1}\tag{1}
$$

The right hand side has the form of an unnormalised Beta distribution, so we can immediately jump to the answer, 

$$
\theta|y \sim \text{Beta}(y+1, n-y+1).\label{eq2}\tag{2}
$$

### Intuition for Beta distribution

This graph plots the density of Beta($\alpha$, $\beta$) where $\alpha$ and $\beta$ are controlled by input fields.

In [None]:
import ipywidgets as widgets
import matplotlib.pyplot as plt
from IPython.display import clear_output, display
import numpy as np
from scipy.stats import beta as beta_dist

In [None]:
# turn off interactive mode so plots aren't auto-displayed
plt.ioff()
out = widgets.Output()

f, ax = plt.subplots(figsize=(8, 6))

x = np.linspace(0, 1, 1001)
y = beta_dist.pdf(x, 5, 5)
(line,) = ax.plot(x, y)
# invisible point to make sure y axis always starts at zero
ax.plot(0, 0, alpha=0)
ax.set_xlabel("x", fontsize=16)
ax.set_ylabel("Density", fontsize=16)


def update_plot(alpha, beta):
    line.set_ydata(beta_dist.pdf(x, alpha, beta))
    # recompute the ax.dataLim
    ax.relim()
    # update ax.viewLim using the new dataLim
    ax.autoscale_view()
    with out:
        clear_output(wait=True)
        display(f)


def make_input(desc):
    return widgets.BoundedIntText(
        value=5,
        min=0,
        max=1000,
        step=1,
        description=desc,
    )


widgets.interact(update_plot, alpha=make_input("Alpha"), beta=make_input("Beta"))
out

In [None]:
# Close figure window so that new plots won't show previous results
plt.close()

### Compute the posterior

**Exercise:** Define a function that computes the posterior pdf.

In [None]:
def posterior(theta, y=N_HEADS, n=N_TRIALS):
    return beta_dist.pdf(theta, y + 1, n - y + 1)

### Plot prior, likelihood and posterior

In [None]:
theta_vals = np.linspace(0, 1, 100)
prior_vals = prior(theta_vals)
likelihood_vals = likelihood(theta_vals)
posterior_vals = posterior(theta_vals)

In [None]:
plt.plot(theta_vals, prior_vals, label="Prior")
plt.plot(theta_vals, likelihood_vals, label="Likelihood")
plt.plot(theta_vals, posterior_vals, label="Posterior")
plt.xlabel("Theta")
plt.legend()
plt.show()

We see that the posterior is massively skewed towards 1, following the tendency of the likelihood. The value of $\theta$ that maximises the posterior is the same that maximises the likelihood, since we assumed a uniform prior.

### Comparison to frequentist approach

Despite having observed only heads, our posterior still allows for the probability of tails being greater than 0, which seems sensible. How does this compare to the frequentist approach?

The MLE is a classic frequentist estimate: it is the parameter value that maximizes the likelihood $p(y \vert \theta)$ (the $n$ for number of trials is omitted), that is,

$$\hat\theta_{\text{MLE}} =  \arg \max_{\theta} p(y \vert \theta) $$

From the likelihood plot we see that $\hat\theta_{\text{MLE}} = 1$, hence suggesting to estimate the outcome of any toin toss to always be heads! This is an extreme example, but in general small sample sizes can lead to bad inferences. Moderating our inferences with a well chosen prior can protect against this.

## 3. Evaluate model fit and then assess the implications

Now what we have the posterior, we can compute means, quantiles and more. But first let us check whether we are happy with our model.

### Check model fit

In [None]:
from scipy.integrate import quad

**Exercise:** Compute the probability of $\theta|y > 0.5$.

(Tip: you can use ```quad``` to integrate.)

In [None]:
result, numerical_error = quad(posterior, 0, 0.5)
print(1.0 - result)

This is a mighty large probability given we had only 3 observations! Are we really confident we should trust this model? In reality we might be pretty sure that the probability of heads is close to $1/2$ even without flipping the coin. We can build this belief into the model by changing the prior.

### Redefine prior

But by what choice should we replace our current prior? We know that $\theta$ is bounded to $[0,1]$. The only distribution we know that satisfies this is the Beta distribution (with the uniform distribution being the special case of $\alpha=\beta=1$). Since we know that most coins are fairly symmetric (hence fair), we are going to choose a Beta distribution with mean at $1/2$.

But how concentrated around 0.5 do we want it to be? Surely more than for the uniform distribution since we already tried that, and weren't satisfies with our model. 

**Exercise:** Play around with the parameters of $\alpha$ and $\beta$ in Section 2.1 until you find a distribution that captures what you think it's true about $\theta$. Hint: For it to have mean 0.5, you will need both parameters to be equal.

In [None]:
ALPHA = 20
BETA = 20

In [None]:
def new_prior(theta, alpha=ALPHA, beta=BETA):
    return beta_dist(alpha, beta).pdf(theta)

### Recompute posterior

It turns out that if we replace the uniform prior in equation (1) by Beta$(\alpha, \beta)$, the posterior is still Beta-distributed. More precisely,

$$
\theta|y \sim \text{Beta}(y+\alpha, n-y+\beta).
$$

Note that formula (1) can be seen as a special case of this (for when $\alpha=\beta=1$).

**Exercise:** Write a function which computes the posterior pdf.

In [None]:
def new_posterior(theta, y=N_HEADS, n=N_TRIALS, alpha=ALPHA, beta=BETA):
    return beta_dist(y + alpha, n - y + beta).pdf(theta)

In [None]:
new_prior_vals = new_prior(theta_vals)
new_posterior_vals = new_posterior(theta_vals)

In [None]:
plt.plot(theta_vals, new_prior_vals, label="Prior")
plt.plot(theta_vals, likelihood_vals, label="Likelihood")
plt.plot(theta_vals, new_posterior_vals, label="Posterior")
plt.legend()
plt.xlabel("Theta")
plt.show()

The posterior is still skewed to 1 due to the likelihood, but now the prior has a considerably higher influence. Let us check whether the model fits our assumptions better than the previous one. Let us compute the probability of $\theta|y > 0.5$.

In [None]:
result, numerical_error = quad(new_posterior, 0, 0.5)
print(1.0 - result)

This is already considerably better than the previous.

### Posterior Statistics

Now that we are happy with our model we want to compute some statistics which describe $\theta$'s distribution.

#### A. Expecation value

**Exercise:** Compute the expected value of $\theta$ under the posterior. Note that this is defined by

$$\mathbb{E}\left[\theta \vert y\right] = \int_0^{\infty} \theta\ p(\theta \vert y) d\theta $$

(Tip: you can do the integration by using ```quad```.)

In [None]:
mean = quad(
    lambda theta: theta * new_posterior(theta),
    0,
    np.inf,
)[0]

print(mean)

#### B. Event Probabilities

The probability of $\theta \in [\theta_1, \theta_2]$ under the posterior can be computed by

$$ \int_{\theta_1}^{\theta_2} p(\theta \vert y) \mathrm{d} \theta.$$

We can use this to find out what the chances are that the coin is fair, given our observations. But what do we mean by fair? In practice we know that there's no such thing as a fair coin. Every coin will have some minor imperfections that make it not symmetric. We usually do not care about those imperfections since we have some "error tolerance". More technically: Being a continuous variable, there is 0% probability that $\theta$ is exactly $0.5$. What we should care about is the probability mass around a neighbourhood of 0.5. 

**Exercice:** Set a tolerance and evaluate the probability of $\theta$ is fair within your tolerance. 

(Tip: this requires again integration)

In [None]:
tolerance = 0.01
result, numerical_error = quad(new_posterior, 0.5 - tolerance, 0.5 + tolerance)
print(result)

#### C. Quantiles

For a given probability $P \in [0,1]$ the associated posterior quantile is defined by

$$ \arg\max_x \left\{ \int_0^{x} \ p(\theta \vert y) d\theta \le P \right\}$$

**Exercise:** Define a function which computes the posterior quantiles. Use it to compute the 5-th and 95-th percentile.

(Tip: You can use the inverse cdf method ```.ppf``` from scipy's Beta distribution.)

In [None]:
def quantile_function(p, y=N_HEADS, n=N_TRIALS, alpha=ALPHA, beta=BETA):
    return beta_dist(y + alpha, n - y + beta).ppf(p)

In [None]:
print(quantile_function(0.05))
print(quantile_function(0.95))