# "MCMC From Scratch I: Bayesian Statistics"
> "An introduction to Markov Chain Monte Carlo for Bayesian Statistics"

- toc:false- branch: master- badges: true- comments: true
- author: Rik de Kort
- categories: [mcmc]

This is the first in a series of notebooks on solving the Eight Schools problem from Bayesian Data Analysis from scratch in Python. I developed these notebooks for the bi-weekly knowledge sharing sessions between Data Scientists we have at my company. Their aim is to give a standalone introduction to the basics of Markov Chain Monte Carlo based on a practical example, as opposed to first developing a lot of theory.  

This notebook is a (very) gentle introduction to the Eight Schools problem and the Bayesian approach used to solve it.

Enjoy!

In [1]:
#hide
import altair as alt
import numpy as np
import pandas as pd
from scipy.stats import norm, uniform

from mcmc import json_dir

alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir', data_dir='altairdata')

DataTransformerRegistry.enable('json_dir')

# The problem we're trying to solve
We are investigating the effect of a new way of teaching something (say, Bayesian statistics) in 8 different schools. We find the following data, which of course we visualize.

In [2]:
effect_estimates = np.array([28, 8, -3, 7, -1, 1, 18, 12])
std_estimates = np.array([15, 10, 16, 11, 9, 11, 10, 18])
school_names = ["A", "B", "C", "D", "E", "F", "G", "H"]

df = pd.DataFrame({"effect_estimate": effect_estimates,
                   "std_estimate": std_estimates,
                  "school": school_names})

bar_chart = alt.Chart(df).mark_bar().encode(
    x="school",
    y=alt.Y("effect_estimate", title="effect"),
    tooltip="mean(effect_estimate)"
)
mean = alt.Chart(df).mark_rule().encode(
    y="mean(effect_estimate)",
    tooltip="mean(effect_estimate)"
)

(bar_chart + mean).properties(
    width=768
).configure_axisX(
    labelAngle=0
)

There seems to be a big disparity between various schools. Is the true effect for school A really so high as to be 28? That seems unlikely, it could also be the result of a measurement error. On the other hand, if we pool all the effect estimates, we get a mean effect size of 8.75 (the horizontal line in the graphic), which seems low for school A. So what really is the effect for school A?

# A third option
What we are looking for it the "true" effect size for school A. This is potentially different from the *observed* effect estimate of 28. But this is something we can model! If we call the true effect size `t_A`, and we use the standard deviation estimate from the table, we can model the estimated effect size by:

In [3]:
t_A = 42  # Suppose it's known
effect_estimate_A = norm.rvs(loc=t_A, scale=15)

Clearly, we can extend this for schools B, C,..., H. We then have a list of true effect sizes `t`, and we model the estimated effect sizes using the standard deviation estimates from our data:

In [4]:
t = [42, 37, 23, 17, 19, 101, 0, -5]  # Suppose it's known
effect_estimates = norm.rvs(loc=t, scale=std_estimates)

In this model, each school still has its own effect size, with no relation between them. In order to draw this connection, we suppose that all the true parameters come from some shared distribution with mean `mu` and standard deviation `sigma`. That is:

In [5]:
mu, sigma = 42, 37  # Suppose they're known
t = norm.rvs(loc=mu, scale=sigma, size=8)
effect_estimates = norm.rvs(loc=t, scale=std_estimates)

**Inference**  
That's a pretty neat model! But how do we obtain the values for `t`? We only know the effect estimates, and have no idea what `mu` and `sigma` should be.  
From our model, we know the *conditional probability* of the effect estimates given `t`, `mu`, and `sigma`. In mathematical symbols we write that we know $p(\mathtt{effect\_estimates} \mid t, \mu, \sigma)$.

In [6]:
def p_conditional(effect_estimates, t, mu, sigma):
    return norm.pdf(effect_estimates, loc=t, scale=std_estimates)

p_conditional([0]*8, [0]*8, 0, 1)

array([0.02659615, 0.03989423, 0.02493389, 0.03626748, 0.04432692,
       0.03626748, 0.03989423, 0.02216346])

This doesn't seem to help us much: we have simply phrased the model above in terms of probability distributions. However, as we will see, using the above expression mathematics allows us to find $p(t, \mu, \sigma \mid \mathtt{effect\_estimates})$, the distribution of the parameters given the data!

# Conditional probability and Bayes theorem
Below we have an image of several animals and where they reside: on land, or in water. We have the options bear, horse, bee, cat, duck, hippo, alligator, fish, and shark. A simple conditional probability question might be: what is the probability that an animal lives in water, *given* that it lives on land. In the symbols of mathematics: $p(\text{water} \mid \text{land})$. Looking at the diagram, it's clear that this probability is $\frac{3}{7}$: there are seven animals that live on land and three of those that live in water *and* land. So the calculation we did is $$p(\text{water} \mid \text{land}) = \frac{p(\text{water}, \text{land})}{p(\text{land})}$$

![](venn_diagram.jpg)

Now we invert the question: what is the probability that an animal lives on land, given that it lives in water? A simple look tells us $$p(\text{land} \mid \text{water}) = \frac{p(\text{water}, \text{land})}{p(\text{water})}$$

To obtain a relationship between these, notice that $p(\text{water}, \text{land}) = p(\text{water} \mid \text{land}) \cdot p(\text{land})$. Putting this into our second equation, we get $$p(\text{land} \mid \text{water}) = \frac{p(\text{water} \mid \text{land}) \cdot p(\text{land})}{p(\text{water})}$$

Which is the formula known as **Bayes' theorem**. It tells us how to get from a conditional probability to its inverse.

# Applying Bayes theorem to our 8 Schools problem
Using Bayes theorem

$$p(t, \mu, \sigma \mid \mathtt{effect\_estimates}) = \frac{p(\mathtt{effect\_estimates} \mid t, \mu, \sigma) \cdot p(t, \mu, \sigma)}{p(\mathtt{effect\_estimates})}$$

We have reduced our problem to finding out the following quantities: what is the probability of a certain value of $t$, $\mu$, $\sigma$, and what is the probability *in general* of a certain `effect_estimate`.

To answer the first question, remember that we modelled `t = norm.rvs(loc=mu, scale=sigma, size=8)`. That is, we already know $p(t \mid \mu, \sigma)$! From the derivation of Bayes' theorem we already knew that $p(t, \mu, \sigma) = p(t \mid \mu, \sigma) \cdot p(\mu, \sigma)$. So it appears we are still missing a piece of the equation: probabilities for $\mu$ and $\sigma$.  
These probabilities we call *priors*. They encode what information we already have about the parameters. In this case, that is not much. It seems reasonable to say $\mu$ should be somewhere in the vicinity of 8.75, so let's put the probability for $\mu$ as a normal distribution with mean 8.75 and standard deviation 20. For $\sigma$, we pretty much only know it's not negative, so we put it's probability uniform between 0 and 100.  
Conversely, we call the distribution $p(t, \mu, \sigma \mid \mathtt{effect\_estimates})$ the *posterior*.

This gives us all the bits and pieces to answer our first question. In code:

In [7]:
def p_mu(mu):
    return norm.pdf(mu, loc=8.75, scale=20)

def p_sigma(sigma):
    return uniform.pdf(sigma, loc=0, scale=100)

def p_t_mu_sigma(t, mu, sigma):
    return norm.pdf(t, loc=mu, scale=sigma).prod() * norm.pdf(mu, loc=8.75, scale=20) * uniform.pdf(sigma, loc=0, scale=100)

p_mu(10), p_sigma(2.3), p_t_mu_sigma([8.75]*8, 8.75, 1)

(0.01990819283434433, 0.01, 1.279854491013879e-07)

The second question seems more difficult at first sight: it is hard to obtain $p(\mathtt{effect\_estimates})$. In this case it could be done, because we chose simple distributions.

We note that `effect_estimates` is given, so $p(\mathtt{effect\_estimates})$ is a constant. If we remove this from the equation, we find the left-hand side to be *proportional to* ($\propto$) the right-hand side in the following equation:

$$p(t, \mu, \sigma \mid \mathtt{effect\_estimates}) \propto p(\mathtt{effect\_estimates} \mid t) \cdot p(t \mid \mu, \sigma) \cdot p(\mu) \cdot p(\sigma)$$

In code:

In [8]:
def p_prop(t, mu, sigma):
    p_mu = norm.pdf(mu, loc=8.75, scale=20)
    p_sigma = uniform.pdf(sigma, loc=0, scale=100)
    p_t_mu_sigma = norm.pdf(t, loc=mu, scale=sigma)
    p_ee_t = norm.pdf(effect_estimates, loc=t, scale=std_estimates).prod()
    return p_ee_t * p_t_mu_sigma * p_mu * p_sigma

Using this probability proportion, we can already sample from the distribution: we weight each sample by its probability proportion, dividing by the total weight. This will allow us to draw histograms. A small example in two dimensions:

In [9]:
def p_prop(q):
    x, y = q
    if abs(x) > 2 or abs(y) > 3: return 0
    return 0.3*norm.pdf(x) + 0.7*norm.pdf(y) - 0.1*norm.pdf(x*y)

bins = np.linspace(-3, 3, 200)
points = [(x, y) for x in bins for y in bins]
samples = pd.DataFrame([(*q, p_prop(q)) for q in points], columns=["x", "y", "p_prop"])
samples["p"] = samples.p_prop / samples.p_prop.sum()
samples.head()

Unnamed: 0,x,y,p_prop,p
0,-3.0,-3.0,0.0,0.0
1,-3.0,-2.969849,0.0,0.0
2,-3.0,-2.939698,0.0,0.0
3,-3.0,-2.909548,0.0,0.0
4,-3.0,-2.879397,0.0,0.0


An overview of the distirbution in the two-dimensional space.

In [10]:
alt.Chart(samples).mark_point(filled=False).encode(
    x="x",
    y="y",
    color="p",
    tooltip=["x", "y", "p"]
)

And histograms for the two components.

In [11]:
alt.Chart(samples).mark_bar().encode(
    x=alt.X("x", bin=alt.Bin(maxbins=25)),
    y="sum(p)",
    tooltip="sum(p)"
) | alt.Chart(samples).mark_bar().encode(
    x="sum(p)",
    y=alt.Y("y", bin=alt.Bin(maxbins=25)), 
    tooltip="sum(p)"
)

However, there is a second problem: how do you sample from a 10-dimensional space? Already sampling 40000 points can take some time. With each dimension we add, we're going to multiply the number of points needing to be sampled by 100. And most of these points will be have a very low probability anyway.

Using Markov Chain Monte Carlo, we can draw histograms based on this formula without having to know the exact value of $p(\mathtt{effect\_estimate})$, and allowing us to only sample more in regions where it matters.

# Summary
We have seen an example of a Bayesian approach to solving the problem of finding true effect sizes given some measured effect sizes.

* We had 8 schools A,..., H with a measured effect size for A that seemed very high. Our practical goal is to find a more reasonable estimate for school A.
* By writing down a model for the estimated effect size, having as parameters the true effect size and hyperparameters $\mu$ and $\sigma$, we can find the conditional probability of measuring a certain effect given the other parameters.
* Using Bayes theorem, we can invert the conditional probability to give the probability of the true effect size and hyperparameters given the data. This required us to define a *prior* on the hyperparameters $\mu, \sigma$.

The difficulty here is: the distribution we're trying to sample from is 10-dimensional (8 dimensions of $t$, 1 each for $\mu, \sigma$). Most of the points in this 10-dimensional space will have negligble density and thus aren't relevant. How do we sample from the relevant regions? This is the problem markov chain monte carlo addresses. It has two important features:

* It draws samples only needing the probability *proportion* as opposed to an actual probability.
* It samples more often from high-importance regions, which leads to significant gains in efficiency.

The concept of Markov Chain Monte Carlo is explored in the next notebook.