# "MCMC From Scratch II: Markov Chain Monte Carlo"
> "Introducing Markov Chain Monte Carlo with the Metropolis-Hastings algorithm"

In the previous notebook, we introduced the Eight schools problem and approached it using Bayesian statistics. The fundamental question was "how do we adjust the effect estimate for school A (which was very high) in light of what we know about the other schools?" This amounted to building a model with the parameter `t`, the "true" effect size per school. Our model also included hyperparameters `mu` and `sigma`.

Picking up where we left off, we had found a probability proportion for the parameters `t`, `mu`, and `sigma`, and were left with the issue of using this probability proportion to sample from the 10-dimensional distribution. The promise then was that this would all be made possible using Markov Chain Monte Carlo (MCMC) methods. This notebook is here to fulfill that promise.

In [1]:
import numpy as np
import altair as alt
import pandas as pd
from scipy.stats import norm, uniform

from mcmc import json_dir, visualize_simulation, transition_MH_trace, simulate_trace, generate_sample, make_2d_histogram

alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir', data_dir='/altairdata')

DataTransformerRegistry.enable('json_dir')

In [2]:
effect_estimates = np.array([28, 8, -3, 7, -1, 1, 18, 12])
std_estimates = np.array([15, 10, 16, 11, 9, 11, 10, 18])
school_names = ["A", "B", "C", "D", "E", "F", "G", "H"]

df = pd.DataFrame({"effect_estimate": effect_estimates,
                   "std_estimate": std_estimates,
                  "school": school_names})

def p_prop(t, mu, sigma):
    p_mu = norm.pdf(mu, loc=8.75, scale=20)
    p_sigma = uniform.pdf(sigma, loc=0, scale=100)
    p_t_mu_sigma = norm.pdf(t, loc=mu, scale=sigma)
    p_ee_t = norm.pdf(effect_estimates, loc=t, scale=std_estimates).prod()
    return p_ee_t * p_t_mu_sigma * p_mu * p_sigma

# What is a Markov Chain, anyway?
Fundamentally, a Markov chain can be described as a "random walk through space". It some starting point. At each timestep, we look where we are, and decide to move somewhere else (or stay on the same spot), dependent *only on where we are*. Put differently, the next step only depends on our current location, not where we've been before that.  

Seeing is believing, so below an example. In this case, we are walking through a 2d-space.

In [3]:
def transition(q):  # We always use q for the current point in these notebooks
    x, y = q  # Unpack coordinates
    step_x = np.random.choice([1, -1])  # Right or left with probability 0.5
    step_y = np.random.choice([1, -1])  # Up or down with probability 0.5
    return (x+step_x, y+step_y)

def simulate(initial, transition, n_iter=100):
    result = [initial]
    for _ in range(n_iter):
        result.append(transition(result[-1]))
        
    # Bookkeeping
    n_dim = len(initial)
    result = pd.DataFrame(result[1:],
                          columns=[f"x{i}" for i in range(n_dim)],
                          index=range(n_iter))
    result.index.name = "i"
    return result

def simulate_multiple(initial_points, transition, n_iter=100):
    """
    Runs multiple markov chains using simulate(p, transition, n_iter),
    one for every point p in initial_points.
    """
    return pd.concat([simulate(initial, transition, n_iter) for initial in initial_points],
                     keys=range(len(initial_points)),
                     names=["simulation", "i"],
                     axis=0).reset_index()


df = simulate_multiple([(0, 0) for _ in range(4)], transition, n_iter=100)
visualize_simulation(df).facet(facet="simulation", columns=2)

# Metropolis-Hastings
By writing the transition function well, we can create a Markov chain that explores a probability distribution, without having to calculate any normalizing constants. The earliest known algorithm for this is the Metropolis-Hastings algorithm.
The only requirement of the algorithm is that we know the probability *proportion* which we write by `p_prop`. The transition function works as follows:
1. Generate a proposal for the next point, by drawing from a normal distribution centered at the current point. 
2. Calculate the probability ratio between the proposal and the current point. (The normalizing factor cancels out here so it's sufficient to have `p_prop`.) This will be high if the proposal has a higher probability than the current point.
3. Draw a random number between 0 and 1.
4. If the random number is less than the probability ratio, we accept the proposal. Else we don't move.

In code:

In [4]:
def transition_MH(current, p_prop, scale=0.1):
    proposal = norm.rvs(loc=current, scale=scale)
    u = uniform.rvs()
    return proposal if u <= p_prop(proposal)/p_prop(current) else current

To see this in action, we use a slightly simpler example: we assume all standard deviations to be 1 and only consider the effect size of one school, which is, say, 0.3. Then the model we have (note the "proportional to"-sign!) for this one school is that

$$p(t, \mu \mid \mathtt{effect\_estimates}) \propto p(\mathtt{effect\_estimates} \mid t) \cdot p(t \mid \mu) \cdot p(\mu)$$

In code:

In [5]:
def p_prop(q):
    t, mu = q  # Unpack arguments
    
    p_mu = norm.pdf(mu, loc=3, scale=1)
    p_t_mu = norm.pdf(t, loc=mu, scale=1)
    p_ee_t = norm.pdf(0.3, loc=t, scale=1)
    
    return p_ee_t * p_t_mu * p_mu

We use a tracing simulation to provide is with some info. It works just the same as the `transition_MH` function derived above.

In [6]:
transition = lambda current: transition_MH_trace(current, p_prop, scale=1)
df, info = simulate_trace((0, 0), transition, n_iter=100)
df = df.join(info).reset_index()
df.head()

Unnamed: 0,i,x0,x1,current,f_current,proposal,f_proposal,u
0,0,0.0,0.0,"(0, 0)",0.0,"(0, 0)",0.0,0.0
1,1,0.0,0.0,"(0, 0)",0.000674,"[-1.8489284346591206, 0.6836315761184948]",1.7e-05,0.228078
2,2,0.0,0.0,"(0, 0)",0.000674,"[2.2169675743945847, 0.0644415335569206]",1.3e-05,0.547593
3,3,0.0,0.0,"(0, 0)",0.000674,"[0.9171650385565093, -0.46792119534142007]",4.9e-05,0.762075
4,4,0.0,0.0,"(0, 0)",0.000674,"[-1.203043882456486, 0.32471158912139797]",0.000178,0.916377


Now we make an interactive chart. The code here is *just* for plotting, so you can safely skip over it to look at the picture.

In [7]:
# Create slider
n_iter = df.i.nunique()
slider = alt.binding_range(min=1, max=n_iter, step=1)
select_time = alt.selection_single(name="iteration", fields=['i'],
                                   bind=slider, init={'i': 1})

# Unpack proposal and find axis scales
df[["proposal_x0", "proposal_x1"]] = pd.DataFrame(df.proposal.to_list(), index=df.index)
xmin, xmax = min(df.x0.min(), df.proposal_x0.min()), max(df.x0.max(), df.proposal_x0.max())
ymin, ymax = min(df.x1.min(), df.proposal_x1.min()), max(df.x1.max(), df.proposal_x1.max())
x_scale = alt.Scale(domain=(xmin, xmax))
y_scale = alt.Scale(domain=(ymin, ymax))

# Create chart showing simulation
simulation = alt.Chart(df).mark_point(opacity=0.7).encode(
    x=alt.X("x0", scale=x_scale),
    y=alt.Y("x1", scale=y_scale),
    tooltip=["x0", "x1"],
    opacity=alt.condition(alt.datum.i < select_time.i, alt.value(1), alt.value(0)),
    order="i")

# Create chart using proposal
df_proposal = df
df_proposal[["proposal_x0", "proposal_x1"]] = df[["proposal_x0", "proposal_x1"]].shift(-1)
proposal = alt.Chart(df).mark_point(color="red", filled=True, size=50).encode(
    x=alt.X("proposal_x0", scale=x_scale),
    y=alt.Y("proposal_x1", scale=y_scale),
    tooltip=["proposal_x0", "proposal_x1"],
    color=alt.condition(
        alt.datum.ratio > alt.datum.u,
        alt.value("blue"),
        alt.value("red")
    )
)

# Create chart of probability ratio and u
df["ratio"] = df.f_proposal / df.f_current
df_ratio = df[["i", "ratio", "u"]].melt(id_vars="i")
ratio = alt.Chart(df_ratio).mark_bar().encode(
    x="variable",
    y=alt.Y("value", scale=alt.Scale(domain=(0, 1), clamp=True)),
    tooltip="value"
).add_selection(select_time).transform_filter(
    select_time
)

# Go!
(simulation + proposal.transform_filter(select_time)).add_selection(select_time) | ratio

# Comparing it to the true distribution
That looks pretty interesting, but how do we fare against the real distribution? To do this, we run a Markov chain for 10000 iterations, and compare its histogram against the a plot of the actual density function, created by sampling the probability proportion in a 150x150 grid and normalizing it.

In [8]:
x_bins = np.linspace(-5, 5, 150)
y_bins = np.linspace(-5, 5, 150)

transition = lambda current: transition_MH(current, p_prop, scale=1)
df = simulate((0, 0), transition, n_iter=10_000)
hist = make_2d_histogram(df, x_bins, y_bins)

sample = generate_sample(p_prop, x_bins=x_bins, y_bins=y_bins)
sample["p_matched"] = sample.p * hist.data.p.sum()  # Need to rescale due to binning
true_density = alt.Chart(sample).mark_point().encode(x="x0", y="x1", color="p_matched")

hist | true_density

100%|██████████| 22500/22500 [00:06<00:00, 3266.81it/s]


That's pretty good! Let's look at the estimated means. We see that they're at least in the ball park with less than half the points (we have 22500 points in our grid, and only 10000 points in our markov chain).

In [9]:
print(f"MCMC mean: {(df.x0.mean(), df.x1.mean())}")
print(f"Grid sample mean: {(sample.x0 @ sample.p, sample.x1 @ sample.p)}")

MCMC mean: (1.1776513919915395, 2.106161026368139)
Grid sample mean: (1.19973972141778, 2.0994867810183075)


# How good is good enough?
This is a natural point in the narrative to start wondering about the question of convergence. It certainly seems clear from the graphs we've drawn that the Metropolis-Hastings algorithm ends up "exploring" the distribution we want to sample from pretty well. And, as it turns out, there is a mathematical theorem proving it: as the number of samples goes to infinity, the distribution of the Markov chain will converge to the distribution of the probability proportion. So if we extend our Markov chain long enough, we will always get accurate estimates.  
But therein lies the rub. What is "long enough"? When do we decide to stop a Markov chain? How do we know that our parameter estimates are accurate? We can't simply compare the MCMC-estimates to the true distribution, since *the whole point of MCMC is to be able to sample from a distribution we can't sample from otherwise*.  
There do exist metrics for monitoring and improving the speed of convergence, and using Markov Chain Monte Carlo methods in practice is effectively doing these two things. Convergence is the subject of our next notebook.

# Summary
We introduced Markov chains, which can be described as a random walk through space: the next location only depends upon the previous one through a transition function. By choosing this transition function wisely, as in the Metropolis-Hasting algorithm, we can sample from a probability distribution already when we only know the probability proportion, and apparently can do so pretty well.  
The question remains, *how* well? Mathematics guarantees convergence *eventually*, but that isn't a very practical guideline. Monitoring this convergence in practice will be the subject of our next notebook.