# Hierarchical models

Often we find ourselves working with hierarchical data: broadly speaking a collection of observations, which can be categorized as belonging to one of a number of groups. The cancer deaths by county data discussed in the slides is an example.

The question is, how best to build a model in such a situation? There are two obvious approaches:
  1. Combine all of the data and build a single model used to make predictions / inferences for every group (pooling).
  2. Model each group independently (no pooling).

The first approach is not really satisfactory because we lose group level information, we don't really want to model all groups as the same. The second approach is not great either because we often run into small data problems when dealing with each group independently.

The ideal approach would be a hybrid one where we build seperate models for each group, but ones which aren't totally independent. I.e. they produce predictions and inferences specific to that group, but they learn from each other as well. 

Let's look at a particular example. We have historical data on the rate of tumor incidence in control groups of lab rats. We want to estimate the probability of a lab rate developing tumors in order to better be able to measure whether a given drug has preventative power. We are particularly interested in comparing the most recent experiment to the others, where 4 tumours were observed in a group of size 14.

In [None]:
from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pystan
import seaborn as sns

%load_ext jupyterstan
%matplotlib inline

warnings.filterwarnings("ignore")

In [None]:
if not Path('rat_tumors.csv').exists():
    !wget https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials/bayesian-statistics/rat_tumors.csv

In [None]:
df = pd.read_csv("rat_tumors.csv")

If we just had one control group, we would model this in the same way that we modelled the Paris births data, i.e.

$$
    y \: | \: \theta \sim \text{Binomial}(n, \theta) \\
    \theta \sim \text{Beta}(\alpha, \beta)
$$

where $n$ is the number of rats, $y$ is the number of rats that develop a tumour, and $\theta$ is the probability of a given rat developing a tumor (which we model as independent, identically distributed events).

There's a big unanswered question here, how do we choose $\alpha$ and $\beta$? I certainly have no idea what a reasonable prior is on the probability of a rat developing a tumour. Unlike the Paris births example where we had enough data that our choice of prior didn't really matter, in this case the data is small (about 20 rats in each control group), so our inferences here are much more sensitive to the choice we make.

To resolve this, we use the fact that with 71 groups in our data, we can learn this prior from the data! Nothing comes for free, we now need to specify a hyperprior on the parameters of our prior. However, this hyperprior applies to all of the data (~1700 observations) and so our inferences are much less senstive to the choice we make here. Hence we can safely choose a relatively flat, non-informative prior. Here is our new model, with $j = 1, \dots, 70$ indexing the groups.

$$
    y_j \: | \: \theta_j\sim \text{Binomial}(n_j, \theta_j) \\
    \theta_j \: | \: \alpha, \: \beta \sim \text{Beta}(\alpha, \beta) \\
    \alpha, \beta \sim \text{Half-Cauchy}(0, 2.5)
$$

We specify this model in Stan as follows

In [None]:
%%stan tumour_model
data {
  int<lower=0> J;  // number of groups
  int<lower=0> n[J];  // number of rats in each group
  int<lower=0> y[J];  // number of rats that developed tumours
}
parameters {
  real<lower=0> a;
  real<lower=0> b;
  vector<lower=0, upper=1>[J] theta;  // tumour incidence rates
}
model {
  // hyperprior
  a ~ cauchy(0, 2.5);
  b ~ cauchy(0, 2.5);
  
  // prior (vectorised assignement)
  theta ~ beta(a, b);
  
  // sampling distribution
  for (j in 1:J) {
    y[j] ~ binomial(n[j], theta[j]);
  }
}

Once our model has compiled we can fit it to our data.

In [None]:
data = {
    "J": df.shape[0],
    "n": df.N,
    "y": df.y,
}

fit = tumour_model.sampling(data=data, iter=2000, n_jobs=1)

We're interested in the tumour incidence rates in each group, i.e. $\theta_j$, which we can extract with the `extract` method of `fit`.

In [None]:
theta = fit.extract()["theta"]

We can use seaborn to estimate the density of the posterior probability distribution for each $\theta_j$ from the samples. We plot the estimated density of the most recent experiment in red. 

In [None]:
import seaborn as sns

f, ax = plt.subplots()

for i in np.random.randint(0, 70, 20):
    sns.distplot(theta[:, i], hist=False, ax=ax, color="#9C9C9C")

sns.distplot(theta[:, 70], hist=False, ax=ax, color="#FA7268");

We can see that the posterior distribution of the death rate for the most recent group puts the death rate relatively high, but not unusually so. In fact the posterior mean is much lower than the sample rate (maximum likelihood estimate).

In [None]:
mle = df.loc[70, "y"] / df.loc[70, "N"]
pm = theta[:, 70].mean()

print(f"MLE of tumour rate for group 70: {mle:.3f}")
print(f"Posterior mean of tumour rate for group 70: {pm:.3f}")

We can compare the 95% posterior intervals and posterior means for all groups to their sample rates. Notice that there is some regression to the mean as the extreme outcomes are pulled towards the centre. 

In [None]:
intervals = np.percentile(theta, [2.5, 97.5], axis=0)
means = np.mean(theta, axis=0)
sample_rates = df.y / df.N

f, ax = plt.subplots(figsize=(12, 8))

ax.plot([0, 0.4], [0, 0.4], "k--", linewidth=0.5)

ax.scatter(sample_rates[:70], means[:70], color="#0099ff", s=10)

for i in range(70):
    ax.plot(
        [sample_rates[i], sample_rates[i]],
        [intervals[0, i], intervals[1, i]],
        "--",
        color="#0099ff",
        linewidth=0.5,
    )


ax.scatter(sample_rates[70], means[70], color="#ff0098", s=12)
ax.plot(
    [sample_rates[70], sample_rates[70]],
    [intervals[0, 70], intervals[1, 70]],
    "--",
    color="#ff0098",
    linewidth=0.7,
)

ax.set_xlabel("Sample rate", fontdict={"fontsize": "14"})
ax.set_ylabel("Posterior sample rate", fontdict={"fontsize": "14"})
ax.set_title("Posterior means and 95 intervals for tumour incidence rate");

Notice that the small sample sizes correspond to wider credible intervals, reflecting the greater uncertainty associated to those groups.

## Credit

This notebook makes extensive use of PyStan which is licensed under the GPL license, a requirement of which is that derivative works are also licensed under the GPL license. Hence this notebook is distributed under the GPL license. There is a copy of the full text of the license in this directory.

<hr>

<small style="font-size:12px">
This notebook is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This notebook is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with Foobar.  If not, see <https://www.gnu.org/licenses/>.
</small>