# Tutorial: Comparison of Galaxy Cluster Centering Models

In the [previous tutorial](model_evaluation.ipynb), we evaluated the "goodness of fit" of two different models on a data set representing the distribution of centering offsets in a galaxy cluster sample. Here we will continue by doing two styles of quantitative test to compare the two models, with the goal of deciding whether an improved goodness of fit actually justifies the additional complexity of the second model. Specifically, you will calculate
* the Deviance Information Criterion, one of several possible information criteria, which has the advantages of being relatively simple and having a straightforward Bayesian interpretation;
* the Bayesian evidence, a more principled but more complex approach fully in the Bayesian framework.

In [None]:
TutorialName = 'model_comparison'
exec(open('tbc.py').read()) # define TBC and TBC_above
import numpy as np
import scipy.stats as st
from scipy.special import logsumexp
import matplotlib.pyplot as plt
%matplotlib inline
import dynesty
from dynesty import plotting as dyplot

## Getting set up

First, we'll want to take advantage of your work in the previous tutorial. As irritating as it may be, paste the (completed) definitions of `Model`, `ExponentialModel` and your alternative model below.

In [None]:
TBC() # cut/paste

Next, we'll read in the chains for each model that you produced. (You'll need to fill in the name of your model class.)

In [None]:
Model1 = ExponentialModel(samples=np.loadtxt('saved/centering_model1_chain.txt', ndmin=2))
TBC() # Model2 = YourModel(samples=np.loadtxt('saved/centering_model2_chain.txt', ndmin=2))

If you're as lazy as me, your implementation above relied on the data `y` being at global scope, so here they are again. If there were any other global variable dependences I haven't anticipated, you'll need to reproduce their definitions here also.

In [None]:
y = np.array([39.30917,35.13419,5.417072,59.75137,30.69077,14.45971,27.07368,27.48429,80.60219,483.1432,24.65057,
              22.36524,43.39081,39.89816,30.67409,6.905061,53.69709,9.504133,41.07874,10.9369,48.29861,61.34125,
              68.37279,30.51124,26.74462,13.7165,6.043301,976.1495,27.20097,7.818419,5.589193,3.310114,271.8901,
              126.0384,99.51247,249.1279,403.0484,3.071718,0.9434036,54.94336,1.529382,8.441071,19.59434,59.43049,
              77.21293,29.6533,286.7116,11.2386,9.511912,29.04711,33.77766,151.4803,223.3557,12.33816,25.22682,
              26.86597,339.7084,405.6737,3.809868,221.6523,307.2994,73.36697,42.15523,36.74785,5.415392,69.4721,
              136.8073,17.3534,4.135966,20.19435,79.06968,8.095599,4.474533,44.90669,85.891,1.636425,75.39335,
              15.94149,2.828709,20.5636,41.52905,42.51133,104.3908,67.41335,13.80204,394.9841,33.90415,84.78714,
              36.77924,14.48424,66.01276,2.910331,92.79938,29.74337,42.40971,1.692674,1.039994,120.5902,154.7106,
              14.38967,147.8399,166.5054,87.53685,22.63141,638.1976,273.6167,593.4997,45.57279,87.30421,75.03385,
              18.33932,36.05779,3.659462,263.9074,0.2432062,8.499095,1.160031,38.16615,41.65371,361.5,148.9294,
              10.25777,71.29159,10.02279,16.36062,601.1667,4.960311,12.22526,87.54137,48.48371,78.56777,212.8153,
              77.0353,62.7624,81.26739,34.36881,42.63432,264.4551,15.24863,25.94133,35.88882,34.94669,222.5425,
              304.9676,19.68377,7.216153,17.61534,32.25887,14.08842,773.5914])

## 1. Calculate the DIC for each model

Recall that the Deviance Information Criterion is given by:

$\mathrm{DIC} = \langle D(\theta) \rangle + p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $\theta$ are the parameters of a model, the deviance $D(\theta)=-2\log P(\mathrm{data}|\theta)$, and averages $\langle\rangle$ are over the posterior distribution of $\theta$.

Write a function to compute this.

In [None]:
def DIC(Model):
    """
    Compute the Deviance Information Criterion for the given model.
    (In a less pedagogical world, this would logically be a method of the base Model class.)
    """
    # Compute the deviance D for each sample
    D = -2.0*np.array([ Model.log_likelihood(*params) for params in Model.samples ])
    #pD = 
    #DIC = 
    #return DIC, pD

TBC_above()

Compute the DIC for each model.

In [None]:
DIC1, pD1 = DIC(Model1)
print(Model1.name+':')
print("Effective number of fitted parameters =", pD1)
print("DIC =", DIC1)

**Checkpoint:** For Model 1 (the exponential), I get $p_D \approx 1.0$ and DIC $\approx 1668$. As with anything else computed from chains, there will be some stochasticity to the values you compute.

In [None]:
DIC2, pD2 = DIC(Model2)
print(Model2.name+':')
print("Effective number of fitted parameters =", pD2)
print("DIC =", DIC2)

Do your values of $p_D$ make intuitive sense?

In [None]:
TBC() # answer in Markdown

Now, to interpret this, we can compare the reduction (hopefully) in the DIC of Model 2 compared with Model 1 to the Jeffreys scale (see the [notes](../notes/model_evaluation.ipynb)). By this metric, is your second model better at explaining the data than the exponential model?

In [None]:
DIC1 - DIC2

In [None]:
TBC() # answer in Markdown

## 2. Compute the evidence by Monte Carlo integration

To do this, note that

$p(\mathrm{data}|H)=\int d\theta \, p(\mathrm{data}|\theta,H) \, p(\theta|H)$

can be approximated by averaging the likelihood over samples from the prior:

$p(\mathrm{data}|H) \approx \frac{1}{m}\sum_{k=1}^m p(\mathrm{data}|\theta_k,H)$, with $\theta_k\sim p(\theta|H)$.

This estimate is much more straightforward than trying to use samples from the posterior to calculate the evidence (which would require us to be able to normalize the posterior, which would require an estimate of the evidence, ...). But in general, and especially for large-dimensional parameter spaces, it is very inefficient (because the likelihood typically is large in only a small fraction of the prior volume). Still, let's give it a try.

Write a function to draw a large number of samples from the prior and use them to calculate the evidence. To avoid numerical over/underflows, use the special `scipy` function `logsumexp` (which we imported directly, way at the top of the notebook) to do the sum. As the name implies, this function is equivalent to `log(sum(exp(...)))`, but is more numerically stable.

In [None]:
def log_evidence(Model, N=1000):
    """
    Compute the log evidence for the model using N samples from the prior
    """
    TBC()
    
TBC_above()

Do a quick test to check for NaNs:

In [None]:
log_evidence(Model1, N=2), log_evidence(Model2, N=2)

Roughly how precisely do we need to know the log Evidence, to be able to compare models? Run `log_evidence` with different values of `N` (the number of prior samples in the average) to until you're satisfied that you're getting a usefully accurate result for each model.

In [None]:
print('Model 1:')
for Nevidence in [1, 10, 100, 1000, 10000]:
    %time logE1 = log_evidence(Model1, N=Nevidence)
    print("From", Nevidence, "samples, the log-evidence is", logE1)

In [None]:
print('Model 2:')
for Nevidence in [1, 10, 100, 1000, 10000]:
    %time logE2 = log_evidence(Model2, N=Nevidence)
    print("From", Nevidence, "samples, the log-evidence is", logE2)

In [None]:
TBC() # SPEEDBUMP: Do the two evidence calculations look converged as a function of N?
#                  If not, increase the number of samples in the calculation.

So, we have log evidences computed for each model. Now what? We just compare their difference to the Jeffreys scale again:

In [None]:
logE2 - logE1

Note that we might end up with a different conclusion as to the strength of any preference of Model 2, compared with the DIC! The reason is that the evidence explicitly accounts for the information in the prior (which, recall, counts as part of the model definition), while the DIC does this much less directly.

We could also be good Bayesians and admit that there should be a prior distribution in model space. For example, maybe I have a very compelling theoretical reason why the offset distribution should be exponential (I don't, but just for example). Then, I might need some extra convincing that an alternative model is required.

We would then compute the ratio of the posterior probabilities of the models as follows:

In [None]:
prior_H1 = 0.9 # or your choice
prior_H2 = 1.0 - prior_H1 # assuming only these two options

log_post_H1 = np.log(prior_H1) + logE1
log_post_H2 = np.log(prior_H2) + logE2

print('Difference of log posteriors (H2-H1):', log_post_H2-log_post_H1)
print('Ratio of posteriors (H2/H1):', np.exp(log_post_H2-log_post_H1)) # NB this one might over/underflow

Depending on the fitness of the alternative model you chose, you may find that only an extremely lopsided prior in model space would influence your conclusion.

Comment on what you find from the evidence, compared with the DIC.

In [None]:
TBC() # answer in Markdown

## 3. Compute the evidence with `dynesty`

To get some experience with a package that uses nested sampling to compute the evidence, let's repeat Section 2 using `dynesty`.

Looking at [the docs](https://dynesty.readthedocs.io/en/latest/crashcourse.html), we first need a function that maps the unit cube onto our prior. That is, the code is doing a substitution like

$\int d\theta\,p(\theta|H)\,p(\mathrm{data}|\theta,H) = \int_0^1 dF \,p\left[\mathrm{data}|\theta(F),H\right]$,

where $F=\int_{-\infty}^\theta d\theta'\,p(\theta'|H)$ is the cumulative distribution function of the prior (hence the identity $dF/d\theta = p(\theta|H)$ makes the equation above work out).

If our priors were uniform, the translation from $F$ to $\theta$ is a simple translation and rescaling, but for other priors it involves going through the prior distribution's quantile function. Fortunately, we assumed uniform priors for Model 1! However, if you followed the advice in the previous notebook to keep a dictionary of univariate priors in the `Model` object, and those priors are `scipy.stats` distributions, it's relatively easy to handle this slightly more general case using functions provided by `scipy`. Regrdless of how you choose to do it, implement a function that performs this transformation for the 1-parameter Exponential model (and priors) below.

In [None]:
def ptform(u):
    '''
    Input: a vector in the unit cube, 0 <= u[i] <= 1, that conforms to the prior volume.
    Output: a vector in our parameter space.
    '''
    TBC()
    
TBC_above()

**Checkpoint:** Do a sanity check - the cell below should return the quartiles of the uniform prior you chose.

In [None]:
print(ptform([0.25]), ptform([0.5]), ptform([0.75]))

We'll also need a log-likelihood function that takes a vector of parameters as input, in contrast to how it's written in the Model class. This could be solved by using the ridiculous construction in the `Model.log_posterior` method, or we could just define:

In [None]:
def dyn_log_like(params):
    return Model1.log_likelihood(*params)

Then we can just go ahead and run.

In [None]:
%%time
sampler = dynesty.NestedSampler(dyn_log_like, ptform, len(Model1.param_names))
sampler.run_nested()
results = sampler.results

And look at some stuff (see the Dynesty documentation for an explanation of what all this is):

In [None]:
# Plot a summary of the run.
rfig, raxes = dynesty.plotting.runplot(results)

The Evidence panel appears to be utterly unhelpful in this case (at least when I ran it), so below is a plot of the log-evidence as a function of iteration, discarding the beginning when the accumulated evidence is truly tiny. It should show us converging to a value similar to what we found earlier.

In [None]:
plt.rcParams['figure.figsize'] = (12.0, 4.0)
Ndiscard = 100
plt.plot(results['logz'][Ndiscard:]);
plt.xlabel("Iteration - "+str(Ndiscard), fontsize=12);
plt.ylabel("Log Evidence", fontsize=12);

This extracts the log-evidence at the last iteration, which may or not be the approved method.

In [None]:
results['logz'][-1]

Let's compare with what we got from simple monte carlo:

In [None]:
logE1 - results['logz'][-1]

That was so easy, let's go ahead and do the same for Model 2. Implement the transformation from the unit cube to your model's parameter space:

In [None]:
def ptform2(u):
    '''
    Input: a vector in the unit cube, 0 <= u[i] <= 1, that conforms to the prior volume.
    Output: a vector in our parameter space.
    '''
    TBC()
    
TBC_above()

And run this elegant definition:

In [None]:
def dyn_log_like2(params):
    return Model2.log_likelihood(*params)

Now we can go ahead and run the sample.

In [None]:
%%time
sampler = dynesty.NestedSampler(dyn_log_like2, ptform2, len(Model2.param_names))
sampler.run_nested()
results = sampler.results

... and produce the same set of plots.

In [None]:
rfig, raxes = dynesty.plotting.runplot(results)

In [None]:
plt.rcParams['figure.figsize'] = (12.0, 4.0)
Ndiscard = 100
plt.plot(results['logz'][Ndiscard:]);
plt.xlabel("Iteration - "+str(Ndiscard), fontsize=12);
plt.ylabel("Log Evidence", fontsize=12);

Finally, extract the final evidence,

In [None]:
results['logz'][-1]

... and compare it to the SMC result.

In [None]:
logE2 - results['logz'][-1]

Given the work you'd already done, that was (I hope) staggeringly easy to set up, even if it did take longer to compute the same answer (I find) in this case. In larger-dimensional parameter spaces, it's easy to imagine this method being far preferable to blindly sampling from the prior. Note that we run `dynesty` out of the box here - there are many options (and some potential failure modes) that you should read about before using it in the real world.