# Tutorial: The Cepheid Period-Luminosity Relation for a Single Galaxy

In the [Cepheids](cepheids.ipynb) notebook, we loaded in data that should constrain the period-luminosity relation, and sketched out a hierarchical model that will let us determine whether the relation is universal (the same in all galaxies).

Strictly speaking, even the model for a single galaxy is hierarchical, so we will start with that. Later on we will tackle the more complex challenge of fitting all the galaxies together.

Start by restoring the previous notebook. This includes all of the `import`s from that notebook, so we don't need to repeat them.

In [None]:
exec(open('tbc.py').read()) # define TBC and TBC_above
import dill

# may need to change the load path
TBC() # dill.load_session('../ignore/cepheids.db')

exec(open('tbc.py').read()) # (re-)define TBC and TBC_above

These are additional imports we imagine you might like.

In [None]:
import scipy.stats as st
import emcee
import incredible as cr
from pygtc import plotGTC

## 1. Data

Let's arbitarily use the first galaxy for this exercise - it's somewhere in the middle of the pack in terms of how many measured cepheids it contains.

Even though we're only looking at one galaxy so far, let's try to write code that can later be re-used to handle any galaxy (so that we can fit all galaxies simultaneously). To that end, most functions will have an argument, `g`, which is a key into the `data` dictionary.

In [None]:
g = ngc_numbers[0]
g

Print the number of cepheids in this galaxy:

In [None]:
data[g]['Ngal']

## 2. Model specification

Note: it isn't especially onerous to keep all the galaxies around for the "Model" and "Strategy" sections, but feel free to specialize to the single galaxy case if it helps.

Before charging forward, let's finish specifying the model. We previously said we would allow an intrinsic scatter about the overall period-luminosity relation - let's take that to be Gaussian such that the linear relation sets the mean of the scatter distribution, and there is an additional parameter for the width (in magnitudes), $\sigma_i$, for the $i$th galaxy. (Note that normal scatter in magnitudes, which are log-luminosity, could also be called log-normal scatter in luminosity; these are completely equivalent.) We'll hold off on specifying hyperpriors for the distributions of intercepts, slopes and intrinsic scatters among galaxies.

Your previous PGM should need minimal if any modification, but make sure that all of the model parameters are represented:

* The observed apparent magnitude of the $j^{th}$ cepheid in the $i^{th}$ galaxy, $m^{\rm obs}_{ij}$
* The "true" apparent magnitude of the $j^{th}$ cepheid in the $i^{th}$ galaxy, $m_{ij}$
* The known observational uncertainty on the apparent magnitude of the $j^{th}$ cepheid in the $i^{th}$ galaxy, $\varepsilon_{ij}$
* The true absolute magnitude of the $j^{th}$ cepheid in the $i^{th}$ galaxy, $M_{ij}$
* The log period for the $j^{th}$ cepheid in the $i^{th}$ galaxy, $\log_{10}P_{ij}$

* The luminosity distance to the $i^{th}$ galaxy, $d_{L,i}$
* The intercept parameter of the period-luminosity relation in the $i^{th}$ galaxy, $a_{i}$
* The slope parameter of the period-luminosity relation in the $i^{th}$ galaxy, $b_{i}$
* The intrinsic scatter parameter about the period-luminosity relation in the $i^{th}$ galaxy, $\sigma_{i}$

> _TBC: new PGM_

Also write down the probabilistic expressions represented in the PGM, with the exception of those for $a_i$, $b_i$ and $\sigma_i$, which we still haven't chosen priors.

> _TBC: probabilistic relationships_

For the remainder of this notebook, we will assume wide, uniform priors for $a_i$, $b_i$ and $\sigma_i$, but it's useful (for later) to have the model sketched out above withut those assumptions.

## 3. Strategy

The hierarchical nature of this problem has left us with a large number of nuisance parameters, namely a true absolute magnitude for every one of the cepheids. The question now is: how are we going to deal with it?

There are a few possibilities:

### Sampling:

We could take a brute force approach - just apply one of the general-purpose algorithms we've looked at and hope it works.

Alternatively, while it might not be obvious, this problem (a linear model with normal distributions everywhere) is fully conjugate, given the right choice of prior. We could therefore use a conjugate Gibbs sampling code specific to the linear/Gaussian case (it's common enough thay they exist) or a more general code that works out and takes advantage of any conjugate relations, given a model. (You could also work out and code up the conjugacies yourself, if you're into that kind of thing.) These are all still "brute-force" in the sense that they are sampling all the nuisance parameters, but we might hope for faster convergence than a more generic algorithm.

### Direct integration:

If some parameters truly are nuisance parameters, in the sense that we don't care what their posteriors are, then we'll ultimately marginalize over them anyway. Rather than sampling the full-dimensional parameter space and then looking only at the marginal distributions we care about, we always have the option of sampling only parameters we care about, and, while evaluating _their_ posterior, doing integrals over the nuisance parameters in some other way. In other words, we should remember that obtaining samples of a parameter is only one method of integrating over it.

Whether it makes sense to go this route depends on the structure of the model (and how sophisticated you care to make your sampler). Somtimes, sampling the nuisance parameters just like the parameters of interest turns out to be the best option. Other times, direct integration is much more efficient. And, of course, "direct integration" could take many forms, depending on the integrand: an integral might be analytic, or it might be best accomplished by quadrature or by monte carlo integration. The dimensionality of the integration (in particular, whether it factors into one- or at least low-dimensional integrals) is something to consider.

So, for this model, try to write down the posterior for $a_i$, $b_i$ and $\sigma_i$, marginalized over the $M_{ij}$ parameters. If you're persistent, you should find that the integral is analytic, meaning that we can reduce the sampling problem to a computationally efficient posterior distribution over just $a_i$, $b_i$ and $\sigma_i$, at the expense of having to use our brains.

If you get super stuck, note that working this out is not a requirement for the notebook (see below), but I suspect it provides the most efficient solution overall.

Hint: the [`gaussians`](gaussians.ipynb) notebook is helpful here.

> _TBC math_

## 4. Obtain the posterior

Sample the posterior of $a_i$, $b_i$ and $\sigma_i$ for the one galaxy chosen above (i.e. a single $i$), and do the usual sanity checks and visualizations. Use "wide uniform" priors on $a$, $b$ and $\sigma$.

In the subsections below, you'll get to do this 3 different ways! First you'll apply a generic sampler to the brute-force and analytic integration methods. Then we'll walk through using a Gibbs sampling package.

Hint: a common trick to reduce the posterior correlation between the intercept and slope parameters of a line is to reparametrize the model as $a+bx \rightarrow a' + b(x-x_0)$, where the "pivot" $x_0$ is roughly the mean of $x$ in the data. You don't _have_ to do this, but smaller correlations usually mean faster convergence. If you do, don't forget about the redefinition when visualizing/interpretting the results!

In [None]:
# find pivots (nb different for every galaxy, which is not what we'd want in a simultaneous analysis)
for i in ngc_numbers:
    data[i]['pivot'] = data[i]['logP'].mean()
# to avoid confusion later, reset all pivots to the same value
global_pivot = np.mean([data[i]['logP'].mean() for i in ngc_numbers])
for i in ngc_numbers:
    data[i]['pivot'] = global_pivot

Here's a function to evaluate the mean relation, with an extra argument for the pivot point:

In [None]:
def meanfunc(x, xpivot, a, b):
    '''
    x is log10(period/days)
    returns an absolute magnitude
    '''
    return a + b*(x - xpivot)

### 4a. Brute force sampling of all parameters

Attempt to simply sample all the parameters of the model. Let's... not include all the individual magnitudes in these lists of named parameters, though.

In [None]:
param_names = ['a', 'b', 'sigma']
param_labels = [r'$a$', r'$b$', r'$\sigma$']

I suggest starting by finding decent guesses of $a$, $b$, $\sigma$ by trial and error/inspection. For extra fun, chose values such that the model goes through the points, but isn't a _great_ fit. This will let us see how well the sampler used below performs when it needs to find its own way to the best fit.

In [None]:
TBC(1) # guess = {'a': ...

guessvec = [guess[p] for p in param_names] # it will be useful to have `guess` as a vector also

plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.errorbar(data[g]['logP'], data[g]['M'], yerr=data[g]['merr'], fmt='none');
plt.xlabel('log10 period/days', fontsize=14);
plt.ylabel('absolute magnitude', fontsize=14);
xx = np.linspace(0.5, 2.25, 100)
plt.plot(xx, meanfunc(xx, data[g]['pivot'], guess['a'], guess['b']))
plt.gca().invert_yaxis();

We'll provide the familiar skeleton of function prototypes below, with a couple of small changes. One is that we added an option argument `Mtrue` to the log-prior - this allows the same prior function to be used in all parts of this exercise, even when the true magnitudes are not being explicitly sampled (the function calls in later sections would simply not pass anything for `Mtrue`). The log-posterior function is also generic, in the sense that it takes as an argument the log-likelihood function it should use. Another difference is that we provide a function called `logpost_vecarg_A` ("A" referring to this part of the notebook) that takes a vector of parameters as input, ordered $a$, $b$, $\sigma$, $M_1$, $M_2$, ..., instead of a dictionary. This is for compatibility with the `emcee` sampler which is used below. (If you would like to use a different but still generic method instead, like HMC, go for it.)

In [None]:
# prior, likelihood, posterior functions for a SINGLE galaxy

# generic prior for use in all parts of the notebook
def log_prior(a, b, sigma, Mtrue=None):
    TBC()
    
# likelihood specifically for part A
def log_likelihood_A(gal, a, b, sigma, Mtrue):
    '''
    `gal` is an entry in the `data` dictionary; `a`, `b`, and `sigma` are scalars; `Mtrue` is an array
    '''
    TBC()
    
# generic posterior, again for all parts of the problem
def log_posterior(gal, loglike, **params):
    lnp = log_prior(**params)
    if lnp != -np.inf:
        lnp += loglike(gal, **params)
    return lnp

# posterior for part A, taking a parameter array argument for compatibility with emcee
def logpost_vecarg_A(pvec):
    params = {name:pvec[i] for i,name in enumerate(param_names)}
    params['Mtrue'] = pvec[len(param_names):]
    return log_posterior(data[g], log_likelihood_A, **params)

TBC_above()

Here's a quick sanity check, which you can refine if needed:

In [None]:
guess_A = np.concatenate((guessvec, data[g]['M']))
logpost_vecarg_A(guess_A)

The cell below will set up and run `emcee` using the functions defined above. We've made some generic choices, such as using twice as many "walkers" as free parameters, and starting them distributed according to a Gaussian around `guess_A` with a width of 1%.

#### IMPORTANT

You do **not** need to run this version long enough to get what we would normally consider acceptable results, in terms of convergence and number of independent samples. Just convince yourself that it's functioning, and get a sense of how it performs. **Please do not turn in a notebook where the sampling cell below takes longer than $\sim30$ seconds to evaluate.**

In [None]:
%%time

nsteps = 1000 # or whatever

npars = len(guess_A)
nwalkers = 2*npars
sampler = emcee.EnsembleSampler(nwalkers, npars, logpost_vecarg_A)
start = np.array([np.array(guess_A)*(1.0 + 0.01*np.random.randn(npars)) for j in range(nwalkers)])
sampler.run_mcmc(start, nsteps)
print('Yay!')

Let's look at the usual trace plots, including only one of the magnitudes since there are so many.

In [None]:
npars = len(guess)+1
plt.rcParams['figure.figsize'] = (16.0, 3.0*npars)
fig, ax = plt.subplots(npars, 1);
cr.plot_traces(sampler.chain[:min(8,nwalkers),:,:npars], ax, labels=param_labels+[r'$M_1$']);
npars = len(guess_A)

Chances are this is not very impressive. But we carry on, to have it as a point of comparison. The cell below will print out the usual quantitiative diagnostics.

In [None]:
TBC()
# burn = ...
# maxlag = ...

tmp_samples = [sampler.chain[i,burn:,:4] for i in range(nwalkers)]
print('R =', cr.GelmanRubinR(tmp_samples))
print('neff =', cr.effective_samples(tmp_samples, maxlag=maxlag))
print('NB: Since walkers are not independent, these will be optimistic!')
print("Plus, there's a good chance that the results in this section are garbage...")

Finally, we'll look at a triangle plot.

In [None]:
samples_A = sampler.chain[:,burn:,:].reshape(nwalkers*(nsteps-burn), npars)

plotGTC([samples_A[:,:4]], paramNames=param_labels+[r'$M_1$'], chainLabels=['emcee/brute'],
        figureSize=8, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});

We should also probably look at how well the fitted model matches the data, qualitatively.

In [None]:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.errorbar(data[g]['logP'], data[g]['M'], yerr=data[g]['merr'], fmt='none');
plt.xlabel('log10 period/days', fontsize=14);
plt.ylabel('absolute magnitude', fontsize=14);
xx = np.linspace(0.5, 2.25, 100)
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_A[:,0].mean(), samples_A[:,1].mean()), label='emcee/brute')
plt.gca().invert_yaxis();
plt.legend();

### 4b. Sampling with analytic marginalization

Next, implement sampling of $a$, $b$, $\sigma$ using your analytic marginalization over the true magnitudes. Again, the machinery to do the sampling is below; you only need to provide the log-posterior function.

In [None]:
def log_likelihood_B(gal, a, b, sigma):
    '''
    `gal` is an entry in the `data` dictionary; `a`, `b`, and `sigma` are scalars
    '''
    TBC()

def logpost_vecarg_B(pvec):
    params = {name:pvec[i] for i,name in enumerate(param_names)}
    return log_posterior(data[g], log_likelihood_B, **params)

TBC_above()

Check for NaNs:

In [None]:
logpost_vecarg_B(guessvec)

Again, we run `emcee` below. Anticipating an improvment in efficiency, we've increased the default number of steps below. Unlike the last time, you should run long enough to have useful samples in the end.

In [None]:
%%time

nsteps = 10000

npars = len(param_names)
nwalkers = 2*npars
sampler = emcee.EnsembleSampler(nwalkers, npars, logpost_vecarg_B)
start = np.array([np.array(guessvec)*(1.0 + 0.01*np.random.randn(npars)) for j in range(nwalkers)])
sampler.run_mcmc(start, nsteps)
print('Yay!')

Again, trace plots. Note that we no longer get a trace of the magnitude parameters. If we really wanted a posterior for them, we would now need to do extra calculations.

In [None]:
plt.rcParams['figure.figsize'] = (16.0, 3.0*npars)
fig, ax = plt.subplots(npars, 1);
cr.plot_traces(sampler.chain[:min(8,nwalkers),:,:], ax, labels=param_labels);

Again, $R$ and $n_\mathrm{eff}$.

In [None]:
TBC()
# burn = ...
# maxlab = ...

tmp_samples = [sampler.chain[i,burn:,:] for i in range(nwalkers)]
print('R =', cr.GelmanRubinR(tmp_samples))
print('neff =', cr.effective_samples(tmp_samples, maxlag=maxlag))
print('NB: Since walkers are not independent, these will be optimistic!')

Now, let's compare the posterior from this analysis to the one we got before:

In [None]:
samples_B = sampler.chain[:,burn:,:].reshape(nwalkers*(nsteps-burn), npars)

plotGTC([samples_A[:,:3], samples_B], paramNames=param_labels, chainLabels=['emcee/brute', 'emcee/analytic'],
        figureSize=8, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});

**Checkpoint:** Your posterior is compared with our solution by the cell below. Note that we used the `global_pivot` defined above. If you did not, your constraints on $a$ will differ due to this difference in definition, even if everything is correct.

In [None]:
sol = np.loadtxt('solutions/ceph1.dat.gz')
plotGTC([sol, samples_B], paramNames=param_labels, chainLabels=['solution', 'my emcee/analytic'],
        figureSize=8, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});

Moving on, look at how the two fits you've done compare visually:

In [None]:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.errorbar(data[g]['logP'], data[g]['M'], yerr=data[g]['merr'], fmt='none');
plt.xlabel('log10 period/days', fontsize=14);
plt.ylabel('absolute magnitude', fontsize=14);
xx = np.linspace(0.5, 2.25, 100)
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_A[:,0].mean(), samples_A[:,1].mean()), label='emcee/brute')
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_B[:,0].mean(), samples_B[:,1].mean()), label='emcee/analytic')
plt.gca().invert_yaxis();
plt.legend();

Comment on things like the efficiency, accuracy, and/or utility of the two approaches.

> _TBC commentary_

### 4c. Conjugate Gibbs sampling

Finally, we'll step through using a specialized Gibbs sampler to solve this problem. We'll use the `LRGS` package, not because it's the best option (it isn't), but because it's written in pure Python. The industry-standard (and far less specialized) alternative goes by the name JAGS, and requires a separate installation (though one can add a Python interface on top of that).

Let me stress that LRGS is in no fashion optimized for speed; JAGS is presumably faster, not to mention applicable to more than just fitting lines. Even so, LRGS seems to be comparable in speed with our analytically supercharged `emcee` in this case, when one considers that the samples it returns are less correlated.

In [None]:
import lrgs

LRGS is a "general" linear model fitter, meaning that $x$ and $y$ can be multidimensional. So the input data are formatted as matrices with one row for each data point. In this case, they're column vectors ($n\times1$ matrices).

Measurement uncertainties are given as a list of covariance matrices. The code handles errors on both $x$ and $y$, so these are $2\times2$ for us. Since our $x$'s are given precisely, we just put in a dummy value here and use a different option to fix the values of $x$ below.

In [None]:
x = np.asmatrix(data[g]['logP'] - data[g]['pivot']).T
y = np.asmatrix(data[g]['M']).T
M = [np.matrix([[1e-6, 0], [0, err**2]]) for err in data[g]['merr']]

Conjugate Gibbs sampling can be parallelized in the simplest possible way - you just run multiple chains from different starting points or even just with different random seeds in parallel. (`emcee` is parallelized internally, since walkers need to talk to each other.) Therefore...

In [None]:
import multiprocessing

This function sets things up and does the actual sampling, returning a Numpy array in the usual format. The default priors are equivalent to the ones we chose above, helpfully.

In [None]:
nsteps = 2000 # some arbitrary number of steps to run

def do_gibbs(i):
    # every parallel process will have the same random seed if we don't reset them here
    if i > 0:
        np.random.seed(i)
    # lrgs.Parameters set up a sampler that assumes the x's are known precisely.
    # Other classes would correspond to different possible priors on x.
    par = lrgs.Parameters(x, y, M)
    chain = lrgs.Chain(par, nsteps)
    chain.run(fix='x') # fix='x' isn't necessary here, but it shows how one would fix other parameters if we wanted to
    # Extracts the chain as a dictionary. Note that we have the option of hanging onto the samples of the magnitude
    #  parameters in addition to the intercept, slope and scatter, though this is not the default.
    dchain = chain.to_dict(["B", "Sigma"])
    # since $sigma^2$ is sampled rather than $\sigma$, take the square root here
    return np.array([dchain['B_0_0'], dchain['B_1_0'], np.sqrt(dchain['Sigma_0_0'])]).T

Go!

In [None]:
%%time
with multiprocessing.Pool() as pool:
    gibbs_samples = pool.map(do_gibbs, range(2)) # 2 parallel processes - change if you want

Show!

In [None]:
plt.rcParams['figure.figsize'] = (16.0, 3.0*npars)
fig, ax = plt.subplots(npars, 1);
cr.plot_traces(gibbs_samples, ax, labels=param_labels);

In [None]:
burn = 50
maxlag = 1000

tmp_samples = [x[burn:,:] for x in gibbs_samples]
print('R =', cr.GelmanRubinR(tmp_samples))
print('neff =', cr.effective_samples(tmp_samples, maxlag=maxlag))

Here are the posteriors:

In [None]:
samples_C = np.concatenate(tmp_samples, axis=0)

plotGTC([samples_A[:,:3], samples_B, samples_C], paramNames=param_labels,
        chainLabels=['emcee/brute', 'emcee/analytic', 'LRGS/Gibbs'],
        figureSize=8, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});

Again, look at the fit compared with the other methods:

In [None]:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.errorbar(data[g]['logP'], data[g]['M'], yerr=data[g]['merr'], fmt='none');
plt.xlabel('log10 period/days', fontsize=14);
plt.ylabel('absolute magnitude', fontsize=14);
xx = np.linspace(0.5, 2.25, 100)
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_A[:,0].mean(), samples_A[:,1].mean()), label='emcee/brute')
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_B[:,0].mean(), samples_B[:,1].mean()), label='emcee/analytic')
plt.plot(xx, meanfunc(xx, data[g]['pivot'], samples_C[:,0].mean(), samples_C[:,1].mean()), '--', label='LRGS/Gibbs')
plt.gca().invert_yaxis();
plt.legend();

## Finishing up

There's yet more fun to be had in the next tutorial, so let's again save the definitions and data from this session.

In [None]:
del pool # cannot be pickled

TBC() # change path below if desired
# dill.dump_session('../ignore/cepheids_one.db')