## Tutorial: AGN Photometry with Gibbs Sampling

Note: this tutorial follows [AGN Photometry on a Grid](agn_photometry_grid.ipynb). Along with [AGN Photometry with Metropolis Sampling](agn_photometry_metro.ipynb), it should be done before [MCMC Diagnostics](mcmc_diagnostics.ipynb).

Having laboriously done Baysian inference on a grid to fit an AGN source to X-ray data, we will now turn to solving the same problem using the Gibbs sampling method of MCMC. A heads up, this is the longest tutorial to date. 

In [None]:
exec(open('tbc.py').read()) # define TBC and TBC_above
import astropy.io.fits as pyfits
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from io import StringIO   # StringIO behaves like a file object
import scipy.stats as st
from pygtc import plotGTC
import incredible as cr

Once again, we will read in the X-ray image data, and extract a small image around an AGN that we wish to study.

In [None]:
from xray_image import Image

In [None]:
TBC() # datadir = '../ignore/' # or whatever - path to where you put the downloaded files

imagefile = datadir + 'P0098010101M2U009IMAGE_3000.FTZ'
expmapfile = datadir + 'P0098010101M2U009EXPMAP3000.FTZ'

imfits = pyfits.open(imagefile)
exfits = pyfits.open(expmapfile)

im = imfits[0].data
ex = exfits[0].data

orig = Image(im, ex)

In [None]:
x0 = 417
y0 = 209
stampwid = 5 # very small! see below
stamp = orig.cutout(x0-stampwid, x0+stampwid, y0-stampwid, y0+stampwid)

plt.rcParams['figure.figsize'] = (10.0, 10.0)
stamp.display(log_image=False)

## Fitting for 2 parameters

We're going use this AGN photometry example to try out Gibbs sampling, specifically conjugate Gibbs. This doesn't work out beautifully for all the model parameters, but shouldn't be too bad if we
1. assume the background is zero. If we narrow the postage stamp down tightly around the AGN, then a very small fraction of counts should be background, and this will not be a terrible assumption. (Note the choice of `stampwid` above.)
2. assume that the exposure map is uniform over the stamp. Again, not bad if it's small enough.
3. fit only for the flux and PSF width. (We'll generalize to include the AGN position later.)

Please note that, even though the assumptions above should not be terrible, they aren't ones we would normally make in real life. Probably we wouldn't use Gibbs sampling in this particular case. Still, it's neat that we can even use this real example to illustrate the method.

### The likelihood

Time for some math. (Recall that conjuate Gibbs sampling sometimes lets us design a very efficient sampler at the cost of doing some math ourselves.)

First, it will be convenient to work with the expected total number of counts instead of the count rate. So, instead of `lnF0` as a parameter, we'll have $\mu$.

Recalling that the counts are Poisson distributed, our likelihood is

$p(\{N_i\}|\mu,\sigma) = \prod_i \frac{e^{-\mu_i}\mu_i^{N_i}}{N_i!}$,

where $N_i$ is the number of counts in pixel $i$ and $\mu_i$ is the expected number in that pixel ($\sum_i \mu_i=\mu$). Furthermore, our symmetric, Gaussian model of the PSF says that

$\mu_i = \mu \, F_i(x_0,y_0,\sigma)$,

where the AGN center is $(x_0,y_0)$, and $F_i$ is the integral of the Gaussian PSF over the area of the $i$th pixel.

This seems like a little bit of a mess, but we can simplify it to a more intuitive form. Namely, with $N=\sum_i N_i$,

$p(\{N_i\}|\mu,\sigma) \propto \mathrm{Poisson}(N|\mu) \prod_j F_{i(j)}(x_0,y_0,\sigma)$,

where the index $j$ only runs over detected counts (i.e. those pixels where $N_i>0$, repeated $N_i$ times in the product), and the notation $i(j)$ clumsily refers to the pixel where the $j$th count landed.

In other words, we can factor the likelihood into
1. a probability for the total number of counts, and
2. a probability for the spatial distribution of those counts.

This is one of those results that may be more intuitive to see than to derive, even if the derivation is not complex. So, please take a minute to check that you can get to this version of the likelihood from the previous one (up to factors of $N$ and $N_i$, which don't depend on model parameters; see Endnote 1), and write down how it works in your own words.

> _your own words..._

We will now be slightly dodgy, and replace $F_{i(j)}(x_0,y_0,\sigma)$ with the PSF (the Gaussian density for two uncorrelated variables) evaluated at the center of pixel $i$, $(x_i,y_i)$. This should not be the worst thing ever if the PSF is much larger than a pixel, which is the case here. The likelihood will no longer be correctly normalized, but as we'll see, this doesn't matter to us.

$p(\{N_i\}|\mu,\sigma) \propto \mathrm{Poisson}(N|\mu) \prod_j \mathrm{Normal}(x_j|x_0,\sigma) \, \mathrm{Normal}(y_j|y_0,\sigma)$.

Finally, we can shorten this even more by concatenating $\mathbf{x}-x_0$ and $\mathbf{y}-y_0$ into $\mathbf{z}$, indexed by $k$, writing

$p(\{N_i\}|\mu,\sigma) \propto \mathrm{Poisson}(N|\mu) \prod_k \mathrm{Normal}(z_k|0,\sigma)$.

This last step is just a notation trick to emphasize that all the displacements in both $x$ and $y$ from $(x_0,y_0)$ tell us about the PSF width, $\sigma$. The index $k$ would run from 1 to twice the number of detected counts.

### Conjugacy

The form of the likelihood is now about as simple as we could hope for. Next, we need to work out the conjugate relations for updating each parameter, and decide whether we can live with the form of the prior that this would require.

#### Relation for $\mu$

The dependence of the likelihood on $\mu$ is all in a Poisson term, and we see from [the repository of all knowledge](https://en.wikipedia.org/wiki/Conjugate_prior) that the conjugate prior is the Gamma distribution. Since this conjugacy was worked out [in the notes](../notes/bayes_law.ipynb), we won't make you derive it here. Lucky you!

Recall that, for conjugate Gibbs sampling, we want to work out the posterior for $\mu$ _conditioned on all other parameters_. This is much simpler than dealing with marginalization or some other operation. So, using the Poisson-Gamma conjugacy relation, we have that

$\mathrm{Gamma}(\mu|\alpha_0,\beta_0) \, \mathrm{Poisson}(N|\mu,\ldots) \propto \mathrm{Gamma}(\mu|\alpha_0+N,\beta_0+1) = p(\mu|\{N_i\};\ldots)$,

where "$\ldots$" stands for all the other parameters: $x_0,y_0,\sigma$.

(Remember that the posterior _must be_ a normalized PDF. Hence, the RHS _is_ the posterior for $\mu$. If we had carefully kept track of all the normalizing factors and computed the evidence, this is what it would work out to be. Aren't we lucky that we didn't need to do so explicitly?)

So we _could_ use conjugate Gibbs sampling for $\mu$, but should we? In other words, can we find a prior we can live with within the [Gamma](https://en.wikipedia.org/wiki/Gamma_distribution) family? Probably. Here are some special cases of the distribution, some of which you can verify by inspection of its PDF:
* $(\alpha_0=1,\beta_0\rightarrow0)$ is uniform on $\mu>0$;
* $(\alpha_0=1/2,\beta_0\rightarrow0)$ turns out to be the Jefferys prior, $p(\mu) \propto \mu^{-1/2}$;
* $(\alpha_0\rightarrow0,\beta_0\rightarrow0)$ is $p(\mu)\propto 1/\mu$, or uniform in $\ln(\mu)$, thus equivalent to the uniform prior on `lnF0` we used previously. Let's use this one.

#### Relation for $\sigma$

Specifically, we are looking for a conjugate relation for the standard deviation of a Normal likelihood, with a known mean, since we are conditioning on $x_0$ and $y_0$. This conjugacy relation is much more annoying to work out, but the Wikipedia tells us what the update rule for the **variance** (not the standard deviation, but $\sigma^2$) is, and that the conjugate prior is the "scaled inverse chi-square" distribution:

$\mathrm{SclInv}\chi^2(\sigma^2|\nu_0,\sigma_0^2) \, \prod_k\mathrm{Normal}(z_k|0,\sigma,\ldots) \propto \mathrm{SclInv}\chi^2\left(\sigma^2\left|\nu_0+n_k, \frac{\nu_0\sigma_0^2 + \sum_k z_k^2}{\nu_0+n_k}\right.\right) = p(\sigma^2|\{N_i\};\ldots)$.

Here $n_k=2N$ is the number of items in the sum over $k$, i.e. _twice_ the number of counts (since both their $x$ and $y$ positions provide independent information about $\sigma$).

The [scaled inverse $\chi^2$ distribution](https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution) is a little uglier than we're used to, but not all that hard to deal with due to it's close relationship with the $\chi^2$ distribution. Again we can find values or limits of the parameters that are compatible with some of the "uninformative" choices we might want to make. See if these make sense by inspection of the PDF (all these have non-positive-integer degrees of freedom, which is admittedly bizarre, but the math works out):
* $(\nu_0=-2,\sigma_0^2\rightarrow0)$ is uniform in $\sigma^2$;
* $\nu_0\rightarrow0$ and any $\sigma_0^2$ is the Jeffreys prior, $p(\sigma^2) \propto \sigma^{-2}$;
* $(\nu_0=-1,\sigma_0^2\rightarrow0)$ is uniform in $\sigma$, $p(\sigma^2) \propto \sigma^{-1}$.

Let's use the third one. But remember that we've made a non-linear change of variables here ($\sigma$ to $\sigma^2$) and that in general the prior density is not invariant under such a transformation. Use what you learned way back in [Probability Tranformations](../notes/probability_transformations.ipynb) to work out the equivalent prior on $\sigma$, $p(\sigma)$. This will be important to know so that we can use the same prior when comparing to a different method.

> $p(\sigma) = $ ...

### Implementation

So, now we have rules from drawing samples from the posterior distributions of both parameters. In fact, we even have the posterior distribution in closed form, since in this case it factored cleanly into $p(\mu,\sigma|\{N_i\})=p(\mu|\{N_i\})p(\sigma|\{N_i\})$. That is, we won't actually need to generate a Markov chain, we can sample directly from the posterior without correlation between samples! This is called "independence sampling".

Later, we'll see that this is no longer the case when we introduce the AGN $x$ and $y$ positions as free parameters; then we will need to sample from the _fully conditional_ posterior of each parameter in turn, producing a Markov chain of samples that approximate draws from the full posterior.

But, before tackling that, let's get some code together to produce samples from the simpler case we're starting with, with just the two free parameters. Note that sampling from the scaled inverse $\chi^2$ distribution is slightly involved, but not too complicated; it just follows from the definition of the distribution, so you ultimately end up drawing from a $\chi^2$ distribution and applying some functions to those random draws.

The cell below just defines dictionaries for the model parameters and prior hyperparameters as usual. Note that, because we can do independence sampling here, we don't actually need starting guesses for $\mu$ and $\sigma$!

In [None]:
params = {'mu':None, 'sigma':None, 'x0':x0, 'y0':y0}
hyperparams = {'alpha0':0.0, 'beta0':0.0, 'nu0':-1, 'sigmasq0':0.0}

Now for the fun part: write a function that takes the data, model parameters and prior hyperparameters as input, and returns a new sample of $\mu$ and $\sigma$ based on the derivations above. Note that **you should return $\sigma$, not $\sigma^2$**. (There is nothing to this other than taking a square root - post-facto transformations of parameters is simple; the only wrinkle is in chosing the appropriate prior density as we did above.)

Even though we chose specific hyperparameter values above, you should write your code such that any valid values can be passed.

In [None]:
def independence_sampler(img, par, hypar):
    """
    img is of type Image (our postage stamp)
    par and hypar are our params and hyperparams dictionaries
    """
    # You will need to do some calculations involving the image data...
    # (some of these could be done once instead of repeating every time this function is called, but whatever)
    TBC()
    # and then work out parameters of the conditional posterior for mu, and draw a sample from it...
    TBC()
    # and then do the same for sigma^2...
    TBC()
    # and then return the new samples.
    TBC()
    
TBC_above()

Let's see what a few samples generated by this function look like.

In [None]:
np.array([independence_sampler(stamp, params, hyperparams) for i in range(10)])

### Results

We could generate lots of samples with a python construction like the one above (because, again, this case is independence sampling). But, instead, let's belabor it as though we actually had to create a Markov chain, where each sample is dependent on the previous one. The cell below does this, updating the `params` dictionary within a loop (compare with the pseudocode in the [notes](../notes/montecarlo.ipynb)), and collecting all the samples in an array.

In [None]:
%%time
nsamples = 10000
samples2 = np.zeros((nsamples, 2))
for i in range(samples2.shape[0]):
    p = independence_sampler(stamp, params, hyperparams)
    params['mu'] = p[0]
    params['sigma'] = p[1]
    samples2[i,:] = p

We can use `plotGTC` to quickly visualize the posterior. This package shows us all the 1D marginalized posteriors and every pair of 2D marginalized posteriors (as a contour plot), after some smoothing, in a triangular grid.

In [None]:
plotGTC(samples2, paramNames=[r'$\mu$', r'$\sigma$'],
        figureSize=5, customLabelFont={'size':12}, customTickFont={'size':12});

**Checkpoint:** The cell below will compare your samples with some we have generated. They won't be identical, but should be extremely close if you've used the priors and data specified above.

In [None]:
ours = np.loadtxt('solutions/gibbs.dat')
plotGTC([samples2, ours], paramNames=[r'$\mu$', r'$\sigma$'], chainLabels=['yours', 'ours'],
        figureSize=5, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});

In order to compare to what we get using other methods, we'll also want to transform from $\mu$ back to `lnF0`. We can do this (roughly) by dividing our the median value of the exposure map (remember that we are assuming a uniform exposure in this notebook).

In [None]:
samples2[:,0] = np.log(samples2[:,0] / np.median(stamp.ex))

In [None]:
plotGTC(samples2, paramNames=[r'$\ln F_0$', r'$\sigma$'],
        figureSize=5, customLabelFont={'size':12}, customTickFont={'size':12});

How does this compare with what you found from the grid exercise? Keep in mind that we made different assumptions here (including that the background is zero). Do any differences in the posterior make sense in light of that?

> _Your commentary ..._

## Fitting for 4 parameters

Let's now fit for the source position also ($x_0$ and $y_0$). We'll do this on simulated data (with _really_ no background), so that we can check whether our results are consistent with the input parameters.

### Mocking up an image

First, I declare that these shall be the "true" parameters according to the simulation:

In [None]:
sim_params = {'mu': 35.0, 'sigma': 2.5, 'x0': 417, 'y0': 209}

Now, over to you to write a function that produces a mock image for us to work with. Remember that the sampling distribution/likelihood, written down somewhere way above, is a guide to exactly how to do this. Store the result of calling the function in a variable called `mock`. (You might find it convenient to just overwrite the image data in `stamp`, in which case you can just run `mock = stamp` afterward.) 

In [None]:
def mock_image(data, x0, y0, mu, sigma):
    '''
    Generate a mock image from the model given by x0, y0, mu, sigma.
    Either return an Image object containing this image, along with the other metadata help by `data`, or just
    overwrite the counts image in `data`.
    '''
    TBC()

# mock = mock_image(stamp, ...) ?
# or, mock_image(stamp, ...); mock = stamp ?
TBC_above()

Let's have a look at it.

In [None]:
plt.rcParams['figure.figsize'] = (10.0, 10.0)
mock.display(log_image=False)

### Doing the math

Let's explicitly write down the fully conditional posteriors for the case where the source position is free.

Nothing has changed for the total mean number of counts, which remains indepedent of the other parameters:

$p(\mu|\{N_i\};\ldots) = \mathrm{Gamma}(\mu|\alpha_0+N,\beta_0+1)$.


The fully conditional posterior for $\sigma^2$ is just as it was before, but we should explicitly admit that we are conditioning on $x_0$ and $y_0$, and show where they enter the expression. Make sure you're happy with this, comparing with our previous equations, before going on.

$p(\sigma^2|\{N_i\};x_0, y_0,\ldots) = \mathrm{SclInv}\chi^2\left(\sigma^2\left|\nu_0+n_k, \frac{1}{\nu_0+n_k}\left[\nu_0\sigma_0^2 + \sum_j \left\{\frac{(x_0-x_j)^2}{\sigma^2}+\frac{(y_0-y_j)^2}{\sigma^2}\right\}\right]\right.\right)$.

Finally, we need the fully conditional posteriors for $x_0$ and $y_0$, which are each independently the mean of a normal distribution with standard deviation $\sigma$. Looking it up, we see that the conjugate prior is also normal. This is a little bit fiddly to show, but relatively straightforward using the Gaussian identities you've worked with [previously](gaussians.ipynb). Whether or not you take the time to work this out yourself, it's worth checking that the way the information from the prior and data are combined in the expressions below makes sense.

Denoting the hyperparameters of the conjugate priors $(m_x,s_x,m_y,s_y)$, we have

$p(x_0|\sigma, \{N_i\}) = \mathrm{Normal}\left(x_0\left|\left[\frac{1}{s_x^2}+\frac{N}{\sigma^2}\right]^{-1}\left[\frac{m_x}{s_x^2}+\frac{\sum_j x_j}{\sigma^2}\right], \left[\frac{1}{s_x^2}+\frac{N}{\sigma^2}\right]^{-1/2}\right.\right)$,

$p(y_0|\sigma, \{N_i\}) = \mathrm{Normal}\left(y_0\left|\left[\frac{1}{s_y^2}+\frac{N}{\sigma^2}\right]^{-1}\left[\frac{m_y}{s_y^2}+\frac{\sum_j y_j}{\sigma^2}\right], \left[\frac{1}{s_y^2}+\frac{N}{\sigma^2}\right]^{-1/2}\right.\right)$.

In practice, let's use uniform priors for $x_0$ and $y_0$, which correspond to the limit $s_x,s_y\rightarrow\infty$ (at which point the values of $m_x$ and $m_y$ cease to be important). However, you should write your code such that any valid hyperparameters can be used. Here we add them to the `hyperparameters` dictionary:

In [None]:
hyperparams['mx'] = 0.0
hyperparams['sx'] = np.inf
hyperparams['my'] = 0.0
hyperparams['sy'] = np.inf

### Implementation

Remember that now each parameter must be updated in turn, meaning the new value of one parameter is used when updating the next, etc. We don't update everything all at once based on the current position, as we could before. So, let's explicitly write separate functions for updating different sets of parameters.

In [None]:
def update_mu(img, par, hypar):
    """
    img is of type Image (our mock postage stamp)
    par and hypar are our params and hyperparams dictionaries
    Instead of returning anything, we UPDATE par in place
    """
    TBC()
    # par['mu'] = ...
    
TBC_above()

In [None]:
def update_sigma(img, par, hypar):
    """
    img is of type Image (our mock postage stamp)
    par and hypar are our params and hyperparams dictionaries
    Instead of returning anything, we UPDATE par in place
    (Remember to return sigma instead of signa^2!)
    """
    TBC()
    # par['sigma'] = ...
    
TBC_above()

We can update $x_0$ and $y_0$ in a single function, since their posteriors do not depend on one another.

In [None]:
def update_x0y0(img, par, hypar):
    """
    img is of type Image (our mock postage stamp)
    par and hypar are our params and hyperparams dictionaries
    Instead of returning anything, we UPDATE par in place
    """
    TBC()
    # par['x0'] = ...
    # par['y0'] = ...
    
TBC_above()

Let's test all of that by calling each function and verifying that all the parameters changed (to finite, allowed values).

In [None]:
print(params)
update_mu(mock, params, hyperparams)
update_sigma(mock, params, hyperparams)
update_x0y0(mock, params, hyperparams)
print(params)

### Results

As before, we can fill in an array with samples generated with the functions above. Note that, this time, the for loop is necessary, since we can't fill in row $i$ without knowing the contents of row $(i-1)$. The order of the individual parameter updates is arbitrary, and could even be randomized if you particularly wanted to.

In [None]:
%%time
samples4 = np.zeros((nsamples,4))
for i in range(samples4.shape[0]):
    update_mu(mock, params, hyperparams)
    update_sigma(mock, params, hyperparams)
    update_x0y0(mock, params, hyperparams)
    samples4[i,:] = [params['mu'], params['sigma'], params['x0'], params['y0']]

Let's do the most basic (yet still extremely important) visual check to see how our sampler performed, looking at traces of the Markov chain for each parameter. (It's ok if you haven't read the notes on [MCMC Diagnostics](../notes/mcmc_diagnostics.ipynb) yet; we will go more in-depth later.) These trace plots show the value of each parameter as a function of iteration, and we'll add a line showing the value that was used to create the mock data.

In [None]:
param_labels = [r'$\mu$', r'$\sigma$', r'$x_0$', r'$y_0$']
plt.rcParams['figure.figsize'] = (16.0, 12.0)
fig, ax = plt.subplots(4,1);
cr.plot_traces(samples4, ax, labels=param_labels, 
            truths=[sim_params['mu'], sim_params['sigma'], sim_params['x0'], sim_params['y0']])

Note, if you started with pretty reasonable parameter values, it's entirely possible that there isn't a clear burn-in phase that needs to be thrown out.

We can similarly look at the triangle-plot summary of the posterior:

In [None]:
plotGTC(samples4, paramNames=param_labels,
        truths=[sim_params['mu'], sim_params['sigma'], sim_params['x0'], sim_params['y0']],
        figureSize=8, customLabelFont={'size':12}, customTickFont={'size':12});

**Checkpoint:** The details of these plots will depend on your random mock data set, but, statistically speaking, your posterior should look by eye to be pretty consistent with most of the input parameters. It's entirely possible for one or perhaps a pair of parameters to look a bit discrepant (again, by eye) by chance.

We weren't overly concerned with the starting point for the test chain above. But, for later notebooks, we'll want to see how multiple, independent chains with different starting points behave when using this method. The cell below will take care of running 4 chains, started at random positions broadly in the vicinity of the input values.

In [None]:
%%time
chains = [np.zeros((10000,4)) for j in range(4)]

for samples in chains:
    params = {'mu':st.uniform.rvs()*50.0,
              'sigma':st.uniform.rvs()*4.9 + 0.1,
              'x0':st.uniform.rvs()*10.0 + 412.0,
              'y0':st.uniform.rvs()*10.0 + 204.0}
    for i in range(samples.shape[0]):
        update_mu(mock, params, hyperparams)
        update_sigma(mock, params, hyperparams)
        update_x0y0(mock, params, hyperparams)
        samples[i,:] = [params['mu'], params['sigma'], params['x0'], params['y0']]

Now we can look at a more colorful version of the trace plots, showing all of the chains simultaneously:

In [None]:
plt.rcParams['figure.figsize'] = (16.0, 12.0)
fig, ax = plt.subplots(len(param_labels), 1);
cr.plot_traces(chains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0},
           truths=[sim_params['mu'], sim_params['sigma'], sim_params['x0'], sim_params['y0']])

Save them for later, and we're done!

In [None]:
TBC() # change path below, if desired
#for i,samples in enumerate(chains):
#    np.savetxt('../ignore/agn_gibbs_chain_'+str(i)+'.txt', samples, header='mu sigma x0 y0')

There you have it - at the cost of engaing our brains, you've now fit a model with enough free parameters to make a grid-based solution uncomfortably slow.

#### Endnotes

1. You might recognize the leftover coefficient in this expression as a multinomial coefficient, enumerating the various ways that $N$ counts can be distributed among pixels such that the final number in each pixel is given by ${N_i}$. This is the kind of detail that we normally don't have to worry about - since model parameters make no appearance, the term has no effect on parameter constraints - but we'll see that such combinatorics are important to keep track of in the context of [missing data](../notes/missingdata.ipynb) later on.