# Tutorial: AGN Photometry on a Grid

Note: this tutorial follows [X-ray Image Data](xmm_image.ipynb) and precedes [AGN Photometry with Gibbs Sampling](agn_photometry_gibbs.ipynb) and [AGN Photometry with Metropolis Sampling](agn_photometry_metro.ipynb).

In this notebook we will fit a model to X-ray imaging data, bringing together what you've learned about generative models, Bayes law, posterior evaluations on a grid, and credible intervals. After stepping through the inference of model parameters in this way, we'll motivate the use of sampling methods that you'll explore in upcoming tutorials. 

In [None]:
exec(open('tbc.py').read()) # define TBC and TBC_above
import astropy.io.fits as pyfits
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from astropy.visualization import LogStretch
logstretch = LogStretch()
import scipy.stats as st
from scipy.optimize import minimize
import incredible as cr

## Defining a model

Now that you know your way around the image data itself, we're going to fit a model to a cut-out of the data. To keep things simple, we will ignore the galaxy cluster and just fit the brightness of one of the AGN in the field, after asserting that we know both its position and the telescope PSF (this is called "forced photometry").

In generative terms, we will define a 2-dimensional model for source brightness on the sky, apply the PSF to smear it out as the telescope optics would, and apply the exposure map to compute an expected, average number of counts in each pixel given our observation's length. This can then be compared with the measured counts using the sampling distribution which, for counts of rare events, is Poisson. More concretely,

* The physical model we're interested in is the source brightness, $S(x,y)$.
* Assuming the PSF doesn't vary with position within our cut-out, it's action is a simple convolution.
* Multiplication by the exposure map, $E(x,y)$, then takes care of the overall observation length and the position-dependent vignetting.

Even more compactly, we could write the model expectation number of counts per pixel as

$\mu(x,y) = E(x,y) \times \left[\mathrm{PSF}\otimes S(x,y)\right]$,

and the data as

$N(x,y) \sim \mathrm{Poisson}\left[\mu(x,y)\right]$.

Remembering that the exposure map had units of seconds, $S$ as used above must have units of counts per second per pixel (the units of "pixel on the sky" are solid angle). In real life we might want to eventually measure the AGN flux in real units (e.g. ergs/s/cm$^2$), which would require an assumption about the spectrum of the source (or an analysis that didn't throw out the spectral information!). We'll stick to inferring things in terms of counts.

A few more things are necessary to fully specify our model. We'll take the exposure map to be known and fixed, which leaves $S(x,y)$ and the PSF.

An AGN is point-like as far as this telescope is concerned, so we can write

$S_\mathrm{agn}(x,y) = F_0\,\delta(x-x_0)\,\delta(y-y_0)$.

For simplicity, let's assume there is a constant background in whatever region in the image we decide to analyze:

$S_\mathrm{bg}(x,y) = b$.

In the context of measuring the AGN flux, "background" includes the cluster emmission, so we'll want to select an AGN in a part of the field of view where the cluster emission is subdominant to the more uniform background.

Finally, we'll assume a symmetric, Gaussian PSF, with a standard deviation of $\sigma=5''=1.25$ pixels. This is not entirely accurate, but it would take an extremely bright source, or a statistical analysis of many fainter sources, for us to see the actual PSF pattern, which is quite complicated in detail.

We thus have only 2 model parameters to fit, $F_0$ and $b$. Since AGN fluxes span many orders of magnitude, but the quiescent background count rate does not, let's use wide uniform priors on $\ln(F_0)$ and $b$ and see what we can do.

Before continuing with an implementation, go through our usual generative model sanity check: enumerate the model parameters and data, write down the probabilistic relationships among them, and visualize them with a PGM. Keep in mind which parameters will be free vs fixed in our initial analysis, as described above.

> _solution_

## Setup

For convenience, this defines the `Image` class as in the [`xmm_image`](xmm_image.ipynb) notebook.

In [None]:
from xray_image import Image

This reads in the data and displays it, just as before:

In [None]:
TBC() # datadir = '../ignore/' # or whatever - path to where you put the downloaded files

imagefile = datadir + 'P0098010101M2U009IMAGE_3000.FTZ'
expmapfile = datadir + 'P0098010101M2U009EXPMAP3000.FTZ'

imfits = pyfits.open(imagefile)
exfits = pyfits.open(expmapfile)

im = imfits[0].data
ex = exfits[0].data

orig = Image(im, ex)
plt.rcParams['figure.figsize'] = (20.0, 20.0)
orig.display()

Next, we need to decide on a specific AGN to measure, and decide what size cut-out to make the measurement in, keeping in mind the model assumptions above (especially the uniform background and lack of other AGN within the cut-out). We'll make a standard choice below so that you'll have a known solution to compare to. But, for completeness, here is a list of rough AGN positions in IMAGE coordinates determined some time ago (by eye).

```
232 399
188 418
362 474
336 417
381 359
391 418
398 294
417 209
271 216
300 212
286 162
345 153
168 361
197 248
277 234
241 212
251 379
310 413
460 287
442 353
288 268
148 317
151 286
223 239
490 406
481 318
```

The 8th entry looks pretty good, but feel free to mess around with these choices (just keep in mind that you won't be able to compare to the solutions below).

In [None]:
x0 = 417
y0 = 209
stampwid = 25
stamp = orig.cutout(x0-stampwid, x0+stampwid, y0-stampwid, y0+stampwid)

plt.rcParams['figure.figsize'] = (10.0, 10.0)
stamp.display(log_image=False) # not a huge dynamic range in this cutout

This is the subset of the data we'll use to fit the AGN flux.

## Implementation

It's time to write functions to evaluate the prior, sampling and posterior distributions. Even though we're only fitting for $\ln(F_0)$ and $b$ for the moment, let's write those functions more generally, so that they also depend explicitly on the AGN position, $(x_0,y_0)$, and the PSF width, $\sigma$.

Here's a dictionary of parameter values, with arbitrary values for `lnF0` and `b`, for now.

In [None]:
params = {'x0':x0, 'y0':y0, 'lnF0':-5.0, 'b':1e-6, 'sigma':1.25}

Normally, we try to evaluate the _log_ of these distributions. This is because floating-point underflows can be an issue, especially when the sampling distribution is a product with many terms. Another benefit is that we don't have to worry about normalizing coefficients that don't depend on model parameters - if we need some distribution to be normalized later, we can always normalize it explicitly by dividing it by its numerical integral. If any of these functions should return zero probability, you can and should have them return a log-probability of -infinity.

Complete the function evaluating the log-prior:

In [None]:
def log_prior(x0, y0, lnF0, b, sigma):
    TBC()
    
TBC_above()

As always, we should make sure it returns some kind of value instead of crashing when fed an example parameter dictionary.

In [None]:
# sanity check
log_prior(**params)

Next, implement the log-likelihood/sampling distribution. This is, of course, where all the fun of evaluating the model happens, although I'd suggest outsourcing the evaluation of $\mu(x,y)$ to a separate function. Keep in mind that our model consists of a delta function and a uniform background, so it isn't necessary to carry out the convolution in the model numerically.

In [None]:
def log_likelihood(data, x0, y0, lnF0, b, sigma):
    """
    `data` will be an Image object
    """
    TBC()
    
TBC_above()

In [None]:
# sanity check
log_likelihood(stamp, **params)

Finally, the log-posterior. As will normally be the case, we will just return the sum of the log-prior and log-likelihood, i.e. we will neglect the normalizing constant (evidence), which is constant with repect to the model parameters.

The construction below is a good one to be in the habit of using. Usually, the likelihood is much more expensive to compute than the prior, and might even crash if passed prior-incompatible parameter values. So it's worth the extra check of a non-zero prior probability before attempting to evaluate the likelihood.

In [None]:
def log_posterior(data, **params):
    """
    `data` will be an Image object
    """
    lnp = log_prior(**params)
    if np.isfinite(lnp):
        lnp += log_likelihood(data, **params)
    return lnp

In [None]:
## sanity check
log_posterior(stamp, **params)

Our posterior function is written to work with scalar values of its arguments, but sometimes (especially when working over a grid) it's convenient to have a version that takes vector arguments. According to the `numpy` documentation, `vectorize` is not actually more efficient than evaluating your function within nested `for` loops, but it's an option anyway.

In [None]:
vectorized_lnpost = np.vectorize(log_posterior, excluded=['data'])

## Evaluating the posterior

We should now be able to define a grid and evaluate the posterior over it, as we've seen before. In practical terms, the extent of the grid defines the bounds of the "wide, uniform" priors we decided to use. It would, therefore, be helpful to know roughly what part of `lnF0` - `b` space we need to cover. A resonable approach is to use a numerical optimizer to find the maximum of the log-posterior (or the minimum of its additive inverse). That won't instantly tell us how wide to make the grid in order to contain nearly all the posterior probability, but it's a good start and we can then iterate if need be.

We'll use `scipy.optimize.minimize` (imported above as `minimize`). Refer to its documentation if needed. Define the function that `minimize` will work on here. (Note that it will want a vector argument rather than a dictionary of parameter values.)

In [None]:
def mlnpost(p):
    """
    p: a numpy array of parameter values in the order lnF0, b
    Return value: minus the log-posterior
    """
    TBC()
    
TBC_above()

Now we'll do the optimization and print out the best fit:

In [None]:
bestfit = minimize(mlnpost, [params['lnF0'], params['b']])
bestfit

**Checkpoint:** If you're fitting to the postage stamp defined above with the suggested priors, the best fit should be approximately `lnF0 = -6.6` and `b = 5.3e-6`.

Now, define the bounds and spacing of the grid to evaluate on. You may need to refine these values after seeing the results. You can get an initial guess at the appropriate size in each direction by treating `bestfit['hess_inv']` as the covariance of the log-posterior and looking at the square root of its diagonal - this won't be wonderfully accurate, but will get you an order of magnitude.)

In [None]:
TBC()
# lnF0_min = 
# lnF0_max = 
# dlnF0 = 
# b_min = 
# b_max = 
# db = 

This will define 2D grids holding the values of `lnF0` and `b` for each entry, as illustrated:

In [None]:
lnF0_values = np.arange(lnF0_min, lnF0_max+dlnF0, dlnF0)
b_values = np.arange(b_min, b_max+db, db)
grid_lnF0, grid_b = np.meshgrid(lnF0_values, b_values)

In [None]:
plt.rcParams['figure.figsize'] = (14.0, 5.0)
fig, ax = plt.subplots(1,2);
ax[0].imshow(grid_lnF0, cmap='gray', origin='lower', extent=[lnF0_min, lnF0_max, b_min, b_max], aspect='auto');
ax[0].set_xlabel('ln F0');
ax[0].set_ylabel('b');
ax[1].imshow(grid_b, cmap='gray', origin='lower', extent=[lnF0_min, lnF0_max, b_min, b_max], aspect='auto');
ax[1].set_xlabel('ln F0');
ax[1].set_ylabel('b');

This cell will evaluate the log-posterior over the grid and display the result:

In [None]:
%%time
grid_params = params.copy()
grid_params['lnF0'] = grid_lnF0
grid_params['b'] = grid_b
grid_lnpost = vectorized_lnpost(stamp, **grid_params)

In [None]:
plt.rcParams['figure.figsize'] = (6.0, 6.0)
plt.imshow(grid_lnpost, cmap='gray', origin='lower', aspect='auto', extent=[lnF0_min, lnF0_max, b_min, b_max]);
plt.xlabel('ln F0');
plt.ylabel('b');

## Finding credible regions

Next, we'll go through the steps to find standard credible regions and intervals for the two parameters. We can use the machinery you saw in the [Credible Regions](credible_intervals.ipynb) tutorial by reformating our grid evaluation in a form those functions will like.

Below, create a dictionary that looks like the output of `incredible.whist2d` by filling in the entry for the posterior density estimated over an array. Recall that we haven't properly normalized the log-posterior, so it's a good idea to subtract off the maximum log-posterior before exponentiating to avoid numerical under/overflows (that way the maximum posterior value is automatically 1.0 and not something like 1e-300 or 0.0).

In [None]:
TBC() # grid_post = ...

h2d = {'x':lnF0_values, 'y':b_values, 'z':grid_post}

See what that gives us:

In [None]:
plt.rcParams['figure.figsize'] = (5.0, 5.0)
contours = cr.whist2d_ci(h2d)
plt.xlabel('ln F0');
plt.ylabel('b');

We can do something similar to get the 1D credible intervals. Use your amazing skills to compute the marginalized posteriors for `lnF0` and `b` at  the grid points defined by `lnF0_values` and `b_values`:

In [None]:
TBC()
# lnF0_post = 
# b_post =

Here we'll pass those to our CI finder. This should reveal if your grid is too course to resolve the CI's well.

In [None]:
h1d_lnF0 = {'x':lnF0_values, 'density':lnF0_post}
h1d_b = {'x':b_values, 'density':b_post}

plt.rcParams['figure.figsize'] = (14.0, 5.0)
fig, ax = plt.subplots(1,2);
ci_lnF0 = cr.whist_ci(h1d_lnF0, plot=ax[0])
ax[0].set_xlabel('ln F0');
ax[0].set_ylabel('marginalized post');
ci_b = cr.whist_ci(h1d_b, plot=ax[1])
ax[1].set_xlabel('b');
ax[1].set_ylabel('marginalized post');

Print out the CI's:

In [None]:
ci_lnF0

In [None]:
ci_b

**Checkpoint:** with the given setup, you should end up with something like $\ln(F_0)=-6.6\pm0.2$ and $b=(5.3\pm0.4)\times10^{-6}$.

## Looking at the best fit

It's always a good idea to visualize how well your fit predicts the data. This is a little bit of an aside for this notebook (we will do more later on), so the code below is mostly given. You will need to fill in code to evaluate the model expectated number of counts over the postage stamp, just as in the likelihood. The plot shows the data as a surface brightness profile (average counts/second/pixel as a function of radius) compared with the mean prediction +/- 1 standard deviation of the best fitting model you found above.

In [None]:
def compare_profile(r_bins, data, x0, y0, lnF0, b, sigma):
    r2 = (data.imx - x0)**2 + (data.imy - y0)**2
    TBC() # mu = mean counts in each pixel in `data`, as an array
    rmin2 = r_bins[range(len(r_bins)-1)]**2
    rmax2 = r_bins[range(1, len(r_bins))]**2
    data_counts = np.zeros(len(rmin2))
    model_counts = np.zeros(len(rmin2))
    model_sd = np.zeros(len(rmin2))
    for i in range(len(rmin2)):
        j = np.where( np.all([r2>=rmin2[i], r2<rmax2[i]], axis=0) )
        npix = len(j[0])
        data_counts[i] = np.sum( data.im[j] ) / npix
        model_counts[i] = np.sum( mu[j] ) / npix
        model_sd[i] = np.sqrt(model_counts[i] / npix)
    r = 0.5*(np.sqrt(rmin2) + np.sqrt(rmax2))
    plt.rcParams['figure.figsize'] = (7.0, 5.0)
    plt.loglog(r, data_counts, 'o');
    plt.errorbar(r, model_counts, yerr=model_sd);
    plt.xlabel('r');
    plt.ylabel('counts/pix');
    
TBC_above()

In [None]:
best_params = params.copy()
best_params['lnF0'] = bestfit['x'][0]
best_params['b'] = bestfit['x'][1]
best_params

In [None]:
compare_profile(np.arange(0.0, 30.0, 1.0), stamp, **best_params)

## The curse of dimensionality

If we were very mean, we would now ask you to repeat everything above with one more free parameter, say $\sigma$. And feel free, if you'd like. We'll wait. And we will be waiting... a pretty long time.

Alternatively, you could observe how long it took to evaluate the log-posterior over a 2D grid with sufficient extent and resolution to contain nearly all the posterior probability and resolve the credible intervals well. (This should have been printed out above). Now estimate the time needed to do the same over a 3D grid, assuming the 3rd parameter needs about as many grid points as the first 2. How about with 10 parameters?

> TBC

Now keep in mind that this is a pretty simple, toy problem. It's not unusual for the likelihood to take ~1 second for a single evaluation in real life - or even longer. Hence, this exponential scaling, $(\mathrm{grid~points})^{(\mathrm{free~parameters})}$, is in general prohibitive for us. In the following tutorials you'll solve the same (or at least similar) problems using Monte Carlo methods instead. We will see that Monte Carlo methods also take longer when there are more parameters to juggle, but the scaling is typically much nicer.