This code investigates the shape measurement performance of the DM stack on the DR2 imaging data.  It has a few steps:
 - Load a star catalog and a galaxy catalog from the imaging area
 - Load the extragalactic catalog and truth catalog over the same area
 - Match the imaging and EGC catalogs 
 - Measure the galaxy-galaxy shear correlation functions of the two galaxy catalogs
 - Measure the "rho statistics" -- correlation functions of the PSF and star shapes that diagnose PSF leakage into the cosmic shear signal
 - Measure the calibration factors that remove measurement bias from the shape measurement on images
 - Correct the galaxy-galaxy shear CF from the imaging data and compare the corrected version to the EGC CF
 - Measure the correlation function of the EGC over the same area, using all EGC galaxies reweighted to the same size and flux distribution as the measured galaxies, to further look for selection bias
 
Many of these steps borrow heavily from other DC2 tutorials, as listed in the code comments.

We begin by loading up the code we'll need. Stile will eventually be in the DESC environment but you may need to load it by hand at first by adding the path to sys.path.  The desc-stack kernel is most likely to have the dependencies you need for this tutorial.

In [None]:
import warnings
import sys
sys.path.insert(0, '')
# Could not get this to work with anything less specific...
sys.path.insert(0, '/global/homes/m/msimet/.local/lib/python3.6/site-packages')

import numpy as np
import pandas as pd
import FoFCatalogMatching
import matplotlib.pyplot as plt
import scipy.optimize as op
with warnings.catch_warnings():
    # Stile throws up a bunch of matplotlib warnings that we can just ignore
    warnings.filterwarnings('ignore')
    import stile
from utils.paired_catalogs import get_catalogs

figsize_x, figsize_y = plt.gcf().get_size_inches()

%matplotlib inline

In [None]:
# If you look at LSST_stack_matching.py, you'll see that this is a large block of code that retrieves
# data from the butler and from various catalogs, computes shapes where necessary, applies cuts, and 
# finally returns these catalogs. The default settings and default cuts for get_catalogs() should be 
# sensible, but you can look at the docstring for that function for more info.

# This takes a few minutes to run
galaxy_catalog, star_catalog, egc_catalog, truth_catalog = get_catalogs()

In [None]:
# Match the truth catalog to the DM catalog, and the EGC to the truth catalog

matches = FoFCatalogMatching.match(
    catalog_dict={'truth': truth_catalog, 'object': galaxy_catalog},
    linking_lengths=1.0,
    catalog_len_getter=lambda x: len(x['ra']),
)

In [None]:
# Now, find the one-to-one matches.
# We could probably be more clever and try to figure out what's going on with the non-matches,
# because this probably has a weird selection, but it should be good enough for now.
# This (like the above box) is from the FoF matching tutorial,
# https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/matching_fof.ipynb
truth_mask = matches['catalog_key'] == 'truth'
object_mask = ~truth_mask

n_groups = matches['group_id'].max() + 1
n_truth = np.bincount(matches['group_id'][truth_mask], minlength=n_groups)
n_object = np.bincount(matches['group_id'][object_mask], minlength=n_groups)

one_to_one_group_mask = np.in1d(matches['group_id'], np.flatnonzero((n_truth == 1) & (n_object == 1)))
truth_idx = matches['row_index'][one_to_one_group_mask & truth_mask]
object_idx = matches['row_index'][one_to_one_group_mask & object_mask]

In [None]:
# Make a pandas DataFrame that merges all the catalogs
truth_table = pd.DataFrame(truth_catalog).iloc[truth_idx].reset_index(drop=True)
object_table = pd.DataFrame(galaxy_catalog).iloc[object_idx].reset_index(drop=True)
merged_table = pd.merge(truth_table, object_table, left_index=True, right_index=True, suffixes=('_truth', '_object'))
merged_table = pd.merge(merged_table, pd.DataFrame(egc_catalog), 'inner', left_on='object_id', right_on='galaxy_id', suffixes=('', '_egc'))
# some EGC things are named "_true" and having both "_true" and "_truth" is confusing!
merged_table = merged_table.rename(columns=lambda x: x[:-5]+'_egc' if x[-5:] == '_true' else x) 
# Make some column names easier to type
merged_table = merged_table.rename(columns={"ext_shapeHSM_HsmShapeRegauss_e1" : "e1_object", 
                                            "ext_shapeHSM_HsmShapeRegauss_e2" : "e2_object",
                                            "ellipticity_1_egc": "e1_egc",
                                            "ellipticity_2_egc": "e2_egc"})
    
print("Number of matches: {} from {} DM galaxies and {} truth galaxies".format(
            len(merged_table['ra_truth']), len(galaxy_catalog['ra']), len(truth_catalog['ra'])))

Now we will measure the calibration factors we need to go from the measured HSM shapes to the catalog shapes.  We usually use a linear decomposition: $m$ and $c$.  Most shape measurement codes have been found to be linear in the weak lensing regime, so this should be sufficient for our purposes.  That means that for a given intrinsic shape $g^{\rm true}$, what we actually measure is:

$$ g^{\rm meas} = (1+m)g^{\rm true} + c $$

The $c$ portion also has a dependence on the PSF ellipticity that we will parameterize through $alpha$, so:

$$ g^{\rm meas} = (1+m)g^{\rm true} + \alpha e^{\rm PSF} + c $$

We have an advantage here in that we have real pairs of input and output shapes to measure these coefficients on.  We will assume there is no difference between the two components of the ellipticity (which should be close enough for this exercise).  However, there is one more component to care about.  The shape in the catalogs is ellipticity, not distortion (see [Bernstein & Jarvis 2002](https://arxiv.org/abs/astro-ph/0107431) for more), so we need to correct the galaxy shapes for the different definition.  This factor includes a factor of two and a component called the _responsivity_ that depends on the per-component RMS distortion:

$$ \mathcal{R} \approx 1-e_{\rm rms}^2 $$

We would want this to be weighted if we were weighting galaxies, but in this tutorial, we are not.  So in the end our full equation is:

$$ \frac{g^{\rm meas}}{2\mathcal{R}} = (1+m)g^{\rm true} + \alpha e^{\rm PSF} + c $$

In [None]:
responsivity = 1-np.mean(np.concatenate([merged_table['e1_object']**2, merged_table['e2_object']**2]))
print("Responsivity = {}".format(responsivity))

In [None]:
def bias_model(x, m, alpha, c):
    xtrue = x[0]
    xpsf = x[1]
    return 2*responsivity*((1+m)*xtrue + alpha*xpsf + c)

# The EGC and object catalogs seem to have different sign conventions
e_egc_all = np.concatenate([merged_table['e1_egc'], merged_table['e2_egc']])
psf_e_all = np.concatenate([-1*merged_table['psf_e1'], merged_table['psf_e2']])
e_obj_all =  np.concatenate([-1*merged_table['e1_object'], merged_table['e2_object']])
param, _ =  op.curve_fit(bias_model, [e_egc_all, psf_e_all], e_obj_all)

m = param[0]
alpha = param[1]
c = param[2]

print("m={}, alpha={}, c={}".format(*param))

In [None]:
avg_psf_e1 = merged_table['psf_e1'].mean()
avg_psf_e2 = merged_table['psf_e2'].mean()

# Plot what these biases actually look like
fig = plt.figure(figsize=[2*figsize_x, figsize_y])
ax = fig.add_subplot(121)
ax.hist2d(responsivity*merged_table['e1_egc'], -merged_table['e1_object']/2/responsivity, bins=20)
ax.plot(2*merged_table['e1_egc'], 2*(1+m)*merged_table['e1_egc']+alpha*avg_psf_e1+c, color='black')
ax.set_xlabel("EGC g1")
ax.set_ylabel("Object g1")
ax.set_ylim((-1, 1))
ax = fig.add_subplot(122)
ax.hist2d(merged_table['e2_egc'], merged_table['e2_object']/2/responsivity, bins=20)
ax.plot(2*merged_table['e2_egc'], 2*(1+m)*merged_table['e2_egc']+alpha*avg_psf_e2+c, color='black')
ax.set_xlabel("EGC g2")
ax.set_ylabel("Object g2")
ax.set_ylim((-1, 1))

Next, correlation functions. We'll use Stile to do this.  Stile's correlation function code wraps TreeCorr, but it has some built-in plotting and data-formatting functions that save us some lines of code here.

We need to define the binning for the correlation function.  I've gone for 1/5 of the extent of the catalog in declination for the max distance (possibly large enough to see edge effects anyway), using 20 bins to cover an order of magnitude in angular distance.  We'll also need to rename some columns, since Stile expects specific names.

In [None]:
# Pick some good bin edges
star_catalog = pd.DataFrame(star_catalog)
star_catalog['w'] = np.ones_like(star_catalog['ra'])
star_catalog_stile = star_catalog.rename(columns={'e1': 'g1', 'e2': 'g2', 'psf_e1': 'psf_g1', 'psf_e2': 'psf_g2'}).to_records()
# Responsivity for round objects is 1, but we still need that factor of 2!
star_catalog_stile['g1'] /= 2
star_catalog_stile['g2'] /= 2
min_ra = star_catalog['ra'].min()
max_ra = star_catalog['ra'].max()
min_dec = star_catalog['dec'].min()
max_dec = star_catalog['dec'].max()

max_sep = 0.2*(max_dec-min_dec)
min_sep = 0.1*max_sep
nbins = 20

# Make a dict of TreeCorr parameters that we can pass to Stile
corrfunc_kwargs = {'ra_units': 'degrees', 'dec_units': 'degrees',
                   'min_sep': min_sep, 'max_sep': max_sep, 'sep_units': 'degrees', 'nbins': nbins }

rho1 = stile.CorrelationFunctionSysTest("Rho1")
rho2 = stile.CorrelationFunctionSysTest("Rho2")
rho3 = stile.CorrelationFunctionSysTest("Rho3")
rho4 = stile.CorrelationFunctionSysTest("Rho4")
rho5 = stile.CorrelationFunctionSysTest("Rho5")

rho1_res = rho1(star_catalog_stile, **corrfunc_kwargs)
rho2_res = rho2(star_catalog_stile, **corrfunc_kwargs)
rho3_res = rho3(star_catalog_stile, **corrfunc_kwargs)
rho4_res = rho4(star_catalog_stile, **corrfunc_kwargs)
rho5_res = rho5(star_catalog_stile, **corrfunc_kwargs)

# When you correct correlation functions for additive and multiplicative bias, the additive is generally stable
# and can be done per-object, while the multiplicative is not stable and should be done in ensemble.  Since we measured
# m and c from the whole ensemble, though, we can just do the subtraction and division right now for m and c, and 
# because the alpha term is additive, we can do that per-object right now, too.  And again, we need to flip
# the e1 direction to make it comparable to the EGC.

merged_table['e1_object_prime'] = -(merged_table['e1_object']/2/responsivity - c - alpha*merged_table['psf_e1'])/(1+m)
merged_table['e2_object_prime'] = (merged_table['e2_object']/2/responsivity - c - alpha*merged_table['psf_e2'])/(1+m)
merged_table['w'] = np.ones_like(merged_table['ra_egc'])

merged_table_stile_obj = merged_table.rename(columns={'e1_object_prime': 'g1', 'e2_object_prime': 'g2', 'ra_object': 'ra', 'dec_object': 'dec'}).to_records()
object_corrfunc = stile.CorrelationFunctionSysTest()
object_corrfunc_res = object_corrfunc('gg', merged_table_stile_obj, **corrfunc_kwargs)

merged_table_stile_egc = merged_table.rename(columns={'e1_egc': 'g1', 'e2_egc': 'g2', 'ra_egc': 'ra', 'dec_egc': 'dec'}).to_records()
egc_corrfunc = stile.CorrelationFunctionSysTest()
egc_corrfunc_res = egc_corrfunc('gg', merged_table_stile_egc, **corrfunc_kwargs)

In [None]:
# Free up some memory
del star_catalog_stile
del merged_table_stile_egc
del merged_table_stile_obj

What do these functions look like?

In [None]:
# Note: the interface presented here will change in the near future to rho1_res.plot()
fig = rho1.plot(rho1_res)
fig.suptitle('Rho 1')
plt.clf()

fig = rho2.plot(rho2_res)
fig.suptitle('Rho 2')
plt.clf()

fig = rho3.plot(rho3_res)
fig.suptitle('Rho 3')
plt.clf()

fig = rho4.plot(rho4_res)
fig.suptitle('Rho 4')
plt.clf()

fig = rho5.plot(rho5_res)
fig.suptitle('Rho 5')
plt.clf()

In [None]:
fig = object_corrfunc.plot(object_corrfunc_res)
fig.suptitle('Image-derived correlation function')
plt.clf()

fig = egc_corrfunc.plot(egc_corrfunc_res)
fig.suptitle('EGC-derived correlation function')
plt.clf()

Following __[Jarvis et al (2015)](https://ui.adsabs.harvard.edu/#abs/arXiv:1507.05603)__, we define the correction to the correlation function as

$$ \delta \xi_+(\theta) = 2 \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}} \frac{\delta T_{\rm PSF}}{T_{\rm PSF}}\right\rangle \xi_+(\theta)  +  \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}} \right\rangle^2 \rho_1(\theta) - \alpha  \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}}\right\rangle \rho_2(\theta) +  \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}} \right\rangle^2 \rho_3 (\theta) +  \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}}\right\rangle^2 \rho_4(\theta) - \alpha  \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}}\right\rangle \rho_5(\theta) $$

T represents the intensity-weighted second moment of the radius, called $R^2$ in an earlier paper by Paulin-Henriksson et al. (2008).  Handily for us, this is the quantity called "sigma" in the reGaussianization pipeline.  Following  Jarvis et al, we'll approximate that first expectation value as the multiplication of two expectation values:

$$\left\langle \frac{T_{\rm PSF}}{T_{\rm gal}} \frac{\delta T_{\rm PSF}}{T_{\rm PSF}}\right\rangle = \left\langle \frac{T_{\rm PSF}}{T_{\rm gal}}\right\rangle \left\langle \frac{\delta T_{\rm PSF}}{T_{\rm PSF}}\right\rangle $$

since we can't measure  the PSF modeling error at the locations of galaxies, and we can't measure galaxy size at the locations of stars.

In [None]:
xip_object = object_corrfunc_res['xip']/(1+2*m)
# We don't have the galaxy shapes directly, but resolution, which we *do* have, is 1-TPSF/Tgal
tpsf_tgal = np.mean(1-merged_table['ext_shapeHSM_HsmShapeRegauss_resolution'])
tpsf_tgal_deltatpsf = tpsf_tgal*np.mean((star_catalog['psf_sigma']-star_catalog['sigma'])/star_catalog['sigma'])

delta_xip = ( 2*tpsf_tgal_deltatpsf * xip_object + tpsf_tgal**2*(rho1_res['xip'] + rho3_res['xip'] + rho4_res['xip'])
             - alpha*tpsf_tgal*(rho2_res['xip']+rho5_res['xip']))

And now, the comparison.  For ease of reading the plot, we're only going to plot the errorbars for the corrected object-based shape-shape correlation function.

In [None]:
xip_gc = egc_corrfunc_res['xip']
# Correct the measurement from the DRP for m biases as well as responsivity
xip_object_corrected = (xip_object + delta_xip)

xi_err_sq = (object_corrfunc_res['sigma_xi']**2*(1+4*tpsf_tgal_deltatpsf**2) + 
             tpsf_tgal**4*(rho1_res['sigma_xi']**2 + rho3_res['sigma_xi']**2 + rho4_res['sigma_xi']**2) -
             alpha**2*tpsf_tgal**2*(rho2_res['sigma_xi']**2+rho5_res['sigma_xi']**2))

x = egc_corrfunc_res['meanR [deg]']
x_edges = np.concatenate(([x[0]**2/x[1]], np.sqrt(x[:1]*x[1:]), [x[-1]**2/x[-2]]))
x_err = [x-x_edges[:-1], x_edges[1:]-x]
plt.figure(figsize=(2*figsize_x, 2*figsize_y))
plt.plot(x, np.abs(egc_corrfunc_res['xip']), label="Truth", color="C0") 
plt.errorbar(x, np.abs(xip_object_corrected), yerr=np.sqrt(xi_err_sq),
             label="object, corrected", color='C1')
plt.plot(x, np.abs(xip_object), 
             label="object", color='C2')
plt.plot(x, np.abs(2*tpsf_tgal_deltatpsf*xip_object), 
             label=r"$\xi_+$ error term", color='C3')
plt.plot(x, np.abs(tpsf_tgal**2*rho1_res['xip']), 
             label=r"$\rho_1$ error term", color='C4')
plt.plot(x, np.abs(alpha*tpsf_tgal*rho2_res['xip']), 
             label=r"$\rho_2$ error term", color='C5')
plt.plot(x, np.abs(tpsf_tgal**2*rho3_res['xip']), 
             label=r"$\rho_3$ error term", color='C6')
plt.plot(x, np.abs(tpsf_tgal**2*rho4_res['xip']), 
             label=r"$\rho_4$ error term", color='C7')
plt.plot(x, np.abs(alpha*tpsf_tgal*rho5_res['xip']), 
             label=r"$\rho_5$ error term", color='C8')

plt.legend()
plt.xlabel("R [deg]")
plt.ylabel(r"$\xi_+$")
plt.yscale('log')
plt.xscale('log')

Finally, we can ask, what effect does selection have?  These previous plots were measured using only matched objects--but not all objects were matched.  Let's compute some correlation functions using *all* the objects in the given sky area.

In [None]:
egc_catalog['w'] = np.ones_like(egc_catalog['ra_true'])
for old, new in [('ra_true', 'ra'), ('dec_true', 'dec'), ('ellipticity_1_true', 'g1'), ('ellipticity_2_true', 'g2')]:
    egc_catalog[new] = egc_catalog[old]
egc_catalog_stile = np.rec.fromarrays(egc_catalog.values(), names=list(egc_catalog.keys()))
egc_all_corrfunc = stile.CorrelationFunctionSysTest()
egc_all_corrfunc_res = egc_all_corrfunc('gg', egc_catalog_stile, **corrfunc_kwargs)

galaxy_catalog['w'] = np.ones_like(galaxy_catalog['ra'])
for old, new in [('ext_shapeHSM_HsmShapeRegauss_e1', 'g1'), ('ext_shapeHSM_HsmShapeRegauss_e2', 'g2')]:
    galaxy_catalog[new] = galaxy_catalog[old]
galaxy_catalog_stile = np.rec.fromarrays(galaxy_catalog.values(), names=list(galaxy_catalog.keys()))
galaxy_catalog_stile['g1'] = -(galaxy_catalog_stile['g1']/2/responsivity - c - alpha*galaxy_catalog_stile['psf_e1'])/(1+m)
galaxy_catalog_stile['g2'] = (galaxy_catalog_stile['g2']/2/responsivity - c - alpha*galaxy_catalog_stile['psf_e2'])/(1+m)
object_all_corrfunc = stile.CorrelationFunctionSysTest()
object_all_corrfunc_res = object_all_corrfunc('gg', galaxy_catalog_stile, **corrfunc_kwargs)

plt.figure(figsize=(2*figsize_x, 2*figsize_y))
plt.errorbar(x, object_corrfunc_res['xip'], yerr=object_corrfunc_res['sigma_xi'], label="Matched DM objects")
plt.errorbar(x, object_all_corrfunc_res['xip'], yerr=object_all_corrfunc_res['sigma_xi'], label="All DM objects")
plt.legend()
plt.xlabel("R [deg]")
plt.ylabel(r"$\xi_+$")
#plt.yscale('symlog', linthreshy=1.E-4)
plt.xscale('log')

plt.figure(figsize=(2*figsize_x, 2*figsize_y))
plt.errorbar(x, egc_corrfunc_res['xip'], yerr=egc_corrfunc_res['sigma_xi'], label="Matched EGC objects")
plt.errorbar(x, egc_all_corrfunc_res['xip'], yerr=egc_all_corrfunc_res['sigma_xi'], label="All EGC objects")
plt.legend()
plt.xlabel("R [deg]")
plt.ylabel(r"$\xi_+$")
#plt.yscale('symlog', linthreshy=1.E-4)
plt.xscale('log')
