# Using ELAsTiCC $p(z | photometry)$

_Alex Malz (GCCL@RUB --> CMU)_

In [None]:
import bisect
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
random.seed = 42
import scipy.integrate as spi
import scipy.stats as sps
import sys
eps = sys.float_info.epsilon

## Reading ELAsTiCC photo-$z$ data

Note: this whole section will have to be changed to the actual training set file!

In [None]:
hl_heads = {'SNIa': 10,
            'SNII': 10, 
            'SNIbc': 10, 
            'UNMATCHED_KN_SHIFT': 10,
            'UNMATCHED_COSMODC2': 9}
# next time, do something clever to infer the header lengths, e.g.
# hl_head = int(os.system(f"zcat {hl_path} | cat -n | sed -n '/VARNAMES/ {{ p; q }}'  | awk '{{print $1-1}}'"))

Let's pick one hostlib for now.

In [None]:
pick_one = 0
which_hl = list(hl_heads.keys())[pick_one]
hl_path = '/global/cfs/cdirs/lsst/groups/TD/SN/SNANA/SURVEYS/LSST/ROOT/PLASTICC_DEV/HOSTLIB/zquants/'+which_hl+'_dummy_pz.csv'

**WARNING: slow!**

In [None]:
df = pd.read_csv(hl_path, delimiter=' ', header=0)
nhost = len(df)

In [None]:
df.columns

Do a sanity check on the point estimates, if truth is available.

In [None]:
plt.scatter(df['ZTRUE'], df['ZPHOT_Q050'], s=0.1, alpha=0.1, c=df['P_ZPHOT'])
plt.xlabel('$z_{true}$')
plt.ylabel('$z_{median}$')
plt.plot([0., 3.], [0., 3.], c='k')

## Reviewing quantiles

Recall that the quantiles $z_{q}$ are the redshifts at which $q = CDF(z_{q}) = \int_{0}^{z_{q}} p(z) dz$.

The quantiles used should probably be stored somewhere other than in the column names, but here they are anyway.

Note that we could have saved ourselves one float by replacing the redshifts where $CDF=0$ and $CDF=1$ with $p(z_{q})$ for any of the saved quantiles $q$.

In [None]:
quants = np.linspace(0., 1., 11)
quants[0] += eps
quants[-1] -= eps

Let's isolate that information from the table.

In [None]:
quantlabs = ['ZPHOT_Q000', 'ZPHOT_Q010', 'ZPHOT_Q020', 'ZPHOT_Q030', 'ZPHOT_Q040', 'ZPHOT_Q050', 'ZPHOT_Q060', 'ZPHOT_Q070', 'ZPHOT_Q080', 'ZPHOT_Q090', 'ZPHOT_Q100']

And let's pick just one galaxy for demonstration.

In [None]:
show_one = random.sample(range(nhost), 1)

In [None]:
zq_vals = df[quantlabs].loc[show_one].values[0]

## Reconstructing PDFs from quantiles

The goal is now to recover $p(z)$ from the $(z, CDF(z))$ pairs.
This demonstrates the reconstruction algorithm from [ye olde qp](https://github.com/aimalz/qp), which was originally written in Python 2, and though it runs without error in Python 3, changes to numpy array broadcasting may produce results inconsistent with [Malz & Marshall+ 2017](http://stacks.iop.org/1538-3881/156/i=1/a=35).
The basic idea, however, is robust.
We can safely assume $p(z_{000}) = 0$ and $p(z_{100}) = 0$ for $CDF(z_{000}) = 0$ and $CDF(z_{100}) = 1$ to anchor the endpoints, and by definition of the quantiles, the area under the curve between $z_{q_{i}}$ and $z_{q_{i+1}}$ is equal to $q_{i+1} - q_{i}$.
By linear interpolation, 

In [None]:
q = quants
z = zq_vals

derivative = (q[1:] - q[:-1]) / (z[1:] - z[:-1])
derivative = np.insert(derivative, 0, eps)
derivative = np.append(derivative, eps)
def inside(xf):
    nx = len(xf)
    yf = np.ones(nx) * eps
    for n in range(nx):
        i = bisect.bisect_left(z, xf[n])
        yf[n] = derivative[i]
    return(yf)

Let's try a few grids upon which to evaluate reconstructed PDFs, testing the following:
1. `log`: the grid from which they were originally reduced
2. `lin`: a linearly spaced grid with the same granularity
3. `spa`: a very coarse grid
4. `den`: an excessively dense grid

In [None]:
zgrid = {}
zgrid['log'] = np.logspace(-3., np.log10(3.), 300)
zgrid['lin'] = np.arange(0., 3.01, 0.01)
zgrid['spa'] = np.linspace(0., 3., 100)#zq_vals
zgrid['den'] = np.linspace(0., 3., 1000)

In [None]:
eval_pdf = {}
for key, val in zgrid.items():
    eval_pdf[key] = inside(val)

Let's visualize this one to see how much it looks like a Gaussian PDF.

In [None]:
plt.vlines(df[quantlabs].loc[show_one].values[0], -1, 1, linestyle='--', color='k')
# plt.xlim(obs_locs[plot_one][0]-5*sigma*(1+obs_locs[plot_one][0]), 
#          obs_locs[plot_one][0]+5*sigma*(1+obs_locs[plot_one][0]))
plt.xlim(df['ZPHOT_Q000'].loc[show_one].values[0]-0.01, df['ZPHOT_Q100'].loc[show_one].values[0]+0.01)
for key in zgrid.keys():
    plt.plot(zgrid[key], eval_pdf[key], '-o', markersize=3, label=key, alpha=0.75)
plt.text(df['ZPHOT_Q000'].loc[show_one].values[0], 5, str(df['GALID'].loc[show_one].values[0]))
plt.xlabel('$z$')
plt.ylabel('$p(z)$')
plt.legend(loc='upper right')

## Performing sanity checks

On a sufficiently fine grid, the recovered PDF should integrate to 1.
We can and should manually renormalize if the adherence to the normalization condition is insufficient.
Here are a few ways to check that integral.

In [None]:
trap_int = {}
for key, val in zgrid.items():
    print(key)
    print('trapezoid-rule integral: '+str(spi.trapezoid(eval_pdf[key], zgrid[key])))
    print('average integral at histogram midpoints: '+str(np.sum((eval_pdf[key][1:]+eval_pdf[key][:-1])/2.*(zgrid[key][1:]-zgrid[key][:-1]))))
    print('histogram approximation: '+str(np.sum(eval_pdf[key][:-1] * (zgrid[key][1:]-zgrid[key][:-1]))))

Another way to reconstruct PDFs from quantiles would use any one $p(z_{q})$ corresponding to saved quantile $q$ as an anchor, eliminating the need for the anchors at $CDF = 0$ and $CDF = 1$.
However, in the name of expediency, we leave it as an exercise for the reader.
Just kidding!
I'll take care of it soon.