# Demo: RAIL Evaluation 

The purpose of this notebook is to demonstrate the application of the metrics scripts to be used on the photo-z PDF catalogs produced by the PZ working group. The first implementation of the _evaluation_ module is based on the refactoring of the code used in [Schmidt et al. 2020](https://arxiv.org/pdf/2001.03621.pdf), available on Github repository [PZDC1paper](https://github.com/LSSTDESC/PZDC1paper). 

To run this notebook, you must install qp and have the notebook in the same directory as `utils.py` (available in RAIL's examples directrory). You must also install some run-of-the-mill Python packages: numpy, scipy, matplotlib, and seaborn.

### Contents

* [Data](#data)
 - [Photo-z Results](#fzboost)
* [CDF-based metrics](#metrics)
 - [PIT](#pit) 
 - [QQ plot](#qq) 
* [Summary statistics of CDF-based metrics](#summary_stats)
  - [KS](#ks) 
  - [CvM](#cvm) 
  - [AD](#ad) 
  - [KLD](#kld) 
* [CDE loss](#cde_loss)  
* [Summary](#summary)

In [None]:
from rail.evaluation.metrics.pit import *
from rail.evaluation.metrics.cdeloss import *
from utils import read_pz_output, plot_pit_qq, ks_plot
from main import Summary
import qp 
import os
%matplotlib inline
%reload_ext autoreload
%autoreload 2

<a class="anchor" id="data"></a>
# Data  


To compute the photo-z metrics of a given test sample, it is necessary to read the output of a photo-z code containing galaxies' photo-z PDFs. Let's use the toy data available in `tests/data/` (**test_dc2_training_9816.hdf5** and **test_dc2_validation_9816.hdf5**) and the configuration file available in `examples/configs/FZBoost.yaml` to generate a small sample of photo-z PDFs using the **FZBoost** algorithm available on RAIL's _estimation_ module.

<a class="anchor" id="fzboost"></a>
### Photo-z Results
#### Run FZBoost

Go to dir  `<your_path>/RAIL/examples/estimation/` and run the command:

`python main.py configs/FZBoost.yaml`

The photo-z output files (inputs for this notebook) will be writen at: 

`<your_path>/RAIL/examples/estimation/results/FZBoost/test_FZBoost.hdf5`. 

Let's use the ancillary function **read_pz_output** to facilitate the reading of all necessary data. 

In [None]:
my_path = '/Users/sam/WORK/software/TMPRAIL/RAIL' # replace this with your local path to RAIL's parent dir
pdfs_file =  os.path.join(my_path, "examples/estimation/results/FZBoost/test_FZBoost.hdf5")
ztrue_file =  os.path.join(my_path, "tests/data/test_dc2_validation_9816.hdf5")
pdfs, zgrid, ztrue, photoz_mode = read_pz_output(pdfs_file, ztrue_file) # all numpy arrays

The inputs for the metrics shown above are the array of true (or spectroscopic) redshifts, and an ensemble of photo-z PDFs (a `qp.Ensemble` object). 

In [None]:
fzdata = qp.Ensemble(qp.interp, data=dict(xvals=zgrid, yvals=pdfs))

*** 
<a class="anchor" id="metrics"></a>
# Metrics



<a class="anchor" id="pit"></a>
## PIT

The Probability Integral Transform (PIT), is the Cumulative Distribution Function (CDF) of the photo-z PDF 

$$ \mathrm{CDF}(f, q)\ =\ \int_{-\infty}^{q}\ f(z)\ dz $$

evaluated at the galaxy's true redshift for every galaxy $i$ in the catalog.

$$ \mathrm{PIT}(p_{i}(z);\ z_{i})\ =\ \int_{-\infty}^{z^{true}_{i}}\ p_{i}(z)\ dz $$ 


In [None]:
pitobj = PIT(fzdata, ztrue)
quant_ens, metamets = pitobj.evaluate()

The _evaluate_ method PIT class returns two objects, a quantile distribution based on the full set of PIT values (a frozen distribution object), and a dictionary of meta metrics associated to PIT (to be detailed below). 

In [None]:
quant_ens

In [None]:
metamets

PIT values

In [None]:
pit_vals = np.array(pitobj._pit_samps)
pit_vals

### PIT outlier rate

The PIT outlier rate is a global metric defined as the fraction of galaxies in the sample with extreme PIT values. The lower and upper limits for considering a PIT as outlier are optional parameters set at the Metrics instantiation (default values are: PIT $<10^{-4}$ or PIT $>0.9999$). 

In [None]:
pit_out_rate = PITOutRate(pit_vals, quant_ens).evaluate()
print(f"PIT outlier rate of this sample: {pit_out_rate:.6f}") 

<a class="anchor" id="qq"></a>
## PIT-QQ plot

The histogram of PIT values is a useful tool for a qualitative assessment of PDFs quality. It shows whether the PDFs are:
* biased (tilted PIT histogram)
* under-dispersed (excess counts close to the boudaries 0 and 1)
* over-dispersed (lack of counts close the boudaries 0 and 1)
* well-calibrated (flat histogram)

Following the standards in DC1 paper, the PIT histogram is accompanied by the quantile-quantile (QQ), which can be used to compare qualitatively the PIT distribution obtained with the PDFs agaist the ideal case (uniform distribution). The closer the QQ plot is to the diagonal, the better is the PDFs calibration. 

In [None]:
plot_pit_qq(pdfs, zgrid, ztrue, title="PIT-QQ - toy data", code="FZBoost",
                pit_out_rate=pit_out_rate, savefig=False)

The black horizontal line represents the ideal case where the PIT histogram would behave as a uniform distribution U(0,1). 
***

<a class="anchor" id="summary_stats"></a>
# Summary statistics of CDF-based metrics

To evaluate globally the quality of PDFs estimates, `rail.evaluation` provides a set of metrics to compare the empirical distributions of PIT values with the reference uniform distribution, U(0,1). 

<a class="anchor" id="ks"></a>
### Kolmogorov-Smirnov  

Let's start with the traditional Kolmogorov-Smirnov (KS) statistic test, which is the maximum difference between the empirical and the expected cumulative distributions of PIT values:

$$
\mathrm{KS} \equiv \max_{PIT} \Big( \left| \ \mathrm{CDF} \small[ \hat{f}, z \small] - \mathrm{CDF} \small[ \tilde{f}, z \small] \  \right| \Big)
$$

Where $\hat{f}$ is the PIT distribution and $\tilde{f}$ is U(0,1). Therefore, the smaller value of KS the closer the PIT distribution is to be uniform. The `evaluate` method of the PITKS class returns a named tuple with the statistic and p-value. 

In [None]:
ksobj = PITKS(pit_vals, quant_ens)
ks_stat_and_pval = ksobj.evaluate()

In [None]:
ks_stat_and_pval

Visual interpretation of the KS statistic:

In [None]:
ks_plot(pitobj)

In [None]:
print(f"KS metric of this sample: {ks_stat_and_pval.statistic:.4f}") 

<a class="anchor" id="cvm"></a>
### Cramer-von Mises

Similarly, let's calculate the Cramer-von Mises (CvM) test, a variant of the KS statistic defined as the mean-square difference between the CDFs of an empirical PDF and the true PDFs:

$$ \mathrm{CvM}^2 \equiv \int_{-\infty}^{\infty} \Big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \Big)^{2} \mathrm{dCDF}(\tilde{f}, z) $$ 


on the distribution of PIT values, which should be uniform if the PDFs are perfect.

In [None]:
cvmobj = PITCvM(pit_vals, quant_ens)
cvm_stat_and_pval = cvmobj.evaluate()

In [None]:
print(f"CvM metric of this sample: {cvm_stat_and_pval.statistic:.4f}") 

<a class="anchor" id="ad"></a>
### Anderson-Darling 

Another variation of the KS statistic is the Anderson-Darling (AD) test, a weighted mean-squared difference featuring enhanced sensitivity to discrepancies in the tails of the distribution. 

$$ \mathrm{AD}^2 \equiv N_{tot} \int_{-\infty}^{\infty} \frac{\big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)^{2}}{\mathrm{CDF} \small[ \tilde{f}, z \small] \big( 1 \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)}\mathrm{dCDF}(\tilde{f}, z) $$ 



In [None]:
adobj = PITAD(pit_vals, quant_ens)
ad_stat_crit_sig = adobj.evaluate()
ad_stat_crit_sig

In [None]:
ad_stat_crit_sig

In [None]:
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}") 

It is possible to remove catastrophic outliers before calculating the integral for the sake of preserving numerical instability. For instance, Schmidt et al. computed the Anderson-Darling statistic within the interval (0.01, 0.99).

In [None]:
ad_stat_crit_sig_cut = adobj.evaluate(pit_min=0.01, pit_max=0.99)
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}") 
print(f"AD metric for 0.01 < PIT < 0.99: {ad_stat_crit_sig_cut.statistic:.4f}") 

<a class="anchor" id="cde_loss"></a>
# CDE Loss



In the absence of true photo-z posteriors, the metric used to evaluate individual PDFs is the **Conditional Density Estimate (CDE) Loss**, a metric analogue to the root-mean-squared-error:

$$ L(f, \hat{f}) \equiv  \int \int {\big(f(z | x) - \hat{f}(z | x) \big)}^{2} dzdP(x), $$ 

where $f(z | x)$ is the true photo-z PDF and $\hat{f}(z | x)$ is the estimated PDF in terms of the photometry $x$. Since $f(z | x)$  is unknown, we estimate the **CDE Loss** as described in [Izbicki & Lee, 2017 (arXiv:1704.08095)](https://arxiv.org/abs/1704.08095). :

$$ \mathrm{CDE} = \mathbb{E}\big(  \int{{\hat{f}(z | X)}^2 dz} \big) - 2{\mathbb{E}}_{X, Z}\big(\hat{f}(Z, X) \big) + K_{f},  $$


where the first term is the expectation value of photo-z posterior with respect to the marginal distribution of the covariates X, and the second term is the expectation value  with respect to the joint distribution of observables X and the space Z of all possible redshifts (in practice, the centroids of the PDF bins), and the third term is a constant depending on the true conditional densities $f(z | x)$. 

In [None]:
cdelossobj = CDELoss(fzdata, zgrid, ztrue)

In [None]:
cde_stat_and_pval = cdelossobj.evaluate()
cde_stat_and_pval

In [None]:
print(f"CDE loss of this sample: {cde_stat_and_pval.statistic:.2f}") 

<a class="anchor" id="summary"></a>
# Summary

In [None]:
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pitobj=pitobj) # pitobj as optional input to speed-up metrics evaluation

In [None]:
summary.markdown_metrics_table(pitobj=pitobj, show_dc1="FlexZBoost")