# RAIL Evaluation - Checking results against DC1 paper

The purpose of this notebook is to validate the reimplementation of the DC1 metrics, previously available on Github repository [PZDC1paper](https://github.com/LSSTDESC/PZDC1paper), now refactored to be part of RAIL Evaluation module. The metrics here were implemented in object-oriented Python 3, following a superclass/subclass structure, and inheriting features from _qp_.

In [None]:
from metrics import *
from utils import *
import os
%matplotlib inline
%reload_ext autoreload
%autoreload 2

### DC1 results
The DC1 results are stored in the class `DC1` (defined in `utils.py` ancillary file), which exists only to provide the reference values.

In [None]:
dc1 = DC1()

In [None]:
dc1.table

To access individual metrics, one can call the metrics dictionary `dc1.results` using the codes and metrics names as keys. 

In [None]:
dc1.results['PIT out rate']['FlexZBoost']

The list of codes and metrics available can be accessed by the properties `dc1.codes` and `dc1.metrics`. 

In [None]:
print(dc1.codes)

In [None]:
print(dc1.metrics)

*** 

<a class="anchor" id="data"></a>
## The data   

 In this notebook we use the same input dataset used in DC1 PZ paper ([Schmidt et al. 2020](https://arxiv.org/pdf/2001.03621.pdf)), copied from cori (/global/cfs/cdirs/lsst/groups/PZ/PhotoZDC1/photoz_results/TESTDC1FLEXZ).


In [None]:
my_path = "/Users/julia/TESTDC1FLEXZ"

pdfs_file =  os.path.join(my_path, "Mar5Flexzgold_pz.out")
ztrue_file =  os.path.join(my_path, "Mar5Flexzgold_idszmag.out")
oldpitfile = os.path.join(my_path,"TESTPITVALS.out")
pdfs, zgrid, ztrue, photoz_mode = read_pz_output(pdfs_file, ztrue_file)

## Metrics

Metrics calculated based on the PITs computed via qp.Ensemble CDF method. The PIT values can be passed as optional input to speed up the metrics calculation. If no PIT array is provided, it is calculated on the fly.

In [None]:
%%time
pit = PIT(pdfs, zgrid, ztrue)
pit.evaluate()
pits = pit.metric

In [None]:
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pits=pits, show_dc1="FlexZBoost")

#### PIT-QQ plot

In [None]:
pit_out_rate = PitOutRate(pdfs, zgrid, ztrue).evaluate(pits=pits)
pit.plot_pit_qq(title="DC1 paper data", code="FlexZBoost",
                pit_out_rate=pit_out_rate, savefig=False)

***
### "Debugging"

Following Sam's suggestion, I also computed the metrics reading the PIT values from the partial results of DC1 paper, instead of calculating them from scratch. 

Reading DC1 PIT values (PITs computed in the past for the paper):

In [None]:
pits_dc1 = np.loadtxt(oldpitfile, skiprows=1,usecols=(1))

In [None]:
plt.figure(figsize=[10,3])
plt.plot(pits_dc1, pits-pits_dc1, 'k,')
plt.plot([0,1], [0,0], 'r--', lw=3)
plt.xlim(0, 1)
plt.ylim(-0.02, 0.02)
plt.xlabel("PITs from DC1 paper")
plt.ylabel("$\Delta$ PIT")
plt.tight_layout()

The values are slightly different.

Recalculating the metrics:

In [None]:
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pits=pits_dc1, show_dc1="FlexZBoost") 

In [None]:
pit_out_rate = PitOutRate(pdfs, zgrid, ztrue).evaluate(pits=pits_dc1)
pit.plot_pit_qq(title="DC1 data (original PITs)", code="FlexZBoost",
                pit_out_rate=pit_out_rate, savefig=False)


Using the original PIT values from the paper, all metrics match reasonably, except for the Anderson-Darling statistic. 

#### Anderson-Darling 

$$ \mathrm{AD}^2 \equiv N_{tot} \int_{-\infty}^{\infty} \frac{\big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)^{2}}{\mathrm{CDF} \small[ \hat{f}, z \small] \big( 1 \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)}\mathrm{dCDF}(\tilde{f}, z) $$ 


The class AD uses `scipy.stats.anderson_ksamp` method to compute the Anderson-Darling statistic for the PIT values by comparing with a uniform distribution between 0 and 1. Up to the current version (1.6.2), `scipy.stats.anderson` (the 1-sample test) does not support uniform distributions as reference sample.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson_ksamp.html

In [None]:
ad_dc1 = dc1.results['AD']['FlexZBoost']
ad_dc1

In [None]:
ad = AD(pdfs, zgrid, ztrue)
ad.evaluate(pits=pits_dc1)
ad.metric

By default, the AD is computed within the interval $0.0 \leq PIT \leq 1.0$. 

5 objects have PIT values out of this interval (this is unexpected). 

In [None]:
print(pits_dc1[(pits_dc1<0)|(pits_dc1>1)])

It is possible to remove extreme values of PIT, as done in the paper.  

In [None]:
ad.evaluate(pits=pits, ad_pit_min=0.0001, ad_pit_max=0.9999)
ad.metric

In [None]:
ad.evaluate(pits=pits, ad_pit_min=0.001, ad_pit_max=0.999)
ad.metric

In [None]:
ad.evaluate(pits=pits, ad_pit_min=0.01, ad_pit_max=0.99)
ad.metric

In [None]:
p_err = (abs(ad_dc1-ad.metric)/ad_dc1)*100. # percent error
print(f"Percent error: {p_err:.1f} %")

*** 
#### Point estimates metrics

These metrics are deprecated and might not be used in future analyses. They are included in this notebook for the sake of reproducing the results from the paper in its totality. 

In [None]:
utils.old_metrics_table(photoz_mode, ztrue, name="this test", show_dc1=True)

In [None]:
utils.plot_old_valid(photoz_mode, ztrue, code="FlexZBoost")

*** 
## Conclusion

The metrics calculated using the new implementation are reasonably close to the expected values. Minor differences were observed due to the difference in the calculation of PIT values. In both cases, here and in the paper, the PITs were calculated using _qp_ functions. The small diferences are attributed to minor changes in _qp_ versions since when the paper was produced. 

When using the original values of PIT, i.e., those calculated for the paper using the _qp_ version availabe at the time, all metrics were reproduced, except for the AD test. This particular metric is quite sensitive to the range of PITs considered in the calculation. Using the same PIT interval as used in the paper (0.01,0.99), the result obtained using the new implementation diverges from the paper result by 19.3%. 