# RAIL Evaluation - Check results against DC1 paper

Contact: _Julia Gschwend_ ([julia@linea.gov.br](mailto:julia@linea.gov.br)), _Sam Schmidt, Alex Malz, Eric Charles_

The purpose of this notebook is to validate the new implementation of the DC1 metrics, previously available on Github repository [PZDC1paper](https://github.com/LSSTDESC/PZDC1paper), now refactored to be part of RAIL Evaluation module. The metrics here were implemented in object-oriented Python 3, inheriting features from _qp_. In this notebook we use the same input dataset used in DC1 PZ paper ([Schmidt et al. 2020](https://arxiv.org/pdf/2001.03621.pdf)), copied from cori (/global/cfs/cdirs/lsst/groups/PZ/PhotoZDC1/photoz_results/TESTDC1FLEXZ).

In [None]:
from IPython.display import Markdown
from sample import Sample
from metrics import *
import utils
import os
import matplotlib.pyplot as plt
%matplotlib inline
%reload_ext autoreload
%autoreload 2

<a class="anchor" id="sample"></a>

## Sample  



In [None]:
my_path = "/Users/julia/TESTDC1FLEXZ"

pdfs_file =  os.path.join(my_path, "Mar5Flexzgold_pz.out")
ztrue_file =  os.path.join(my_path, "Mar5Flexzgold_idszmag.out")

#pdfs_file =  os.path.join(my_path, "1pct_Mar5Flexzgold_pz.out")
#ztrue_file =  os.path.join(my_path, "1pct_Mar5Flexzgold_idszmag.out")

In [None]:
%%time
sample = Sample(pdfs_file, ztrue_file, code="FlexZBoost", name="DC1 paper data")
sample

In [None]:
print(sample)

## Metrics

In [None]:
%%time
metrics = Metrics(sample)

The metrics below are based on the PIT and the CDF(PIT), both computed via qp.Ensemble object method. The PIT array is computed as the qp.Ensemble CDF function for an object containing the photo-z PDFs, evaluated at the true $z$ for each galaxy. The PIT distribution is implemented as the normalized histogram of PIT values. The uniform U(0,1) is implemented as a mock normalized distribution with the same number of bins of PIT distribution, where all values are equal to $1/N_{quant}$.     
Then a new qp.Ensemble object is instantiated for each distribution, PITs and U(0,1), to use the CDF functionallity (an ensemble with only 1 PDF each).

```python 
class Metrics:
    """
       ***   Metrics class   ***
    Receives a Sample object as input.
    Computes PIT and QQ vectors on the initialization.
    It's the basis for the other metrics, such as KS, AD, and CvM.
    """
    def __init__(self, sample, n_quant=100, pit_min=0.0001, pit_max=0.9999, debug=False):
        """Class constructor
        Parameters
        ----------
        sample: `Sample`
            sample object defined in ./sample.py
        n_quant: `int`, (optional)
            number of quantiles for the QQ plot
        pit_min: `float`
            lower limit to define PIT outliers
            default is 0.0001
        pit_max:
            upper limit to define PIT outliers
            default is 0.9999
        """
        self._sample = sample
        self._n_quant = n_quant
        self._pit_min = pit_min
        self._pit_max = pit_max
        self._debug = debug
        n = len(self._sample)
        if debug:
            #n = 1000 # subset for quick tests
            print("DEBUG MODE")
            #ids = np.random.choice(n, 10000)
            self._pit = np.loadtxt(os.path.join(sample.path,"TESTPITVALS.out"), unpack=True, usecols=[1])#[ids]
            self.new_pit = np.nan_to_num([self._sample._pdfs[i].cdf(self._sample._ztrue[i])[0][0] for i in range(n)])# ids])
        else:
            n = len(self._sample)
            self._pit = np.nan_to_num([self._sample._pdfs[i].cdf(self._sample._ztrue[i])[0][0] for i in range(n)])
        # Quantiles
        Qtheory = np.linspace(0., 1., self.n_quant)
        Qdata = np.quantile(self._pit, Qtheory)
        self._qq_vectors = (Qtheory, Qdata)
        # Normalized distribution of PIT values (PIT PDF)
        self._xvals = Qtheory
        self._pit_pdf, self._pit_bins_edges = np.histogram(self._pit, bins=n_quant, density=True)
        #self._uniform_pdf = stats.uniform(self._xvals, scale=n_quant)
        self._uniform_pdf = np.full(n_quant, 1.0 / float(n_quant))
        # Define qp Ensemble to use CDF functionality (an ensemble with only 1 PDF)
        self._pit_ensemble = qp.Ensemble(qp.hist, data=dict(bins=self._pit_bins_edges,
                                                            pdfs=np.array([self._pit_pdf])))
        self._uniform_ensemble = qp.Ensemble(qp.interp, data=dict(xvals=self._xvals,
                                                                  yvals=np.array([self._uniform_pdf])))
        self._pit_cdf = self._pit_ensemble.cdf(self._xvals)[0]
        self._uniform_cdf = self._uniform_ensemble.cdf(self._xvals)[0]
        
```



#### PIT-QQ plot

In [None]:
metrics.plot_pit_qq() #savefig=True)

### DC1 results
The DC1 results are stored in Metrics class object as a table and as a dictionary, inheriting from an independent class DC1 (in `utils.py` ancillary file), which exists only to provide the reference values.

In [None]:
metrics.dc1.table

In [None]:
print(metrics.dc1.codes)

In [None]:
print(metrics.dc1.metrics)

In [None]:
metrics.dc1.results['PIT out rate']['FlexZBoost']

## Results

Summary table with all metrics containing DC1 paper results for comparison 

In [None]:
metrics.markdown_metrics_table(show_dc1=True)

In the first attempt, the results do not match, except for the PIT outliers rate. The CDE loss is close to the reference values. 

In [None]:
delta = abs(metrics.cde_loss - metrics.dc1.results['CDE loss']['FlexZBoost'])
perc = abs(delta/metrics.dc1.results['CDE loss']['FlexZBoost'])*100.
print(f"CDE loss differs from DC1 value by {delta:.3f} ({perc:.1f}%).")

Such small difference could be explained by differences in the binning used for the numerical integration. 


However, the KS, CvM, and AD tests still need to be fixed.
Let's investigate these numbers by comparing the results with what we would get if using the scipy built-in statistical tests (implemented as alternative methods for each metric). 

### Kolmogorov-Smirnov  

$$
\mathrm{KS} \equiv \max_{PIT} \Big( \left| \ \mathrm{CDF} \small[ \hat{f}, z \small] - \mathrm{CDF} \small[ \tilde{f}, z \small] \  \right| \Big)
$$

```python
    def __init__(self, metrics, scipy=False):
        self._metrics = metrics
        if scipy:
            self._stat, self._pvalue = stats.kstest(metrics._pit, "uniform")
        else:
            self._stat, self._pvalue = np.max(np.abs(metrics._pit_cdf - metrics._uniform_cdf)), None # p=value TBD
        # update Metrics object
        metrics._ks_stat = self._stat
```


In [None]:
ks_dc1 = metrics.dc1.results['KS']['FlexZBoost']
ks_dc1

In [None]:
ks = KS(metrics)
ks.stat

In [None]:
ks_sci = KS(metrics, scipy=True)
ks_sci.stat

For the Komolgorof-Smirnov test, the values with and without using scipy.stats.ks_test function are compatible with each other and both disagree with the DC1 result significantly.

In [None]:
delta = abs(ks_sci.stat - metrics.dc1.results['KS']['FlexZBoost'])
perc = abs(delta/metrics.dc1.results['KS']['FlexZBoost'])*100.
print(f"KS differs from DC1 value by {delta:.3f} ({perc:.1f}%).")

Visual interpretation of KS test

In [None]:
ks.plot()

In [None]:
ks_sci.plot()

<font color=red> SOLUTION STILL PENDING!!!</font>

### Cramer-von Mises

Let's fepeat the same excercise with the CvM test.

$$ \mathrm{CvM}^2 \equiv \int_{-\infty}^{\infty} \Big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \Big)^{2} \mathrm{dCDF}(\tilde{f}, z) $$ 

```python

    def __init__(self, metrics, scipy=False):
        if scipy:
            cvm_result = stats.cramervonmises(metrics._pit_dist, "uniform")
            self._stat, self._pvalue = cvm_result.statistic, cvm_result.pvalue
        else:
            self._stat, self._pvalue = np.sqrt(np.trapz((metrics._pit_cdf - metrics._uniform_cdf)**2, metrics._uniform_cdf)), None # p-value TBD
        # update Metrics object
        metrics._cvm_stat = self._stat

```

In [None]:
cvm_dc1 = metrics.dc1.results['CvM']['FlexZBoost']
cvm_dc1

In [None]:
cvm = CvM(metrics)
cvm.stat

In [None]:
cvm_sci = CvM(metrics, scipy=True)
cvm_sci.stat

This time, all numbers disagree. I have checked the code fr CvM test in `skgof` library, and it doesn't look like the equation for the definition of CvM shown in the paper.

From https://github.com/wrwrwr/scikit-gof/blob/master/skgof/ecdfgof.py: 

```python

def cvm_stat(data):
    """
    Calculates the Cramer-von Mises statistic for sorted values from U(0, 1).
    """
    samples2 = 2 * len(data)
    minuends = arange(1, samples2, 2) / samples2
    return 1 / (6 * samples2) + ((minuends - data) ** 2).sum()

(...)

cvm_test = partial(simple_test, stat=cvm_stat, pdist=cvm_unif)
    
```


<font color=red> SOLUTION STILL PENDING!!!</font>

### Anderson-Darling 

The last matric is the AD test, which is the onluy metric that allows the removal of extreme outliers before the calculation:

$$ \mathrm{AD}^2 \equiv N_{tot} \int_{-\infty}^{\infty} \frac{\big( \mathrm{CDF} \small[ \hat{f}, z \small] \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)^{2}}{\mathrm{CDF} \small[ \hat{f}, z \small] \big( 1 \ - \ \mathrm{CDF} \small[ \tilde{f}, z \small] \big)}\mathrm{dCDF}(\tilde{f}, z) $$ 

```python
    def __init__(self, metrics, ad_pit_min=0.0, ad_pit_max=1.0):

        mask_pit = (metrics._pit >= ad_pit_min) & (metrics._pit  <= ad_pit_max)
        if (ad_pit_min != 0.0) or (ad_pit_max != 1.0):
            n_out = len(metrics._pit) - len(metrics._pit[mask_pit])
            perc_out = (float(n_out)/float(len(metrics._pit)))*100.
            print(f"{n_out} outliers (PIT<{ad_pit_min} or PIT>{ad_pit_max}) removed from the calculation ({perc_out:.1f}%)")

        ad_xvals = np.linspace(ad_pit_min, ad_pit_max, metrics.n_quant)
        ad_yscale_uniform = (ad_pit_max-ad_pit_min)/float(metrics._n_quant)
        ad_pit_dist, ad_pit_bins_edges = np.histogram(metrics.pit[mask_pit], bins=metrics.n_quant, density=True)
        ad_uniform_dist = np.full(metrics.n_quant, ad_yscale_uniform)
        # Redo CDFs to account for outliers mask
        ad_pit_ensemble = qp.Ensemble(qp.hist, data=dict(bins=ad_pit_bins_edges, pdfs=np.array([ad_pit_dist])))
        ad_pit_cdf = ad_pit_ensemble.cdf(ad_xvals)[0]
        ad_uniform_ensemble = qp.Ensemble(qp.hist,
                                          data=dict(bins=ad_pit_bins_edges, pdfs=np.array([ad_uniform_dist])))
        ad_uniform_cdf = ad_uniform_ensemble.cdf(ad_xvals)[0]
        numerator = ((ad_pit_cdf - ad_uniform_cdf)**2)
        denominator = (ad_uniform_cdf*(1.-ad_uniform_cdf))
        with np.errstate(divide='ignore', invalid='ignore'):
            self._stat = np.sqrt(float(len(metrics._sample)) * np.trapz(np.nan_to_num(numerator/denominator), ad_uniform_cdf))
        # update Metrics object
        metrics._ad_stat = self._stat
```

For the Anderson-Darling test, the comparison to a uniform distribution is not available in scipy.stats.anderson method, so using it does not make sense. 


In [None]:
ad_dc1 = metrics.dc1.results['AD']['FlexZBoost']
ad_dc1

In [None]:
ad = AD(metrics).stat
ad

Let's remove the catastrophic autliers (as done in the paper), to see the impact.

In [None]:
ad_clean = AD(metrics, ad_pit_min=0.01, ad_pit_max=0.99).stat
ad_clean

Once more, the results disagree.

<font color=red> SOLUTION STILL PENDING!!!</font>

# Debugging

Following Sam's suggestion, I also computed the metrics reading the PIT values from the partial results of DC1 paper, instead of calculating them in advance. The "debug" mode of `metrics` class uses DC1's PIT values. This mode will probably be removed of the code after solving all bugs. 

In [None]:
%%time
metrics_debug = Metrics(sample, debug=True)

In the comment section of RAIL's  [pull request #54](https://github.com/LSSTDESC/RAIL/pull/54), Sam pointed out the small disagreement found between the PIT values of DC1 sample computed now (using current `qp` version), and those computed at the time of the paper writing. There is a trend or new values of PIT to be slightly larger than the old for PIT < 0.5 and slightly smaller for PIT > 0.5.

In [None]:
plt.figure(figsize=[10,3])
plt.plot(metrics_debug.pit, metrics.pit - metrics_debug.pit, 'k,')
plt.plot([0,1], [0,0], 'r--', lw=3)
plt.xlim(0, 1)
plt.ylim(-0.015, 0.015)
plt.xlabel("PITs from DC1 paper")
plt.ylabel("$\Delta$ PIT")
plt.tight_layout()

#### Results using DC1's PIT values

In [None]:
metrics_debug.markdown_metrics_table(show_dc1=True)

Let's see the `scipy=True` version of the metrics:

In [None]:
ks_debug_sci = KS(metrics_debug, scipy=True)
ks_debug_sci.stat

In [None]:
ks_dc1

In [None]:
cvm_debug_sci = CvM(metrics_debug, scipy=True)
cvm_debug_sci.stat

In [None]:
cvm_dc1

<font color=red> SOLUTION STILL PENDING!!!</font>

### Point estimates metrics

In [None]:
old_metrics_table = sample.plot_old_valid()

In [None]:
utils.old_metrics_table(sample, show_dc1=True)

At least the point metrics agree, so the PDFs are being read correctely. 

## Conclusion

<font color=red> I still need help to understand the disagreement in the results. </font>