# RAIL/estimation Tutorial Notebook

author: Sam Schmidt<br>
last run successfully: April 29, 2021<br>

This is a notebook demonstrating some of the features of the LSSTDESC `RAIL` package, namely the features of `estimation`.  `RAIL/estimation` is the interface for running production level photo-z codes within DESC.  There is a minimimal superclass that sets up some file paths and variable names, each specific photo-z code resides in a subclass with code-specific setup variables.<br>

RAIL is available at:<br>
https://github.com/LSSTDESC/RAIL<br>
and must be installed and included in your python path to work.  The LSSTDESC `qp` package that handles PDF files is also required, it is available at:<br>
https://github.com/LSSTDESC/qp<br>

For convenience of running on cori @ NERSC, we have installed RAIL, qp, and all dependencies, the paths are included in cell #2 below.  So, if you are running at NERSC and using the `desc-python` kernel you should be able to run the notebook without any custom installations.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import h5py
%matplotlib inline 

In [None]:
import sys
sys.path.insert(0,"/global/cfs/cdirs/lsst/groups/PZ/users/sschmidt/Packages/RAIL")
sys.path.insert(0,"/global/cfs/cdirs/lsst/groups/PZ/users/sschmidt/Packages/qp")
sys.path.insert(0,"/global/cfs/cdirs/lsst/groups/PZ/users/sschmidt/Packages/tables_io")
sys.path.insert(0,"/global/cfs/cdirs/lsst/groups/PZ/users/sschmidt/Packages/lib/python3.8/site-packages")

In [None]:
import rail
from tables_io import io
import qp

On importing RAIL you should see a list of the available photo-z algorithms, as printed out above.  These are the names of the specific subclasses that invoke a particular method, and they are stored in the `rail.estimation.algos` subdirectory of RAIL.<br>

`randomPZ` is a very simple class that does not actually predict a meaningful photo-z, instead it produces a randomly drawn Gaussian for each galaxy.<br>
`trainZ` is our "pathological" estimator, it makes a PDF from a histogram of the training data and assigns that PDF to every galaxy.<br>
`simpleNN` uses `sklearn`'s neural network to predict a point redshift from the training data, then assigns a sigma width based on the redshift for another toy model example<br>
`FZBoost` is a fully functional photo-z algorith, implementing the FlexZBoost conditional density estimate method that was used in the PhotoZDC1 paper.<br>

## The `base_config` parameter file:

RAIL/estimation is set up so that parameters for general setup (location of data and some settings) are stored in a local yaml file, while settings for a specific code can either be stored in a yaml file or a dictionary.<br>
We will use the yaml file stored in the current directory called `example_estimation_base.yaml` which contains the following entries:<br>
```
base_config:
  trainfile: ./Packages/RAIL/tests/data/test_dc2_training_9816.hdf5
  testfile: ./Packages/RAIL/tests/data/test_dc2_validation_9816.hdf5
  hdf5_groupname: photometry
  chunk_size: 2500
  configpath: ./configs
  outpath: ./results
  output_format: old
```

This yaml file will be opened when a RAIL.estimation.Estimator class is invoked.  The quantities in the yaml file are:<br><br>
-trainfile (str): path to the training data for this run<br>
-testfile (str): path to the testing data for this run<br>
-hdf5_groupname (str): the toplevel `groupname` for the hdf5 file, if the data is stored in the input hdf5 under e.g. `photometry`<br>
-chunk_size (int): `tables_io` has an iterator method that can break the data in chunks of size `chunk_size` rather than run all at once<br>
-configpath (str): location of config files<br>
-outpath (str): directory in which to write output files<br>
-output_format(str): If `output_format` is set to "qp" the method will return PDF data in qp format, if set to any other value it will return a dictionary.  We will demonstrate both data types in this notebook.

## The code-specific parameters
Code-specific parameters are stored in a dictionary.  This dictionary should itself contain a dictionary named `run_params`.  If no second argument is supplied, the code will print a warning message and use a default set of values.<br>

Let's start with a very simple demonstration using `simpleNN`.  `simpleNN` is just `sklearn`'s neural net method set up within the RAIL interface.  It estimates a point estimate redshift for each galaxy, then sets the width of the PDF based purely on the redshift.  This is a toy model estimator, but nicely illustrates some properties of RAIL. The parameters we'll use for simpleNN are:

In [None]:
nn_dict = {'run_params': {
  'class_name': 'simpleNN',
  'run_name': 'test_simpleNN',
  'zmin': 0.0,
  'zmax': 3.0,
  'nzbins': 301,
  'width': 0.05,
  'inform_options': {'save_train': True, 'load_model': False, 'modelfile': 'demo_NN_model.pkl'}
  }
}

In these parameters, `zmin`, `zmax`, and `nzbins` control the gridded parameterization on which the PDF will be output and stored (if not using `qp` format, in which case these parameters are ignored).<br>
`width` sets the scale of the Gaussian width assumed for the PDFs.<br>
`inform_options` is a dictionary that stores some information pertaining to the training process: <br>
`modelfile` is a string argument that stores a filename.<br>    
`save_train` is a boolean flag that if set to True will save the trained model to the filename stored in `modelfile` for later import.<br>
`load_model` is a boolean flag that if set to true, will attempt to load a pretrained model from the filename in `modelfile`<br>

We will see example uses of `save_train` and `load_model` later in the notebook.

In [None]:
print(nn_dict)

We instantiate a rail object with a call to the base class.<br>
You can also find the specific class name from the name of the algorithm with:<br>
```
classname = 'simpleNN'
code = Estimator._find_subclass(classname)
```
and then instantiate with 
```
pz = code('example_estimation_base.yaml', nn_dict)  
```
We will hardcode here for the concrete example. The `simpleNN` class is in the file `rail/estimation/algos/sklearn_nn.py`, so to create an instance of this class we need a call to `rail.estimation.algos.sklearn_nn.simpleNN`.  Upon creation, the code will print a brief description of the parameters and their current values.  If any essential parameters are missing from the parameter dictionary, they will be set to default values:

In [None]:
pz = rail.estimation.algos.sklearn_nn.simpleNN('example_estimation_base.yaml',nn_dict)

Now, let's load our training data, which is stored in hdf5 format.  We'll load it with the `tables_io` function read, which returns a dictionary containing numpy arrays of all the columns in the hdf5 file, which matches the input format expected by the rail estimators.  Note that we specify the specific `pz.groupname` in brackets to load the particular hdf5 group that contains our data:

In [None]:
train_fmt = 'hdf5'
training_data = io.read(pz.trainfile, None, train_fmt)[pz.groupname]

In [None]:
print(training_data.keys())

We need to train the neural net, which is done with the `inform` function present in every RAIL/estimation code. In `nn_dict` we have the `inform_options[save_train]` option set to `True`, so the trained model object will be saved in pickle format to the filename specified in `inform_options[modelfile]`, in this case `demo_NN_model.pkl`.  In the future, rather than re-run a potentially time consuming training set, we can simply load this pickle file.<br>

NOTE: in our simple demo dataset, the multilevel perceptron sometimes fails to converge, so you may get a warning message in the next cell.  The estimator will still work and predict photo-z's even if the neural net was not quite converged in the set number of iterations.

In [None]:
%%time
pz.inform(training_data)

We can now run our algorithm on the data to produce simple photo-z estimates:

In [None]:
test_data = io.read(pz.testfile, None, 'hdf5' )['photometry']

In [None]:
print(test_data.keys())

In [None]:
results_dict = pz.estimate(test_data)

The output file is a dictionary containing the redshift PDFs and the mode of the PDFs:

In [None]:
print(results_dict.keys())

Let's plot the redshift mode against the true redshifts to see how they look:

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(test_data['redshift'],results_dict['zmode'],s=1,c='k',label='simple NN mode')
plt.plot([0,3],[0,3],'r--');
plt.xlabel("true redshift")
plt.ylabel("simple NN photo-z")

Not bad, given our very simple estimator.  For the PDFs, the simpleNN is storing a gridded parameterization where the PDF is evaluated at a fixed set of redshifts for each galaxy.  That grid is stored in `pz.zgrid`, and we'll need that to plot.  Remember that our simple Neural Net just estimated a point photo-z then assumed a Gaussian, so all PDFs will be of that simple form.  Let's plot an example pdf:

In [None]:
galid = 9529
zgrid = pz.zgrid
single_gal = results_dict['pz_pdf'][galid]
single_zmode = results_dict['zmode'][galid]
truez = test_data['redshift'][galid]
plt.plot(zgrid,single_gal,color='k',label='single pdf')
plt.axvline(single_zmode,color='k',label='mode')
plt.axvline(truez,color='r',label='true redshift')
plt.legend(loc='upper right')
plt.xlabel("redshift")
plt.ylabel("p(z)")

That illustrates the basics, now let's use a more well developed method, FZBoost, and use the iterator utility from `tables_io` to evaluate data in chunks, and output in `qp` format.  The `chunk_size` value in the base dictionary controls the number of galaxies used in each iterator chunk.<br>

`FZBoost` finds a conditional density estimate for each PDF via a set of weights for basis functions.  This can save space relative to a gridded parameterization, but it also sometimes leads to residual "bumps" in the PDF from the underlying parameterization.  For this reason, `FZBoost` has a post-processing stage where it "trims" (i.e. sets to zero) any "bumps" below a certain threshold.<br>

One of the dominant features seen in our PhotoZDC1 analysis of multiple photo-z codes (Schmidt et al. 2020) was that photo-z estimates were often, in general, overconfident or underconfident in their overall uncertainty in PDFs.  To remedy this, `FZBoost` has an additional post-processing step where it estimates a "sharpening" parameter that modulates the width of the PDFs.<br>

A portion of the training data is held in reserve to find best-fit values for both `bump_thresh` and `sharpening`, which we find by simply calculating the CDE loss for a grid of `bump_thresh` and `sharpening` values.<br>

We'll start with a dictionary of setup parameters for FZBoost, just as we had for simpleNN.  Some of the parameters are the same as in `simpleNN` above, `zmin`, `zmax`, `nzbins`.  However, FZBoost performs a more in depth training than simpleNN, and as such has more input parameters to control behavior.  These parameters are:<br>
`basis_system`: which basis system to use in the density estimate. The default is `cosine` but `fourier` is also an option<br>
`max_basis`: the maximum number of basis functions parameters to use for PDFs<br>
`regression_params`: a dictionary of options fed to `xgboost` that control the maximum depth and the `objective` function.  An update in `xgboost` means that `objective` should now be set to `reg:squarederror` for proper functioning.<br>
`trainfrac`: The fraction of the training data to use for training the density estimate.  The remaining galaxies will be used for validation of `bump_thresh` and `sharpening`.<br>
`bumpmin`: the minimum value to test in the `bump_thresh` grid<br>
`bumpmax`: the maximum value to test in the `bump_thresh` grid<br>
`nbump`: how many points to test in the `bump_thresh` grid<br>
`sharpmin`, `sharpmax`, `nsharp`: same as equivalent `bump_thresh` params, but for `sharpening` parameter<br>

In [None]:
fz_dict = {'run_params': {
  'class_name': 'FZBoost',
  'run_name': 'test_FZBoost',
  'zmin': 0.0,
  'zmax': 3.0,
  'nzbins': 301,
  'trainfrac': 0.75,
  'bumpmin': 0.02,
  'bumpmax': 0.35,
  'nbump': 20,
  'sharpmin': 0.7,
  'sharpmax': 2.1,
  'nsharp': 15,
  'max_basis': 35,
  'basis_system': 'cosine',
  'regression_params': {'max_depth': 8,'objective':'reg:squarederror'},
  'inform_options': {'save_train': True, 'load_model': False, 'modelfile': 'demo_FZB_model.pkl'}
  }
}

We will also demonstrate using qp as our storage format.  We will do this by using `example_estimation_base_qp.yaml`, which has the same parameters as `example_estimation_base.yaml`, but with `output_format` set to `qp` instead of `old`.

In [None]:
pzflex = rail.estimation.algos.flexzboost.FZBoost('example_estimation_base_qp.yaml',fz_dict)

We can now use this data to train our model using `FZBoost`'s inform() method.  `FZBoost` uses xgboost to determine a conditional density estimate model, and also fits a `bump_thresh` parameter that erases small peaks that are an artifact of the `cosine` parameterization.  It then finds a best fit `sharpen` parameter that modulates the peaks with a power law.<br>
We have `save_train` set to `True` in our `inform_options`, so this will save a pickled version of the best fit model to the file specified in `inform_options['modelfile']`, which is set above to `demo_FZB_model.pkl`.  We can use the same training data that we used for `simpleNN`.  `FZBoost` is a bit more sophisticated than `simpleNN`, so it will take a bit longer to train (note: it should take about 10-15 minutes on cori for the 10,000 galaxies in our demo sample):

In [None]:
%%time
pzflex.inform(training_data)

## Loading a pre-trained model

That took quite a while to train! But, because we had `inform_options[save_train]` set to `True` we have saved the pretrained model in the file `demo_FZB_model.pkl`.  To save time we can load this pickled model without having to repeat the training stage in future runs for this specific training data, and that should be much faster:

In [None]:
%%time
pzflex.load_pretrained_model()

Yes, under a second.  So, if you are running an algorithm with a burdensome training requirement, saving a trained copy of the model for later repeated use can be a real time saver.

# Add a temp qp append function!

Now, let's compute photo-z's using with the `estimate` method.  `tables_io` has a function that will iterate through the data returning dictionary of arrays as it goes named `iterHdf5ToDict`.  This can be useful if you have a very large datafile with millions of galaxies and you do not want to store all of the data in memory at once, and e.g. want to write out the PDFs to file as you progress.  Our demo file has only ~20,000 galaxies, but we will use the iterator here as an example.  the `tables_io` function `iterHdf5ToDict` takes arguments `infile` (path to the data file), `chunk_size` (how many galaxies you want to include in each iterator chunk), and `groupname` (the hdf5 groupname for the input data file).  We will use the same datafile used for our `simpleNN` demo.  Because we aren't writing to file, we will just stack the astropy tables returned by our iterator into one big qp file using the custom `qpappend` function defined in the cell above.  Again, if memory was a problem, you could instead write each file out to disk.

In [None]:
%%time
datafile =  "Packages/RAIL/tests/data/test_dc2_validation_9816.hdf5"
for chunk, (start, end, data) in enumerate(io.iterHdf5ToDict(pz.testfile, pz._chunk_size, 'photometry')):
    print(f"calculating pdfs[{start}:{end}]")
    pz_data_chunk = pzflex.estimate(data)
    if chunk == 0:
        FZ_pdfs = pz_data_chunk
    else:
        FZ_pdfs.append(pz_data_chunk)
        del pz_data_chunk

We can plot an example PDF using `qp`'s native plotting:

In [None]:
fig, axes = qp.plotting.plot_native(FZ_pdfs[1225], xlim=(0.,3.))

We can also plot a few point estimates to make sure our algorithm worked properly, we can compute the median of the PDF trivially and plot against true redshift:

In [None]:
Fz_medians = FZ_pdfs.median()

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(test_data['redshift'],Fz_medians,s=1,c='k')
plt.plot([0,3],[0,3],'r--')
plt.xlabel("true redshift")
plt.ylabel("photoz median")
plt.title("median point estimate for FZBoost");

Unfortunately, the `qp.interp` parameterization used for `FZBoost` doest not have a native `mode()` method, but we can easily compute the mode using `np.argmax` on each PDF and assigning the `pz.zgrid` value.  Note that this works only for this particular `qp` paremeterization, and would be different for alternate storage forms.  for `qp.interp` the raw gridded PDFs are stored in `objdata()['yvals']`: 

In [None]:
rawfzdata = FZ_pdfs.objdata()['yvals']

In [None]:
FZ_mode = pzflex.zgrid[np.argmax(rawfzdata,axis=1)]

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(test_data['redshift'],FZ_mode,c='k',s=1)
plt.plot([0,3],[0,3],'r--')
plt.xlabel("true redshift")
plt.ylabel("photoz mode")
plt.title("PDF mode for FZBoost");

The results look very good! FZBoost is a mature algorithm, and with representative training data we see a very tight correlation with true redshift and few outliers.<br>

We can use this same raw data to see how the summed PDFs compare to the true redshift distribution (note: this is as a sanity check rather than a "proper" analysis for science, for which we might want to use something like chippr to properly estimate the overall probability distribution):

In [None]:
FZ_nzsum = np.sum(rawfzdata,axis=0)

In [None]:
plt.figure(figsize=(10,7))
plt.plot(pz.zgrid,FZ_nzsum*.03, label='summed PDFs')
plt.hist(test_data['redshift'],bins=np.linspace(0.,3.,101), label="true z histogram");
plt.xlabel("redshift")
plt.ylabel("N(z)")
plt.legend(loc='upper right');