# Goldenspike, an example of an end-to-end analysis using RAIL

This notebook demonstrates how to use a the various RAIL Modules to draw synthetic samples of fluxes by color, apply physical effects to them, train photo-Z estimators on the samples, test and validate the preformance of those estimators, and to use the RAIL summarization modules to obtain n(z) estimates based on the p(z) estimates.

### Creation 

Note that in the parlance of the Creation Module, "degradation" is any post-processing that occurs to the "true" sample generated by the create Engine.  This can include adding photometric errors, applying quality cuts, introducing systematic biases, etc.

In this notebook, we will draw both test and training samples from a RAIL Engine object. Then we will demonstrate how to use RAIL degraders to apply effects to those samples.

### Training and Estimation

The RAIL Trainer modules "train" or "inform" models used to estimate p(z) given band fluxes (and potentially other information).

The RAIL Estimation modules then use those same models to actually apply the model and extract the p(z) estimates.

### p(z) Validation 

The RAIL Validator module applies various metrics 

### p(z) to n(z) Summarization

The RAIL Summarization modules convert per-galaxy p(z) posteriors to ensemble n(z) estimates. 

###  Imports

In [None]:
# Prerquisites, os, and numpy
import os
import numpy as np

In [None]:
# Various rail modules
import rail
from rail.creation.degradation import LSSTErrorModel, InvRedshiftIncompleteness, LineConfusion, QuantityCut
from rail.creation.engines.flowEngine import FlowEngine, FlowPosterior
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from rail.core.utilStages import ColumnMapper, TableConverter

from rail.estimation.algos.bpz_lite import BPZ_lite
from rail.estimation.algos.trainZ import Train_trainZ, TrainZ
from rail.estimation.algos.sklearn_nn import Train_SimpleNN, SimpleNN
from rail.estimation.algos.randomPZ import RandomPZ
from rail.estimation.algos.flexzboost import Train_FZBoost, FZBoost

from rail.evaluation.evaluator import Evaluator

from rail.summarization.algos.naiveStack import NaiveStack
from rail.summarization.algos.pointEstimateHist import PointEstimateHist

RAIL now uses ceci as a back-end, which takes care of a lot of file I/O decisions to be consistent with other choices in DESC.

This bit effectively overrides a ceci default to prevent overwriting previous results, generally good but not necessary for this demo.

The `DataStore` uses `DataHandle` objects to keep track of the connections between the various stages.  When one stage returns a `DataHandle` and then you pass that `DataHandle` to another stage, the underlying code can establish the connections needed to build a reproducilble pipeline.   

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

The path stuff for setup establishes where to find a pre-trained creator's file.
TODO: make an issue for trainign one through RAIL rather than externally with pzflow.

In [None]:
RAIL_DIR = os.path.join(os.path.dirname(rail.__file__), '..')
flow_file = os.path.join(RAIL_DIR, 'examples/goldenspike/data/pretrained_flow.pkl')

Here we need a few configuration parameters to deal with differences in data schema between existing PZ codes.

In [None]:
bands = ['u','g','r','i','z','y']
band_dict = {band:f'mag_{band}_lsst' for band in bands}
rename_dict = {f'mag_{band}_lsst_err':f'mag_err_{band}_lsst' for band in bands}

## Make mock data

First, we make the stages.
Note that training and test data will refer to the same flow file for the Engine, but they otherwise have different stages.

### Training sample

For the training sample we will:

1. Use the Flow to produce some synthetic data
2. Use the LSSTErrorModel to smear the data
3. Use the ColumnMapper to rename some of the columns, as needed by FlowPosterior module
4. Use the FlowPosterior to estimate the redshift posteriors for the degraded sample
5. Use the TableConverter to convert the data to a numpy dictionary, which will be stored in a hdf5 file with the same schema as the DC2 data


In [None]:
flow_engine_train = FlowEngine.make_stage(name='flow_engine_train', 
                                          flow=flow_file, n_samples=50)
      
lsst_error_model_train = LSSTErrorModel.make_stage(name='lsst_error_model_train',
                                                   bandNames=band_dict)
                
col_remapper_train = ColumnMapper.make_stage(name='col_remapper_train', hdf5_groupname='',
                                             columns=rename_dict)

flow_post_train = FlowPosterior.make_stage(name='flow_post_train',
                                           column='redshift', flow=flow_file,
                                           grid=np.linspace(0., 5., 21))

table_conv_train = TableConverter.make_stage(name='table_conv_train', output_format='numpyDict', 
                                             seed=12345)


In [None]:
train_data_orig = flow_engine_train.sample(50, 12345)
train_data_errs = lsst_error_model_train(train_data_orig)
train_data_pq = col_remapper_train(train_data_errs)
train_data_post = flow_post_train.get_posterior(train_data_pq, 'redshift', err_samples=None)
train_data = table_conv_train(train_data_pq)

TODO: print some data, like column names

### Testing sample

For the testing data we are going to apply a couple of extra degradation effects to the data.  This will allow us to see how the trained models perform with imperfect training data.

More details about the degraders are available in the `rail/examples/creation/degradation_demo.ipynb` notebook.

In [None]:
flow_engine_test = FlowEngine.make_stage(name='flow_engine_test', 
                                         flow=flow_file, n_samples=50,
                                         seed=12345)

lsst_error_model_test = LSSTErrorModel.make_stage(name='lsst_error_model_test',
                                                  bandNames=band_dict)

inv_redshift = InvRedshiftIncompleteness.make_stage(name='inv_redshift',
                                                    pivot_redshift=1.0)

line_confusion = LineConfusion.make_stage(name='line_confusion', 
                                          true_wavelen=5007., wrong_wavelen=3727., frac_wrong=0.05)

quantity_cut = QuantityCut.make_stage(name='quantity_cut',    
                                      cuts={'mag_i_lsst': 25.3})

col_remapper_test = ColumnMapper.make_stage(name='col_remapper_test', columns=rename_dict)
   
table_conv_test = TableConverter.make_stage(name='table_conv_test', output_format='numpyDict')
 


In [None]:
test_data_orig = flow_engine_test.sample(50, 12345)
test_data_errs = lsst_error_model_test(test_data_orig)
test_data_inc = inv_redshift(test_data_errs)
test_data_conf = line_confusion(test_data_inc)
test_data_cut = quantity_cut(test_data_conf)
test_data_pq = col_remapper_test(test_data_cut)
test_data = table_conv_test(test_data_pq)

TODO: print some data, e.g. column names showing difference from test data

## "Inform" some estimators

More details about the process of "informing" or "training" the models used by the estimators is available in the `rail/examples/estimation/RAIL_estimation_demo.ipynb` notebook.

{inform refers to any prior info, usually training set but also template library, relevant to hybrid estimators, consistency in how they're called}

TODO: change `train` to `inform` in corresponding RAIL modules. . .

In [None]:
inform_trainZ = Train_trainZ.make_stage(name='inform_trainZ', input='inprogress_output_table_conv_train.hdf5', 
                                        model='trainZ.pkl', hdf5_groupname='')

# train_simpleNN = Train_SimpleNN.make_stage(name='train_simpleNN', input='inprogress_output_table_conv_train.hdf5', 
#                                            model_file='simpleNN.pkl', hdf5_groupname='')

# train_fzboost = Train_FZBoost.make_stage(name='train_FZBoost', input='inprogress_output_table_conv_train.pq', 
#                                          model_file='fzboost.pkl', hdf5_groupname='')

In [None]:
inform_trainZ.inform(train_data)
#train_simpleNN.inform(train_data)
#train_fzboost.inform(train_data)

## Estimate photo-z posteriors

More details about the estimators is available in the `rail/examples/estimation/RAIL_estimation_demo.ipynb` notebook.

`randomPZ` is a very simple class that does not actually predict a meaningful photo-z, instead it produces a randomly drawn Gaussian for each galaxy.<br>
`trainZ` is our "pathological" estimator, it makes a PDF from a histogram of the training data and assigns that PDF to every galaxy.<br>
`BPZ_lite` is a template-based code that outputs the posterior estimated given a specific template set and Bayesian prior. See Benitez (2000) for more details.<br>


In [None]:
estimate_bpz = BPZ_lite.make_stage(name='estimate_bpz', hdf5_groupname='', columns_file='../estimation/configs/test_bpz.columns')

estimate_trainZ = TrainZ.make_stage(name='estimate_trainZ', hdf5_groupname='', model=inform_trainZ.get_handle('model'))

estimate_randomPZ = RandomPZ.make_stage(name='estimate_randomZ', hdf5_groupname='')

#test_simpleNN = SimpleNN.make_stage(name='test_simpleNN', 
#                                    model_file='simpleNN.pkl')

#test_fzboost = FZBoost.create(name='test_FZBoost', 
#                              model_file='fzboost.pkl', 
#                              aliases=dict(input='test_data', output='fzboost_estim'))

In [None]:
bpz_estimated = estimate_bpz.estimate(test_data)
trainZ_estimated = estimate_trainZ.estimate(test_data)
randomPZ_estimated = estimate_randomPZ.estimate(test_data)

## Evaluate the estimates

Now we evaluate metrics on the estimates, separately for each estimator.  

Each call to the `Evaluator.evaluate` will create a table with the various performance metrics. 
We will store all of these tables in a dictionary, keyed by the name of the estimator.

In [None]:
eval_dict = dict(bpz=bpz_estimated, trainZ=trainZ_estimated)
truth = test_data_orig

result_dict = {}
for key, val in eval_dict.items():
    the_eval = Evaluator.make_stage(name=f'{key}_eval', truth=truth)
    result_dict[key] = the_eval.evaluate(val, truth)

The Pandas DataFrame output format conveniently makes human-readable printouts of the metrics.  
This next cell will convert everything to Pandas.

In [None]:
import tables_io
results_tables = {key:tables_io.convertObj(val.data, tables_io.types.PD_DATAFRAME) for key,val in result_dict.items()}

In [None]:
results_tables['bpz']

In [None]:
results_tables['trainZ']

## Summarize the per-galaxy redshift constraints to make population-level distributions

{introduce the summarizers}

First we make the stages, then execute them, then plot the output.

In [None]:
point_estimate_test = PointEstimateHist.make_stage(name='point_estimate_test')
naive_stack_test = NaiveStack.make_stage(name='naive_stack_test')

In [None]:
point_estimate_ens = point_estimate_test.summarize(eval_dict['bpz'])
naive_stack_ens = naive_stack_test.summarize(eval_dict['bpz'])

In [None]:
_ = naive_stack_ens.data.plot_native(xlim=(0,3))

In [None]:
_ = point_estimate_ens.data.plot_native(xlim=(0,3))

### Convert this to a `ceci` Pipeline

Now that we have all these stages defined and configured, and that we have established the connections between them by passing `DataHandle` objects between them, we can build a `ceci` Pipeline.


In [None]:
import ceci
pipe = ceci.Pipeline.interactive()
stages = [flow_engine_test, lsst_error_model_test, col_remapper_test, table_conv_test,
          flow_engine_train, lsst_error_model_train, col_remapper_train, table_conv_train, 
          inv_redshift, line_confusion, quantity_cut,
          inform_trainZ, estimate_bpz, estimate_trainZ, estimate_randomPZ,
          point_estimate_test, naive_stack_test]
for stage in stages:
    pipe.add_stage(stage)

In [None]:
pipe.initialize(dict(flow=flow_file), dict(output_dir='.', log_dir='.', resume=False), None)

In [None]:
pipe.save('tmp_goldenspike.yml')

### Read back the pipeline and run it

In [None]:
pr = ceci.Pipeline.read('tmp_goldenspike.yml')

In [None]:
pr.run()