# GPz Estimation Example

**Author:** Sam Schmidt

**Last Run Successfully:** September 26, 2023

A quick demo of running GPz on the typical test data.  You should have installed rail_gpz_v1 (we highly recommend that you do this from within a custom conda environment so that all dependencies for package versions are met), either by cloning and installing from github, or with:
```
pip install pz-rail-gpz-v1
```

As RAIL is a namespace package, installing rail_gpz_v1 will make `GPzInformer` and `GPzEstimator` available, and they can be imported via:<br>
```
from rail.estimation.algos.gpz import GPzInformer, GPzEstimator
```

Let's start with all of our necessary imports:

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
import rail
import qp
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from rail.estimation.algos.gpz import GPzInformer, GPzEstimator

In [None]:
# set up the DataStore to keep track of data
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
# find_rail_file is a convenience function that finds a file in the RAIL ecosystem   We have several example data files that are copied with RAIL that we can use for our example run, let's grab those files, one for training/validation, and the other for testing:
from rail.core.utils import find_rail_file
trainFile = find_rail_file('examples_data/testdata/test_dc2_training_9816.hdf5')
testFile = find_rail_file('examples_data/testdata/test_dc2_validation_9816.hdf5')
training_data = DS.read_file("training_data", TableHandle, trainFile)
test_data = DS.read_file("test_data", TableHandle, testFile)

Now, we need to set up the stage that will run GPz.  We begin by defining a dictionary with the config options for the algorithm.  There are sensible defaults set, we will override several of these as an example of how to do this.  Config parameters not set in the dictionary will automatically be set to their default values.

In [None]:
gpz_train_dict = dict(n_basis=60, trainfrac=0.8, csl_method="normal", max_iter=150, hdf5_groupname="photometry") 

Let's set up the training stage.  We need to provide a name for the stage for ceci, as well as a name for the model file that will be written by the stage.  We also include the arguments in the dictionary we wrote above as additional arguments:

In [None]:
# set up the stage to run our GPZ_training
pz_train = GPzInformer.make_stage(name="GPz_Train", model="GPz_model.pkl", **gpz_train_dict)

We are now ready to run the stage to create the model.  We will use the training data from `test_dc2_training_9816.hdf5`, which contains 10,225 galaxies drawn from healpix 9816 from the cosmoDC2_v1.1.4 dataset, to train the model.  Note that we read this data in called `train_data` in the DataStore.  Note that we set `trainfrac` to 0.8, so 80% of the data will be used in the "main" training, but 20% will be reserved by `GPzInformer` to determine a SIGMA parameter.  We set `max_iter` to 150, so we will see 150 steps where the stage tries to maximize the likelihood. We run the stage as follows:

In [None]:
%%time
pz_train.inform(training_data)

This should have taken about 30 seconds on a typical desktop computer, and you should now see a file called `GPz_model.pkl` in the directory.  This model file is used by the `GPzEstimator` stage to determine our redshift PDFs for the test set of galaxies.  Let's set up that stage, again defining a dictionary of variables for the config params:

In [None]:
gpz_test_dict = dict(hdf5_groupname="photometry", model="GPz_model.pkl")

gpz_run = GPzEstimator.make_stage(name="gpz_run", **gpz_test_dict)

Let's run the stage and compute photo-z's for our test set:

In [None]:
%%time
results = gpz_run.estimate(test_data)

This should be very fast, under a second for our 20,449 galaxies in the test set.  Now, let's plot a scatter plot of the point estimates, as well as a few example PDFs.  We can get access to the `qp` ensemble that was written via the DataStore via `results()`

In [None]:
ens = results()

In [None]:
expdfids = [2, 180, 13517, 18032]
fig, axs = plt.subplots(4, 1, figsize=(12,10))
for i, xx in enumerate(expdfids):
    axs[i].set_xlim(0,3)
    ens[xx].plot_native(axes=axs[i])
axs[3].set_xlabel("redshift", fontsize=15)

GPzEstimator parameterizes each PDF as a single Gaussian, here we see a few examples of Gaussians of different widths.  Now let's grab the mode of each PDF (stored as ancil data in the ensemble) and compare to the true redshifts from the test_data file:

In [None]:
truez = test_data.data['photometry']['redshift']
zmode = ens.ancil['zmode'].flatten()

In [None]:
plt.figure(figsize=(12,12))
plt.scatter(truez, zmode, s=3)
plt.plot([0,3],[0,3], 'k--')
plt.xlabel("redshift", fontsize=12)
plt.ylabel("z mode", fontsize=12)