# RAIL Estimation Tutorial Notebook

**Authors:** Sam Schmidt, Eric Charles, Alex Malz, others...

**Last run successfully:** Feb 9, 2026

This is a notebook demonstrating some of the `estimation` features of the LSST-DESC `RAIL`-iverse packages.  

The `rail.estimation` subpackage contains infrastructure to run multiple production-level photo-z codes.  There is a minimimal superclass that sets up some file paths and variable names. Each specific photo-z code resides in a subclass in `rail.estimation.algos` with algorithm-specific setup variables.  More extensive documentation is available on Read the Docs here:
https://rail-hub.readthedocs.io/en/latest/

**Note:** If you're interested in running this in pipeline mode, see [`00_Quick_Start_in_Estimation.ipynb`](https://github.com/LSSTDESC/rail/blob/main/pipeline_examples/estimation_examples/00_Quick_Start_in_Estimation.ipynb) in the `pipeline_examples/estimation_examples/` folder.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import rail.interactive as ri
import tables_io
from rail.utils.path_utils import find_rail_file

## The code-specific parameters
Each photo-z algorithm has code-specific parameters necessary to initialize the code.  These values can be input on the command line, or passed in via a dictionary.<br>

Let's start with a very simple demonstration using `k_nearneigh`, a RAIL wrapper around `sklearn`'s nearest neighbor (NN) method.  It calculates a normalized weight for the K nearest neighbors based on their distance and makes a PDF as a sum of K Gaussians, each at the redshift of the training galaxy with amplitude based on the distance weight, and a Gaussian width set by the user.  This is a toy model estimator, but it actually performs very well for representative data sets. There are configuration parameters for the names of columns, random seeds, etc... in `KNearNeighEstimator` with best-guess sensible defaults based on preliminary experimentation in DESC. See the [KNearNeigh code](https://github.com/LSSTDESC/RAIL/blob/eac-dev/rail/estimation/algos/k_nearneigh.py) for more details, but here is a minimal set to run:

In [None]:
knn_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    sigma_grid_min=0.01,
    sigma_grid_max=0.07,
    ngrid_sigma=10,
    nneigh_min=3,
    nneigh_max=7,
    hdf5_groupname="photometry",
)

Here, `trainfrac` sets the proportion of training data to use in training the algorithm, where the remaining fraction is used to validate both the width of the Gaussians used in constructing the PDF and the number of neighbors used in each PDF.  The CDE Loss is a metric computed on a grid of some width and number of neighbors, and the combination of width and number of neighbors with the lowest CDE loss is used.  `sigma_grid_min`, `sigma_grid_max`, and `ngrid_sigma` are used to specify the grid of sigma values to test, while `nneigh_min` and `nneigh_max` are the integer values between which we will check the loss.

`zmin`, `zmax`, and `nzbins` are used to create a grid on which the CDE Loss is computed when minimizing the loss to find the best values for sigma and number of neighbors to use.

Now, let's load our training data, which is stored in hdf5 format.  We'll load it into the `DataStore` so that the `ceci` stages are able to access it.

In [None]:
trainFile = find_rail_file("examples_data/testdata/test_dc2_training_9816.hdf5")
testFile = find_rail_file("examples_data/testdata/test_dc2_validation_9816.hdf5")
training_data = tables_io.read(trainFile)
test_data = tables_io.read(testFile)

We will begin by training the algorithm by instantiating its `Informer` stage.

If any essential parameters are missing from the parameter dictionary, they will be set to default values:

We need to train the KDTree, which is done with the `inform()` method present in every `Informer` stage. The parameter `model` is the name that the trained model object that will be saved as, in a format specific to the estimation algorithm in question.  In this case the format is a pickle file called `demo_knn.pkl`. 

`KNearNeighInformer.inform` finds the best sigma and NNeigh and stores those along with the KDTree in the model.

In [None]:
# %%time
pz_model = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
    training_data=training_data,
    **knn_dict,
)

We can now set up the main photo-z `Estimator` stage and run our algorithm on the data to produce simple photo-z estimates.  Note that we are loading the trained model that we computed from the `Informer` stage:

In [None]:
results = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
    input_data=test_data, model=pz_model["model"]
)

The output file is a `qp.Ensemble` containing the redshift PDFs.  This `Ensemble` also includes a photo-z point estimate derived from the PDFs, the mode by default (though there will soon be a keyword option to choose a different point estimation method or to skip the calculation of a point estimate).  The modes are stored in the "ancillary" data within the `Ensemble`.  By default it will be in an 1xM array, so you may need to include a `.flatten()` to flatten the array.  The zmode values in the ancillary data can be accessed via:

In [None]:
zmode = results["output"].ancil["zmode"].flatten()

Let's plot the redshift mode against the true redshifts to see how they look:

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"], zmode, s=1, c="k", label="simple NN mode"
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("simple NN photo-z")

Not bad, given our very simple estimator!  For the PDFs, `KNearNeigh` is storing each PDF as a Gaussian mixture model parameterization where each PDF is represented by a set of N Gaussians for each galaxy.  `qp.Ensemble` objects have all the methods of `scipy.stats.rv_continuous` objects so we can evaluate the PDF on a set of grid points with the built-in `.pdf` method.  Let's pick a single galaxy from our sample and evaluate and plot the PDF, the mode, and true redshift:

In [None]:
zgrid = np.linspace(0, 3.0, 301)

In [None]:
galid = 9529
single_gal = np.squeeze(results["output"][galid].pdf(zgrid))
single_zmode = zmode[galid]
truez = test_data["photometry"]["redshift"][galid]
plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmode, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We see that KNearNeigh PDFs do consist of a number of discrete Gaussians, and many have quite a bit of substructure.  This is a naive estimator, and some of these features are likely spurious.

## FlexZBoost

That illustrates the basics. Now let's try the `FlexZBoostEstimator` estimator.  FlexZBoost is available in the [rail_flexzboost](https://github.com/LSSTDESC/rail_flexzboost/) repo and can be installed with

`pip install pz-rail-flexzboost`

on the command line or from source.  Once installed, it will function the same as any of the other estimators included in the primary `rail` repo.

`FlexZBoostEstimator` approximates the conditional density estimate for each PDF with a set of weights on a set of basis functions.  This can save space relative to a gridded parameterization, but it also leads to residual "bumps" in the PDF intrinsic to the underlying cosine or fourier parameterization.  For this reason, `FlexZBoostEstimator` has a post-processing stage where it "trims" (i.e. sets to zero) any small peaks, or "bumps", below a certain `bump_thresh` threshold.

One of the dominant features seen in our PhotoZDC1 analysis of multiple photo-z codes (Schmidt, Malz et al. 2020) was that photo-z estimates were often, in general, overconfident or underconfident in their overall uncertainty in PDFs.  To remedy this, `FlexZBoostEstimator` has an additional post-processing step where it applies a "sharpening" parameter `sharpen` that modulates the width of the PDFs according to a power law.

A portion of the training data is held in reserve to determine best-fit values for both `bump_thresh` and `sharpening`, which we currently find by simply calculating the CDE loss for a grid of `bump_thresh` and `sharpening` values; once those values are set FlexZBoost will re-train its density estimate model with the full dataset. A more sophisticated hyperparameter fitting procedure may be implemented in the future.

We'll start with a dictionary of setup parameters for FlexZBoostEstimator, just as we had for the k-nearest neighbor estimator.  Some of the parameters are the same as in k-nearest neighbor above, `zmin`, `zmax`, `nzbins`.  However, FlexZBoostEstimator performs a more in depth training and as such has more input parameters to control its behavior.  These parameters are:

- `basis_system`: which basis system to use in the density estimate. The default is `cosine` but `fourier` is also an option
- `max_basis`: the maximum number of basis functions parameters to use for PDFs
- `regression_params`: a dictionary of options fed to `xgboost` that control the maximum depth and the `objective` function.  An update in `xgboost` means that `objective` should now be set to `reg:squarederror` for proper functioning.
- `trainfrac`: The fraction of the training data to use for training the density estimate.  The remaining galaxies will be used for validation of `bump_thresh` and `sharpening`.
- `bumpmin`: the minimum value to test in the `bump_thresh` grid
- `bumpmax`: the maximum value to test in the `bump_thresh` grid
- `nbump`: how many points to test in the `bump_thresh` grid
- `sharpmin`, `sharpmax`, `nsharp`: same as equivalent `bump_thresh` params, but for `sharpening` parameter

In [None]:
fz_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    bumpmin=0.02,
    bumpmax=0.35,
    nbump=20,
    sharpmin=0.7,
    sharpmax=2.1,
    nsharp=15,
    max_basis=35,
    basis_system="cosine",
    hdf5_groupname="photometry",
    regression_params={"max_depth": 8, "objective": "reg:squarederror"},
)

`FlexZBoostInformer` operates on the training set and writes a file containing the estimation model.  `FlexZBoost` uses xgboost to determine a conditional density estimate model, and also fits the `bump_thresh` and `sharpen` parameters described above.

`FlexZBoost` is a bit more sophisticated than the earlier k-nearest neighbor estimator, so it will take a bit longer to train, but not drastically so, still under a minute on a semi-new laptop.  We specified the name of the model file, `demo_FZB_model.pkl`, which will store our trained model for use with the estimation stage.

In [None]:
%%time
flexzboost_model = ri.estimation.algos.flexzboost.flex_z_boost_informer(
    training_data=training_data, **fz_dict
)

## Loading a pre-trained model

If we have an existing pretrained model, for example the one in the file `demo_FZB_model.pkl`, we can skip this step in subsequent runs of an estimator; that is, we load this pickled model without having to repeat the training stage for this specific training data, and that can save time for larger training sets that would take longer to create the model.

Now, let's compute photo-z's using with the `estimate` method.  

In [None]:
fzresults = ri.estimation.algos.flexzboost.flex_z_boost_estimator(
    input_data=test_data, model=flexzboost_model["model"]
)

We can calculate the median and mode values of the PDFs and plot their distribution (in this case the modes are already stored in the qp.Ensemble's ancillary data, but here is an example of computing the point estimates via qp directly):

In [None]:
fz_medians = fzresults["output"].median()
fz_modes = fzresults["output"].mode(grid=zgrid)

In [None]:
plt.hist(fz_medians, bins=np.linspace(-0.005, 3.005, 101))
plt.xlabel("redshift")
plt.ylabel("Number")

We can plot an example PDF, its median redshift, and its true redshift from the results file:

In [None]:
galid = 9529
single_gal = np.squeeze(fzresults["output"][galid].pdf(zgrid))
single_zmedian = fz_medians[galid]
truez = test_data["photometry"]["redshift"][galid]
plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmedian, color="k", ls="--", label="median")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We can also plot a point estimaten against the truth as a visual diagnostic:

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(test_data["photometry"]["redshift"], fz_modes, s=1, c="k")
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("photoz mode")
plt.title("mode point estimate derived from FlexZBoost PDFs")

The results look very good! FlexZBoost is a mature algorithm, and with representative training data we see a very tight correlation with true redshift and few outliers due to physical degeneracies.<br>