# Test Sampled Summarizers

**Author:** Sam Schmidt

**Last successfully run:** Feb 9, 2026

June 28 update:
I modified the summarizers to output not just N sample N(z) distributions (saved to the file specified via the `output` keyword), but also the single fiducial N(z) estimate (saved to the file specified via the `single_NZ` keyword).  I also updated NZDir and included it in this example notebook

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import rail.interactive as ri
import tables_io
from rail.utils.path_utils import find_rail_file

To create some N(z) distributions, we'll want some PDFs to work with first, for a quick demo let's just run some photo-z's using the KNearNeighEstimator estimator using the 10,000 training galaxies to generate ~20,000 PDFs using data from healpix 9816 of cosmoDC2_v1.1.4 that are included in the RAIL repo:

In [None]:
knn_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    sigma_grid_min=0.01,
    sigma_grid_max=0.07,
    ngrid_sigma=10,
    nneigh_min=3,
    nneigh_max=7,
    hdf5_groupname="photometry",
)

In [None]:
trainFile = find_rail_file("examples_data/testdata/test_dc2_training_9816.hdf5")
testFile = find_rail_file("examples_data/testdata/test_dc2_validation_9816.hdf5")
training_data = tables_io.read(trainFile)
test_data = tables_io.read(testFile)

In [None]:
# train knnpz
model = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
    training_data=training_data, **knn_dict
)["model"]

In [None]:
qp_data = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
    input_data=test_data, model=model
)["output"]

So, `qp_data` now contains the 20,000 PDFs from KNearNeighEstimator, we can feed this in to three summarizers to generate an overall N(z) distribution.  We won't bother with any tomographic selections for this demo, just the overall N(z).  It is stored as `qp_data`, but has also been saved to disk as `output_KNN.fits` as an astropy table.  If you want to read in this data to grab the qp Ensemble at a later stage, you can do this via qp with a `ens = qp.read("output_KNN.fits")`

I coded up **quick and dirty** bootstrap versions of the `NaiveStackSummarizer`, `PointEstHistSummarizer`, and `VarInference` sumarizers.  These are not optimized, not parallel (issue created for future update), but they do produce N different bootstrap realizations of the overall N(z) which are returned as a qp Ensemble (Note: the previous versions of these degraders returned only the single overall N(z) rather than samples).

## Naive Stack

Naive stack just "stacks" i.e. sums up, the PDFs and returns a qp.interp distribution with bins defined by np.linspace(zmin, zmax, nzbins), we will create a stack with 41 bins and generate 20 bootstrap realizations

In [None]:
naive_results = ri.estimation.algos.naive_stack.naive_stack_summarizer(
    input_data=qp_data,
    zmin=0.0,
    zmax=3.0,
    nzbins=41,
    n_samples=20,
)

The results are now in naive_results, but because of the DataStore, the actual *ensemble* is stored in `.data`, let's grab the ensemble and plot a few of the bootstrap sample N(z) estimates:

In [None]:
newens = naive_results["output"]

In [None]:
fig, axs = plt.subplots(figsize=(8, 6))
for i in range(0, 20, 2):
    newens[i].plot_native(axes=axs, label=f"sample {i}")
axs.plot([0, 3], [0, 0], "k--")
axs.set_xlim(0, 3)
axs.legend(loc="upper right")

The summarizer also outputs a **second** file containing the fiducial N(z).  We saved the fiducial N(z) in the file "NaiveStack_NZ.hdf5", let's grab the N(z) estimate with qp and plot it with the native plotter:

In [None]:
naive_nz = naive_results["single_NZ"]
naive_nz.plot_native(xlim=(0, 3))

## Point Estimate Hist
PointEstHistSummarizer takes the point estimate mode of each PDF and then histograms these, we'll again generate 41 bootstrap samples of this and plot a few of the resultant histograms.
Note: For some reason the plotting on the histogram distribution in qp is a little wonky, it appears alpha is broken, so this plot is not the best:

In [None]:
pens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input_data=qp_data,
    zmin=0.0,
    zmax=3.0,
    nzbins=41,
    n_samples=20,
)["output"]

In [None]:
fig, axs = plt.subplots(figsize=(8, 6))
pens[0].plot_native(axes=axs, fc=[0, 0, 1, 0.01])
pens[1].plot_native(axes=axs, fc=[0, 1, 0, 0.01])
pens[4].plot_native(axes=axs, fc=[1, 0, 0, 0.01])
axs.set_xlim(0, 3)
axs.legend()

Again, we have saved the fiducial N(z) in a separate file, "point_NZ.hdf5", we could read that data in if we desired.

## VarInfStackSummarizer

VarInfStackSummarizer implements Markus' variational inference scheme and returns qp.interp gridded distribution. VarInfStackSummarizer tends to get a little wonky if you use too many bins, so we'll only use 25 bins. Again let's generate 20 samples and plot a few:

In [None]:
vens = ri.estimation.algos.var_inf.var_inf_stack_summarizer(
    input_data=qp_data, zmin=0.0, zmax=3.0, nzbins=25, niter=10, n_samples=10
)
vens

Let's plot the fiducial N(z) for this distribution:

In [None]:
varinf_nz = vens["single_NZ"]
varinf_nz.plot_native(xlim=(0, 3))

## NZDir

NZDirSummarizer is a different type of summarizer, taking a weighted set of neighbors to a set of training spectroscopic objects to reconstruct the redshift distribution of the photometric sample.  I implemented a bootstrap of the **spectroscopic data** rather than the photometric data, both because it was much easier computationally, and I think that the spectroscopic variance is more important to take account of than simple bootstrap of the large photometric sample.
We must first run the `inform_NZDir` stage to train up the K nearest neigh tree used by NZDirSummarizer, then we will run `NZDirSummarizer` to actually construct the N(z) estimate.  

Like PointEstHistSummarizer NZDirSummarizer returns a qp.hist ensemble of samples

In [None]:
nzdir_model = ri.estimation.algos.nz_dir.nz_dir_informer(
    training_data=training_data, n_neigh=8
)["model"]

In [None]:
nzd_summary = ri.estimation.algos.nz_dir.nz_dir_summarizer(
    input_data=test_data,
    leafsize=20,
    zmin=0.0,
    zmax=3.0,
    nzbins=31,
    model=nzdir_model,
    hdf5_groupname="photometry",
)

In [None]:
nzd_ens = nzd_summary["output"]
nzdir_nz = nzd_summary["single_NZ"]

In [None]:
fig, axs = plt.subplots(figsize=(10, 8))
nzd_ens[0].plot_native(axes=axs, fc=[0, 0, 1, 0.01])
nzd_ens[1].plot_native(axes=axs, fc=[0, 1, 0, 0.01])
nzd_ens[4].plot_native(axes=axs, fc=[1, 0, 0, 0.01])
axs.set_xlim(0, 3)
axs.legend()

As we also wrote out the single estimate of N(z) we can read that data from the second file written (specified by the `single_NZ` argument given in NZDirSummarizer.make_stage above, in this case "NZDir_NZ.hdf5")

In [None]:
nzdir_nz.plot_native(xlim=(0, 3))

## Results

All three results files are qp distributions, NaiveStackSummarizer and VarInfStackSummarizer return qp.interp distributions while PointEstHistSummarizer returns a qp.histogram distribution.  Even with the different distributions you can use qp functionality to do things like determine the means, modes, etc... of the distributions.  You could then use the std dev of any of these to estimate a 1 sigma "shift", etc...

In [None]:
zgrid = np.linspace(0, 3, 41)
names = ["naive", "point", "varinf", "nzdir"]
enslist = [newens, pens, vens["output"], nzd_ens]
results_dict = {}
for nm, en in zip(names, enslist):
    results_dict[f"{nm}_modes"] = en.mode(grid=zgrid).flatten()
    results_dict[f"{nm}_means"] = en.mean().flatten()
    results_dict[f"{nm}_std"] = en.std().flatten()

In [None]:
results_dict

You can also use qp to compute quantities the pdf, cdf, ppf, etc... on any grid that you want, much of the functionality of scipy.stats distributions have been inherited by qp ensembles

In [None]:
newgrid = np.linspace(0.005, 2.995, 35)
naive_pdf = newens.pdf(newgrid)
point_cdf = pens.cdf(newgrid)
var_ppf = vens["output"].ppf(newgrid)

In [None]:
plt.plot(newgrid, naive_pdf[0])

In [None]:
plt.plot(newgrid, point_cdf[0])

In [None]:
plt.plot(newgrid, var_ppf[0])

## Shifts

If you want to "shift" a PDF, you can just evaluate the PDF on a shifted grid, for example to shift the PDF by +0.0375 in redshift you could evaluate on a shifted grid.  For now we can just do this "by hand", we could easily implement `shift` functionality in qp, I think.

In [None]:
def_grid = np.linspace(0.0, 3.0, 41)
shift_grid = def_grid - 0.0675
native_nz = newens.pdf(def_grid)
shift_nz = newens.pdf(shift_grid)

In [None]:
fig = plt.figure(figsize=(12, 10))
plt.plot(def_grid, native_nz[0], label="original")
plt.plot(def_grid, shift_nz[0], label="shifted +0.0675")
plt.legend(loc="upper right")

You can estimate how much shift you might expect based on the statistics of our bootstrap samples, say the std dev of the means for the NZDir-derived distribution:

In [None]:
results_dict["nzdir_means"]

In [None]:
spread = np.std(results_dict["nzdir_means"])

In [None]:
spread

Again, not a huge spread in predicted mean redshifts based solely on bootstraps, even with only ~20,000 galaxies.