# Using Photometry to Estimate Photometric Redshifts

**Authors:** Jennifer Scora, Tai Withers, Mubdi Rahman

**Last run successfully:** Feb 9, 2026


This notebook covers the basics of using real galaxy photometry to estimate photometric redshifts with RAIL. We will use a couple of the RAIL algorithms to do this, to get a sense of the differences between algorithms and how they work. We'll go through the following steps:

1. Setting up your data
2. Estimating redshifts with `k-Nearest Neighbours` (KNN)
    * this includes how to save the redshifts to file or convert them to other data formats 
3. Estimating redshifts with `FlexZBoost`

Before we get started, here's a quick introduction to some of the features of RAIL interactive mode. The only RAIL package you need to import is the `rail.interactive` package. This contains all of the interactive functions for all of the RAIL algorithms. You may need to import supporting functions that are not part of a stage separately. To get a sense of what functions/stages are available and for some more detailed instructions, see [the RAIL documentation](https://rail-hub.readthedocs.io/en/latest/source/user_guide/interactive_usage.html).  

In [None]:
import pickle

import matplotlib.pyplot as plt
import numpy as np
import qp

# import the packages we'll need
import rail.interactive as ri
import tables_io
from rail.utils.path_utils import find_rail_file

In this notebook, we'll be using estimation algorithms, which can all be found under the `ri.estimation.algos` namespace. You can see a list of [existing algorithms](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html), or you can also explore the available options using tab-complete. Each algorithm will have its own namespace, for example, the namespace for KNN is `k_nearneigh`. Each of these algorithms will then have an `informer` and `estimator` method.

To get the docstrings for a function, including what parameters it needs and what it returns, you can just put a question mark after the function call or use the `help()` function, as you would with any python function. For example, we'll be using the KNN estimator function later, so we can take a look at what it needs:

In [None]:
ri.estimation.algos.k_nearneigh.k_near_neigh_estimator?

## 1. Setting up your data

In this notebook we'll be using the ['Estimation' stage](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html) of RAIL. The estimation algorithms, or `Estimators`, have both an *inform* method and an *estimation* method:

- **Inform**: calibrates the photometric redshift estimation algorithms, so it requires both photometry data and the true redshifts of the galaxies. 
- **Estimation:** uses the calibrated algorithm to estimate the redshifts of the galaxies you're interested in given their photometry. 

This means we'll need two separate data sets, one for calibration and one for estimating on. We'll start by getting those data sets set up. `test_dc2_training_9816.hdf5` is what we'll use for calibrating the *inform* method, and `test_dc2_validation_9816.hdf5` will act as our 'real' galaxy photometry, or target data set, which we will provide to the *estimation* method to get our photometric redshifts (photo-z). 

Both files contain data drawn from the cosmoDC2_v1.1.4 truth extragalactic catalog generated by DESC with model 10-year-depth magnitude uncertainties.  The calibration data contains roughly 10,000 galaxies, while the target data contains roughly 20,000. In a real-world scenario, you'll be bringing your own "test" data at least, but also likely your calibration data set as well. 

First, we'll use the `find_rail_file` function to get the full path to the data files mentioned above.


In [None]:
calibration_data_file = find_rail_file(
    "examples_data/testdata/test_dc2_training_9816.hdf5"
)
test_data_file = find_rail_file("examples_data/testdata/test_dc2_validation_9816.hdf5")
print(calibration_data_file)

Now let's read these files in. We'll start by reading in the calibration data, and converting it to a Pandas DataFrame to make it easier to read.

We use the [tables_io](https://tables-io.readthedocs.io) package in order to read in these HDF5 files.

**NOTE:** Your data doesn't have to be in HDF5 files. The only main requirement is that the data in-memory is in a table format, such as a Pandas DataFrame.

In [None]:
# read in file
calibration_data = tables_io.read(calibration_data_file)
print(type(calibration_data), calibration_data.keys())

# get the data table out of the photometry dictionary and convert to pandas DataFrame
calibration_data = calibration_data["photometry"]
calibration_data = tables_io.convert(calibration_data, "pandasDataFrame")
print(calibration_data.info())

The `calibration_data` is now a Pandas DataFrame, containing information on 10,225 galaxies. It has magnitudes for the *ugrizy* bands, including errors, and the true redshift of these galaxies. 

Take a note of the column names -- these are the default expected column names for most of the RAIL estimation stages. Of course, your actual calibration or target data sets are not likely to have these exact column names. In that case, there are a few optional parameters that need to be changed when running both the *inform* and *estimation* algorithms:
- `hdf5_groupname`: This is the key used to access your data table from a dictionary, for example if it was stored in an HDF5 file. If you are passing a data table directly, just give it `""`
- `bands`: This is a list of the column names with your photometry bands, i.e. `['u','g','r','i','z','y']`
- `err_bands`: This is a list of the error column names that are associated with the photometry bands, i.e. `['u_err','g_err'...]`
- `mag_limits`: This is a dictionary of magnitude limits for each band, i.e. `['u': 28.0, 'g': 29.1 ...]`
- `ref_band`: This is the column name of a specific band to use in addition to colors, i.e. `'i'`

For an example of this you can check out [16_Running_with_different_data.ipynb](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/16_Running_with_different_data.html). 

We'll now also load in the target data, which contains the magnitudes for the galaxies we actually want to calculate redshifts for. Just as an example, we'll leave the target data in the default format given by `tables_io` for an `hdf5` file, which is a dictionary of arrays. Either method can be used with RAIL functions, but they can require slightly different methods of passing the data.

In [None]:
test_data = tables_io.read(test_data_file)
print(test_data["photometry"].keys())

## 2. Estimate redshifts with the [KNN algorithm](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html#k-nearest-neighbor) 

**The algorithm**:  The `k-Nearest Neighbours` algorithm we're using (see [here](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) for more of an explanation of how KNN works) is a wrapper around `sklearn`'s nearest neighbour (NN) machine learning model. Essentially, it takes a given galaxy, identifies its nearest neighbours in the space, in this case galaxies that have similar colours, and then constructs the photometric redshift PDF as a sum of Gaussians from each neighbour.

**Inform**: The inform method is calibrating the model that we will use to estimate the redshifts. It will set aside some of the calibration data set as a validation data set. We will plug in our calibration data set, and any parameters the model needs, which we can check by putting a question mark after the function name:


In [None]:
ri.estimation.algos.k_nearneigh.k_near_neigh_informer?

There are a lot of optional parameters. Some of the main ones to be aware of for the KNN algorithm are:
- `trainfrac` sets the proportion of calibration data to use in training the algorithm, where the remaining fraction is used to validate both the width of the Gaussians used in constructing the PDF and the number of neighbors used in each PDF.  
- `sigma_grid_min`, `sigma_grid_max`, and `ngrid_sigma` are used to specify the grid of sigma values to test for the Gaussians 
- `nneigh_min` and `nneigh_max` set the range of nearest neighbours that will be tested 
- `zmin`, `zmax`, and `nzbins` are used to create a grid of redshift points on which to validate the model 

The only required parameter is the calibration data (called `training_data`). We'll also need to include `hdf5_groupname = ""`, which just tells the code that there is no dictionary key it needs to use to get to the data, since we're just passing it a DataFrame directly.

In [None]:
# parameters to pass to the informer
knn_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    sigma_grid_min=0.01,
    sigma_grid_max=0.07,
    ngrid_sigma=10,
    nneigh_min=3,
    nneigh_max=7,
    hdf5_groupname="",
)

# run the inform method
knn_inform = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
    training_data=calibration_data, **knn_dict
)

Now, if you take a look at the output of this function, you can see that it's a dictionary with the key 'model', since that's what we're generating, and the actual model object as the value. If there were multiple outputs for this function, they would all be collected in this dictionary: 

In [None]:
print(knn_inform)

To assess how well this model has been calibrated, you can take a look at the [algorithm documentation] for more information about it, and you can explore the [introduction to RAIL interactive](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/Estimating_Redshifts_and_Comparing_Results_for_Different_Parameters.html) or the [01_Evaluation_by_Type](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/evaluation_examples/01_Evaluation_by_Type.html) notebooks, which go into more detail about the process of evaluating the performance of estimation algorithm models. 

### Saving a model and using it with an estimator stage

The output of the inform stage is just a dictionary with the model under the "model" key. To make our lives easier, we can save this model to a file. That way, we only have to run the *estimator* method in the future, and supply the file name of the model we've just saved. This will make it so you can use this calibration without running the *informer* stage again. 

Let's start by saving the file (we recommend using the `pickle` module since that's what format most estimation algorithms expect the model to be in):


In [None]:
# write model file out here
with open("./knn_model.pkl", "wb") as fout:
    pickle.dump(obj=knn_inform["model"], file=fout, protocol=pickle.HIGHEST_PROTOCOL)

### Estimate 

Now that our model is calibrated, we can use it to estimate the redshifts of the target data set. We provide the estimate algorithm with the target data set, and the filename of the model that we've trained, and any other necessary parameters:

In [None]:
knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
    input_data=test_data, model="knn_model.pkl"
)

### Taking a look at your estimated redshifts 

Typically for photometric redshifts you would expect the algorithm to give you just a number with error bars. RAIL provides more detailed information about the photometric redshift estimate, so we give you the full probability distribution function for each target galaxy. This is a little more complex than just a table of numbers, so we use the `qp` library and its `Ensemble` data structure to help store it. 

Now let's take a look at what the output of the estimation stage actually looks like. Most estimation stages output an `Ensemble`, which is a data structure from the package `qp`. For more information, see [the qp documentation](https://qp.readthedocs.io/en/main/user_guide/datastructure.html). 

We're using an `Ensemble` to hold a redshift distribution for each of the galaxies we're estimating. There are two required dictionaries that make up an Ensemble, and one that is optional:
- `.metadata`: Contains information about the whole data structure, like the Ensemble type, and any shared parameters such as the bins of histograms. This is not per-object metadata. 
- `.objdata`: The main data points of the distributions for each object, where each object is a row. 
- `.ancil`: the optional dictionary, containing extra information about each object. It can have arrays that have one or more data points per distribution.

In [None]:
print(knn_estimated)

We can see that this algorithm outputs Ensembles of class `mixmod`, which are just combinations of Gaussians (for more info see the [qp docs](https://qp.readthedocs.io/en/main/user_guide/parameterizations/mixmod.html)). The shape portion of the print statement tells us two things: the first number is the number of photo-z distributions, or galaxies, in this `Ensemble`, and the second number tells us how many Gaussians are combined to make up each photo-z distribution. 

Let's take a look at what the different dictionaries look like for this `Ensemble`:  

In [None]:
print(knn_estimated["output"].metadata)

In [None]:
print(knn_estimated["output"].objdata)

In [None]:
print(knn_estimated["output"].ancil)

The KNN algorithm automatically adds a photo-z point estimate derived from the PDFs, called 'zmode' in the ancillary dictionary above, but this is not the case for all Estimation algorithms, and we recommend choosing your own point estimate based on what you want to use it for. 

The `Ensemble` acts a bit like a [`scipy` probability distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous), so you can easily calculate statistics from it, such as the mean, median, or mode of all of the photo-z PDFs. For a full list of the functions available see [the list of methods here](https://qp.readthedocs.io/en/main/user_guide/methods.html).

Another extremely useful method is the `.pdf()` method, which calculates the value(s) of the distribution(s) at any redshift(s) you provide. This is quite useful for plotting, since the different estimation algorithms can store data in different ways, which makes plotting the actual data points more confusing, but we can always use the `.pdf()` method. 

We'll make use of this method to plot a couple of the photo-z distributions that the algorithm has generated to get a sense of what they're like. In order to do this, we'll need to select a specific galaxy to get the PDF for, which can be done by slicing as you would with a numpy array. So to get the first galaxy's distribution, for example you would do this: `knn_estimated["output"][0]`. 

In [None]:
%matplotlib inline

xvals = np.linspace(0, 3, 200)  # we want to cover the whole available redshift space
plt.plot(xvals, knn_estimated["output"][0].pdf(xvals), label="0")
plt.plot(xvals, knn_estimated["output"][1000].pdf(xvals), label="1000")
plt.plot(xvals, knn_estimated["output"][10000].pdf(xvals), label="10,000")

plt.legend(loc="best", title="Galaxy ID")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We can see that these distributions have varying shapes. Importantly, you can see most of them don't just look like one Guassian, but have multiple Gaussian peaks. This explains why you shouldn't just use a point estimate with some error bars, since that doesn't accurately encompass the probability distribution. 

However, if you do want a point estimate, you should think about what you want the point estimate to represent. It's easy enough to calculate either the mean, median, or mode with the built-in functions, which means you can pick which one works best for your goals. For this example, we'll go with the median value, which we calculate via: `knn_estimated["output"][galid].median()`, and we'll compare it to the actual redshift for that galaxy to get a sense of how they compare: 

In [None]:
zgrid = np.linspace(0, 3.0, 301)
galid = 1000
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(knn_estimated["output"][galid].pdf(zgrid))
single_zmed = knn_estimated["output"][galid].median()

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmed, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

In a real world scenario you won't be able to make this plot, but we do it here just to illustrate how the estimators work. We can see there is some difference between the true redshift and the estimated redshift point estimate, though for this galaxy the point estimate seems quite good. Let's see how the algorithm did overall by plotting the pre-calculated redshift point estimates (the "zmodes") versus the true redshifts:

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    knn_estimated["output"].ancil["zmode"].flatten(),
    s=1,
    c="k",
    label="simple NN mode",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("estimated photo-z")
plt.show()

We can see that the algorithm does quite well overall, though there are certainly some catastrophic outliers, and more of a spread at higher redshifts. 

### Saving the estimated redshift distributions to a file

Now that we've investigated our redshifts, let's say that we want to save them to a file for later. `Ensembles` come with a built in `write_to()` function that allows you to write it to an HDF5, FITS, or parquet file. Just provide the function with the path to the file you want it to write:

In [None]:
knn_estimated["output"].write_to("knn_redshift_estimates.hdf5")

Now if you want to make use of the `Ensemble` of distributions in the future, you can just read the file in like so:

In [None]:
knn_ens = qp.read("knn_redshift_estimates.hdf5")
print(knn_ens)

But what if you want to convert your data to a table, so you can easily input it into other algorithms, for example? One way we might want to do that is by outputting an array where each row gives the values of a photo-z PDF at certain points on a grid of redshifts. We can do this easily by using the `.pdf()` function of an Ensemble:

In [None]:
# create the points to evaluate the PDFs at 
zgrid_out = np.linspace(0,3,200)
# evaluate all the PDFs at the given redshifts
photoz_out = knn_ens.pdf(zgrid_out)
photoz_out

Now we have an array of datapoints that describes all of our photo-z distributions at the set `zgrid_out` redshift values. This array can be saved to a file however you like, or passed on to other functions. 

You can also use a similar method to create arrays of different data points, for example if you want the CDF values of all the distributions, use `.cdf()`, or if you just want the medians and standard deviations of all the distributions, use `.median()` and `.std()`. 

## 3. Estimate redshifts with [FlexZBoost](https://rail-hub.readthedocs.io/en/latest/source/estimators.html#flexzboost)

Now let's use `FlexZBoost` to get our redshifts. `FlexZBoostEstimator` approximates the conditional density estimate for each PDF with a set of weights on a set of basis functions.  This can save space relative to a gridded parameterization, but it also leads to residual "bumps" in the PDF intrinsic to the underlying cosine or fourier parameterization.  For this reason, `FlexZBoostEstimator` has a post-processing stage where it "trims" (i.e. sets to zero) any small peaks, or "bumps", below a certain `bump_thresh` threshold. For more details on running `FlexZBoost`, see the [00_Quick_Start_in_Estimation.ipynb](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/00_Quick_Start_in_Estimation.html) notebook.

These are some of the main parameters for the informer:
 `basis_system`: which basis system to use in the density estimate. The default is `cosine` but `fourier` is also an option
- `max_basis`: the maximum number of basis functions parameters to use for PDFs
- `regression_params`: a dictionary of options fed to `xgboost` that control the maximum depth and the `objective` function. `objective` should be set to `reg:squarederror` for proper functioning.
- `trainfrac`: The fraction of the calibration data to use for training the density estimate.  The remaining galaxies will be used for validation of `bump_thresh` and `sharpening`.
- `bumpmin`: the minimum value to test in the `bump_thresh` grid
- `bumpmax`: the maximum value to test in the `bump_thresh` grid
- `nbump`: how many points to test in the `bump_thresh` grid
- `sharpmin`, `sharpmax`, `nsharp`: same as equivalent `bump_thresh` params, but for `sharpening` parameter

The dictionary below gives the defaults for all of these parameters:

In [None]:
fz_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    bumpmin=0.02,
    bumpmax=0.35,
    nbump=20,
    sharpmin=0.7,
    sharpmax=2.1,
    nsharp=15,
    max_basis=35,
    basis_system="cosine",
    regression_params={"max_depth": 8, "objective": "reg:squarederror"},
)

Now we can run the informer with this dictionary, along with the calibration data and the `hdf5_groupname` parameter as before:

In [None]:
flex_inform = ri.estimation.algos.flexzboost.flex_z_boost_informer(
    training_data=calibration_data, hdf5_groupname="", **fz_dict
)

Next we run the estimator with the model that we've just created, like the `knn_inform` variable, `flex_inform` is also a dictionary with a "model" key.

**Note that when we pass the model to this function, we don't pass the dictionary, but the actual model object. This is true of all the interactive functions.** 

In [None]:
flex_estimated = ri.estimation.algos.flexzboost.flex_z_boost_estimator(
    input_data=test_data, model=flex_inform["model"]
)

In [None]:
print(flex_estimated)

We once again get our dictionary with the key "output", this time the `Ensemble` is of type [`interp`](https://qp.readthedocs.io/en/main/user_guide/parameterizations/interp.html), which means the distributions are given as a set of `xvals` and `yvals`, where all the distributions share the same set of `xvals`. 

Now we can plot out some of the data, same as above, to get a sense of how the estimator did. Let's start by plotting some of the individual galaxy photo-z distributions:

In [None]:
xvals = np.linspace(0, 3, 200)  # we want to cover the whole available redshift space
plt.plot(xvals, flex_estimated["output"][0].pdf(xvals), label="0")
plt.plot(xvals, flex_estimated["output"][1000].pdf(xvals), label="1000")
plt.plot(xvals, flex_estimated["output"][10000].pdf(xvals), label="10,000")

plt.legend(loc="best", title="Galaxy ID")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We can see that these distributions look a little smoother than the ones we got from the KNN algorithm. Let's compare them by plotting the PDF, median, and true redshift of a galaxy against the KNN-estimated PDF of the same galaxy:

In [None]:
# calculate median of photo-z PDF
z_med = flex_estimated["output"][1000].median()

# get rest of plotting data
zgrid = np.linspace(0, 3.0, 301)
galid = 1000
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(flex_estimated["output"][galid].pdf(zgrid))

plt.plot(zgrid, single_gal, color="k", label="FlexZBoost")
plt.plot(zgrid, knn_estimated["output"][galid].pdf(zgrid), label="KNN")
plt.axvline(z_med, color="k", ls="--", label="median")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

As mentioned, the FlexZBoost estimated distribution for this galaxy is a lot cleaner, and the point estimate is very close to the true redshift. 

Finally, we can do this for all the galaxies, and compare the estimated redshifts with the true redshifts as we did for the KNN algorithm earlier:

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    flex_estimated["output"].median(),
    s=1,
    c="k",
    label="FlexZBoost median",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("estimated photo-z")
plt.show()

This distribution actually looks quite similar to our distribution of KNN estimated values -- in both cases, they're quite good with some outliers at higher redshift. 

## 4. Next steps

Now that you've got a handle on estimating photometric redshifts with these algorithms, take a look at the other [available estimation algorithms](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html#) that you can use. If you want some more detail and some examples of how to use them, you can explore the [estimation notebooks](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/estimation_notebooks.html). 

If you're interested in parallelizing the estimation method within a notebook, or in learning about any of the other parts of RAIL, try [this notebook](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/Estimating_Redshifts_and_Comparing_Results_for_Different_Parameters.html). 