# Exploring the Effects of Different Degraders on Estimated Redshifts

**Authors:** Jennifer Scora

**Last run successfully:** Feb 9, 2026

Thanks to Matteo Moretti and Biprateep Dey for inspiring the use case.

In this notebook, we'll explore how to create simulated datasets with the [RAIL creation stage](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/creation.html), in particular focusing on how data sets created using different degradation algorithms can affect the calibration of models to estimate photometric redshifts (photo-zs). Here "degradation" algorithms refer to any algorithms applied to alter the "true" sample, for example to add biases or cuts. 

Here are the main steps we'll be following:

1. Simulating galaxies with photometric data and redshifts 
2. "Degrading" photometry and redshift information to create different calibration data
3. Calibrating the photometric redshift algorithms with the differently degraded data
4. Estimating the photometric redshifts of a set of target galaxies using the calibrated models 
4. Seeing how the algorithm calibration affected the output redshift distributions

## 1. Simulating galaxies with photometric data and redshifts 

In this step we want to create the data sets of galaxy magnitudes and corresponding redshifts that we will use to calibrate and estimate photometric redshifts. We use the [PZflow algorithm](https://rail-hub.readthedocs.io/en/latarget/source/rail_stages/creation.html#pzflow-engine) to generate our model, which is a machine learning package that we're going to use in this context to model galaxies. Then we sample two data sets from the model, a calibration dataset and a target dataset. The calibration data set will be used to calibrate our models, and the target data set is the data we will get photo-z estimates for. These data sets will be considered our "true" data, which means they contain the "real" redshifts before we have made any alterations to make the data more realistic. 

### Set up

Let's start by importing the packages we'll need to create and analyze the data sets.

In [None]:
import rail.interactive as ri
import numpy as np
from pzflow.examples import get_galaxy_data

# for plotting
import matplotlib.pyplot as plt

%matplotlib inline

We need to set up some column name dictionaries, as the expected column names vary between some of the codes. In order to handle this, we can pass in  dictionaries of expected column names and the column name that exists in the input data (`band_dict` and `rename_dict` below). In this notebook, we are using bands ugrizy, and each band will have a name 'mag_u_lsst', for example, with the error column name being 'mag_err_u_lsst'.

The initial data we pull from our model won't have any associated errors. Those will be created when we degrade the datasets, but the error columns will need to be renamed with the `rename_dict` later on.

In [None]:
bands = ["u", "g", "r", "i", "z", "y"]
band_dict = {band: f"mag_{band}_lsst" for band in bands}
rename_dict = {f"mag_{band}_lsst_err": f"mag_err_{band}_lsst" for band in bands}

In order to generate the model with PZflow, we need to grab some sample data to base the model off of. This sample data is only used to create the model, and is seperate from the calibration and target data we'll get from the model later. We'll rename the band columns in this data table to match our desired band names as discussed above, using `band_dict`. We can check that our columns have been renamed appropriately by printing out the first few lines of the table:

In [None]:
catalog = get_galaxy_data().rename(band_dict, axis=1)
# let's take a look at the columns
catalog.head()

Looks like the column names are the way we want them! 

### Calibrate and sample the model

Now we need to use the galaxy data we retrieved to calibrate the model that we'll use to create our input galaxy magnitude data catalogues later. We're going to use the `PZflow` engine to do this, specifically the `modeler` function. This will train the normalizing flow that serves as the engine for the input data creation. To get a sense of what it does and the parameters it needs, let's check out its docstrings:

In [None]:
ri.creation.engines.flowEngine.flow_modeler?

We'll pass the modeler a few parameters:
- **input_data:** this is the input catalog that our modeler needs to train the data flow (the one we retrieved above)
- **seed (optional):** this is the random seed used for training
- **phys_cols (optional):** The names of any non-photometry columns and their [min,max] values.
- **phot_cols (optional):** This is a dictionary of the names of the photometry columns and their corresponding [min,max] values.
- **calc_colors (optional):** Whether to internally calculate colors (if phot_cols are magnitudes). Assumes that you want to calculate colors from adjacent columns in phot_cols. If you do not want to calculate colors, set False. Else, provide a dictionary `{‘ref_column_name’: band}`, where band is a string corresponding to the column in phot_cols you want to save as the overall galaxy magnitude. We're passing in the default value here just so you can see how it works. 
- **num_training_epochs (optional):** By default 30, here we're doing fewer so that it doesn't take as long. 


**NOTE:** This calibration may take a while depending on your setup. 

In [None]:
flow_model = ri.creation.engines.flowEngine.flow_modeler(
    input_data=catalog,
    seed=0,
    phys_cols={"redshift": [0, 3]},
    phot_cols={
        "mag_u_lsst": [17, 35],
        "mag_g_lsst": [16, 32],
        "mag_r_lsst": [15, 30],
        "mag_i_lsst": [15, 30],
        "mag_z_lsst": [14, 29],
        "mag_y_lsst": [14, 28],
    },
    calc_colors={"ref_column_name": "mag_i_lsst"},
    num_training_epochs=10,
)

Now we'll use the flow to produce some synthetic data for our calibration data set and target data set. Since this is a test we'll create some small datasets, with 600 galaxies for this sample, so we'll pass in the argument: `n_samples = 600`. We'll also use a specific seed for each one to ensure they're reproducible but different from each other.

**Note that when we pass the model to this function, we don't pass the dictionary, but the actual model object. This is true of all the interactive functions.** 

In [None]:
# get sample calibration and target data sets
calib_data_orig = ri.creation.engines.flowEngine.flow_creator(
    n_samples=600, model=flow_model["model"], seed=1235
)
targ_data_orig = ri.creation.engines.flowEngine.flow_creator(
    model=flow_model["model"], n_samples=600, seed=1234
)

Let's plot these data sets to check that they are in fact different:

In [None]:
hist_options = {"bins": np.linspace(0, 3, 30), "histtype": "stepfilled", "alpha": 0.5}

plt.hist(calib_data_orig["output"]["redshift"], label="calibration", **hist_options)
plt.hist(targ_data_orig["output"]["redshift"], label="target", **hist_options)
plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("number of galaxies")

## 2. "Degrading" photometry and redshift information to create different calibration data

The goal of this step is to create a bunch of realistic galaxy observations that have been degraded in a variety of ways that we're going to use as calibration sets for our favourite photometric redshift algorithm, and to degrade one target data set we want to use to get estimated redshifts. 

So in this step, we're going to create four different calibration data sets, where each data set has had one more degrader applied. Thus, the fourth data has all four degraders applied, while the first only has one applied. We'll also create a set of target data will all of the same four degradations applied, such that the target data should most closely resemble the most degraded calibration data set. 

The degraders we'll be using, in order, are:

1. `lsst_error_model` to add photometric errors that are modelled based on the Vera Rubin telescope 
2. `inv_redshift_incompleteness` to mimic redshift dependent incompleteness
3. `line_confusion` to simulate the effect of misidentified lines 
4. `quantity_cut` mimics a band-dependent brightness cut


### 1. LSST Error Model

This method adds photometric errors, non-detections and extended source errors that are modelled based on the Vera Rubin telescope. We're going to apply it to both calibration and target data sets. Once again, we're supplying different seeds to ensure the results are reproducible and different from each other. We need to supply the `band_dict` we created earlier, which tells the code what the band column names should be. We are also supplying `ndFlag=np.nan`, which just tells the code to make non-detections `np.nan` in the output. 

In [None]:
# calibration data
calib_data_photerrs = ri.creation.degraders.photometric_errors.lsst_error_model(
    sample=calib_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)

# target data set
targ_data_photerrs = ri.creation.degraders.photometric_errors.lsst_error_model(
    sample=targ_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)

In [None]:
# let's see what the output looks like
calib_data_photerrs["output"].head()

You can see that error columns have been added in for each of the magnitude columns. 

Now let's take a look at what's happened to the magnitudes. Below we'll plot the u-band magnitudes before and after running the degrader. You can see that the higher magnitude objects now have a much wider variance in magnitude compared to their initial magnitudes, but at lower magnitudes they've remained similar:

In [None]:
# we have to set the range because there are nans in the new dataset with errors, which messes up plt.hist2d
range = [
    [
        np.min(calib_data_orig["output"]["mag_u_lsst"]),
        np.max(calib_data_orig["output"]["mag_u_lsst"]),
    ],
    [
        np.min(calib_data_photerrs["output"]["mag_u_lsst"]),
        np.max(calib_data_photerrs["output"]["mag_u_lsst"]),
    ],
]
plt.hist2d(
    calib_data_orig["output"]["mag_u_lsst"],
    calib_data_photerrs["output"]["mag_u_lsst"],
    range=range,
    bins=20,
    cmap="viridis",
)
plt.xlabel("original u-band magnitude")
plt.ylabel("new u-band magnitude")
plt.colorbar(label="number of galaxies")

You can make this plot for all the other magnitudes if you'd like. 

### 2. Redshift Incompleteness 

This method applies a selection function, which keeps galaxies with probability 

$p_{\text{keep}}(z) = \min(1, \frac{z_p}{z})$, 

where $z_p$ is the ''pivot'' redshift. We'll use $z_p = 1.0$. 

**NOTE**:

As you'll see later with the evaluators, they'll require the samples that we want to compare to be the same length. But if you've removed galaxies due to incompleteness, they won't inherently be the same length. So instead, what we're going to do is flag those galaxies that are removed. 

To do this, we can use the parameter `drop_rows=False`. This will return a data table of the same length as before, with a "flag" column that identifies which galaxies are to be kept, and which are to be dropped. 

In [None]:
# calibration data set
calib_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        sample=calib_data_photerrs["output"], pivot_redshift=1.0
    )
)

# target data set - use drop_rows to ensure it's the same length
targ_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        sample=targ_data_photerrs["output"], pivot_redshift=1.0, drop_rows=False
    )
)
targ_data_inc["output"]  # look at the output

We can see that, as expected, the target data set has the "flag" column, and that the length of the data set is still 600. Now let's take a look at the calibration data set, where we left `drop_rows` as true:

In [None]:
targ_data_inc["output"]  # look at the output

This data set is shorter than the target data set now, since those galaxies have just been removed from the data entirely. This isn't a problem for the calibration data set, since we don't need to compare it to anything later. Let's plot a histogram of the calibration data set redshifts with just the photometric errors, and compare it to our new data set with both that and the redshift incompleteness:

In [None]:
plt.hist(calib_data_photerrs["output"]["redshift"], label="input", **hist_options)
plt.hist(calib_data_inc["output"]["redshift"], label="ouput", **hist_options)
plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("number of galaxies")

The output data set clearly has fewer galaxies than the input data set above redshift of 1, and the distributions are the same for redshifts less than 1, as expected. 

For the target data set, we just have one more step that we need to do before we can feed it into any other degraders. We use the "flag" column to mask all of the "dropped" galaxy rows and set them all as `np.nan` - this keeps the indices the same, allowing us to compare to the truth data set as is our goal.  

In [None]:
# save the column as a separate variable
inc_flag = targ_data_inc["output"]["flag"]

# drop the flag column from the dataframe entirely
targ_data_inc["output"].drop(columns="flag", inplace=True)

# replace the lines that are cut out by the degrader with np.nan
new_targ_data_inc = targ_data_inc["output"].where(inc_flag, np.nan)

# take a look at the result
new_targ_data_inc

The new dataframe is the same length as the old one, but without the flag column, and now those rows will just be `np.nan`. 



### 3. Line Confusion

This method simulates the effect of misidentified lines. The degrader will misidentify some percentage (`frac_wrong`) of the actual lines (here we're picking $5007.0~\mathring{\mathrm{A}}$, which are OIII lines) as the line we pick for `wrong_wavelen`. In this case, we'll pick $3727.0~\mathring{\mathrm{A}}$, which are OII lines. 

This degrader doesn't cut any galaxies, so we don't have to worry about the `drop_rows` parameter. 

In [None]:
# dataset 3: add in line confusion
calib_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    sample=calib_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)

# dataset 3: add in line confusion using the modified data set
targ_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    sample=new_targ_data_inc,
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1450,
)

Now let's take a look at what this has done to our redshift distribution by plotting the input calibration data set against the one output by the `line_confusion` method:

In [None]:
plt.hist(calib_data_inc["output"]["redshift"], label="input data", **hist_options)
plt.hist(calib_data_conf["output"]["redshift"], label="output data", **hist_options)
plt.legend(loc="best")
plt.ylabel("redshift")
plt.ylabel("number of galaxies")

We can see that the output data has a few small differences in the distribution, spread across the whole range of redshifts. 

### 4. Quantity Cut

 This method cuts galaxies based on their band magnitudes. It takes a dictionary of cuts, where you can provide the band name and the values to cut that band on (for example, `{"mag_i_lsst": 25.0}`). If one value is given, it's considered a maximum, and if a tuple is given, it's considered a range within which the sample is selected. For this, we'll just set a maximum magnitude for the i band of 25.

 Since this method cuts galaxies, we're going to follow the steps we used for the `inv_redshift_incompleteness` method to keep our target dataset at the same length:

In [None]:
# cut some of the data below a certain magnitude
calib_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
    sample=calib_data_conf["output"], cuts={"mag_i_lsst": 25.0}
)

# cut some of the data below a certain magnitude, set drop_rows=False to keep data set the same length
targ_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
    sample=targ_data_conf["output"], cuts={"mag_i_lsst": 25.0}, drop_rows=False
)
targ_data_cut["output"]

We can see that there's been a flag column added to the target data again, but this time the flags are 1 and 0 instead of True and False. Let's save the flag column and drop it from the main DataFrame. We're going to do something a little different with the data later so we don't need do the `np.nan` substitution from earlier. 

In [None]:
# save flag column
cut_flag = targ_data_cut["output"]["flag"]

# drop flag column from dataframe
targ_data_cut["output"].drop(columns="flag", inplace=True)

Now let's plot a histogram of the calibration data set we input into the `quantity_cut` method compared to the output calibration data set to see how it's changed the number and distribution of galaxies:

In [None]:
plt.hist(calib_data_conf["output"]["redshift"], label="input data", **hist_options)
plt.hist(calib_data_cut["output"]["redshift"], label="output data", **hist_options)
plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("number of galaxies")

We can see our output distribution has roughly the same shape, but with significantly fewer galaxies overall. 

Now we have applied four different degraders, so we've set up our various calibration data sets, and our target data set. The final step is to use the dictionary we made earlier of error column names (`rename_dict`) and the RAIL function `column_mapper` to rename the error columns, so they match the expected names for the later steps:

In [None]:
# renames error columns to match DC2 for calibration data sets

# photerrs
df_calib_data_photerrs = ri.tools.table_tools.column_mapper(
    data=calib_data_photerrs["output"], columns=rename_dict
)

# photerrs
df_calib_data_inc = ri.tools.table_tools.column_mapper(
    data=targ_data_inc["output"], columns=rename_dict
)

# photerrs
df_calib_data_conf = ri.tools.table_tools.column_mapper(
    data=calib_data_conf["output"], columns=rename_dict
)

# photerrs
df_calib_data_cut = ri.tools.table_tools.column_mapper(
    data=calib_data_cut["output"], columns=rename_dict
)


# renames error columns for target data set
df_targ_data = ri.tools.table_tools.column_mapper(
    data=targ_data_cut["output"], columns=rename_dict
)

Now that we have all four of our calibration data sets, let's plot them all together to get a final look at their differences:

In [None]:
plt.hist(
    df_calib_data_photerrs["output"]["redshift"],
    label="photometric errors",
    **hist_options,
)
plt.hist(
    df_calib_data_inc["output"]["redshift"], label="z incompleteness", **hist_options
)
plt.hist(
    df_calib_data_conf["output"]["redshift"], label="line confusion", **hist_options
)
plt.hist(df_calib_data_cut["output"]["redshift"], label="quantity cut", **hist_options)

plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("number of galaxies")

We have one final step to do to our target data set before we can use it in the estimation and evaluation stages. For this data set to work with the RAIL evaluate stages, we want a couple of things:

1. Our degraded (and cut down) target DataFrame indices to match up with our original target DataFrame indices
2. Our target data sets to not have columns with all NaNs 
3. Our target data to have linearly increasing indices (i.e. not retain the masked indices) 

In order to accomplish this, we're going to do the following:
1. Mask the degraded target data set using our existing masks 
2. Mask the "truth" target data set using the existing masks 
3. Reindex both of these arrays 

In [None]:
# mask the degraded test data
masked_targ_data = df_targ_data["output"][cut_flag & inc_flag]

# reset the index
reindexed_targ_data = masked_targ_data.reset_index(drop=True)
reindexed_targ_data

In [None]:
# mask the degraded target data
masked_targ_data_orig = targ_data_orig["output"][cut_flag & inc_flag]

# reset the index
reindexed_targ_data_orig = masked_targ_data_orig.reset_index(drop=True)
reindexed_targ_data_orig

We can see that these DataFrames are now the same length, with indices that actually match the length of the arrays and that are linearly increasing, which is what we wanted. Now these can be appropriately compared to each other in the later steps. 

## 3. Calibrating the photometric redshift algorithms with the differently degraded data

Now we can loop through each of the calibration datasets to calibrate our algorithms. We'll use all four of our calibration data sets to calibrate our models. 

For this notebook, we'll use the [K-Nearest Neighbours](https://rail-hub.readthedocs.io/en/latarget/source/rail_stages/estimation.html#k-nearest-neighbor) (KNN) algorithm, which is a wrapper around `sklearn`'s nearest neighbour (NN) machine learning model. Essentially, it takes a given galaxy, identifies its nearest neighbours in the space, in this case galaxies that have similar colours, and then constructs the photometric redshift PDF as a sum of Gaussians from each neighbour. For more details on how this algorithm works, you can see the [wikipedia page](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) or the [Quick Start in Estimation](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/00_Quick_Start_in_Estimation.html) notebook.

The calibration methods of RAIL algorithms are called *informers*, so the function we want to use is called `k_near_neigh_informer()`.

Useful parameters:
- `nondetect_val`: This tells the code which values are considered non-detections. We pass in `np.nan` here, since that's what we used as the `ndFlag` in the degradation stage for non-detections. 
- `hdf5_groupname`: the dictionary key the code will find the data under. Set to `""` if the data is passed in directly. 

First, we'll set up a dictionary with all four of the calibration datasets, and empty dictionaries to store the calibrated models:

In [None]:
# make a dictionary of the calibration datasets to iterate through
calib_datasets = {
    "lsst_error_model": df_calib_data_photerrs,
    "inv_redshift_inc": df_calib_data_inc,
    "line_confusion": df_calib_data_conf,
    "quantity_cut": df_calib_data_cut,
}

# set up dictionary for output
knn_models = {}

Now we'll iterate through the datasets, calibrating a model for each calibration set:

In [None]:
for key, item in calib_datasets.items():

    # calibrate the model
    inform_knn = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
        training_data=item["output"], nondetect_val=np.nan, hdf5_groupname=""
    )
    
    knn_models[key] = inform_knn

In [None]:
# let's see what the output looks like 
knn_models["lsst_error_model"]

We can see that the models output by this algorithm include a dictionary of data and an `sklearn` object. 

## Estimating the photometric redshifts of a set of target galaxies using the calibrated models

Now that we've got all four of our models, we can use the *Estimator* of the KNN Algorithm on our target data set to get our photometric redshift probability distribution functions. It takes the same parameters we gave to the *informer* above that relate to the data format, as well as the galaxy data to estimate redshifts for as `input_data`, and the model from the *informer* stage as `model`.

We'll iterate over each of the models, storing the estimated redshifts in a dictionary:

In [None]:
estimated_photoz = {} # set up a dictionary to store estimates in 

for key, item in knn_models.items():

    # estimate the photozs
    knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
        input_data=reindexed_targ_data,
        model=item["model"],
        nondetect_val=np.nan,
        hdf5_groupname="",
    )

    # add estimates to dictionary under the appropriate key
    estimated_photoz[key] = knn_estimated

Now let's take a look at what the output of the estimation stage actually looks like. Most estimation stages output an `Ensemble`, which is a data structure from the package `qp`. For more information, see [the qp documentation](https://qp.readthedocs.io/en/main/user_guide/datastructure.html). 

We're using an `Ensemble` to hold a redshift distribution for each of the galaxies we're estimating. There are two required dictionaries that make up an Ensemble, and one that is optional:
- `.metadata`: Contains information about the whole data structure, like the Ensemble type, and any shared parameters such as the bins of histograms. This is not per-object metadata. 
- `.objdata`: The main data points of the distributions for each object, where each object is a row. 
- `.ancil`: the optional dictionary, containing extra information about each object. It can have arrays that have one or more data points per distribution. 

In [None]:
# estimated_photoz contains the output of the KNN estimate function for each of our
# parameter sets. Here we print out the result for just one of them. We can see that
# the Ensemble has the same number of rows as galaxies that we input, and some number
# of points per row
print(estimated_photoz["lsst_error_model"])

We can see that this algorithm outputs Ensembles of class `mixmod`, which are just combinations of Gaussians (for more info see the [qp docs](https://qp.readthedocs.io/en/main/user_guide/parameterizations/mixmod.html)). 

So each distribution in this Ensemble has a set of Gaussians that, added together, make up the distribution. Each distribution is therefore described by a set of means, weights, and standard deviations. The shape portion of the print statement tells us two things: the first number is the number of photo-z distributions, or galaxies, in this `Ensemble`, and the second number tells us how many Gaussians are combined to make up each photo-z distribution. 

Let's take a look at what the different dictionaries look like for this `Ensemble`:  

In [None]:
# this is the metadata dictionary of that output Ensemble
print(estimated_photoz["lsst_error_model"]["output"].metadata)

In [None]:
# this is the actual distribution data of that output Ensemble, which contains
# the data points that describe each photometric redshift probability distribution
print(estimated_photoz["lsst_error_model"]["output"].objdata)

Typically the ancillary data table includes a photo-z point estimate derived from the PDFs, by default this is the mode of the distribution, called 'zmode' in the ancillary dictionary below:

In [None]:
# this is the ancillary dictionary of the output Ensemble, which in this case
# contains the zmode, redshift, and distribution type
print(estimated_photoz["lsst_error_model"]["output"].ancil)

Now let's plot one redshift PDF from each of our four estimated redshift distribution datasets to compare them:

In [None]:
xvals = np.linspace(0, 3, 200)  # we want to cover the whole available redshift space
gal_id = 100 # the galaxy we'll look at 
for key, df in estimated_photoz.items():
    plt.plot(xvals, df["output"][gal_id].pdf(xvals), label=key)

# plot the true redshift
plt.axvline(
    targ_data_orig["output"]["redshift"].iloc[gal_id],
    color="k",
    ls="--",
    label="true redshift",
)

plt.legend(loc="best", title="calibration dataset")
plt.xlabel("redshift")
plt.ylabel("p(z)")

This plot shows us the estimated photo-z PDF for the first galaxy with each of the different calibration sets, compared to the redshift from the "true" target dataset we sampled at the beginning. 

Plotting one distribution at a time isn't the best way to get a sense of how the whole set of galaxy redshift distributions changes, so let's summarize these distributions. This will give us a sense of how all of the estimated redshift distributions change with each different calibration data set. There are a number of summarizing algorithms, but here we'll use two of the most basic: 

1. [**Point Estimate Histogram**](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html#point-estimate-histogram): This algorithm creates a histogram of all the point estimates of the photometric redshifts. By default, the point estimate used is `zmode`, which is usually found in the ancillary dictionary of the distributions. 
2. [**Naive Stacking**](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/estimation.html#naive-stacking): This algorithm stacks the PDFs of the estimated photometric redshifts together and normalizes the stacked distribution.  

In [None]:
# set up dictionaries for output
point_est_dict = {}
naive_stack_dict = {}

for key, item in estimated_photoz.items():

    # get the summary of the point estimates
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
        input_data=item["output"]
    )
    point_est_dict[key] = point_estimate_ens

    # get a summary of the PDFs
    naive_stack_ens = ri.estimation.algos.naive_stack.naive_stack_summarizer(
        input_data=item["output"]
    )
    naive_stack_dict[key] = naive_stack_ens

Now let's take a look at the output dictionaries for both these functions for one of the distributions:

In [None]:
print(point_est_dict["lsst_error_model"])
print(naive_stack_dict["lsst_error_model"])

These functions output `Ensembles`, just like the KNN estimation algorithm. However, they output two separate `Ensembles`: the "single_NZ" one contains just one distribution, the actual stacked distribution that has been created. The 'output' one contains a number of bootstrapped distributions, to make further analysis easier.

We're going to focus on the "single_NZ" distribution here. We'll start by plotting the point estimate summarized distributions for all of the runs, which are histograms:

In [None]:
# get bin centers and widths
bin_width = (
    point_est_dict["lsst_error_model"]["single_NZ"].metadata["bins"][1]
    - point_est_dict["lsst_error_model"]["single_NZ"].metadata["bins"][0]
)
bin_centers = (
    point_est_dict["lsst_error_model"]["single_NZ"].metadata["bins"][:-1]
    + point_est_dict["lsst_error_model"]["single_NZ"].metadata["bins"][1:]
) / 2

for key, df in point_est_dict.items():
    plt.bar(
        bin_centers,
        df["single_NZ"].objdata["pdfs"],
        width=bin_width,
        alpha=0.7,
        label=key,
    )

plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("N(z)")

It's a little difficult to see the differences between so many distributions in this format, but you can get a sense that there are some distinct differences in the distributions of redshifts. 

Let's plot the summarized distributions from the Naive Stacking algorithm, which are smoothed distributions since they are created by stacking the full photo-z PDFs instead of point estimates: 

In [None]:
for key, df in naive_stack_dict.items():
    plt.plot(
        df["single_NZ"].metadata["xvals"], df["single_NZ"].objdata["yvals"], label=key
    )

plt.legend(loc="best")
plt.xlabel("redshift")
plt.ylabel("N(z)")

It's a bit easier to see the differences between the distributions of redshifts in this plot. We can see that the overall shape of the distributions is the same, but there are some significant differences, in particular at higher redshifts. 

If you'd like to save these summarized distributions so you can use them elsewhere, or compare them using your own algorithms, you likely want them as just an array instead of an Ensemble. These Ensembles are of the type "interp", which means that they have their data already stored in a grid of x and y values. So it's easy to just pull those values out into an array, like so:

In [None]:
# returns the array of y values of the summarized photo-z distribution for all the galaxies 
y_arr = naive_stack_dict["lsst_error_model"]["single_NZ"].objdata['yvals']
type(y_arr) 

However, if you want to do this for any type of Ensemble, the method that works the most consistently is to use the `.pdf()` function, which will return the values of the PDF at a given set of redshifts. We can use this to get a dictionary of arrays, one for each of the distributions we've summarized, and turn it into a pandas DataFrame:

In [None]:
import pandas as pd

array_dict = {}
z_grid_out = np.linspace(0,3,301) # create a set of z values to sample the PDFs on 
# add the z grid into the dictionary
array_dict["z_grid_values"] = z_grid_out

# calculate the PDF values for each of the different distributions 
for key, item in naive_stack_dict.items():
    array_dict[key] = item["single_NZ"].pdf(z_grid_out) 

df_summarized_dist = pd.DataFrame(array_dict)
df_summarized_dist.head()

Now you can save the pandas DataFrame to whatever file type you would like, or pass it on to another algorithm. 

## 4. Seeing how the algorithm calibration affected the output redshift distributions

You can compare your estimated redshift distributions however you want, but RAIL has a bunch of built in metrics that compare the distributions, so we'll use that here. They are a part of the [*evaluation* stage](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/creation.html). For a more detailed look at all of the available metrics and how to use them, take a look at the `01_Evaluation_by_Type.ipynb` notebook.

Here we're just going to use two of the available metrics:
1. The [Brier score](https://en.wikipedia.org/wiki/Brier_score), which assesses the accuracy of probabilistic predictions. The lower the score, the better the predictions.  
2. The [Conditional Density Estimation loss](https://vitaliset.github.io/conditional-density-estimation/), which is the averaged squared loss between the true and predicted conditional probability density functions. The lower the score, the better the predicted probability density, in this case, the photometric redshift distributions.

For the evaluation metrics, in general we need the estimated redshift distributions, and the actual redshifts -- these are the pre-degradation redshifts from our initially sampled distribution. This is why we did all of that data wrangling earlier to get our estimated redshifts to line up with our pre-degradation photometry data.

In [None]:
# set up dictionaries for output
eval_dict = {}

for key, item in estimated_photoz.items():
    # evaluate the results
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
        },
    )

    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
        data=item["output"],
        truth=reindexed_targ_data_orig,
        **evaluator_stage_dict,
        hdf5_groupname="",
    )

    # put the evaluation results in a dictionary so we have them
    eval_dict[key] = the_eval

Now let's take a look at the metrics we calculated, and compare them. The code below just selects the one dictionary output per run that we want to look at, to make the dictionary a little easier to read. 

In [None]:
# pull data out of the sub-directory to make the dictionaries easier to read
results_dict = {key: val["summary"] for key, val in eval_dict.items()}

results_dict

We can also plot these metrics to better visualize which trianing data sets gave better scores, 'better' here meaning lower for both of the metrics: 

In [None]:
for key, value in eval_dict.items():
    plt.scatter(value["summary"]["brier"], value["summary"]["cdeloss"], label=key)

plt.legend(loc="best")
plt.xlabel("Brier score")
plt.ylabel("CDE loss")

This gives us a bit of a clearer picture of which calibration distributions did better than others. It's also clear from this why multiple metrics can be useful, since some of these distributions do better in one metric than the other. 

## Next Steps

If you'd like to parallelize your iteration in order to speed things up, take a look at the [introduction to RAIL interactive](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/estimation_examples/Estimating_Redshifts_and_Comparing_Results_for_Different_Parameters.html) notebook. 

To learn more about the creation stage of RAIL, and the available degraders, take a look at the [RAIL Creation docs](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/creation.html). 

Similarly, if you'd like to learn more about the Evaluation stage, you can take a look at the [RAIL Evaluation docs](https://rail-hub.readthedocs.io/en/latest/source/rail_stages/evaluation.html), or try out the [Evaluation by type](https://rail-hub.readthedocs.io/projects/rail-notebooks/en/latest/interactive_examples/rendered/evaluation_examples/01_Evaluation_by_Type.html) notebook.