# Running RAIL with a different dataset

**Authors:** Sam Schmidt

**Last run successfully:** September 24, 2025

This is a notebook with a quick example of running a `rail` algoritm with a different dataset and overriding configuration parameters.  

Most of our other demo notebooks use small datasets included with the RAIL demo package, all with the same input names.  These datasets are named consistently with many of the default parameter values used in RAIL, e.g. `hdf5_groupname="photometry"` and ugrizy photometry named in a pattern `"mag_{band}_lsst"`, often specified in `SHARED_PARAMS`.  

This notebook will just show a quick run with an alternate dataset, showing the values that users will likely need to change in order to get things running.  



In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
import tables_io

First, we'll start with grabbing some small datasets from NERSC, a tar file with some data drawn from the Roman-Rubin simulation:

In [None]:
training_file = "./romanrubin_demo_data.tar"

if not os.path.exists(training_file):
  os.system('curl -O https://portal.nersc.gov/cfs/lsst/PZ/romanrubin_demo_data.tar')
  os.system('tar -xvf romanrubin_demo_data.tar')

Let's load one of the files and look at the contents:

In [None]:
infile = "romanrubin_train_data.hdf5"
data = tables_io.read(infile)
data.keys()

We can see that, unlike the demo data in other notebooks, there is no top level hdf5_groupname of "photometry", the data is directly in the top level of the hdf5 file.  As such, we will need to specify `hdf5_groupname=""` to override the default value of `"photometry"` in RAIL.

We also see that the magnitudes and errors are simply named with the band name, e.g. `"u"` rather than `"mag_u_lsst"`.  Again, we will need to specify the band and error names in order to override the defaults in RAIL.  Let's do that below, using the KNearNeighInformer and Estimator algorithms:

In [None]:
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from rail.estimation.algos.k_nearneigh import KNearNeighInformer, KNearNeighEstimator

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
trainFile = "./romanrubin_train_data.hdf5"
testFile = "./romanrubin_test_data.hdf5"
training_data = DS.read_file("training_data", TableHandle, trainFile)
test_data = DS.read_file("test_data", TableHandle, testFile)

## The dataset-specific parameters

We will need to specify several  parameters to override the default values in RAIL, we can create a dictionary of these and pass those into the `make_stage` for our informer.  Because we have Roman J and H, we will also demonstrate running with 8 bands rather than the default six.

RAIL requires that we specify the names of the input columns as `bands`, and the input errors on those as `err_bands`.  Most algorithms also require a `ref_band`.  To handle non-detections, RAIL uses a dictionary of `mag_limits` which must contain keys for all of the columns in `bands` and a float for the value with which the non-detect will be replaced.  You may also need to specify a different `nondetect_val` if the dataset has a different convention for non-detections (in this dataset, our non-detetions have a value of `np.inf`).  

**NOTE:** RAIL uses `SHARED_PARAMS`, a central location for specifying a subset of parameters that are common to a dataset, and setting them in one place when running multiple algorithms.  However, any configuration parameters specified as `SHARED_PARAMS` can be overridden in the same way as any other parameter, there is nothing special about them, and we will do that here with `bands`, `err_bands`, etc...

Let's set up our dictionary with these values:

In [None]:
bands = ['u', 'g', 'r', 'i', 'z', 'y', 'J', 'H']
errbands = []
maglims = {}
limvals = [27.8, 29.0, 29.1, 28.6, 28.0, 27.0, 26.4, 26.4]
for band, limval in zip(bands, limvals):
    errbands.append(f"{band}_err")
    maglims[band] = limval


print(bands)
print(errbands)
print(maglims)


In [None]:
knn_dict = dict(hdf5_groupname='', bands=bands, err_bands=errbands, mag_limits=maglims, ref_band='i')

We can now feed this into our inform stage:

In [None]:
pz_train = KNearNeighInformer.make_stage(name='inform_KNN', model='rd_demo_knn.pkl', **knn_dict)

In [None]:
%%time
pz_train.inform(training_data)

We can use the same dictionary to specify overrides for the estimator stage:

In [None]:
pz = KNearNeighEstimator.make_stage(name='KNN', model=pz_train.get_handle('model'), **knn_dict)
results = pz.estimate(test_data)

Let's plot the mode vs the true redshift to make sure that things ran properly:

In [None]:
zmode = results().ancil['zmode'].flatten()

Let's plot the redshift mode against the true redshifts to see how they look:

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(test_data()['redshift'],zmode,s=1,c='k',label='KNN mode')
plt.plot([0,3],[0,3],'r--');
plt.xlabel("true redshift")
plt.ylabel("KNN mode")

Yes, things look very nice, and the inclusion of NIR photometry gives us very little scatter and very few outliers!