# Converting Galaxy Magnitudes to Photometric Redshifts

This notebook covers the basics of using real galaxy magnitudes to estimate photometric redshifts with RAIL. We will use a couple of the RAIL algorithms to do this, to get a sense of the differences between algorithms and how they work. We'll go through the following steps:

1. Setting up training and testing data sets 
2. Estimating redshifts with KNN (link)
3. Estimating redshifts with BPZ (link)
4. Estimating redshifts with Flexzboost (link)


Before we get started, here's a quick introduction to some of the features of RAIL interactive mode. The only RAIL package you need to import is the `rail.interactive` package. This contains all of the interactive functions for all of the RAIL algorithms. You may need to import supporting functions that are not part of a stage separately. To get a sense of what functions/stages are available and for some more detailed instructions, see [the RAIL documentation](https://descraildocs.z27.web.core.windows.net/source/user_guide/interactive_usage.html).  

In [1]:
# import the packages we'll need 
import rail.interactive as ri 

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.


LEPHAREDIR is being set to the default cache directory:
/home/jscora/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/jscora/.cache/lephare/work
Default work cache is already linked. 
This is linked to the run directory:
/home/jscora/.cache/lephare/runs/20250327T165906


To get the docstrings for a function, including what parameters it needs and what it returns, you can just put a question mark after the function call or use the `help()` function, as you would with any python function. For example, we'll be using the KNN estimation function later, so we can take a look at what it needs:

In [3]:
ri.estimation.algos.k_nearneigh.k_near_neigh_estimator?

[31mSignature:[39m       ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(**kwargs) -> Any
[31mCall signature:[39m  ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(*args, **kwargs)
[31mType:[39m            partial
[31mString form:[39m     functools.partial(<function _interactive_factory at 0x769245120c20>, <class 'rail.estimation.algos.k_nearneigh.KNearNeighEstimator'>, False, True)
[31mFile:[39m            ~/software/anaconda3/envs/rail/lib/python3.12/functools.py
[31mDocstring:[39m      
KNN-based estimator

---

The main interface method for the photo-z estimation

This will attach the input data (defined in ``inputs`` as "input") to this
``Estimator`` (for introspection and provenance tracking). Then call the
``run()``, ``validate()``, and ``finalize()`` methods.

The run method will call ``_process_chunk()``, which needs to be implemented
in the subclass, to process input data in batches. See ``RandomGaussEstimator``
for a simple example.

Finally, this wi

## 1. Setting up training and testing data sets

To convert the magnitudes into photometric redshifts, we will be *estimating* that redshift. Most of the `Estimators` in RAIL have an *inform* stage, and an *estimation* stage.
The inform stage trains an model on how to do the conversion, so that stage will need to be given both the magnitude data, and the true redshifts of the galaxies.
We can then pass a new set of magnitudes (the ones we're actually interested in) to the *estimator*, along with the model that the informer created. The estimator can then apply the model to the new magnitudes in order to calculate a redshift value.


Steps/description

We'll start by getting our two data sets. The `test_dc2_training_9816.hdf5` is what we'll use for training our `Inform` stages, and `test_dc2_validation_9816.hdf5` will act as our 'real' galaxy magnitude data, which we will provide to the `Estimate` stage to get our estimated photometric redshifts. However, if you have your own data, you can substitute it for the `testFile` varaible below.  

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import tables_io
from rail.utils.path_utils import find_rail_file

trainFile = find_rail_file("examples_data/testdata/test_dc2_training_9816.hdf5")
testFile = find_rail_file("examples_data/testdata/test_dc2_validation_9816.hdf5")
print(trainFile)

/home/jscora/code/desc-rail/rail_base/src/rail/examples_data/testdata/test_dc2_training_9816.hdf5


Now let's read these files in. We'll start by reading in the training data, and converting it to a Pandas dataframe to make it easier to read:

In [8]:
training_data = tables_io.read(trainFile)
print(type(training_data), training_data.keys())
training_data = training_data["photometry"]
training_data = tables_io.convert(training_data, "pandasDataFrame")
print(training_data.info())

<class 'collections.OrderedDict'> odict_keys(['photometry'])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10225 entries, 0 to 10224
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              10225 non-null  int64  
 1   mag_err_g_lsst  10225 non-null  float32
 2   mag_err_i_lsst  10225 non-null  float32
 3   mag_err_r_lsst  10225 non-null  float32
 4   mag_err_u_lsst  10225 non-null  float32
 5   mag_err_y_lsst  10225 non-null  float32
 6   mag_err_z_lsst  10225 non-null  float32
 7   mag_g_lsst      10225 non-null  float32
 8   mag_i_lsst      10225 non-null  float32
 9   mag_r_lsst      10225 non-null  float32
 10  mag_u_lsst      10225 non-null  float32
 11  mag_y_lsst      10225 non-null  float32
 12  mag_z_lsst      10225 non-null  float32
 13  redshift        10225 non-null  float64
dtypes: float32(12), float64(1), int64(1)
memory usage: 639.2 KB
None


`training_data` is now a Pandas DataFrame, containing information on 10,225 galaxies. It has magnitude information for the *ugrizy* bands, including errors, and the true redshift of these galaxies.

We'll now also load in the test data, which contains the magnitudes for the galaxies we actually want to calculate redshifts for. Just as an example, we'll leave the test data in the format given by `tables_io`, which is a dictionary of arrays. Either method can be used with RAIL functions, but they can require slightly different methods of passing the data.

In [9]:
test_data = tables_io.read(testFile)
print(test_data["photometry"].keys())

odict_keys(['id', 'mag_err_g_lsst', 'mag_err_i_lsst', 'mag_err_r_lsst', 'mag_err_u_lsst', 'mag_err_y_lsst', 'mag_err_z_lsst', 'mag_g_lsst', 'mag_i_lsst', 'mag_r_lsst', 'mag_u_lsst', 'mag_y_lsst', 'mag_z_lsst', 'redshift'])


## 2. Estimate redshifts with the [KNN algorithm](https://rail-hub.readthedocs.io/en/latest/source/estimators.html#k-nearest-neighbor) 

**The algorithm**:  The `K-Nearest Neighbours` algorithm we're using (see [here](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) for more of an explanation of how it works) is a machine learning model. Essentially, it takes a given galaxy, identifies its neighbours in the space, in this case galaxies that have similar colours, and then assigns a redshift to that galaxy by taking an average of its neighbouring galaxies.       

**Inform**: The inform method is training the model that we will use to estimate the redshifts. It will set aside some of the training data set as a validation data set. We will plug in our training data set, and any parameters the model needs, which we can check by putting a question mark after the function name:


In [11]:
ri.estimation.algos.k_nearneigh.k_near_neigh_informer?

[31mSignature:[39m       ri.estimation.algos.k_nearneigh.k_near_neigh_informer(**kwargs) -> Any
[31mCall signature:[39m  ri.estimation.algos.k_nearneigh.k_near_neigh_informer(*args, **kwargs)
[31mType:[39m            partial
[31mString form:[39m     functools.partial(<function _interactive_factory at 0x769245120c20>, <class 'rail.estimation.algos.k_nearneigh.KNearNeighInformer'>, False, True)
[31mFile:[39m            ~/software/anaconda3/envs/rail/lib/python3.12/functools.py
[31mDocstring:[39m      
Train a KNN-based estimator

---

The main interface method for Informers

This will attach the input_data to this `Informer`
(for introspection and provenance tracking).

Then it will call the run(), validate() and finalize() methods, which need to
be implemented by the sub-classes.

The run() method will need to register the model that it creates to this Estimator
by using `self.add_data('model', model)`.

Finally, this will return a ModelHandle providing access to the trained

There are a lot of optional parameters, but the only required one is the input data, or the training data. Let's train our informer:

In [None]:
knn_model = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(input=training_data)

**Estimate**: Once our model is trained, we can then use it to estimate the redshifts of the test data set. We provide the estimate algorithm with the test data set, and the model that we've trained, and any other necessary parameters.

## Random Gauss

This estimation algorithm doesn't use any of the magnitude information to estimate a redshift, but instead just pulls a random value out of a Gaussian distribution. As such, while it has an informer stage, that stage doesn't do anything, so we can skip it.

Naturally since this estimator just picks random values, it's not very accurate, but we'll use it to get a feel for the shape of the data.

In [None]:
# Print the docstring for the estimator
ri.estimation.algos.random_gauss.random_gauss_estimator?

In [None]:
# run the random gauss estimator with default options
rg_result_default = ri.estimation.algos.random_gauss.random_gauss_estimator(
    input=test_data
)

print(
    rg_result_default
)  # it returns a dictionary with the key "output" pointing to a qp.Ensemble

We will extract data from the output ensemble in a few ways:
- Calculate the probability density function (pdf) for a specific galaxy (row), at specific points (the grid) (`ensemble[row_number].pdf(grid)`)
- Access the mode of the pdf for each galaxy (`ensemble.ancil["zmode"]`)

In [None]:
# replace this with the above rg_result default
result = ri.estimation.algos.random_gauss.random_gauss_estimator(input=test_data)

zgrid = np.linspace(0, 3.0, 301)
galid = 9529
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(result["output"][galid].pdf(zgrid))
single_zmode = result["output"].ancil["zmode"][galid]

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmode, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")
plt.show()

In [None]:
# rename this result to be more specific, discuss how the addition of one parameter
# changes the result
result = ri.estimation.algos.random_gauss.random_gauss_estimator(
    input=test_data, rand_width=0.5
)

zgrid = np.linspace(0, 3.0, 301)
galid = 9529
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(result["output"][galid].pdf(zgrid))
single_zmode = result["output"].ancil["zmode"][galid]

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmode, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")
plt.show()

In [None]:
# change from "result" to one of the ones above (or a comparison of both?)
# discuss what this graph is and how it shows both the gaussian shape and how bad a
# random gaussian selection is
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    result["output"].ancil["zmode"].flatten(),
    s=1,
    c="k",
    label="simple NN mode",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("simple NN photo-z")
plt.show()

## Something Else

In [None]:
ri.estimation.algos.k_nearneigh.k_near_neig_informer?

In [None]:
model = ri.estimation.algos.k_nearneigh.k_near_neig_informer(
    input=training_data, hdf5_groupname=""
)
print(model)

In [None]:
ri.estimation.algos.k_nearneigh.k_near_neig_estimator?

In [None]:
# model is missing from the docstirng?
result = ri.estimation.algos.k_nearneigh.k_near_neig_estimator(
    input=test_data, model=model["model"]
)

In [None]:
zgrid = np.linspace(0, 3.0, 301)
galid = 9529
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(result["output"][galid].pdf(zgrid))
single_zmode = result["output"].ancil["zmode"][galid]

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmode, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    result["output"].ancil["zmode"].flatten(),
    s=1,
    c="k",
    label="simple NN mode",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("simple NN photo-z")
plt.show()