# Using Galaxy Magnitudes to Estimate Photometric Redshifts

This notebook covers the basics of using real galaxy magnitudes to estimate photometric redshifts with RAIL. We will use a couple of the RAIL algorithms to do this, to get a sense of the differences between algorithms and how they work. We'll go through the following steps:

1. Setting up training and testing data sets 
2. Estimating redshifts with `k-Nearest Neighbours` (KNN)
3. Estimating redshifts with `FlexZBoost`


Before we get started, here's a quick introduction to some of the features of RAIL interactive mode. The only RAIL package you need to import is the `rail.interactive` package. This contains all of the interactive functions for all of the RAIL algorithms. You may need to import supporting functions that are not part of a stage separately. To get a sense of what functions/stages are available and for some more detailed instructions, see [the RAIL documentation](https://descraildocs.z27.web.core.windows.net/source/user_guide/interactive_usage.html).  

In [None]:
# import the packages we'll need
import rail.interactive as ri

In this notebook, we'll be using estimation algorithms, which can all be found under the `ri.estimation.algos` namespace. You can see a list of existing algorithms (here), or you can also explore the available options using tab-complete. Each algorithm will have its own namespace, for example, the namespace for KNN is `k_nearneigh`. Each of these algorithms will then have an `informer` and `estimator` method.

To get the docstrings for a function, including what parameters it needs and what it returns, you can just put a question mark after the function call or use the `help()` function, as you would with any python function. For example, we'll be using the KNN estimator function later, so we can take a look at what it needs:

In [None]:
ri.estimation.algos.k_nearneigh.k_near_neigh_estimator?

## 1. Setting up training and testing data sets

In this notebook we'll be using the ['Estimation' stage](https://descraildocs.z27.web.core.windows.net/source/rail_stages/estimation.html) of RAIL. The estimation algorithms, or `Estimators`, have both an *inform* method and an *estimation* method. The inform method trains the model that will be used to estimate the redshifts, so that will need to be given both magnitude data and the true redshifts of the galaxies. We can then pass a new set of magnitudes (the ones we're actually interested in) to the *estimator*, along with the model that the informer created. The estimator can then apply the model to the new magnitudes in order to calculate a redshift value. 

This means we'll need two separate data sets for each of the methods. We'll start by getting those data sets set up. `test_dc2_training_9816.hdf5` is what we'll use for training the *inform* method, and `test_dc2_validation_9816.hdf5` will act as our 'real' galaxy magnitude data, which we will provide to the *estimation* method to get our photometric redshifts (photo-z). Both files contain data drawn from the cosmoDC2_v1.1.4 truth extragalactic catalog generated by DESC with model 10-year-depth magnitude uncertainties.  The training data contains roughly 10,000 galaxies, while the test data contains roughly 20,000.  Both sets are representative down to a limiting apparent magnitude. 

First, we'll use the `find_rail_file` function to get the full path to the data files mentioned above. If you have your own data, you can substitute in the path to that file for the value of `testFile` below. 


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import tables_io
from rail.utils.path_utils import find_rail_file

trainFile = find_rail_file("examples_data/testdata/test_dc2_training_9816.hdf5")
testFile = find_rail_file("examples_data/testdata/test_dc2_validation_9816.hdf5")
print(trainFile)

Now let's read these files in. We'll start by reading in the training data, and converting it to a Pandas dataframe to make it easier to read:

In [None]:
# read in file
training_data = tables_io.read(trainFile)
print(type(training_data), training_data.keys())

# get the data table out of the photometry dictionary and convert to pandas DataFrame
training_data = training_data["photometry"]
training_data = tables_io.convert(training_data, "pandasDataFrame")
print(training_data.info())

`training_data` is now a Pandas DataFrame, containing information on 10,225 galaxies. It has magnitude information for the *ugrizy* bands, including errors, and the true redshift of these galaxies.

We'll now also load in the test data, which contains the magnitudes for the galaxies we actually want to calculate redshifts for. Just as an example, we'll leave the test data in the default format given by `tables_io` for an `hdf5` file, which is a dictionary of arrays. Either method can be used with RAIL functions, but they can require slightly different methods of passing the data.

In [None]:
test_data = tables_io.read(testFile)
print(test_data["photometry"].keys())

## 2. Estimate redshifts with the [KNN algorithm](https://rail-hub.readthedocs.io/en/latest/source/estimators.html#k-nearest-neighbor) 

**The algorithm**:  The `k-Nearest Neighbours` algorithm we're using (see [here](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) for more of an explanation of how KNN works) is a wrapper around `sklearn`'s nearest neighbour (NN) machine learning model. Essentially, it takes a given galaxy, identifies its nearest neighbours in the space, in this case galaxies that have similar colours, and then constructs the photometric redshift PDF as a sum of Gaussians from each neighbour.

**Inform**: The inform method is training the model that we will use to estimate the redshifts. It will set aside some of the training data set as a validation data set. We will plug in our training data set, and any parameters the model needs, which we can check by putting a question mark after the function name:


In [None]:
ri.estimation.algos.k_nearneigh.k_near_neigh_informer?

There are a lot of optional parameters. Some of the main ones to be aware of for the KNN algorithm are:
- `trainfrac` sets the proportion of training data to use in training the algorithm, where the remaining fraction is used to validate both the width of the Gaussians used in constructing the PDF and the number of neighbors used in each PDF.  
- `sigma_grid_min`, `sigma_grid_max`, and `ngrid_sigma` are used to specify the grid of sigma values to test for the Gaussians 
- `nneigh_min` and `nneigh_max` set the range of nearest neighbours that will be tested 
- `zmin`, `zmax`, and `nzbins` are used to create a grid of redshift points on which to validate the model 

The only required parameter is the training data (called `input`). We'll also need to include `hdf5_groupname = ""`, since we've pulled the data table out of the `photometry` group when we converted to a DataFrame:

In [None]:
knn_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    sigma_grid_min=0.01,
    sigma_grid_max=0.07,
    ngrid_sigma=10,
    nneigh_min=3,
    nneigh_max=7,
    hdf5_groupname="",
)
knn_inform = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
    input=training_data, **knn_dict
)

Now, if you take a look at the output of this function, you can see that it's a dictionary with the key 'model', since that's what we're generating, and the actual model object as the value. If there were multiple outputs for this function, they would all be collected in this dictionary: 

In [None]:
print(knn_inform)

### Saving a model and using it with an estimator stage
We can see that the output of the inform stage is a dictionary with the model under the "model" key. To make our lives easier, we can save this model to a file. That way, we only have to run the `estimator` method in the future, and supply the file name of the model we've just saved. This will speed up our data analysis. Let's start by saving the file:


In [None]:
import pickle

# write model file out here
with open("./knn_model.pkl", "wb") as fout:
    pickle.dump(obj=knn_inform["model"], file=fout, protocol=pickle.HIGHEST_PROTOCOL)

**Estimate**: Now that our model is trained, we can use it to estimate the redshifts of the test data set. We provide the estimate algorithm with the test data set, and the filename of the model that we've trained, and any other necessary parameters:

In [None]:
knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
    input=test_data, model="knn_model.pkl"
)

Now let's take a look at what the output of the estimation stage actually looks like. Most estimation stages output an `Ensemble`, which is a data structure from the package `qp`. For more information, see [the qp documentation](https://qp.readthedocs.io/en/main/user_guide/datastructure.html). 

We're using an `Ensemble` to hold a redshift distribution for each of the galaxies we're estimating. There are two required dictionaries that make up an Ensemble, and one that is optional:
- `.metadata`: Contains information about the whole data structure, like the Ensemble type, and any shared parameters such as the bins of histograms. This is not per-object metadata. 
- `.objdata`: The main data points of the distributions for each object, where each object is a row. 
- `.ancil`: the optional dictionary, containing extra information about each object. It can have arrays that have one or more data points per distribution. Typically the ancillary data table includes a photo-z point estimate derived from the PDFs, the mode by default, called 'zmode' in the ancillary dictionary below:

In [None]:
print(knn_estimated)

We can see that this algorithm outputs Ensembles of class `mixmod`, which are just combinations of Gaussians (for more info see the [qp docs](https://qp.readthedocs.io/en/main/user_guide/parameterizations/mixmod.html)). The shape portion of the print statement tells us two things: the first number is the number of photo-z distributions, or galaxies, in this `Ensemble`, and the second number tells us how many Gaussians are combined to make up each photo-z distribution. 

Let's take a look at what the different dictionaries look like for this `Ensemble`:  

In [None]:
print(knn_estimated["output"].metadata)

In [None]:
print(knn_estimated["output"].objdata)

In [None]:
print(knn_estimated["output"].ancil)

We can also get a slice of distributions or a single specific distribution by slicing the Ensemble, for example: `knn_estimated["output"][0]` would give us the first distribution. 

Now we can take a look at what these photo-z distributions actually look like by plotting them. You can use the `.plot_native` method to do this quickly, but if you want to make your own plots you can use the `.pdf` function which takes an array of redshift values and returns the photo-z probability distribution function at those values, like this:

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

xvals = np.linspace(0, 3, 200)  # we want to cover the whole available redshift space
plt.plot(xvals, knn_estimated["output"][0].pdf(xvals), label="0")
plt.plot(xvals, knn_estimated["output"][1000].pdf(xvals), label="1000")
plt.plot(xvals, knn_estimated["output"][10000].pdf(xvals), label="10 000")

plt.legend(loc="best", title="Galaxy ID")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We can see that these distributions have varying shapes. To get a point estimate of the redshift, we can use the `zmode` value from the ancillary dictionary. Also, for this notebook our test data has real redshifts to compare against. Let's take the 1000th galaxy's distribution and compare it to these two values:

In [None]:
zgrid = np.linspace(0, 3.0, 301)
galid = 1000
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(knn_estimated["output"][galid].pdf(zgrid))
single_zmode = knn_estimated["output"].ancil["zmode"][galid]

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(single_zmode, color="k", ls="--", label="mode")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

We can see there is some difference between the true redshift and the estimated redshift point estimate, though for this galaxy the point estimate seems quite good. Let's see how the algorithm did overall by plotting the redshift point estimates (the "zmodes") versus the true redshifts:

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    knn_estimated["output"].ancil["zmode"].flatten(),
    s=1,
    c="k",
    label="simple NN mode",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("estimated photo-z")
plt.show()

We can see that the algorithm does quite well overall, though there are certainly some outliers, and more of a spread at higher redshifts. 

## 3. Estimate redshifts with [FlexZBoost](https://rail-hub.readthedocs.io/en/latest/source/estimators.html#flexzboost)

Now let's use `FlexZBoost` to get our redshifts. `FlexZBoostEstimator` approximates the conditional density estimate for each PDF with a set of weights on a set of basis functions.  This can save space relative to a gridded parameterization, but it also leads to residual "bumps" in the PDF intrinsic to the underlying cosine or fourier parameterization.  For this reason, `FlexZBoostEstimator` has a post-processing stage where it "trims" (i.e. sets to zero) any small peaks, or "bumps", below a certain `bump_thresh` threshold. For more details on running `FlexZBoost`, see the 'Quick_Start_in_Estimation.ipynb' notebook **TODO: add link**

These are some of the main parameters for the informer:
 `basis_system`: which basis system to use in the density estimate. The default is `cosine` but `fourier` is also an option
- `max_basis`: the maximum number of basis functions parameters to use for PDFs
- `regression_params`: a dictionary of options fed to `xgboost` that control the maximum depth and the `objective` function. `objective` should be set to `reg:squarederror` for proper functioning.
- `trainfrac`: The fraction of the training data to use for training the density estimate.  The remaining galaxies will be used for validation of `bump_thresh` and `sharpening`.
- `bumpmin`: the minimum value to test in the `bump_thresh` grid
- `bumpmax`: the maximum value to test in the `bump_thresh` grid
- `nbump`: how many points to test in the `bump_thresh` grid
- `sharpmin`, `sharpmax`, `nsharp`: same as equivalent `bump_thresh` params, but for `sharpening` parameter

The dictionary below gives the defaults for all of these parameters:

In [None]:
fz_dict = dict(
    zmin=0.0,
    zmax=3.0,
    nzbins=301,
    trainfrac=0.75,
    bumpmin=0.02,
    bumpmax=0.35,
    nbump=20,
    sharpmin=0.7,
    sharpmax=2.1,
    nsharp=15,
    max_basis=35,
    basis_system="cosine",
    regression_params={"max_depth": 8, "objective": "reg:squarederror"},
)

Since we're not making any changes to the defaults in this run, we can run the informer with just the input data and the `hdf5_groupname` parameter as before:

In [None]:
flex_inform = ri.estimation.algos.flexzboost.flex_z_boost_informer(
    input=training_data, hdf5_groupname="", **fz_dict
)

Next we run the estimator with the model that we've just created, like the `knn_inform` variable, `flex_inform` is also a dictionary with a "model" key.

**Note that when we pass the model to this function, we don't pass the dictionary, but the actual model object. This is true of all the interactive functions.** 

In [None]:
flex_estimated = ri.estimation.algos.flexzboost.flex_z_boost_estimator(
    input=test_data, model=flex_inform["model"]
)

In [None]:
print(flex_estimated)

We once again get our dictionary with the key "output", this time the `Ensemble` is of type `interp`, which means the distributions are given as a set of `xvals` and `yvals`, where all the distributions share the same set of `xvals`. 

Now we can plot out some of the data, same as above, to get a sense of how the estimator did. Let's start by plotting some of the individual galaxy photo-z distributions:

In [None]:
xvals = np.linspace(0, 3, 200)  # we want to cover the whole available redshift space
plt.plot(xvals, flex_estimated["output"][0].pdf(xvals), label="0")
plt.plot(xvals, flex_estimated["output"][1000].pdf(xvals), label="1000")
plt.plot(xvals, flex_estimated["output"][10000].pdf(xvals), label="10000")

plt.legend(loc="best", title="Galaxy ID")
plt.xlabel("redshift")
plt.ylabel("p(z)")

`FlexZBoost` doesn't automatically add "zmode" values to the ancillary dictionary of the `Ensemble`, but we can easily calculate redshift point estimates ourselves using `Ensemble` methods `.median()` or `.mode()`. Let's calculate the median for one of the galaxies, and plot it against the PDF and true redshift of galaxy 1000:

In [None]:
# calculate median of photo-z PDF
z_med = flex_estimated["output"][1000].median()

# get rest of plotting data
zgrid = np.linspace(0, 3.0, 301)
galid = 1000
truez = test_data["photometry"]["redshift"][galid]
single_gal = np.squeeze(flex_estimated["output"][galid].pdf(zgrid))

plt.plot(zgrid, single_gal, color="k", label="single pdf")
plt.axvline(z_med, color="k", ls="--", label="median")
plt.axvline(truez, color="r", label="true redshift")
plt.legend(loc="upper right")
plt.xlabel("redshift")
plt.ylabel("p(z)")

Finally, we can do this for all the galaxies, and compare the estimated redshifts with the true redshifts as for the KNN algorithm above:

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(
    test_data["photometry"]["redshift"],
    flex_estimated["output"].median(),
    s=1,
    c="k",
    label="FlexZBoost median",
)
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("estimated photo-z")
plt.show()

This distribution actually looks quite similar to our distribution of KNN estimated values -- in both cases, they're quite good with some outliers at higher redshift. 

## Clean up

This step is just to remove the model file we created in Step 2:

In [None]:
import os

os.remove("knn_model.pkl")