# Iterating over parameters and comparing resulting distributions of redshifts

**Authors:** Jennifer Scora

**Last run successfully:** Jan 16, 2025

This notebook shows how to run through the various stages of RAIL (creation, estimation, and evaluation) while looping over a specific parameter and comparing the resulting photometric redshift estimates. It also will show how to use multiprocessing to speed up this iteration. However, if you want full MPI, or are running on very large datasets, we recommend running in pipeline mode (link).

## Creating the data 
First we want to create the data sets of galaxy magnitudes that we will use to estimate photometric redshifts. We will use PZflow to generate our model, and then pull two data sets from the model, a training dataset and a test dataset. The training data set will be used to train our models, and the test data set is the data we will get photo-z estimates for. 

In [None]:
import rail.interactive as ri 
import numpy as np
import tables_io
from pzflow.examples import get_galaxy_data

Here we need a few configuration parameters to deal with differences in data schema between existing PZ codes. We also need to grab the data to use for training the flow engine. 

In [None]:
bands = ["u", "g", "r", "i", "z", "y"]
band_dict = {band: f"mag_{band}_lsst" for band in bands}
rename_dict = {f"mag_{band}_lsst_err": f"mag_err_{band}_lsst" for band in bands}

In [None]:
catalog = get_galaxy_data().rename(band_dict, axis=1)

### Train and sample the model

Here we need to train the normalizing flow that serves as the engine for the input data creation, and then use the flow to produce some synthetic data for our training data set, as well as for our test data set. We will create small datasets of 150 galaxies each for this example.

In [None]:
flow_model = ri.creation.engines.flowEngine.flow_modeler(
    input=catalog,
    seed=0,
    phys_cols={"redshift": [0, 3]},
    phot_cols={
        "mag_u_lsst": [17, 35],
        "mag_g_lsst": [16, 32],
        "mag_r_lsst": [15, 30],
        "mag_i_lsst": [15, 30],
        "mag_z_lsst": [14, 29],
        "mag_y_lsst": [14, 28],
    },
    calc_colors={"ref_column_name": "mag_i_lsst"},
)

# get sample test and training data sets
train_data_orig = ri.creation.engines.flowEngine.flow_creator(
    n_samples=150, model=flow_model["model"], seed=1235
)
test_data_orig = ri.creation.engines.flowEngine.flow_creator(
    model=flow_model["model"], n_samples=150, seed=1234
)

### Degrade the data sets

Next we will apply some degradation functions to the data, to make it look more like real observations. We apply the following functions to the training data set:
1. `LSSTErrorModel` to add photometric errors 
2. `InvRedshiftIncompleteness` to remove some galaxies above a redshift threshold
3. `LineConfusion` to simulate the effect of misidentified lines 
4. `QuantityCut`cuts galaxies based on their specific band magnitudes 

We then use `ColumnMapper` to rename the error columns so that they match the names in DC2, and use `TableConverter` to convert the data to a numpy dictionary, so that it fits the expected input format for the following functions. 

For the test data set, we only apply the `LSSTErrorModel` degradations, as well as making the above structural changes to get the data in the same output format as the training data set.

In [None]:
### degrade training data
# add photometric errors modelled on LSST to the data
train_data_errs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=train_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)
# randomly removes some galaxies above certain redshift threshold 
train_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        input=train_data_errs["output"], pivot_redshift=1.0
    )
)
# simulates the effect of misidentified lines 
train_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    input=train_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)
# cut some of the data below a certain magnitude 
train_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
    input=train_data_conf["output"], cuts={"mag_i_lsst": 25.0}
)
# renames error columns to match DC2
train_data_pq = ri.tools.table_tools.column_mapper(
    input=train_data_cut["output"], columns=rename_dict
)
# converts output to a numpy dictionary
train_data = ri.tools.table_tools.table_converter(
    input=train_data_pq["output"], output_format="numpyDict"
)

### degrade testing data
# add photometric errors modelled on LSST to the data
test_data_errs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=test_data_orig["output"], seed=58, renameDict=band_dict, ndFlag=np.nan
)
# renames error columns to match DC2
test_data_pq = ri.tools.table_tools.column_mapper(
    input=test_data_errs["output"], columns=rename_dict, hdf5_groupname=""
)
# converts output to a numpy dictionary
test_data = ri.tools.table_tools.table_converter(
    input=test_data_pq["output"], output_format="numpyDict"
)


## Estimate the redshifts and evaluate performance

Now, we estimate our photometric redshifts. We use the [K-Nearest Neighbour algorithm](https://rail-hub.readthedocs.io/en/latest/source/estimators.html#k-nearest-neighbor) to estimate our redshifts, varying the minimum and maximum allowed number of neighbours to see its effect on the final result. To do this, we iterate over a list of the different parameter inputs we want to use for the estimator. In each loop, we:
- estimate the redshifts with the chosen parameters
- summarize the distribution of redshifts using the Naive Stacking and Point Estimate Histogram methods
- evaluate how the estimated redshifts compare to the true redshifts (the original test data set before degradation)
- save the evaluation results and summarized distributions to dictionaries so we can access them outside the loop


In [None]:
### Iterate over estimating photo-zs using KNN 

# set up parameters to iterate over and dictionaries to store data
nb_params = [(3,7), (2,6), (2,8), (4,9)]
eval_dict = {}
naive_dict = {}
point_est_dict = {}

for nb_min, nb_max in nb_params:
    # train the informer
    inform_knn = ri.estimation.algos.k_nearneigh.k_near_neig_informer(
        input=train_data["output"], nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname="", nneigh_min=nb_min, nneigh_max=nb_max
    )
    # get photo-zs
    knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neig_estimator(
        input=test_data["output"],
        model=inform_knn["model"],
        nondetect_val=np.nan,
        hdf5_groupname="",
    )

    ### Evaluate the results 
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "pit", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
            "pit": {"tdigest_compression": 1000},
        },
    )
    truth = test_data_orig

    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
            input={"data": knn_estimated["output"], "truth": truth["output"]
            },
            **evaluator_stage_dict,
            hdf5_groupname="",
        )
    
    # put the evaluation results in a dictionary so we have them 
    eval_dict[(nb_min,nb_max)] = the_eval

    # summarize the distributions using point estimate and naive stack summarizers 
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input=knn_estimated["output"]
    )
    point_est_dict[(nb_min,nb_max)] = point_estimate_ens
    naive_stack_ens = ri.estimation.algos.naive_stack.naive_stack_summarizer(
        input=knn_estimated["output"]
    )
    naive_dict[(nb_min,nb_max)] = naive_stack_ens





## Compare the results

We can take a look at the evaluation metrics that we've generated for each of the runs to see how they compare 

In [None]:
# TODO: either make this into a nicer table or just print out the dictionary?
results_tables = {
    key: tables_io.convertObj(val["summary"], tables_io.types.PD_DATAFRAME)
    for key, val in eval_dict.items()
}
results_tables

In [None]:
eval_dict

We can also plot the summarized distributions of all the photometric redshifts generated in a loop against each other, to compare the effect of the different parameters. Below, we plot the runs with the following parameters, since they have the most different evaluation metrics: 
- minimum neighbours: 3, maximum neighbours: 7
- minimum neighbours: 2, maximum neighbours: 6

In [None]:
# plot of point estimate summarized distribution 
point_est_dict[(3,7)]["output"].plot_native(xlim=(0, 3))
point_est_dict[(2,6)]["output"].plot_native(xlim=(0, 3))

In [None]:
# Plot of naive stack summarized distribution 
naive_dict[(3,7)]["output"].plot_native(xlim=(0,3))
naive_dict[(2,6)]["output"].plot_native(xlim=(0,3))

## Using multiprocessing

Let's say we wanted to do the same as above but with a lot more parameters (and perhaps with a slower algorithm). We can use the python `multiprocessing` module to run the whole loop concurrently, and speed up the process a little. To do this, we need to turn our loop above into its own function. 

In [None]:
def estimate_photoz(nb_lims):
    """A function to estimate photo-zs using the KNN alorithm, given a minimum and maximum number of nearest neighbours. It will 
    then evaluate the performance """

    # nb_lims, train_data, test_data, inform_knn, test_data_orig = args[0], args[1][0], args[1][1], args[1][2]

    # train the informer
    inform_knn = ri.estimation.algos.k_nearneigh.k_near_neig_informer(
        input=train_data["output"], nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname="", nneigh_min=nb_lims[0], nneigh_max=nb_lims[1]
    )
    # get photo-zs
    knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neig_estimator(
        input=test_data["output"],
        model=inform_knn["model"],
        nondetect_val=np.nan,
        hdf5_groupname="",
    )
   
    ### Evaluate the results 
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "pit", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
            "pit": {"tdigest_compression": 1000},
        },
    )
    truth = test_data_orig

    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
            input={"data": knn_estimated["output"], "truth": truth["output"]
            },
            **evaluator_stage_dict,
            hdf5_groupname="",
        )
    
    # put the evaluation results in a dictionary so we have them 
    eval_dict_lg[(nb_lims[0],nb_lims[1])] = the_eval

    # summarize the distributions using point estimate and naive stack summarizers 
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input=knn_estimated["output"]
    )
    point_est_dict_lg[(nb_lims[0],nb_lims[1])] = point_estimate_ens
    naive_stack_ens = ri.estimation.algos.naive_stack.naive_stack_summarizer(
        input=knn_estimated["output"]
    )
    naive_dict_lg[(nb_lims[0],nb_lims[1])] = naive_stack_ens
    return the_eval

In [None]:
from multiprocessing.pool import ThreadPool as Pool

# set up parameters to iterate over and dictionaries to store data
nb_params = [(3,7), (2,6), (2,8), (4,9),(5,10), (1,9), (2,9), (3,10)]
nb_mins = [3,2,2,4,5,1,2,3]
nb_maxs = [7,6,8,9,10,9,9,10]
eval_dict_lg = {}
naive_dict_lg = {}
point_est_dict_lg = {}

pool = Pool(4)
for result in pool.imap_unordered(estimate_photoz, nb_params):
    print(result )

### Comparing results

Now we can take a look at the dictionary of evaluation metrics, and compare the reuslts for the different parameters

In [None]:
eval_dict_lg

Let's plot the distributinos of two different runs with different evalutation metrics to see what was different:

In [None]:
# Plot of naive stack summarized distribution 
point_est_dict_lg[(3,7)]["output"].plot_native(xlim=(0,3))
point_est_dict_lg[(5,10)]["output"].plot_native(xlim=(0,3))

### Large datasets

If the code is slow because you're using extremely large datasets, or you're running into memory issues for the same reason, then we suggest using a pipeline. Pipelines are ideal for large datasets, as the code will chunk up large files and iterate through them as needed. 