# Testing the effects of different datasets on redshifts

In this notebook, we will provide examples of how to create datasets with various different degradations applied, and test out the effects of these degradations on the resulting estimated photometric redshifts. 

In [1]:
import rail.interactive as ri 
import numpy as np
import tables_io
from pzflow.examples import get_galaxy_data

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.


Install FSPS with the following commands:
pip uninstall fsps
git clone --recursive https://github.com/dfm/python-fsps.git
cd python-fsps
python -m pip install .
export SPS_HOME=$(pwd)/src/fsps/libfsps

LEPHAREDIR is being set to the default cache directory:
/home/jscora/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/jscora/.cache/lephare/work
Default work cache is already linked. 
This is linked to the run directory:
/home/jscora/.cache/lephare/runs/20250327T165906


In [2]:
bands = ["u", "g", "r", "i", "z", "y"]
band_dict = {band: f"mag_{band}_lsst" for band in bands}
rename_dict = {f"mag_{band}_lsst_err": f"mag_err_{band}_lsst" for band in bands}

In [3]:
catalog = get_galaxy_data().rename(band_dict, axis=1)

In [4]:
flow_model = ri.creation.engines.flowEngine.flow_modeler(
    input=catalog,
    seed=0,
    phys_cols={"redshift": [0, 3]},
    phot_cols={
        "mag_u_lsst": [17, 35],
        "mag_g_lsst": [16, 32],
        "mag_r_lsst": [15, 30],
        "mag_i_lsst": [15, 30],
        "mag_z_lsst": [14, 29],
        "mag_y_lsst": [14, 28],
    },
    calc_colors={"ref_column_name": "mag_i_lsst"},
)

# get sample test and training data sets
train_data_orig = ri.creation.engines.flowEngine.flow_creator(
    n_samples=600, model=flow_model["model"], seed=1235
)
test_data_orig = ri.creation.engines.flowEngine.flow_creator(
    model=flow_model["model"], n_samples=600, seed=1234
)

Inserting handle into data store.  input: None, FlowModeler
Training 30 epochs 
Loss:
(0) 21.3266
(1) 3.9686
(2) 1.9351
(3) 5.2006
(4) -0.3579
(5) 2.2561
(6) 1.5917
(7) 0.3691
(8) -1.0218
(9) inf
Training stopping after epoch 9 because training loss diverged.
Inserting handle into data store.  model: inprogress_model.pkl, FlowModeler
Inserting handle into data store.  model: <pzflow.flow.Flow object at 0x7f553014af00>, FlowCreator
Inserting handle into data store.  output: inprogress_output.pq, FlowCreator
Inserting handle into data store.  model: <pzflow.flow.Flow object at 0x7f553014af00>, FlowCreator
Inserting handle into data store.  output: inprogress_output.pq, FlowCreator


## Degradation

Let's make 3 different training data sets using increasingly more degraders. Then we'll train an estimator with each of the different data sets, and use the same test data (with all the degradations) to get the redshifts, and compare the results. 

**[NEEDS EDITING]** 
If we instead want to ensure that our sample remains the same size -- which will be important later for evaluating the results, since we need to have the true redshift and the estimated redshift to compare -- we can pass in `drop_rows=False` to the degrader. We'll do this for our test data sample, so we can use this degrader but keep the same number of samples. 

In [6]:
# dataset 1: just photometric errors 
train_data_photerrs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=train_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)
# renames error columns to match DC2
train_data_photerrs_col = ri.tools.table_tools.column_mapper(
    input=train_data_photerrs["output"], columns=rename_dict
)

Inserting handle into data store.  input: None, LSSTErrorModel
Inserting handle into data store.  output: inprogress_output.pq, LSSTErrorModel
Inserting handle into data store.  input: None, ColumnMapper
Inserting handle into data store.  output: inprogress_output.pq, ColumnMapper


In [None]:
train_data_inc = ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(input=train_data_photerrs["output"], pivot_redshift=1.0, drop_rows=False)
train_data_inc["output"].info() # look at the output 

Inserting handle into data store.  input: None, InvRedshiftIncompleteness
Inserting handle into data store.  output: inprogress_output.pq, InvRedshiftIncompleteness
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   flag            600 non-null    bool   
 1   redshift        600 non-null    float32
 2   mag_u_lsst      527 non-null    float64
 3   mag_u_lsst_err  527 non-null    float64
 4   mag_g_lsst      586 non-null    float64
 5   mag_g_lsst_err  586 non-null    float64
 6   mag_r_lsst      599 non-null    float64
 7   mag_r_lsst_err  599 non-null    float64
 8   mag_i_lsst      599 non-null    float64
 9   mag_i_lsst_err  599 non-null    float64
 10  mag_z_lsst      598 non-null    float64
 11  mag_z_lsst_err  598 non-null    float64
 12  mag_y_lsst      586 non-null    float64
 13  mag_y_lsst_err  586 non-null    float64
dtypes: bo

Unnamed: 0,flag,redshift,mag_u_lsst,mag_u_lsst_err,mag_g_lsst,mag_g_lsst_err,mag_r_lsst,mag_r_lsst_err,mag_i_lsst,mag_i_lsst_err,mag_z_lsst,mag_z_lsst_err,mag_y_lsst,mag_y_lsst_err
0,True,0.081514,21.600708,0.007486,20.599435,0.005083,19.996831,0.005023,19.694191,0.005033,19.587435,0.005082,19.553563,0.005326
1,True,1.982106,28.045543,1.087031,27.984741,0.463765,27.825487,0.367745,27.414988,0.410799,26.406834,0.331409,25.950067,0.473291
2,True,0.741412,24.239717,0.055593,24.456792,0.023063,24.182278,0.016138,23.614150,0.015944,23.513933,0.027296,23.499913,0.060768
3,True,0.873177,24.227208,0.054985,24.026876,0.016102,23.444153,0.009345,22.679293,0.008227,21.333696,0.006288,21.014948,0.008182
4,True,0.534689,24.932885,0.102129,24.096806,0.017046,23.026369,0.007389,22.655102,0.008117,22.539690,0.012207,22.390550,0.022870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594,True,1.050994,27.863603,0.975580,27.299412,0.271100,26.148868,0.089899,25.808756,0.108396,25.087469,0.109507,24.939759,0.211998
596,True,0.789594,26.969790,0.536834,27.603759,0.345976,27.147304,0.212318,26.528591,0.201047,26.007307,0.239788,26.194979,0.566227
597,True,1.055365,26.709837,0.442794,26.679047,0.161480,25.869220,0.070238,25.364239,0.073322,24.475552,0.063882,24.367767,0.130250
598,True,0.579771,26.136466,0.282555,26.214452,0.108109,25.535501,0.052243,25.233874,0.065329,25.139298,0.114569,24.849029,0.196470


In [12]:
train_data_inc["output"][train_data_inc["output"]["flag"]] # mask the data down to cut out the lines that are cut out by the degrader 

Unnamed: 0,flag,redshift,mag_u_lsst,mag_u_lsst_err,mag_g_lsst,mag_g_lsst_err,mag_r_lsst,mag_r_lsst_err,mag_i_lsst,mag_i_lsst_err,mag_z_lsst,mag_z_lsst_err,mag_y_lsst,mag_y_lsst_err
0,True,0.081514,21.600708,0.007486,20.599435,0.005083,19.996831,0.005023,19.694191,0.005033,19.587435,0.005082,19.553563,0.005326
1,True,1.982106,28.045543,1.087031,27.984741,0.463765,27.825487,0.367745,27.414988,0.410799,26.406834,0.331409,25.950067,0.473291
2,True,0.741412,24.239717,0.055593,24.456792,0.023063,24.182278,0.016138,23.614150,0.015944,23.513933,0.027296,23.499913,0.060768
3,True,0.873177,24.227208,0.054985,24.026876,0.016102,23.444153,0.009345,22.679293,0.008227,21.333696,0.006288,21.014948,0.008182
4,True,0.534689,24.932885,0.102129,24.096806,0.017046,23.026369,0.007389,22.655102,0.008117,22.539690,0.012207,22.390550,0.022870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594,True,1.050994,27.863603,0.975580,27.299412,0.271100,26.148868,0.089899,25.808756,0.108396,25.087469,0.109507,24.939759,0.211998
596,True,0.789594,26.969790,0.536834,27.603759,0.345976,27.147304,0.212318,26.528591,0.201047,26.007307,0.239788,26.194979,0.566227
597,True,1.055365,26.709837,0.442794,26.679047,0.161480,25.869220,0.070238,25.364239,0.073322,24.475552,0.063882,24.367767,0.130250
598,True,0.579771,26.136466,0.282555,26.214452,0.108109,25.535501,0.052243,25.233874,0.065329,25.139298,0.114569,24.849029,0.196470


In [16]:
train_test_output = train_data_inc["output"][train_data_inc["output"]["flag"]].drop("flag", axis=1)

In [19]:
train_test_output.info()

<class 'pandas.core.frame.DataFrame'>
Index: 516 entries, 0 to 599
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   redshift        516 non-null    float32
 1   mag_u_lsst      466 non-null    float64
 2   mag_u_lsst_err  466 non-null    float64
 3   mag_g_lsst      508 non-null    float64
 4   mag_g_lsst_err  508 non-null    float64
 5   mag_r_lsst      515 non-null    float64
 6   mag_r_lsst_err  515 non-null    float64
 7   mag_i_lsst      515 non-null    float64
 8   mag_i_lsst_err  515 non-null    float64
 9   mag_z_lsst      515 non-null    float64
 10  mag_z_lsst_err  515 non-null    float64
 11  mag_y_lsst      506 non-null    float64
 12  mag_y_lsst_err  506 non-null    float64
dtypes: float32(1), float64(12)
memory usage: 54.4 KB


In [8]:
# renames error columns to match DC2
train_data_inc_col = ri.tools.table_tools.column_mapper(
    input=train_data_inc["output"], columns=rename_dict
)

Inserting handle into data store.  input: None, ColumnMapper
Inserting handle into data store.  output: inprogress_output.pq, ColumnMapper


In [None]:
# dataset 3: add in line confusion
train_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    input=train_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)
# renames error columns to match DC2
train_data_conf_col = ri.tools.table_tools.column_mapper(
    input=train_data_conf["output"], columns=rename_dict
)

Inserting handle into data store.  input: None, LineConfusion
Inserting handle into data store.  output: inprogress_output.pq, LineConfusion


Unnamed: 0,redshift,mag_u_lsst,mag_u_lsst_err,mag_g_lsst,mag_g_lsst_err,mag_r_lsst,mag_r_lsst_err,mag_i_lsst,mag_i_lsst_err,mag_z_lsst,mag_z_lsst_err,mag_y_lsst,mag_y_lsst_err
0,0.081514,21.600708,0.007486,20.599435,0.005083,19.996831,0.005023,19.694191,0.005033,19.587435,0.005082,19.553563,0.005326
1,1.982106,28.045543,1.087031,27.984741,0.463765,27.825487,0.367745,27.414988,0.410799,26.406834,0.331409,25.950067,0.473291
2,0.741412,24.239717,0.055593,24.456792,0.023063,24.182278,0.016138,23.61415,0.015944,23.513933,0.027296,23.499913,0.060768
3,1.5165,24.227208,0.054985,24.026876,0.016102,23.444153,0.009345,22.679293,0.008227,21.333696,0.006288,21.014948,0.008182
4,0.534689,24.932885,0.102129,24.096806,0.017046,23.026369,0.007389,22.655102,0.008117,22.53969,0.012207,22.39055,0.02287


In [18]:
# cut some of the data below a certain magnitude 
train_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
    input=train_test_output, cuts={"mag_i_lsst": 25.0}, drop_rows=False
)
train_data_cut["output"].head()

Inserting handle into data store.  input: None, QuantityCut


IndexError: index 518 is out of bounds for axis 0 with size 516

In [None]:
### create three different training data sets
n_final = 400

# dataset 1: just photometric errors 
train_data_photerrs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=train_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)
# renames error columns to match DC2
train_data_photerrs_col = ri.tools.table_tools.column_mapper(
    input=train_data_photerrs["output"], columns=rename_dict
)
#converts output to a numpy dictionary
train_data = ri.tools.table_tools.table_converter(
    input=train_data_photerrs_col["output"], output_format="numpyDict"
)
df_train_photerrs = tables_io.convertObj(train_data_photerrs_col["output"], tables_io.types.PD_DATAFRAME)


# dataset 2: add in redshift incompleteness 
train_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        input=train_data_photerrs["output"], pivot_redshift=1.0
    )
)
# renames error columns to match DC2
train_data_inc_col = ri.tools.table_tools.column_mapper(
    input=train_data_inc["output"], columns=rename_dict
)
df_train_inc = tables_io.convertObj(train_data_inc_col["output"], tables_io.types.PD_DATAFRAME)


# dataset 3: add in line confusion
train_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    input=train_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)
# renames error columns to match DC2
train_data_conf_col = ri.tools.table_tools.column_mapper(
    input=train_data_conf["output"], columns=rename_dict
)
df_train_conf = tables_io.convertObj(train_data_conf_col["output"], tables_io.types.PD_DATAFRAME)


# cut all the dataframes down to be an equal size
df_train_photerrs = df_train_photerrs.iloc[:n_final]
df_train_inc = df_train_inc.iloc[:n_final]
df_train_conf = df_train_conf.iloc[:n_final]


In [None]:
### degrade test data
# add photometric errors modelled on LSST to the data
test_data_errs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=test_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)
# randomly removes some galaxies above certain redshift threshold 
test_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        input=test_data_errs["output"], pivot_redshift=1.0
    )
)
# simulates the effect of misidentified lines 
test_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    input=test_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)
# # cut some of the data below a certain magnitude 
# test_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
#     input=test_data_conf["output"], cuts={"mag_i_lsst": 25.0}
# )
# renames error columns to match DC2
test_data_pq = ri.tools.table_tools.column_mapper(
    input=test_data_conf["output"], columns=rename_dict
)

df_test_data = tables_io.convertObj(test_data_pq["output"], tables_io.types.PD_DATAFRAME)


# # converts output to a numpy dictionary
# train_data = ri.tools.table_tools.table_converter(
#     input=train_data_pq["output"], output_format="numpyDict"
# )

### Plot the different datasets 

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline

hist_options = {
    "histtype": "step",
    "bins":30,
    "range": (0,3)
}

plt.hist(df_train_photerrs["redshift"],label="Truth",**hist_options)
plt.hist(df_train_inc["redshift"], label="Redshift incompleteness",**hist_options)
plt.hist(df_train_conf["redshift"], label="Line Confusion",**hist_options)
plt.xlabel("Redshift")
plt.ylabel("Galaxy density")
plt.legend(loc="best")

In [None]:
# train the model
inform_bpz = ri.estimation.algos.bpz_lite.bpz_lite_informer(
    input=train_data, nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname=""
)
# estimate the photozs 
bpz_estimated = ri.estimation.algos.bpz_lite.bpz_lite_estimator(
input=df_test_data,
model=inform_bpz["model"],
nondetect_val=np.nan,
hdf5_groupname="",
)

bpz_estimated

In [None]:
df_train_photerrs

## Estimate

In [None]:
train_datasets = [df_train_photerrs, df_train_inc, df_train_conf]
point_est_list = []
eval_list = []

for df in train_datasets:

    # train the model
    inform_bpz = ri.estimation.algos.bpz_lite.bpz_lite_informer(
        input=df, nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname=""
    )
    # estimate the photozs 
    bpz_estimated = ri.estimation.algos.bpz_lite.bpz_lite_estimator(
    input=df_test_data,
    model=inform_bpz["model"],
    nondetect_val=np.nan,
    hdf5_groupname="",
    )

    # summarize the distributions
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input=bpz_estimated["output"]
    )
    point_est_list.append(point_estimate_ens)

    # evaluate the results 
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
        },
    )
    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
            input={"data": bpz_estimated["output"], "truth": test_data_orig["output"]
            },
            **evaluator_stage_dict,
            hdf5_groupname="",
        )
    
    # put the evaluation results in a dictionary so we have them 
    eval_list.append(the_eval)
