# Iterating over parameters and comparing resulting distributions of redshifts

**Authors:** Jennifer Scora

**Last run successfully:** Jan 16, 2025

This notebook shows how to run through the various stages of RAIL (creation, estimation, evaluation and summarization) while looping over a specific parameter and comparing the resulting photometric redshift estimates. It also will show how to use multiprocessing with the interactive mode (if you want full MPI, or are running on very large datasets, we recommend running in pipeline mode (link)).

## Creating the data 
First we want to create the data sets of galaxy magnitudes that we will use to estimate photometric redshifts. We will use PZflow to generate our model, and then pull two data sets from the model, a training dataset and a test dataset. The training data set will be used to train our models, and the test data set is the data we will get photo-z estimates for. 

In [1]:
import rail.interactive as ri 
import numpy as np
import tables_io
from pzflow.examples import get_galaxy_data

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.


LEPHAREDIR is being set to the default cache directory:
/home/jscora/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/jscora/.cache/lephare/work
Default work cache is already linked. 
This is linked to the run directory:
/home/jscora/.cache/lephare/runs/20250327T165906


Here we need a few configuration parameters to deal with differences in data schema between existing PZ codes. We also need to grab the data to use for training the flow engine. 

In [2]:
bands = ["u", "g", "r", "i", "z", "y"]
band_dict = {band: f"mag_{band}_lsst" for band in bands}
rename_dict = {f"mag_{band}_lsst_err": f"mag_err_{band}_lsst" for band in bands}

In [3]:
catalog = get_galaxy_data().rename(band_dict, axis=1)

### Train and sample the model

Here we need to train the normalizing flow that serves as the engine for the input data creation, and then use the flow to produce some synthetic data for our training data set, as well as for our test data set. 

In [4]:
flow_model = ri.creation.engines.flowEngine.flow_modeler(
    input=catalog,
    seed=0,
    phys_cols={"redshift": [0, 3]},
    phot_cols={
        "mag_u_lsst": [17, 35],
        "mag_g_lsst": [16, 32],
        "mag_r_lsst": [15, 30],
        "mag_i_lsst": [15, 30],
        "mag_z_lsst": [14, 29],
        "mag_y_lsst": [14, 28],
    },
    calc_colors={"ref_column_name": "mag_i_lsst"},
)

# get sample test and training data sets
train_data_orig = ri.creation.engines.flowEngine.flow_creator(
    n_samples=150, model=flow_model["model"], seed=1235
)
test_data_orig = ri.creation.engines.flowEngine.flow_creator(
    model=flow_model["model"], n_samples=150, seed=1234
)

Inserting handle into data store.  input: None, FlowModeler
Training 30 epochs 
Loss:
(0) 21.3266
(1) 3.9686
(2) 1.9351
(3) 5.2006
(4) -0.3579
(5) 2.2561
(6) 1.5917
(7) 0.3691
(8) -1.0218
(9) inf
Training stopping after epoch 9 because training loss diverged.
Inserting handle into data store.  model: inprogress_model.pkl, FlowModeler
Inserting handle into data store.  model: <pzflow.flow.Flow object at 0x7316199196d0>, FlowCreator
Inserting handle into data store.  output: inprogress_output.pq, FlowCreator
Inserting handle into data store.  model: <pzflow.flow.Flow object at 0x7316199196d0>, FlowCreator
Inserting handle into data store.  output: inprogress_output.pq, FlowCreator


### Degrade the data sets

Next we will apply some degradation functions to the data, to produce some erroneous input data and apply some cuts to the data to make it more like data that would come from a telescope. We are applying more degradation to the training set, including cuts to ensure our sample is incomplete. For both data sets, we use the `ColumnMapper` to rename the error columns so that they match the names in DC2, and use the `TableConverter` to convert the data to a numpy dictionary, so that it fits the expected input format for the following functions. 

In [None]:
### degrade training data
# add photometric errors modelled on LSST to the data
train_data_errs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=train_data_orig["output"], seed=66, renameDict=band_dict, ndFlag=np.nan
)
# randomly removes some galaxies above certain redshift threshold 
train_data_inc = (
    ri.creation.degraders.spectroscopic_degraders.inv_redshift_incompleteness(
        input=train_data_errs["output"], pivot_redshift=1.0
    )
)
# simulates the effect of misidentified lines 
train_data_conf = ri.creation.degraders.spectroscopic_degraders.line_confusion(
    input=train_data_inc["output"],
    true_wavelen=5007.0,
    wrong_wavelen=3727.0,
    frac_wrong=0.05,
    seed=1337,
)
# cut the data below a certain magnitude 
train_data_cut = ri.creation.degraders.quantityCut.quantity_cut(
    input=train_data_conf["output"], cuts={"mag_i_lsst": 25.0}
)
# renames error columns to match DC2
train_data_pq = ri.tools.table_tools.column_mapper(
    input=train_data_cut["output"], columns=rename_dict
)
# converts output to a numpy dictionary
train_data = ri.tools.table_tools.table_converter(
    input=train_data_pq["output"], output_format="numpyDict"
)

### degrade testing data
# add photometric errors modelled on LSST to the data
test_data_errs = ri.creation.degraders.photometric_errors.lsst_error_model(
    input=test_data_orig["output"], seed=58, renameDict=band_dict, ndFlag=np.nan
)
# renames error columns to match DC2
test_data_pq = ri.tools.table_tools.column_mapper(
    input=test_data_errs["output"], columns=rename_dict, hdf5_groupname=""
)
# converts output to a numpy dictionary
test_data = ri.tools.table_tools.table_converter(
    input=test_data_pq["output"], output_format="numpyDict"
)


Inserting handle into data store.  input: None, LSSTErrorModel
Inserting handle into data store.  output: inprogress_output.pq, LSSTErrorModel
Inserting handle into data store.  input: None, InvRedshiftIncompleteness
Inserting handle into data store.  output: inprogress_output.pq, InvRedshiftIncompleteness
Inserting handle into data store.  input: None, LineConfusion
Inserting handle into data store.  output: inprogress_output.pq, LineConfusion
Inserting handle into data store.  input: None, QuantityCut
Inserting handle into data store.  output: inprogress_output.pq, QuantityCut
Inserting handle into data store.  input: None, ColumnMapper
Inserting handle into data store.  output: inprogress_output.pq, ColumnMapper
Inserting handle into data store.  input: None, TableConverter
Inserting handle into data store.  output: inprogress_output.hdf5, TableConverter
Inserting handle into data store.  input: None, LSSTErrorModel
Inserting handle into data store.  output: inprogress_output.pq, LS

## Estimate the redshifts and evaluate performance

Now, we're going to estimate our photometric redshifts. Here is where we'll iterate over a set of parameters, so we can see how those parameters affect the performance of our redshift estimation algorithm. We're going to use the K-Nearest Neighbours algorithm (link) to estimate our redshifts, and we'll vary the minimum and maximum number of neighbours allowed. 

Then we'll evaluate how the estimated redshifts compare to the true redshifts, which is the original test data set before it was degraded. The results of the evaluation are saved to a dictionary, so we can compare them later. We'll also save the actual generated photometric redshifts to a dictionary so we can plot them later. 

In [None]:
### Iterate over estimating photo-zs using KNN 

# set up parameters to iterate over and dictionaries to store data
nb_params = [(3,7), (2,6), (2,8), (4,9)]
eval_dict = {}
photoz_dict = {}
naive_dict = {}
point_est_dict = {}

for nb_min, nb_max in nb_params:
    # train the informer
    inform_knn = ri.estimation.algos.k_nearneigh.k_near_neig_informer(
        input=train_data["output"], nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname="", nneigh_min=nb_min, nneigh_max=nb_max
    )
    # get photo-zs
    knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neig_estimator(
        input=test_data["output"],
        model=inform_knn["model"],
        nondetect_val=np.nan,
        hdf5_groupname="",
    )
    # put in a dict for later
    photoz_dict[(nb_min, nb_max)] = knn_estimated

    ### Evaluate the results 
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "pit", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
            "pit": {"tdigest_compression": 1000},
        },
    )
    truth = test_data_orig

    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
            input={"data": knn_estimated["output"], "truth": truth["output"]
            },
            **evaluator_stage_dict,
            hdf5_groupname="",
        )
    
    # put the evaluation results in a dictionary so we have them 
    eval_dict[(nb_min,nb_max)] = the_eval

    # summarize the distributions using point estimate and naive stack summarizers 
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input=knn_estimated["output"]
    )
    point_est_dict[(nb_min,nb_max)] = point_estimate_ens
    naive_stack_ens = ri.estimation.algos.naive_stack.naive_stack_summarizer(
        input=knn_estimated["output"]
    )
    naive_dict[(nb_min,nb_max)] = naive_stack_ens





Inserting handle into data store.  input: None, KNearNeighInformer
split into 49 training and 16 validation samples
finding best fit sigma and NNeigh...



best fit values are sigma=0.075 and numneigh=3



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x603b18018170>, 'bestsig': np.float64(0.075), 'nneigh': 3, 'truezs': array([0.8559624 , 1.09725535, 0.67563593, 0.91550589, 0.90324795,
       0.65226948, 0.77357709, 0.37648416, 1.22105432, 0.89333856,
       0.65628129, 0.83754033, 1.04881907, 1.19360622, 0.41372049,
       1.24091506, 0.07890534, 0.33354044, 0.84880698, 0.50793141,
       0.07312357, 0.92710775, 0.53522396, 0.78664237, 0.55667555,
       0.45270801, 0.50885946, 1.21598339, 0.55211103, 0.22890699,
       0.49027538, 0.60366631, 0.66702271, 0.59513855, 0.67450702,
       0



Inserting handle into data store.  input: None, DistToPointEvaluator
Inserting handle into data store.  truth: None, DistToPointEvaluator
Requested metrics: ['cdeloss', 'pit', 'brier']
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, NaiveStackSummarizer
Ins



Inserting handle into data store.  input: None, DistToPointEvaluator
Inserting handle into data store.  truth: None, DistToPointEvaluator
Requested metrics: ['cdeloss', 'pit', 'brier']
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, NaiveStackSummarizer
Ins



Inserting handle into data store.  input: None, DistToPointEvaluator
Inserting handle into data store.  truth: None, DistToPointEvaluator
Requested metrics: ['cdeloss', 'pit', 'brier']
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, NaiveStackSummarizer
Ins



Inserting handle into data store.  input: None, DistToPointEvaluator
Inserting handle into data store.  truth: None, DistToPointEvaluator
Requested metrics: ['cdeloss', 'pit', 'brier']
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, NaiveStackSummarizer
Ins

## Compare the results

We can take a look at the evaluation metrics that we've generated for each of the runs to see how they compare 

In [None]:
# TODO: either make this into a nicer table or get rid of it 
results_tables = {
    key: tables_io.convertObj(val["summary"], tables_io.types.PD_DATAFRAME)
    for key, val in eval_dict.items()
}
results_tables

In [None]:
eval_dict

We can also plot the summarized distributions of all the photometric redshifts generated in a loop against each other, to compare the effect of the different parameters. Below, we plot the runs with the following parameters: 
- minimum neighbours: 3, maximum neighbours: 7
- minimum neighbours: 2, maximum neighbours: 6

In [None]:
# plot of point estimate summarized distribution 
# TODO: try putting these both on one plot 
point_est_dict[(3,7)]["output"].plot_native(xlim=(0, 3))
point_est_dict[(2,6)]["output"].plot_native(xlim=(0, 3))

In [None]:
# Plot of naive stack summarized distribution 
naive_dict[(3,7)]["output"].plot_native(xlim=(0,3))
naive_dict[(2,6)]["output"].plot_native(xlim=(0,3))

## Using multiprocessing

Let's say we wanted to do the same as above but with a lot more parameters (and perhaps with a slower algorithm). We can use the python `multiprocessing` module to run the whole loop concurrently, and speed up the process a little. 

In [None]:
def estimate_photoz(nb_lims):
    """A function to estimate photo-zs using the KNN alorithm, given a minimum and maximum number of nearest neighbours. It will 
    then evaluate the performance """

    # nb_lims, train_data, test_data, inform_knn, test_data_orig = args[0], args[1][0], args[1][1], args[1][2]

    # train the informer
    inform_knn = ri.estimation.algos.k_nearneigh.k_near_neig_informer(
        input=train_data["output"], nondetect_val=np.nan, model="bpz.pkl", hdf5_groupname="", nneigh_min=nb_lims[0], nneigh_max=nb_lims[1]
    )
    # get photo-zs
    knn_estimated = ri.estimation.algos.k_nearneigh.k_near_neig_estimator(
        input=test_data["output"],
        model=inform_knn["model"],
        nondetect_val=np.nan,
        hdf5_groupname="",
    )
    # put in a dict for later
    photoz_dict[(nb_lims[0], nb_lims[1])] = knn_estimated

    ### Evaluate the results 
    evaluator_stage_dict = dict(
        metrics=["cdeloss", "pit", "brier"],
        _random_state=None,
        metric_config={
            "brier": {"limits": (0, 3.1)},
            "pit": {"tdigest_compression": 1000},
        },
    )
    truth = test_data_orig

    the_eval = ri.evaluation.dist_to_point_evaluator.dist_to_point_evaluator(
            input={"data": knn_estimated["output"], "truth": truth["output"]
            },
            **evaluator_stage_dict,
            hdf5_groupname="",
        )
    
    # put the evaluation results in a dictionary so we have them 
    eval_dict[(nb_lims[0],nb_lims[1])] = the_eval

    # summarize the distributions using point estimate and naive stack summarizers 
    point_estimate_ens = ri.estimation.algos.point_est_hist.point_est_hist_summarizer(
    input=knn_estimated["output"]
    )
    point_est_dict[(nb_lims[0],nb_lims[1])] = point_estimate_ens
    naive_stack_ens = ri.estimation.algos.naive_stack.naive_stack_summarizer(
        input=knn_estimated["output"]
    )
    naive_dict[(nb_lims[0],nb_lims[1])] = naive_stack_ens
    return the_eval

In [None]:
from multiprocessing.pool import ThreadPool as Pool

# set up parameters to iterate over and dictionaries to store data
nb_params = [(3,7), (2,6), (2,8), (4,9),(5,10), (1,9), (2,9), (3,10)]
nb_mins = [3,2,2,4,5,1,2,3]
nb_maxs = [7,6,8,9,10,9,9,10]
eval_dict = {}
photoz_dict = {}
naive_dict = {}
point_est_dict = {}

pool = Pool(4)
for result in pool.imap_unordered(estimate_photoz, nb_params):
    print(result )

Inserting handle into data store.  input: None, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighInformer
split into 49 training and 16 validation samples
finding best fit sigma and NNeigh...
Inserting handle into data store.  input: None, KNearNeighInformer
split into 49 training and 16 validation samples
finding best fit sigma and NNeigh...
split into 49 training and 16 validation samples
finding best fit sigma and NNeigh...
split into 49 training and 16 validation samples
finding best fit sigma and NNeigh...



best fit values are sigma=0.075 and numneigh=3



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x7315040a5d00>, 'bestsig': np.float64(0.075), 'nneigh': 3, 'truezs': array([0.8559624 ,



Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150



best fit values are sigma=0.075 and numneigh=4



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle i



Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_singl



Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150



best fit values are sigma=0.075 and numneigh=3



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x731504067920>, 'bestsig': np.float64(0.075), 'nneigh': 3, 'truezs': array([0.8559624 , 1.09725535, 0.67563593, 0.91550589, 0.90324795,
       0.65226948, 0.77357709, 0.37648416, 1.22105432, 0.89333856,
       0.65628129, 0.83754033, 1.04881907, 1.19360622, 0.41372049,
     






best fit values are sigma=0.075 and numneigh=3



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x7315ac092420>, 'bestsig': np.float64(0.075), 'nneigh': 3, 'truezs': array([0.8559624 , 1.09725535, 0.67563593, 0.91550589, 0.90324795,
       0.65226948, 0.77357709, 0.37648416, 1.22105432, 0.89333856,
       0.65628129, 0.83754033, 1.04881907, 1.19360622, 0.41372049,
       1.24091506, 0.07890534, 0.33354044, 0.84880698, 0.50793141,
       0.07312357, 0.92710775, 0.53522396, 0.78664237, 0.55667555,
       0.45270801, 0.50885946, 1.21598339, 0.55211103, 0.22890699,
       0.49027538, 0.60366631, 0.66702271, 0.59513855, 0.67450702,
       0.25323451, 1.34854114, 0.90114802, 0.46726322, 1.00304711,
       0.76495528, 0.16585791, 0.70379937, 0.56558955, 0.61184895,
       1.07242024, 0.591457






best fit values are sigma=0.075 and numneigh=3



Inserting handle into data store.  model: inprogress_bpz.pkl, KNearNeighInformer
Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x7315980a8ac0>, 'bestsig': np.float64(0.075), 'nneigh': 3, 'truezs': array([0.8559624 , 1.09725535, 0.67563593, 0.91550589, 0.90324795,
       0.65226948, 0.77357709, 0.37648416, 1.22105432, 0.89333856,
       0.65628129, 0.83754033, 1.04881907, 1.19360622, 0.41372049,
       1.24091506, 0.07890534, 0.33354044, 0.84880698, 0.50793141,
       0.07312357, 0.92710775, 0.53522396, 0.78664237, 0.55667555,
       0.45270801, 0.50885946, 1.21598339, 0.55211103, 0.22890699,
       0.49027538, 0.60366631, 0.66702271, 0.59513855, 0.67450702,
       0.25323451, 1.34854114, 0.90114802, 0.46726322, 1.00304711,
       0.76495528, 0.16585791, 0.70379937, 0.56558955, 0.61184895,
       1.07242024, 0.591457



Inserting handle into data store.  output: inprogress_output.hdf5, DistToPointEvaluator
Inserting handle into data store.  summary: inprogress_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  single_distribution_summary: inprogress_single_distribution_summary.hdf5, DistToPointEvaluator
Inserting handle into data store.  input: None, PointEstHistSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running estimator on chunk 0 - 150
Inserting handle into data store.  output: inprogress_output.hdf5, PointEstHistSummarizer
Inserting handle into data store.  single_NZ: inprogress_single_NZ.hdf5, PointEstHistSummarizer
Inserting handle into data store.  input: None, NaiveStackSummarizer
Process 0 running est

In [19]:
eval_dict

{(2, 6): {'output': {},
  'summary': {'cdeloss': array([-0.25960067]), 'brier': array([198.97975105])},
  'single_distribution_summary': {'pit': Ensemble(the_class=quant,shape=(1, 90))}},
 (3, 7): {'output': {},
  'summary': {'cdeloss': array([-0.25960067]), 'brier': array([198.97975105])},
  'single_distribution_summary': {'pit': Ensemble(the_class=quant,shape=(1, 90))}},
 (4, 9): {'output': {},
  'summary': {'cdeloss': array([-0.44553651]), 'brier': array([176.7702214])},
  'single_distribution_summary': {'pit': Ensemble(the_class=quant,shape=(1, 91))}},
 (2, 8): {'output': {},
  'summary': {'cdeloss': array([-0.25960067]), 'brier': array([198.97975105])},
  'single_distribution_summary': {'pit': Ensemble(the_class=quant,shape=(1, 90))}},
 (5, 10): {'output': {},
  'summary': {'cdeloss': array([-0.55763654]), 'brier': array([156.11697065])},
  'single_distribution_summary': {'pit': Ensemble(the_class=quant,shape=(1, 92))}},
 (1, 9): {'output': {},
  'summary': {'cdeloss': array([-0.2

dict_keys([(2, 6), (3, 7), (2, 8), (4, 9), (5, 10), (1, 9), (2, 9), (3, 10)])

### Large datasets

If the code is slow because you're using extremely large datasets, or you're running into memory issues for the same reason, then we suggest using a pipeline. Pipelines are ideal for large datasets, as the code will chunk up large files and iterate through them as needed. 