# FlexZBoost PDF Representation Comparison

**Author:** Drew Oldag

**Last Run Successfully:** September 26, 2023

This notebook does a quick comparison of storage requirements for Flexcode output using two different storage techniques. We'll compare `qp.interp` (x,y interpolated) output against the native parameterization of `qp_flexzboost`.

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np

import qp

from rail.core.data import TableHandle
from rail.core.stage import RailStage

%matplotlib inline 

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

Create references to the training and test data.

In [None]:
from rail.core.utils import find_rail_file
trainFile = find_rail_file('examples_data/testdata/test_dc2_training_9816.hdf5')
testFile = find_rail_file('examples_data/testdata/test_dc2_validation_9816.hdf5')
training_data = DS.read_file("training_data", TableHandle, trainFile)
test_data = DS.read_file("test_data", TableHandle, testFile)

Define the configurations for the ML model to be trained by Flexcode. Specifically we'll use Xgboost with a set of 35 cosine basis functions.

In [None]:
fz_dict = dict(zmin=0.0, zmax=3.0, nzbins=301,
               trainfrac=0.75, bumpmin=0.02, bumpmax=0.35,
               nbump=20, sharpmin=0.7, sharpmax=2.1, nsharp=15,
               max_basis=35, basis_system='cosine',
               hdf5_groupname='photometry',
               regression_params={'max_depth': 8,'objective':'reg:squarederror'})

fz_modelfile = 'demo_FZB_model.pkl'

Define the RAIL stage to train the model

In [None]:
from rail.estimation.algos.flexzboost import FlexZBoostInformer, FlexZBoostEstimator
inform_pzflex = FlexZBoostInformer.make_stage(name='inform_fzboost', model=fz_modelfile, **fz_dict)

Then we'll run that stage to train the model and store the result in a file name `demo_FZB_model.pkl`.

In [None]:
%%time
inform_pzflex.inform(training_data)

Now we configure the RAIL stage that will evaluate test data using the saved model.
Note that we specify `qp_representation='flexzboost'` here to instruct `rail_flexzboost` to store the model weights using `qp_flexzboost`.

In [None]:
pzflex_qp_flexzboost = FlexZBoostEstimator.make_stage(name='fzboost_flexzboost', hdf5_groupname='photometry',
                            model=inform_pzflex.get_handle('model'),
                            output='flexzboost.hdf5',
                            qp_representation='flexzboost')

Now we actually evaluate the test data, 20,449 example galaxies, using the trained model, and then print out the size of the file that was saved. 

Note that the final output size will depend on the number of basis functions used by the model. Again, for this experiment, we used 35 basis functions.

In [None]:
%%time
output_file_name = './flexzboost.hdf5'
try:
    os.unlink(output_file_name)
except FileNotFoundError:
    pass

fzresults_qp_flexzboost = pzflex_qp_flexzboost.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")

Example calculating median and mode. Note that we're using the `%%timeit` magic command to get an estimate of the time required for calculating `median`, but we're using `%%time` to estimate the `mode`. This is because `qp` will cache the output of the `pdf` function for a given grid. If we used `%%timeit`, then the resulting estimate would average the run time of one non-cached calculation and N-1 cached calculations. 

In [None]:
zgrid = np.linspace(0, 3., 301)

In [None]:
%%time
fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()

In [None]:
%%time
fz_modes_qp_flexzboost = fzresults_qp_flexzboost().mode(grid=zgrid)

Plotting median values.

In [None]:
fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()

plt.hist(fz_medians_qp_flexzboost, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")
bins = np.linspace(-5, 5, 11)

Example convertion to a `qp.hist` histogram representation.

In [None]:
%%timeit
bins = np.linspace(-5, 5, 11)
fzresults_qp_flexzboost().convert_to(qp.hist_gen, bins=bins)

Now we'll repeat the experiment using `qp.interp` storage. Again, we'll define the RAIL stage to evaluate the test data using the saved model, but instruct `rail_flexzboost` to store the output as x,y interpolated values using `qp.interp`.

In [None]:
pzflex_qp_interp = FlexZBoostEstimator.make_stage(name='fzboost_interp', hdf5_groupname='photometry',
                            model=inform_pzflex.get_handle('model'),
                            output='interp.hdf5',
                            qp_representation='interp',
                            calculated_point_estimates=[])

Finally we evaluate the test data again using the trained model, and then print out the size of the file that was saved using the x,y interpolated technique.

The final file size will depend on the size of the x grid that defines the interpolation. However, we can see that in order to match the storage requirements of `qp_flexzboost`, the x grid would need to be smaller than the number of basis functions used by the model. For this experiment, we used 35 basis functions.

In [None]:
%%time
output_file_name = './interp.hdf5'
try:
    os.unlink(output_file_name)
except FileNotFoundError:
    pass

fzresults_qp_interp = pzflex_qp_interp.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")

Example calculating median and mode. Note that we're using the `%%timeit` magic command to get an estimate of the time required for calculating `median`, but we're using `%%time` to estimate the `mode`. This is because `qp` will cache the output of the `pdf` function for a given grid. If we used `%%timeit`, then the resulting estimate would average the run time of one non-cached calculation and N-1 cached calculations.

In [None]:
zgrid = np.linspace(0, 3., 301)

In [None]:
%%timeit
fz_medians_qp_interp = fzresults_qp_interp().median()

In [None]:
%%time
fz_modes_qp_interp = fzresults_qp_interp().mode(grid=zgrid)

Plotting median values.

In [None]:
fz_medians_qp_interp = fzresults_qp_interp().median()
plt.hist(fz_medians_qp_interp, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")

Example convertion to a `qp.hist` histogram representation.

In [None]:
%%timeit
bins = np.linspace(-5, 5, 11)
fzresults_qp_interp().convert_to(qp.hist_gen, bins=bins)

We'll clean up the files that were produced: the model pickle file, and the output data file. 

In [None]:
model_file_name = 'demo_FZB_model.pkl'
flexzboost_file_name = './flexzboost.hdf5'
interp_file_name = './interp.hdf5'

try:
    os.unlink(model_file_name)
except FileNotFoundError:
    pass

try:
    os.unlink(flexzboost_file_name)
except FileNotFoundError:
    pass

try:
    os.unlink(interp_file_name)
except FileNotFoundError:
    pass