# Data, files, IO, and RAIL

The switchover to a `ceci`-based backend has increased the complexity of methods of data access and IO, this notebook will demonstrate a variety of ways that users may interact with data in RAIL<br>

In addition to the main RAIL code, we have developed another companion package, `tables_io` [available here on Github](https://github.com/LSSTDESC/tables_io/). <br>

`tables_io` aims to simplify IO for reading/writing to some of the most common file formats used within DESC: HDF5 files, parquet files, Astropy tables, and `qp` ensembles.  There are several examples of tables_io usage in the [nb directory](https://github.com/LSSTDESC/tables_io/tree/main/nb) of the `tables_io` repository, but we will demonstrate usage in several places in this notebook as well.  For furthe examples consult the tables_io nb examples.

Another concept used in the `ceci`-based RAIL when used in a Jupyter Notebook is the DataStore and DataHandle file specifications (see [RAIL/rail/core/data.py](https://github.com/LSSTDESC/RAIL/blob/main/rail/core/data.py) for the actual code implementing these).  `ceci` requires that each pipeline stage have defined `input` and `output` files, and is primarily geared toward pipelines rather than interactive runs with a jupyter notebook.  The DataStore enables interactive use of files in Jupyter.  We will demonstrate some useful features of the DataStore below.

Let's start out with some imports:

In [None]:
import os
import tables_io
import rail
import qp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

First, let's use tables_io to read in some example data.  There are two example files that ship with RAIL containing a small amount of cosmoDC2 data from healpix pixel `9816`, it is located in the `RAIL/tests/data/` directory in the RAIL repository, one for "training" and one for "validation".  Let's read in one of those data files with tables_io:

(NOTE: for historical reasons, our examples files have data that is in hdf5 format where all of the data arrays are actually in a single hdf5 group named "photometry".  We will grab the data specifically from that hdf5 group by reading in the file and specifying ["photometry"] as the group in the cell below.  We'll call our dataset "traindata_io" to indicate that we've read it in via tables_io, and distinguish it from the data that we'll place in the DataStore in later steps:

In [None]:
RAIL_DIR = os.path.dirname(rail.__file__)
trainFile = os.path.join(RAIL_DIR, '../tests/data/test_dc2_training_9816.hdf5')
testFile = os.path.join(RAIL_DIR, '../tests/data/test_dc2_validation_9816.hdf5')

traindata_io = tables_io.read(trainFile)["photometry"]

tables_io reads this data in as an ordered dictionary of numpy arrays by default, though you can be converted to other data formats, such as a pandas dataframe as well.  Let's print out the keys in the ordered dict showing the available columns, then convert the data to a pandas dataframe and look at a few of the columns as a demo:

In [None]:
traindata_io.keys()

In [None]:
traindata_pq = tables_io.convert(traindata_io, tables_io.types.PD_DATAFRAME)

In [None]:
traindata_pq.head()

Next, let's set up the Data Store, so that our RAIL module will know where to fetch data.  We will set "allow overwrite" so that we can overwrite data files and not throw errors while in our jupyter notebook:

In [None]:
#import RailStage stuff
from rail.core.data import TableHandle
from rail.core.stage import RailStage

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

We need to add our data to the DataStore, we can add previously read data, like our `traindata`, or add data to the DataStore directly via the `DS.read_file` method, which we will do with our "test data".

In [None]:
#add data that is already read in
train_data = DS.add_data("train_data", traindata_io, TableHandle )

In [None]:
#add test data directly to datastore from file:
test_data = DS.read_file("test_data", TableHandle, testFile)

Let's list the data abailable to us in the DataStore:

In [None]:
DS

Note that the DataStore is just a dictionary of the files.  Each Handle object contains the actual data, which is accessible via the `.data` property for that file.  While not particularly designed for it, you can manipulate the data via these dictionaries, which is handy for on-the-fly exploration in notebooks.<br>
For example, say we want to add an additional column to the train_data, say "FakeID" with a more simple identifier than the long ObjID that is contained the `id` column:

In [None]:
train_data.data.keys()
numgals = len(train_data.data['id'])
train_data.data['FakeID'] = np.arange(numgals)

Let's convert our train_data to a pandas dataframe with tables_io, and our new "FakeID" column should now be present:

In [None]:
train_table = tables_io.convertObj(train_data.data, tables_io.types.PD_DATAFRAME)
train_table.head()

And there it is, success!

# Using the data in a pipeline stage: photo-z estimation example

Now that we have our data in place, we can use it in a RAIL stage.  As an example, we'll estimate photo-z's for our data.  Let's train the `KNearNeighPDF` algorithm with our train_data, and then estimate photo-z's for the test_data.  We need to make the RAIL stages for each of these steps, first we need to train/inform our nearest neighbor algorithm with the train_data:

In [None]:
from rail.estimation.algos.knnpz import Train_KNearNeighPDF, KNearNeighPDF

In [None]:
inform_knn = Train_KNearNeighPDF.make_stage(name='inform_knn', input='train_data', 
                                            nondetect_val=99.0, model='knnpz.pkl',
                                            hdf5_groupname='')


In [None]:
inform_knn.inform(train_data)

Running the `inform` method on the training data has crated the "knnpz.pkl" file, which contains our trained tree, along with the `sigma` bandwidth parameter and the `numneigh` (number of neighbors to use in the PDF estimation).  In the future, you could skip the `inform` stage and simply load this pkl file directly into the estimation stage to save time.

Now, let's stage and run the actual PDF estimation on the test data: NOTE: we have set hdf5_groupname to "photometry", as the original data does have all our our needed photometry in a single hdf5 group named "photometry"!

In [None]:
estimate_knn = KNearNeighPDF.make_stage(name='estimate_knn', hdf5_groupname='photometry', nondetect_val=99.0,
                                        model='knnpz.pkl', output="KNNPZ_estimates.hdf5")

In [None]:
knn_estimated = estimate_knn.estimate(test_data)

In [None]:
knn_estimated.data.ancil

We have successfully estimated PDFs for the ~20,000 galaxies in the test file!  Note that the PDFs are in `qp` format!  Also note that they have been written to disk as "KNNPZ_estimate.hdf5"; however, they are also still available to us via the `knn_estimated` dataset in the datastore. Let's plot an example PDF from our data in the DataStore:

We can do a quick plot to check our photo-z's. Our qp Ensemble is stored in `knn_estimated.data`, and the Ensemble can calculate the mode of each PDF if we give it a grid of redshift values to check, which we can plot against our true redshifts from the test data:

In [None]:
pzmodes = knn_estimated.data.mode(grid=np.linspace(0,3,301)).flatten()
true_zs = test_data.data['photometry']['redshift']

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs, pzmodes, label='photoz mode for KNearNeigh')
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)

As an alternative, we can read the data from file and do the same operations:

In [None]:
newens = qp.read("KNNPZ_estimates.hdf5")
newpzmodes = newens.mode(grid=np.linspace(0,3,301))

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs, newpzmodes, label='photoz mode for KNearNeigh')
plt.xlabel("redshift", fontsize=15)
plt.ylabel("photoz mode", fontsize=15)
plt.legend(loc='upper center', fontsize=12)

In [None]:
xgrid=np.linspace(0,3,101)
newensmodeidx = [np.argmax(newens[ii].pdf(xgrid)) for ii in range(20449)]

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs,xgrid[newensmodeidx])

In [None]:
newens.objdata()

In [None]:
newens.objdata()['weights']

In [None]:
weights = newens.objdata()['weights']
stds = newens.objdata()['stds']
means = newens.objdata()['means']

In [None]:
xens = qp.Ensemble(qp.mixmod, data=dict(means=means, stds=stds, weights=weights))

In [None]:
xens[215].plot_native(xlim=(0,3))

In [None]:
xgrid = np.linspace(0,3,101)
modes = [np.argmax(xens[ii].pdf(xgrid)) for ii in range(20449)]

In [None]:
len(modes)

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(true_zs, xgrid[modes],s=1)
plt.plot([0,3],[0,3],'k',lw=2)