# Reconstruction Storage (HDF5 File Making) 

In this notebook, we will:
 * Describe how to generate **`.h5`** format files to store reconstruction outputs. 

In [1]:
# Basic boilerplate imports

import numpy as np
import pandas as pd
import yaml, os, sys, re

# Visualization Tools
import plotly
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import matplotlib.pyplot as plt
import seaborn as sns

# Set lartpc_mlreco3d path
LARTPC_MLRECO_PATH = "/sdf/group/neutrino/koh0207/lartpc_mlreco3d/"
sys.path.append(LARTPC_MLRECO_PATH)
from mlreco.main_funcs import process_config, cycle
from mlreco.iotools.factories import loader_factory
from mlreco.trainval import trainval

Welcome to JupyROOT 6.22/08


In [2]:
# Load configuration file of the ML chain
config_path = './hdf5_example.cfg'
config = yaml.load(open(config_path, 'r'), Loader=yaml.Loader)
process_config(config, verbose=False)


Config processed at: Linux sdfampere001 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

$CUDA_VISIBLE_DEVICES="0"



In [3]:
pwd

'/sdf/home/k/koh0207/notebooks/workspace/reco_and_visualization'

In [4]:
# Set event_list if necessary
event_list = None

## 1. Storing reconstruction outputs to HDF5.

Once our ML model is fully trained and deployed, we may set our model to test mode and make use of its predictions: track vs shower separation, particle clustering, and PID to name a few. However, once we start using the same saved model  (parameters), there is no need for repeatedly running forward operations of the network to obtain the same predicted labels. This is especially true as most analysis level tasks involve using the same predictions from the ML model but with different post-processing algorithms for calorimetry, selection, and tagging. As a solution, we save all reconstruction related ML model outputs to [Hierarchical Data Format (HDF)](https://www.hdfgroup.org/solutions/hdf5/). This is to:
 * Avoid unnecessary usage of GPU resources
 * Shorten the test-debug-rerun process of doing physics analysis with `lartpc_mlreco3d`
 
We first illustrate how to run the ML chain and save its outputs to HDF5 (`.h5`) file format. 

In [5]:
loader = loader_factory(config, event_list=event_list)
dataset = iter(cycle(loader))
Trainer = trainval(config)

# Please check the "Done." message that ensures the weights are loaded properly!
loaded_iteration = Trainer.initialize()

Loading file: /sdf/data/neutrino/icarus/mpvmpr/test/all_0.root
Loading tree sparse3d_reco_cryoE
Loading tree sparse3d_reco_cryoE_chi2
Loading tree sparse3d_reco_cryoE_hit_charge0
Loading tree sparse3d_reco_cryoE_hit_charge1
Loading tree sparse3d_reco_cryoE_hit_charge2
Loading tree sparse3d_reco_cryoE_hit_key0
Loading tree sparse3d_reco_cryoE_hit_key1
Loading tree sparse3d_reco_cryoE_hit_key2
Loading tree sparse3d_pcluster_semantics_ghost
Loading tree sparse3d_pcluster
Loading tree particle_corrected
Loading tree cluster3d_pcluster
Loading tree particle_pcluster
Loading tree particle_mpv
Loading tree sparse3d_pcluster_semantics
Loading tree cluster3d_sed
Loading tree neutrino_mpv
Found 10475 events in file(s)
Shower GNN: True
Track GNN: True
Particle GNN: False
Interaction GNN: True
Kinematics GNN: False
Cosmic GNN: False

            Since one of the GNNs are turned on, process_fragments is turned ON.
            

        Fragment processing is turned ON. When training CNN models from

### 1-1. Data and Result dictionary format

The following line runs one forward operation of the ML chain. It has two outputs:
 * `data`: Python dictionary containing inputs to the network (3d spacepoints, and deposition values) and any truth information used for labels. It also includes meta data information such as the image index number and px to cm conversion factor. 
 * `result`: Python dictionary storing outputs of the ML chain. Any quantity that involves some amount of reconstruction will be stored in the `result` dictionary. 

In [6]:
data, result = Trainer.forward(dataset)

Deghosting Accuracy: 0.9879
Segmentation Accuracy: 0.9949
PPN Accuracy: 0.9759
Clustering Accuracy: 0.9761
Clustering Edge Accuracy: 0.6996
Shower fragment clustering accuracy: 0.9696
Shower primary prediction accuracy: 1.0000
Track fragment clustering accuracy: 1.0000
Interaction grouping accuracy: 0.8547
Particle ID accuracy: 0.9706
Primary particle score accuracy: 0.9706


In [7]:
print('Length of Data Dictionary =', len(data))
print('Length of Result Dictionary =', len(result))

Length of Data Dictionary = 11
Length of Result Dictionary = 135


In [8]:
pwd

'/sdf/home/k/koh0207/notebooks/workspace/reco_and_visualization'

### 1-2. Saving reconstruction outputs to HDF5

As such, the result dictionary contains a lot of information, for which only a subset of them are necessary for analysis. Most key, value pairs of the `result` dictionary are intermediate outputs from each stage of the reco chain. These are neceesary for the chain to operate and for debugging intermediate stages separately. 

Hence, we only save a subset of the `data/result` dictionary to HDF5 by selecting out the key names. We can do this under the `iotools.writer` section of the chain configuration file:
```yaml
# IO configuration
iotool:
  batch_size: 1
  shuffle: False
  num_workers: 4
  #collate_fn: CollateSparse
  collate:
    collate_fn: CollateSparse
  ...
  ...
  ...
  writer:
    name: HDF5Writer
    file_name: $PATH/output.h5
    input_keys:
      - index
      - meta
      - cluster_label
      - particles_asis
      - sed
    result_keys:
      - input_rescaled
      - cluster_label_adapted
      - segmentation
      - ppn_points
      - ppn_coords
      - ppn_masks
      - ppn_classify_endpoints
      - fragment_clusts
      - fragment_seg
      - shower_fragment_node_pred
      - shower_fragment_start_points
      - track_fragment_start_points
      - track_fragment_end_points
      - particle_clusts
      - particle_seg
      - particle_start_points
      - particle_end_points
      - particle_group_pred
      - particle_node_pred_type
      - particle_node_pred_vtx
```

 * `name`: type of `writer` to be used to store model outputs, for which there is only one option (namely `HDF5Writer`). 
 * `file_name`: full path to output `.h5` file.
 * `input_keys`: keys in the `data` dictionary that will be stored to `output.h5`.
 * `result_keys`: keys in the `result` dictionary that will be stored to `output.h5`. 
 
> **Warning**
> : Note that depending on the data that you will be working on, the necessary keys needed for analysis will differ. The above example was taken from a MPVMPR validation configuration file. 

After adding the `writer` field to your chain config, you can launch a iteration loop to generate your HDF5 file:
```bash
> python3 $PATH_TO_LARTPC_MLRECO3D/bin/run.py $PATH_TO_CHAIN_CONFIG
```


### 1-3. Reading HDF5 files from the notebook.

To read HDF5 files directly from the notebook, we will use the `HDF5Reader` class, for example:
```python
from mlreco.iotools.readers import HDF5Reader
reader = HDF5Reader('output.h5')
```
Although HDF5 file reading is handled separately under analysis tools (which we will explain shortly afterwards), it is still good practice to check if the appropriate quantities are being stored properly by opening the `.h5` files directly. 

In [9]:
from mlreco.iotools.readers import HDF5Reader

In [10]:
reader = HDF5Reader('./output.h5')

Registered ./output.h5


The first index of the reader is the batch index. Since we generated 10 events, it should range from 0 to 9. The second index is a placeholder for the `data` and `result` dictionaries. 

In [11]:
data, result = reader[0][0], reader[0][1]

While most data structures are stored as is, some of them are saved as "blueprints" to reduce the overall size of the h5 file. 