# Full chain inference

In this notebook, we will:
 * Take a look at what a complete ML reconstruction configuration look like
 * Understand what are the high-level represntations built out of the raw ML output
 * Understand how to visualize the data representations
 * Learn how to store the output of the ML reconstruction to file
 * Learn how to load an HDF5 file to back to a dictionary

***
***
## 1. Full Chain Inference Configuration

We start by pointing the python path to the reco chain package:

In [None]:
import sys

SOFTWARE_DIR = '/global/cfs/cdirs/m5252/software/spine' # Change this path to your software install

# Set software directory
sys.path.insert(0, SOFTWARE_DIR+'/src')

In [None]:
!nvidia-smi

Now let's take a look at a full chain configuration:

In [None]:
import yaml

# Load configuration file of the ML chain
cfg_path = 'generic_full_chain.yaml'
cfg = yaml.load(open(cfg_path, 'r'), Loader=yaml.Loader)

print(yaml.dump(cfg, sort_keys=False))

There's a lot to unpack. The details of each module in the full chain is beyond the scope of the WS, but the stucture is as presented in previous notebooks.

You can focus on the `chain` configuration within the `model` block:

```yaml
      deghosting: null
      charge_rescaling: null
      segmentation: uresnet
      point_proposal: ppn
      fragmentation: graph_spice
      shower_aggregation: grappa
      shower_primary: grappa
      track_aggregation: grappa
      particle_aggregation: null
      inter_aggregation: grappa
      particle_identification: grappa
      primary_identification: grappa
      orientation_identification: grappa
      calibration: null
```

You can see which network is performing which reconstruction task. You can note that a few modules are omitted here:
- This a generic dataset, hence it has no ghost points (only relevant for wire TPCs)
- The charge rescaling process is only applied when deghosting
- The calibration is only relevant to correct for detector effects (none in this dataset)

The rest of the model configuration is architectural details as to what parameters define the UResNets and the GNNs (GrapPAs). Here are the modules:
- `uresnet_ppn`: Semantic segmentation + Point proposal
- `graph_spice`: Fragmentation
- `grappa_shower/track`: Aggregation of shower/track fragments
- `grappa_inter`: Aggregation of particles, PID, primary and orientation predictions

You can also see that the path to the weights for the full chain are provided under `model.weight_path`.

```yaml
  weight_path: /sdf/data/neutrino/generic/train/mpvmpr_2020_01_v04/weights/full_chain/default/snapshot-4999.ckpt
```

If you are executing this notebook anywhere but at S3DF, you must pull the weights from the path referenced on the notebook README.md and update the configuration accordingly!

***
***
## 2. Running inference on one batch of data

Once our ML model is fully trained and deployed, we may set our model to test mode and make use of its predictions: track vs shower separation, particle clustering, and PID to name a few. 
 
We first illustrate how to run the ML chain on one batch of data. Again, we use the main Driver:

In [None]:
from spine.driver import Driver

driver = Driver(cfg)

The following line runs one forward operation of the ML chain. It has one output:
* `data`: Python dictionary containing inputs and outputs of the network
  * 3d spacepoints, and deposition values)
  * truth information used for labels
  * Meta data information such as the image index number and px to cm conversion factor
  * Various outputs of the reco chain (clusters, semantics, etc.)

In [None]:
data = driver.process()

In [None]:
print('Length of Data Dictionary =', len(data))

In [None]:
data.keys()

As you can see, the output of the full reconstruction chain is exhaustive, but arcane. This output is useful
to debug each step of the reconstruction chain, but, as an analyzer, how do I interpret this information to
build an analysis?

Where are the particles? Where are the interactions?

<figure>
<img src="https://media.tenor.com/onfpmM94llEAAAAe/the-dark-knight-christopher-nolan.png" style="width:500px">
</figure>

***
***
## 3. Representation building

Thankfully, this is where the (fragment), particle and interaction builders under
[`spine.build`](https://github.com/DeepLearnPhysics/spine/tree/develop/spine/build)
come in handy! Their purpose is to take the raw output of the full chain and convert
it into human-reable representations. To construct these objects, all you need to do is
to add the following block to the configuration:

```yaml
build:
  mode: both
  units: cm
  fragments: false
  particles: true
  interactions: true
```

This is very simple:
- `mode`: specifies whether the builders are to build reconstructed objects, truth reference objects, or both.
  If you are running the reconstruction chain on data, you must set `mode` to `reco`.
- `units`: units in which every coordinate must be expressed. Either `px` (native coordinate system of the input
  or `cm`, detector coordinates, obtained from the meta information)
- `fragments`: whether to build fragments or not (not useful for analysis, useful for debugging)
- `particles`: whether to build particles or not
- `ineractions`: whether to build interactions or not

Now we can simply add this block to our initial configuration, run it and see what comes out!

In [None]:
# Load configuration file of the ML chain
cfg_path = 'generic_full_chain.yaml'
cfg = yaml.load(open(cfg_path, 'r'), Loader=yaml.Loader)
cfg['build'] = {
    'mode': 'both',
    'units': 'cm',
    'fragments': False,
    'particles': True,
    'interactions': True
}

driver = Driver(cfg)

In [None]:
data = driver.process()

In [None]:
print('Length of Data Dictionary =', len(data))

You can see that the number of data products has increased, let's take a look:

In [None]:
data.keys()

At the end of the list, you can see that particle and interaction objects have now been added, phew!

***

Now let's visualize what these particle and interaction objects look like!


Let's focus on the `*_particles` and `*_interactions` data products, which are locally-defined particle and interaction representations, respectively.

In [None]:
# Retrieving data structures.
# Here we need to index the data structures because we have process a batch!
entry = 0
reco_particles     = data['reco_particles'][entry]
truth_particles    = data['truth_particles'][entry]
reco_interactions  = data['reco_interactions'][entry]
truth_interactions = data['truth_interactions'][entry]

Let us start by taking a look at the reco particle objects, what are there?

In [None]:
reco_particles[0]

Here's a few particle attributes that might be useful:
- `points`: Positions of the space points associated with the particle
- `depositions`: Amount of charge/energy associated with each space point
- `interaction_id`: ID of the interaction this particle belongs to
- `pid`: Particle ID (see below for meaning of those numbers)
- `is_primary`: Whether the particle originates from the primary vertex or not
- `start_point`: Position of the start point
- `end_point`: Position of the end point (for EM showers, same as start point)

You may also notice that some of the reconstructed quantities are not filled (e.g. `start_dir`, `is_contained`, etc. This is because these attributes are filled by the post-processors, which have not yet ran...

For a comprehensive list of available attributes, simply use `help`:

In [None]:
#help(reco_particles[0])

Exercise: do the same for the `truth_particles` and investigate what is in there!

We can do the same for interactions.

This time we use the `as_dict()` method, which restricts the list of attributes to short-form attributes.

In [None]:
reco_interactions[0].as_dict()

How do we fill the missing attributes? Let's take a look...

***
***
## 4. Post-processors

Here a schematic representation of the data flow, after the execution of the full chain:

<figure>
<img src="https://github.com/francois-drielsma/lartpc_mlreco3d/raw/me/images/anatools.png" style="width:800px">
</figure>

The post-processing module under `spine.post` takes care of all non-ML reconstruction steps and is configured under the the `post` configuration block. Here is what the full post-processing suite would look on our generic dataset:

```yaml
post:
  shape_logic:
    enforce_pid: true
    enforce_primary: true
    priority: 3
  direction:
    obj_type: particle
    optimize: true
    run_mode: both
    priority: 1
  calo_ke:
    scaling: 1.
    shower_fudge: 1/0.83
    priority: 1
  csda_ke:
    tracking_mode: step_next
    segment_length: 5.0
    priority: 1
  mcs_ke:
    tracking_mode: bin_pca
    segment_length: 5.0
    priority: 1
  topology_threshold:
    ke_thresholds:
      4: 50
      default: 25
  vertex:
    use_primaries: true
    update_primaries: false
    priority: 1
  containment:
    margin: 5.0
    mode: meta
  fiducial:
    margin: 25.0
    mode: meta
  children_count:
    mode: shape
  match:
    match_mode: both
    ghost: false
    fragment: false
    particle: true
    interaction: true
```

No need to understand in detail what each of these modules do. We will go over this again in detail in notebooks dedicated to performing these tasks or using the output of these post-processor. but it is intersting to check on our particle objects again now to see what is new...

In [None]:
import yaml

# Load configuration file of the ML chain
cfg_path = 'generic_full_chain_with_post.yaml'
cfg = yaml.load(open(cfg_path, 'r'), Loader=yaml.Loader)

print(yaml.dump(cfg, sort_keys=False))

In [None]:
driver = Driver(cfg)

In [None]:
data = driver.process()

In [None]:
# Start by getting the particles in the first entry of the batch

entry = 0
reco_particles = data['reco_particles'][entry]

In [None]:
reco_particles[3].as_dict()

Now you can see that there's a lot more filled! E.g.:
- `is_contained`: Whether the particle is contained within the volume of interest
- `start_dir`/`end_dir`: Direction estimates w.r.t. to start/end points
- `*_ke`: estimates using calorimetry, CSDA or MCS
- `momentum`: estimate using `start_dir`, `ke` and `pid`
- ...

***
***
## 5. Storage

Ok, almost done eating your vegetables, I swear. One more step to have a fully fledged inference confuration! The `writer`.

This block, which lives under `io`, defines what and how to store the output of the reco. chain + the post-processors to an HDF5 file. This is useful because it is significantly more efficient to run the full chain on a dataset to start and then use the output for analysis later. This centralizes the production process and saves time for the analyzers.

The writer is defined under `spine.io.write.hdf5` and is configured as follows:

```
  writer:
    name: hdf5
    file_name: dummy.h5
    overwrite: true
    keys:
      - run_info
      - meta
      - points
      - points_label
      - depositions
      - depositions_label
      - reco_particles
      - truth_particles
      - reco_interactions
      - truth_interactions
```

This is the basics of what goes in there:
- `name`: type of writer to use (currently HDF5 only)
- `file_name`: name of the output file
- `overwrite`: if `True` and the file already exists, it will be deleted and a new file will be created in its place
- `keys`: list of keys in the `data` dictionary output that need to be stored to the HDF5 file. The list above is the exhaustive list of things that need to be stored to be able to restore the full SPINE `*particles` and `*interactions` objects from file. If you want to make a compact file with only top level information (no points, no depositions), simply comment out the following keys:
  - `points*`
  - `depositions*`
  
This block completes the assembly of a full chain inference configuration! These configurations are mainained on this repo:

```html
https://github.com/DeepLearnPhysics/spine_prod/
```

If you look under the `config` directory, you'll see that one full chain exists for each data modality this week:
- `generic` (generic detector-less images, i.e. no detector sim.)
- `icarus`
- `sbnd`
- `2x2`

Let's load one of them and make a file:

In [None]:
from spine.config import load_config_file

# Load configuration file of the ML chain
cfg_path = '/global/cfs/cdirs/m5252/software/spine-prod/config/infer/generic/full_chain_240805.yaml'
cfg = load_config_file(cfg_path)

cfg['io']['loader']['dataset']['file_keys'] = '/global/cfs/cdirs/m5252/dune/spine/workshop/larcv/generic_small.root'

print(yaml.dump(cfg, sort_keys=False))

Simply initialize the driver and call the `run` function (will run on the whole mini dataset):

In [None]:
driver = Driver(cfg)

In [None]:
driver.run()

You can now see that a new file has spawned (`dummy.h5`)!

In [None]:
!ls .

In summary, provided with a full chain configuration and a set of weights, all you need to do to run inference is:

```python
import yaml
from spine.driver import Driver

driver = Driver(yaml.safe_load('/sdf/data/neutrino/software/spine_prod/config/generic/generic_full_chain_240718.cfg'))
driver.run()
```

which is exactly equivalent to calling the following command:

```bash
python3 /path/to/spine/bin/run.py -c /sdf/data/neutrino/software/spine_prod/config/generic/generic_full_chain_240718.cfg
```

Typically this command would be run as part of batch job, as described in the training notebook!

In the next notebook, we'll discuss how to build an analysis with these files!