&#8593;&#8593;&#8593;&#8593;&#8593;&#8593;  <p style="text-align: right;">&#8593;&#8593;&#8593;&#8593;&#8593;&#8593;</p>

**To view this notebook as a slideshow click on the deck icon ![deck](https://raw.githubusercontent.com/deathbeds/jupyterlab-deck/main/docs/_static/deck.svg) above.**

For a better slideshow experience, set font sizes to 24px by going to Settings -> Fonts -> Code/Content -> Size.

<center>
<h1>Reading RNTuple data with Uproot</h1>
<h2>Andres Rios-Tascon</h2>
<h4>PyHEP 2024</h4>
<img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/PU_lockup.png" style="height:50px;"/>&nbsp&nbsp&nbsp&nbsp&nbsp<img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/Iris-hep-4-no-long-name.png" style="height:50px;"/>
</center>

## Outline

- Introduction and motivation for RNTuple.
- Status of RNTuple support in Uproot.
- Hands-on demo.
- Future work and outlook.

## What is RNTuple and why should we care?

- `RNTuple` is a modern serialization format that will replace `TTree`.

- `TTree` has become outdated and bloated.
  - Inefficient storing and reading of nested and/or jagged collections.
  - Lots of special cases and hacky implementations.
  - Virtually impossible to fully support on `uproot`.
 
<center><img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/ttree_current_status.svg" style="height:400px;"/></center>

- `RNTuple` will bring many improvements.
  - Simple and modern design (and has a formal spec).
  - Focuses on native data types.
  - Columnar layout very similar to `awkward`.
  - Much faster performance and designed for parallelization.
  - Simpler design should alow for almost 100% support on `uproot`.

## RNTuple performance comparison

<br/>
<br/>
<br/>

<center>
<img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/rntuple_comparison.png" style="height:200px;"/>
</center>

<br/>
<br/>
<br/>

Image taken from [arXiv:2204.09043](https://arxiv.org/abs/2204.09043).

## RNTuple timeline

<br/>
<br/>
<br/>

<center>
<img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/rntuple_timeline.png" style="height:100px;"/>
</center>

<br/>
<br/>
<br/>

Image taken from <https://doi.org/10.1051/epjconf/202429506020>.

**Version 1.0.0 of specification is expected to be done by the end of this year.**

We expect to have most functionality working on `uproot` by the time this happens!

## RNTuple in Uproot

- Initial implementation was written by Jerry Ling.
  - Fairly complete reading support.
  - Basic writing support.

- A significant update to the `RNTuple` spec was released earlier this year, which completely broke the existing implementation.

- We have fixed and reworked the reading functionality, adding new features from the spec.

<center><img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/rntuple_current_status.svg" style="height:400px;"/></center>

- We are aiming to have the interface be the same as (or very similar to) the one for `TTree`.

<center>
<h1>Let's look at a concrete example</h1>
</center>

## Example RNTuple

Let's consider an example<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) where we have the following data:

| Trigger (bool) | Missing ET {float, float} | Lepton ids (vector) |
| -------------- | ------------------------- | ------------------- |
| False          | {et: 79.7, phi: 2.83}     | []                  |
| True           | {et: 78, phi: 0.62}       | [11, -11]           |
| False          | {et: 10, phi: -2.78}      | [-13, -11]          |
| True           | {et: 14.3, phi: 1.31}     | [11, 11, -13]       |
| True           | {et: 83.2, phi: 2.76}     | [11]                |

<br/>
<br/>

<a name="cite_note-1"></a>1. [^](#cite_ref-1) This example is based on [this talk](https://indico.cern.ch/event/1222943/) by Jerry Ling.

## Data layout

```mermaid
flowchart BT
    A[\"(top level)"/]
    B("trig (bool)") --> A
    C("met (struct)") --> A
    D("lep_pid (std::vector&lt;int&gt;)") --> A
    E[("column (data)")] --> B
    F("et (float)") --> C
    G("phi (float)") --> C
    H[("column (data)")] --> F
    I[("column (data)")] --> G
    J[("column (offset)")] --> D
    K("_0 (int)") --> D
    L[("column (data)")] --> K
```

We will see that this very closely matches the data layout in `awkward`!

## ROOT code

We can create an RNTuple with this data by using the following ROOT code.

In [None]:
from IPython.display import Code
with open("example_rntuple.C") as f:
    code = f.read()
Code(code, language='cpp')

## Using `uproot` to read this RNTuple

As a quick reminder, `uproot` can be installed with `pip install uproot` or `conda install -c conda-forge uproot`. If you're using this notebook on JupyterLite then it is already installed.

Let's start by importing `uproot`.

In [None]:
import uproot

Let's now open this example file and see what's inside.

In [None]:
f = uproot.open("example_rntuple.root")
f.classnames()

Let's now look at this RNTuple and briefly take a look at the data layout that we discussed before.

In [None]:
ntpl = f["ntpl"]

In [None]:
for i,fr in enumerate(ntpl.field_records):
    print(f"field_name={fr.field_name:<7} type_name={fr.type_name:<25} idx={i} parent_idx={fr.parent_field_id}")

In [None]:
for cr in ntpl.column_records:
    print(f"idx={cr.idx}, field_id={cr.field_id}, type={cr.type:0>2}, nbits={cr.nbits:0>2}")

### Let's now actually read the data and put it into arrays!

In [None]:
arrays = ntpl.arrays()
arrays

We can check that the memory layout very closely resembles the one in `RNTuple`.

In [None]:
arrays.layout

Now everything works in the usual `awkward` fashion.

In [None]:
arrays.lep_pid

### We can already ready complex files

Here is an example of a file produced with an ATLAS workflow.

In [None]:
url = "https://github.com/scikit-hep/scikit-hep-testdata/raw/main/src/skhep_testdata/data/DAOD_TRUTH3_RC2.root"

# When not using WASM we can read the file directly
# f = uproot.open(f"simplecache::{url}")

# For WASM we need some workarounds
import requests
r = requests.get(f"https://corsproxy.io/?{url}") # Please be careful when using a CORS proxy
open("DAOD_TRUTH3_RC2.root", "wb").write(r.content)
f = uproot.open("DAOD_TRUTH3_RC2.root")

In [None]:
f.classnames()

In [None]:
ntpl = f["RNT:CollectionTree"]

In [None]:
arrays = ntpl.arrays()
arrays

### We also have good performance when reading large(-ish) files
### (more on this later)

Here is an example with a ~100 MB file.

In [None]:
# 4M events
url = "http://root.cern/files/tutorials/ntpl004_dimuon_v1rc2.root"

# When not using WASM we can read the file directly
# f = uproot.open(f"simplecache::{url}")
# ntpl = f["Events"]
# arrays = ntpl.arrays() # Takes ~1 second

# Performance is still not very good on WASM so we'll skip it here,
# but you can try it with your native install.

### Interactions with Scikit-HEP ecosystem remain unchanged

In [None]:
import awkward as ak
from hist import Hist

h = Hist.new.Regular(50, 0, 70000, name="pt", label="Muon $p_T$").Double()
h.fill(pt=ak.flatten(arrays["AntiKt4TruthDressedWZJetsAux:"].pt))
h.plot();

## Future work and outlook

- Although `RNTuple` reading already works, there is still a significant amount of work that needs to be done.

- Lazy-reading will be implemented with `dask`, just how it was done for `TTree`.

- `RNTuple` writing needs to be fixed and likely heavily rewritten.

- We are aiming to keep up to date with changes in `RNTuple` spec and be ready for v1.0.0.

- `uproot` will become even more useful since it should support almost 100% of the `RNTuple` spec, and so should be almost equivalent (albeit slower) to `ROOT` for reading and writing.

<center><img src="https://raw.githubusercontent.com/ariostas-talks/2024-07-02-pyhep-uproot-rntuple/main/images/rntuple_goal.svg" style="height:400px;"/></center>