# uproot

ROOT data to Numpy arrays

There are several ways to get data from ROOT files into Numpy arrays.

   * iteration in PyROOT (super slow!)
   * ROOT's new `TTree::AsMatrix` (flat data, simple types)
   * custom C++ function (defined through `ROOT.gInterpreter.Declare`)
   * root_numpy (compiles against a ROOT version; can segfault with version mismatch)
   * uproot

Unlike all of the above, **uproot** is a *reimplementation* of ROOT I/O that skips unnecessary steps between deserialization and array filling.

uproot uses Numpy vectorization for anything that scales with the number of events, Python for the complex business of navigating the file.

For larger (fewer) baskets, there's less navigation and more vectorization.

<table>
  <tr style="background-color: white;">
    <td style="text-align: center; border-bottom: none; font-size: 18pt;">Speedup relative to bare ROOT vs basket size</td>
    <td style="text-align: center; border-bottom: none; font-size: 18pt;">Speedup relative to root_numpy vs basket size</td>
  </tr>
  <tr style="background-color: white;">
    <td><img src="img/uproot_root-none-muon.png"></td>
    <td><img src="img/uproot_rootnumpy-none-muon.png"></td>
  </tr>
</table>

ROOT builds objects for the convenience of physics C++ code, but when dumping into arrays, we don't want that. That's why it can be a little faster than bare ROOT for large baskets.

uproot makes ROOT files, directories, and TTrees act like `dicts`.

In [None]:
import uproot

nanoaod = uproot.open("~/NanoAOD-DYJetsToLL.root")
nanoaod.keys()

In [None]:
tree = nanoaod["Events"]
tree.keys()

Can we interpret that type? (If not interpretable, the third column is `None`. NanoAOD has no streamers/classes.)

In [None]:
tree.show()

Read one branch into an array at a time, or get a `dict`/`tuple`/`DataFrame` of arrays.

In [None]:
tree.array("MET_pt")

Numpy doesn't have a type for data whose size can vary per event (e.g. variable number of muons per event), so we use uproot's `JaggedArrays` for such branches.

In [None]:
mupt = tree["Muon_pt"].array()
mueta = tree["Muon_eta"].array()
mupt

In [None]:
import numpy, itertools, math, time

starttime = time.time()

pz = numpy.empty(len(mupt))
i = 0
for pts, etas in itertools.izip(mupt, mueta):     # you can do nested loops on these
    for pt, eta in zip(pts, etas):                # as though they were lists within lists
        pz[i] = pt * math.sinh(eta)
        i += 1
        break

time.time() - starttime

But actually each `JaggedArray` is an object with `content`, `starts`, and `stops` arrays.

In [None]:
print(mupt.content)                              # the actual data (without event boundaries)
print(mupt.starts)                               # where each event starts
print(mupt.stops)                                # where each event stops
mupt.starts.base is mupt.stops.base              # (starts and stops are just views of the same offsets)

In [None]:
starttime = time.time()

hasamuon = mupt.stops - mupt.starts > 0           # remember this trick from numpy.ipynb?
firsts = mupt.starts[hasamuon]
pz = mupt.content[firsts] * numpy.sinh(mueta.content[firsts])

time.time() - starttime

uproot has no implicit caching or parallel processing. It must be explicitly requested.

In [None]:
cache = {}

In [None]:
starttime = time.time()
tree.arrays("Jet_*", cache=cache)    # first time: reads file; second time: gets from dict
time.time() - starttime

In [None]:
cache

A `dict` is not really a cache because it never evicts old data to make space. `MemoryCache` is a drop-in replacement that does.

In [None]:
help(uproot.cache.MemoryCache)

You control your own memory use, either by making a `MemoryCache` the right size or by explicitly clearing `dicts` when you need to.

Parallel processing uses Python's Executor model (similar to TBB, but much less developed).

In [None]:
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(4)                   # number of cores

In [None]:
starttime = time.time()
again = tree.arrays("Jet_*", executor=executor)    # that's happening in parallel
time.time() - starttime

With a lot of computational work (LZMA decompression in the case below), this can make a difference.

<img src="img/uproot_scaling.png" style="display: block; margin-left: auto; margin-right: auto">

(That is, parallel processing isn't ruined by Python's interpreter lock: numerical libraries such as Numpy and LZMA escape this constraint to actually run in parallel.)

For large files or many-file datasets, you'll want to iterate: not over events, but arrays (chunks of events).

In [None]:
for arrays in tree.iterate("Jet_*"):
    print("batch of {} arrays, {} MB".format(len(arrays), sum(x.nbytes / 1024.0**2 for x in arrays.values())))
print("done")

The default chunk size is the ROOT "cluster size," but this is highly configurable.

Nearly the same syntax for multiple files (like TChain).

In [None]:
for arrays in uproot.iterate("~/NanoAOD-*.root", "Events", "Jet_*"):
    print("batch of {} arrays, {} MB".format(len(arrays), sum(x.nbytes / 1024.0**2 for x in arrays.values())))
print("done")

# Appliction: dropping data into machine learning libraries

Define a 2 hidden layer neural network in PyTorch.

In [None]:
import torch

class SimpleNN(torch.nn.Module):
    def __init__(self, input_dim, hidden1_dim, hidden2_dim, output_dim):
        super(SimpleNN, self).__init__()
        self.layer1 = torch.nn.Linear(input_dim, hidden1_dim)
        self.relu1 = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(hidden1_dim, hidden2_dim)
        self.relu2 = torch.nn.ReLU()
        self.layer3 = torch.nn.Linear(hidden2_dim, output_dim)

    def forward(self, x):
        return self.layer3(self.relu2(self.layer2(self.relu1(self.layer1(x)))))

# 25 input parameters, 20 node hidden layer, 10 node hidden layer, 1 output
simplenn = SimpleNN(25, 20, 10, 1)

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(simplenn.parameters(), lr=0.01)

The 25 input parameters are jet attributes other than the btag.

The 1 output is the supervised learning target: Jet_btagCMVA.

In [None]:
jetarrays = tree.arrays("Jet_*")

inputs = numpy.vstack(jetarrays[n] for n in sorted(jetarrays) if not n.startswith("Jet_btag")).T.astype("float32")
expected_output = numpy.array(jetarrays["Jet_btagCMVA"]).reshape(-1, 1)

inputs.shape, expected_output.shape

PyTorch, like all other Pythonic ML libraries, has methods to get batches of data from Numpy.

In [None]:
inputs = torch.autograd.Variable(torch.from_numpy(inputs))
expected_output = torch.autograd.Variable(torch.from_numpy(expected_output))

And now we use PyTorch; it doesn't matter where the data came from.

In [None]:
optimizer.zero_grad()
computed_output = simplenn.forward(inputs)
loss = criterion(computed_output, expected_output)
loss.backward()
optimizer.step()
print(loss)                # I have _in no way_ demonstrated that we have a good b-tag training. Just sayin'.