# Exploring a file

[uproot.open](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-open) is the entry point for reading a single file.

It takes a local filename path or a remote `http://` or `root://` URL. (HTTP requires the Python [requests](https://pypi.org/project/requests/) library and XRootD requires [pyxrootd](http://xrootd.org/), both of which have to be explicitly pip-installed if you installed uproot with pip, but are automatically installed if you installed uproot with conda.)

In [14]:
import uproot

file = uproot.open("http://scikit-hep.org/uproot/examples/nesteddirs.root")
file

<ROOTDirectory b'tests/nesteddirs.root' at 0x7fa3150962e8>

[uproot.open](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-open) returns a [ROOTDirectory](https://uproot.readthedocs.io/en/latest/root-io.html#uproot-rootio-rootdirectory), which behaves like a Python dict; it has `keys()`, `values()`, and key-value access with square brackets.

In [15]:
file.keys()

[b'one;1', b'three;1']

In [16]:
file["one"]

<ROOTDirectory b'one' at 0x7fa3150967f0>

Subdirectories also have type [ROOTDirectory](https://uproot.readthedocs.io/en/latest/root-io.html#uproot-rootio-rootdirectory), so they behave like Python dicts, too.

In [17]:
file["one"].keys()

[b'two;1', b'tree;1']

In [18]:
file["one"].values()

[<ROOTDirectory b'two' at 0x7fa3150d0128>, <TTree b'tree' at 0x7fa315096dd8>]

**What's the `b` before each object name?** Python 3 distinguishes between bytestrings and encoded strings. ROOT object names have no encoding, such as Latin-1 or Unicode, so uproot presents them as raw bytestrings. However, if you enter a Python string (no `b`) and it matches an object name (interpreted as plain ASCII), it will count as a match, as `"one"` does above.

**What's the `;1` after each object name?** ROOT objects are versioned with a "cycle number." If multiple objects are written to the ROOT file with the same name, they will have different cycle numbers, with the largest value being last. If you don't specify a cycle number, you'll get the latest one.

This file is deeply nested, so while you could find the TTree with

In [19]:
file["one"]["two"]["tree"]

<TTree b'tree' at 0x7fa3150d0898>

you can also find it using a directory path, with slashes.

In [20]:
file["one/two/tree"]

<TTree b'tree' at 0x7fa3150d09b0>

Here are a few more tricks for finding your way around a file:

   * the `keys()`, `values()`, and `items()` methods have `allkeys()`, `allvalues()`, `allitems()` variants that recursively search through all subdirectories;
   * all of these functions can be filtered by name or class: see [ROOTDirectory.keys](https://uproot.readthedocs.io/en/latest/root-io.html#uproot.rootio.ROOTDirectory.keys).

Here's how you would search the subdirectories to find all TTrees:

In [21]:
file.allkeys(filterclass=lambda cls: issubclass(cls, uproot.tree.TTreeMethods))

[b'one/two/tree;1', b'one/tree;1', b'three/tree;1']

Or get a Python dict of them:

In [23]:
all_ttrees = dict(file.allitems(filterclass=lambda cls: issubclass(cls, uproot.tree.TTreeMethods)))
all_ttrees

{b'one/two/tree;1': <TTree b'tree' at 0x7fa31508ceb8>,
 b'one/tree;1': <TTree b'tree' at 0x7fa31508cfd0>,
 b'three/tree;1': <TTree b'tree' at 0x7fa31508ca20>}

Be careful: Python 3 is not as forgiving about matching key names. `all_ttrees` is a plain Python dict, so the key must be a bytestring and must include the cycle number.

In [24]:
all_ttrees[b"one/two/tree;1"]

<TTree b'tree' at 0x7fa31508ceb8>

## Exploring a TTree

TTrees are special objects in ROOT files: they contain most of the physics data. Uproot presents TTrees as subclasses of [uproot.tree.TTreeMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-ttreemethods).

(**Why subclass?** Different ROOT files can have different versions of a class, so uproot generates Python classes to fit the data, as needed. All TTrees inherit from [uproot.tree.TTreeMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-ttreemethods) so that they get the same data-reading methods.)

In [26]:
events = uproot.open("http://scikit-hep.org/uproot/examples/Zmumu.root")["events"]
events

<TTree b'events' at 0x7fa314ff9f60>

Although [uproot.tree.TTreeMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-ttreemethods) objects behave like Python dicts of [uproot.tree.TBranchMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-tbranchmethods) objects, the easiest way to browse a TTree is by calling its `show()` method, which prints the branches and their interpretations as arrays.

In [37]:
events.keys()

[b'Type',
 b'Run',
 b'Event',
 b'E1',
 b'px1',
 b'py1',
 b'pz1',
 b'pt1',
 b'eta1',
 b'phi1',
 b'Q1',
 b'E2',
 b'px2',
 b'py2',
 b'pz2',
 b'pt2',
 b'eta2',
 b'phi2',
 b'Q2',
 b'M']

In [27]:
events.show()

Type                       (no streamer)              asstring()
Run                        (no streamer)              asdtype('>i4')
Event                      (no streamer)              asdtype('>i4')
E1                         (no streamer)              asdtype('>f8')
px1                        (no streamer)              asdtype('>f8')
py1                        (no streamer)              asdtype('>f8')
pz1                        (no streamer)              asdtype('>f8')
pt1                        (no streamer)              asdtype('>f8')
eta1                       (no streamer)              asdtype('>f8')
phi1                       (no streamer)              asdtype('>f8')
Q1                         (no streamer)              asdtype('>i4')
E2                         (no streamer)              asdtype('>f8')
px2                        (no streamer)              asdtype('>f8')
py2                        (no streamer)              asdtype('>f8')
pz2                        (no streame

Basic information about the TTree, such as its number of entries, are available as properties.

In [42]:
events.name, events.title, events.numentries

(b'events', b'Z -> mumu events', 2304)

# Reading arrays from a TTree

The bulk data in a TTree are not read until requested. There are many ways to do that:

   * select a TBranch and call [TBranchMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id11);
   * call [TTreeMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#array) directly from the TTree object;
   * call [TTreeMethods.arrays](https://uproot.readthedocs.io/en/latest/ttree-handling.html#arrays) to get several arrays at a time;
   * call [TBranch.lazyarray](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id13), [TTreeMethods.lazyarray](https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarray), [TTreeMethods.lazyarrays](https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarrays), or [uproot.lazyarrays](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays) to get array-like objects that read on demand;
   * call [TTreeMethods.iterate](https://uproot.readthedocs.io/en/latest/ttree-handling.html#iterate) or [uproot.iterate](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-iterate) to explicitly iterate over chunks of data (to avoid reading more than would fit into memory);
   * call [TTreeMethods.pandas](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id7) or [uproot.pandas.iterate](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-pandas-iterate) to get Pandas DataFrames ([Pandas](https://pandas.pydata.org/) must be installed).

Let's start with the simplest.

In [57]:
a = events.array("E1")
a

array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
       81.27013558, 81.56621735])

Since `array` is singular, you specify one branch name and get one array back. This is a [Numpy array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) of 8-byte floating point numbers, the [Numpy dtype](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html) specified by the `"E1"` branch's interpretation.

In [30]:
events["E1"].interpretation

asdtype('>f8')

We can use this array in Numpy calculations; see the [Numpy documentation](https://docs.scipy.org/doc/numpy/) for details.

In [58]:
import numpy

numpy.log(a)

array([4.40917801, 4.13268234, 4.13268234, ..., 4.39777861, 4.39777861,
       4.40141517])

Numpy arrays are also the standard container for entering data into machine learning frameworks; see this [Keras introduction](https://keras.io/), [PyTorch introduction](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), [TensorFlow introduction](https://www.tensorflow.org/guide/low_level_intro), or [Scikit-Learn introduction](https://scikit-learn.org/stable/tutorial/basic/tutorial.html) to see how to put Numpy arrays to work in machine learning.

The [TBranchMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id11) method is the same as [TTreeMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#array) except that you don't have to specify the TBranch name (naturally). Sometimes one is more convenient, sometimes the other.

In [32]:
events.array("E1"), events["E1"].array()

(array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]))

The plural `arrays` method is different. Whereas singular `array` could only return one array, plural `arrays` takes a list of names (possibly including wildcards) and returns them all in a Python dict.

In [33]:
events.arrays(["px1", "py1", "pz1"])

{b'px1': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 b'py1': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 b'pz1': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247])}

In [35]:
events.arrays(["p[xyz]*"])

{b'px1': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 b'py1': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 b'pz1': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247]),
 b'px2': array([ 34.14443725, -41.19528764, -40.88332344, ..., -68.04191497,
        -68.79413604, -68.79413604]),
 b'py2': array([-16.11952457,  17.4332439 ,  17.29929704, ..., -26.10584737,
        -26.39840043, -26.39840043]),
 b'pz2': array([ -47.42698439,  -68.96496181,  -68.44725519, ..., -152.2350181 ,
        -153.84760383, -153.84760383])}

As with all ROOT object names, the TBranch names are bytestrings (prepended by `b`). If you know the encoding or it doesn't matter (`"ascii"` and `"utf-8"` are generic), pass a `namedecode` to get keys that are strings.

In [36]:
events.arrays(["p[xyz]*"], namedecode="utf-8")

{'px1': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 'py1': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 'pz1': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247]),
 'px2': array([ 34.14443725, -41.19528764, -40.88332344, ..., -68.04191497,
        -68.79413604, -68.79413604]),
 'py2': array([-16.11952457,  17.4332439 ,  17.29929704, ..., -26.10584737,
        -26.39840043, -26.39840043]),
 'pz2': array([ -47.42698439,  -68.96496181,  -68.44725519, ..., -152.2350181 ,
        -153.84760383, -153.84760383])}

These array-reading functions have many parameters, but most of them have the same names and meanings across all the functions. Rather than discuss all of them here, they'll be presented in context in sections on special features below.

# Caching data

Every time you ask for arrays, uproot goes to the file and re-reads them. For especially large arrays, this can take a long time.

For quicker access, uproot's array-reading functions have a **cache** parameter, which is an entry point for you to manage your own cache. The **cache** only needs to behave like a dict (many third-party Python caches do).

In [43]:
mycache = {}

# first time: reads from file
events.arrays(["p[xyz]*"], cache=mycache);

# any other time: reads from cache
events.arrays(["p[xyz]*"], cache=mycache);

In this example, the cache is a simple Python dict. Uproot has filled it with unique ID → array pairs, and it uses the unique ID to identify an array that it has previously read. You can see that it's full by looking at those keys:

In [45]:
mycache

{'AAGUS3fQmKsR56dpAQAAf77v;events;px1;asdtype(Bf8(),Lf8());0-2304': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;py1;asdtype(Bf8(),Lf8());0-2304': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;pz1;asdtype(Bf8(),Lf8());0-2304': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;px2;asdtype(Bf8(),Lf8());0-2304': array([ 34.14443725, -41.19528764, -40.88332344, ..., -68.04191497,
        -68.79413604, -68.79413604]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;py2;asdtype(Bf8(),Lf8());0-2304': array([-16.11952457,  17.4332439 ,  17.29929704, ..., -26.10584737,
        -26.39840043, -26.39840043]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;pz2;asdtype(Bf8(),Lf8());0-2304': array([ -47.42698439,  -68.96496181,  -68.44725519, ..., -152.

though they're not very human-readable.

If you're running out of memory, you could manually clear your cache by simply clearing the dict.

In [46]:
mycache.clear()
mycache

{}

Now the same line of code reads from the file again.

In [47]:
# not in cache: reads from file
events.arrays(["p[xyz]*"], cache=mycache);

## Automatically managed caches

This manual process of clearing the cache when you run out of memory is not very robust. What you want instead is a dict-like object that drops elements on its own when memory is scarce.

Uproot has an [uproot.cache.ArrayCache](https://uproot.readthedocs.io/en/latest/caches.html#uproot-cache-arraycache) class for this purpose, though it's a thin wrapper around the third-party [cachetools](https://pypi.org/project/cachetools/) library. Whereas [cachetools](https://pypi.org/project/cachetools/) drops old data from cache when a maximum number of items is reached, [uproot.cache.ArrayCache](https://uproot.readthedocs.io/en/latest/caches.html#uproot-cache-arraycache) drops old data when the data usage reaches a limit, specified in bytes.

In [54]:
mycache = uproot.cache.ArrayCache(100*1024)   # 100 kB
events.arrays("*", cache=mycache);

len(mycache), len(events.keys())

(6, 20)

With a limit of 100 kB, only 6 of the 20 arrays fit into cache, the rest have been evicted.

All data sizes in uproot are specified in bytes (integers). Kilobytes, megabytes, and gigabytes are powers of 1024. I use `1024**3` as a convenient way to type 1 GB.

The fact that any dict-like object may be a cache opens many possibilities. If you're struggling with a script that takes a long time to load data, then crashes, you may want to try a process-independent cache like [memcached](https://realpython.com/python-memcache-efficient-caching/). If you have a small, fast disk, you may want to consider [diskcache](http://www.grantjenks.com/docs/diskcache/tutorial.html) to temporarily hold arrays from ROOT files on the big, slow disk.

# Lazy arrays

If you call [TBranchMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id11), [TTreeMethods.array](https://uproot.readthedocs.io/en/latest/ttree-handling.html#array), or [TTreeMethods.arrays](https://uproot.readthedocs.io/en/latest/ttree-handling.html#arrays), uproot reads the file or cache immediately and returns an in-memory array. For exploratory work or to control memory usage, you might want to let the data be read on demand.

The [TBranch.lazyarray](https://uproot.readthedocs.io/en/latest/ttree-handling.html#id13), [TTreeMethods.lazyarray](https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarray), [TTreeMethods.lazyarrays](https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarrays), and [uproot.lazyarrays](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays) functions take most of the same parameters but return lazy array objects, rather than Numpy arrays.

In [62]:
data = events.lazyarrays("*")
data

<ChunkedArray [<Row 0> <Row 1> <Row 2> ... <Row 2301> <Row 2302> <Row 2303>] at 0x7fa31473c9b0>

This `ChunkedArray` represents all the data in the file in chunks specified by ROOT's internal baskets (specifically, the places where the baskets align, called "clusters"). Each chunk contains a `VirtualArray`, which is read when any element from it is accessed.

In [67]:
data = events.lazyarrays(entrysteps=500)   # chunks of 500 events each
data["E1"]

<ChunkedArray [82.2018663875 62.3449289481 62.3449289481 ... 81.2701355756 81.2701355756 81.5662173543] at 0x7fa31473fe48>

Requesting `"E1"` through all the chunks and printing it (above) has caused the first and last chunks of the array to be read, because that's all that got written to the screen.

In [68]:
[chunk["E1"].ismaterialized for chunk in data.chunks]

[True, False, False, False, True]

These arrays can be used with [Numpy's universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html), which are the mathematical functions that perform elementwise mathematics.

In [65]:
numpy.log(data["E1"])

<ChunkedArray [4.409178007248409 4.132682336791151 4.132682336791151 4.104655794838432 3.733527454020269 3.891440776178839 3.891440776178839 ...] at 0x7fa314754cc0>

Now all of the chunks have been read, because the values were needed to compute `log(E1)` for all `E1`.

In [66]:
[chunk["E1"].ismaterialized for chunk in data.chunks]

[True, True, True, True, True]

## Lazy array of many files

There's a lazy version of each of the array-reading functions in [TTreeMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-ttreemethods) and [TBranchMethods](https://uproot.readthedocs.io/en/latest/ttree-handling.html#uproot-tree-tbranchmethods), but there's also module-level [uproot.lazyarray](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot.tree.lazyarray) and [uproot.lazyarrays](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot.tree.lazyarrays). These functions let you make a lazy array that spans many files.

In [72]:
data = uproot.lazyarray(
    # list of files; local files can have wildcards (*)
    ["http://scikit-hep.org/uproot/examples/sample-%s-zlib.root" % x
        for x in ["5.23.02", "5.24.00", "5.25.02", "5.26.00", "5.27.02", "5.28.00",
                  "5.29.02", "5.30.00", "6.08.04", "6.10.05", "6.14.00"]],
    # TTree name in each file
    "sample",
    # branch(s) in each file for lazyarray(s)
    "f8")
data

<ChunkedArray [-14.9 -13.9 -12.9 ... 12.1 13.1 14.1] at 0x7fa314460550>

This `data` represents the entire set of files, and the only up-front processing that had to be done was to find out how many entries each TTree contains.

It uses the [uproot.numentries](https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-numentries) shortcut method (which reads less data than normal file-opening):

In [76]:
dict(uproot.numentries(
    # list of files; local files can have wildcards (*)
    ["http://scikit-hep.org/uproot/examples/sample-%s-zlib.root" % x
        for x in ["5.23.02", "5.24.00", "5.25.02", "5.26.00", "5.27.02", "5.28.00",
                  "5.29.02", "5.30.00", "6.08.04", "6.10.05", "6.14.00"]],
    # TTree name in each file
    "sample",
    # total=True adds all values; total=False leaves them as a dict
    total=False))

{'http://scikit-hep.org/uproot/examples/sample-5.23.02-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.24.00-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.25.02-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.26.00-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.27.02-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.28.00-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.29.02-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-5.30.00-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-6.08.04-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-6.10.05-zlib.root': 30,
 'http://scikit-hep.org/uproot/examples/sample-6.14.00-zlib.root': 30}

## Lazy arrays with caching

By default, lazy arrays cache all data that have been read as long as the lazy array exists. To use a lazy array as a window into a very large dataset, you'll have to specify how much it's allowed to keep in memory at a time.

The caching mechanism is the same as before:

In [77]:
mycache = uproot.cache.ArrayCache(100*1024)   # 100 kB

data = events.lazyarrays(entrysteps=500, cache=mycache)
data

<ChunkedArray [<Row 0> <Row 1> <Row 2> ... <Row 2301> <Row 2302> <Row 2303>] at 0x7fa314346978>

Before performing a calculation, the cache is empty.

In [78]:
len(mycache)

0

In [80]:
numpy.sqrt((data["E1"] + data["E2"])**2 - (data["px1"] + data["px2"])**2 -
           (data["py1"] + data["py2"])**2 - (data["pz1"] + data["pz2"])**2)

<ChunkedArray [82.46269155513643 83.62620400526137 83.30846466680981 82.14937288090277 90.46912303551746 89.75766317061574 89.77394317215372 ...] at 0x7fa314313eb8>

After performing the calculation, the cache contains only as many as it could hold.

In [83]:
# arrays in cache  arrays in dataset
len(mycache),      len(data.chunks) * 8

(28, 40)