# Analyzing Data in a ROOT Tree
## Why Trees?

ROOT trees are optimized for the <span style="color:red">storage</span> of the kind of data produced in high-energy and nuclear physics experiments:

* Very large numbers of _events_ with essentially the same data structure
* Variable length containers holding identical data/objects
* Tree-like structures of collections of objects, similar to databases

<span style="color:red">Access patterns</span> to data in ROOT trees are also optimized for the kinds we typically need for our analyses:

* "Column-wise" reading of individual data elements—only the element(s) of interest are read, not entire events
* Only one event (or part of it) in memory at a time (modulo buffer size)
* Buffered to disk, some degree of integrity protection during writing ("cycles")

Without structures like ROOT trees, the efficient analysis of petabyte-size data sets from CERN and elsewhere would be nearly impossible. ROOT trees

* minimize memory requirements and I/O
* greatly enhance access speed

In addition to simple n-tuple-like data, ROOT <span style="color:red">lets you write C++ objects to file</span>. This is impossible in native C++ and is achieved through class dictionaries generated with the _Cling_ C++ interpreter/compiler. The ability to write <span style="color:red">entire object trees</span> is critical for storing the very complex event data from HEP experiments. This is also knows as "C++ object persistency". Some programming languages offer object persistency natively, but not C++.

## Hall A Analyzer trees

ROOT trees produced by the Hall A and Hall C analyzers have a rather simple structure. Each "global variable" (analysis result) selected for output to the tree is written to its own branch with only a single leaf. The name of the branch is identical to the name of the global variable, for example ```L.tr.p```. At this time, data branches always have the type ```Double_t```. We are planning to support other data types, in particular integers, starting with analyzer version 1.7.

In the case of arrays, a second branch is written to the tree whose name is ```Ndata.``` followed by the name of the corresponding array, _e.g._ ```Ndata.L.tr.p```. These branches always have type ```Int_t``` and hold the number of elements in the array for the current event. There is a lot of redundancy in ```Ndata``` variables for various elements of a single object, such as the collection of tracks. We may reduce or even eliminate this redundancy in a future version of the analyzer.

Here is an example of a typical Hall A analyzer output tree where all the Left-HRS track data have been written out with a ```block L.tr.*``` statement in the output defintion file:

![TreeView](img/TreeView3.png)

To explore this exact tree yourself, start an interactive ```analyzer``` or ROOT session and type:
```
analyzer [1] f = TFile::Open("/data/ROOTfiles/g2p_3132.root","READ");
analyzer [2] b = new TBrowser;
```
or, if you didn't download the file yet,
```
analyzer [1] f = TFile::Open("http://hallaweb.jlab.org/data_reduc/AnaWork2018/ROOTfiles/g2p_3132.root");
analyzer [2] b = new TBrowser;
```

## How to work with tree data
### Text output

ROOT offers a number of ways to work with data in trees. First, there are two commonly-used text-based commands, which work well for the n-tuples in our trees:

<div style="background: #d9edf7; border-color: #bce8f1; border-bottom: 5px solid #bce8f1; color: #31708f;  padding: 15px; margin-top: 20px; margin-bottom: 20px">
<ul>
<li> __Scan__: Prints a table where each row corresponds to an event and and each column, to the branch data. If there are multiple entries in a variable-sized array, multiple rows are printed for a single event, each row corresponding to the array index, called "Instance". Allows quick comparison of a number of columns (often faster and clearer than plotting).
<li> __Show__: Prints all data for a single event. The output can be large. Helps with understanding the data structure. Often used for inspecting unusual events such as misreconstructed tracks.
</ul></div>

Let's try these

In [24]:
// Open ROOT file
f = TFile::Open("/data/ROOTfiles/g2p_3132.root","READ");

Ignore the warnings about ```"no dictionary for class ..."``` These occur because we are running plain ROOT and not the ```analyzer```.

In [25]:
// Look at the contents of the file
f->ls();

TFile**		/data/ROOTfiles/g2p_3132.root	
 TFile*		/data/ROOTfiles/g2p_3132.root	
  KEY: THaRun	Run_Data;2	g2p run 3132
  KEY: TTree	T;1	Hall A Analyzer Output DST


Note the tree named __T__

In [29]:
// Print the value of the "L.tr.p" variable (momentum of reconstructed track in GeV)
// and "L.tr.vz" (vertex z-coordinate in m) for the first 10 events. 
T->Scan("L.tr.p:L.tr.vz","","",10);

***********************************************
*    Row   * Instance *    L.tr.p *   L.tr.vz *
***********************************************
*        0 *        0 * 2.2508660 * 0.0016883 *
*        1 *        0 * 2.2000959 * 0.0075512 *
*        2 *        0 * 2.2489658 * 0.0085386 *
*        3 *        0 * 2.2487922 * 0.0482533 *
*        4 *        0 * 2.2428897 * 0.0239013 *
*        5 *        0 * 2.2513184 * 0.0304236 *
*        6 *        0 *  2.206881 * 0.0085658 *
*        7 *        0 * 2.2273119 * 0.0132412 *
*        8 *        0 * 2.1854466 * 0.0153844 *
*        9 *        0 * 2.2493795 * 0.0435411 *
***********************************************


Let's scan again, this time selecting events with multiple tracks. Because these are rare (at the level of a few percent), let's scan the first 500 events.

In [30]:
T->Scan("L.tr.p:L.tr.vz","L.tr.n>1","",500);

***********************************************
*    Row   * Instance *    L.tr.p *   L.tr.vz *
***********************************************
*       38 *        0 * 3.0418453 * 2.3118241 *
*       38 *        1 * 2.8860089 * 2.6425863 *
*       44 *        0 * 52.622658 * 0.2657139 *
*       44 *        1 * 2.5149214 * -1.542678 *
*       44 *        2 * 2.9936564 * -1.244326 *
*       44 *        3 * 2.5968255 * -0.199140 *
*      156 *        0 * 2.5946465 * 3.0075283 *
*      156 *        1 * 2.3305075 * -0.080241 *
*      174 *        0 * 2.8993665 * 1.4813918 *
*      174 *        1 * 2.0904357 * 1.4947630 *
*      218 *        0 * 4.3860716 * 3.1441395 *
*      218 *        1 * 7.6736062 * 1.8077999 *
*      470 *        0 * 3.7749093 * 1.9965279 *
*      470 *        1 * 22.422996 * 3.7932473 *
*      470 *        2 * 3.7986176 * 2.7120001 *
*      480 *        0 * 1.9938786 * 1.9214297 *
*      480 *        1 * 113.13986 * 10.898452 *
*      484 *        0 * 2.3361050 * 4.64

Modify the command above to scan the first 1000 events or more. Once the output fills a good screenful, ROOT will prompt you if you wish to continue or quit. (The prompt is ignored in the notebook environment.)

As you can see, there are now multiple instances per event (=row number). The ```L.tr.p``` and ```L.tr.vz``` arrays are parallel, _i.e._ for both arrays the index has the same meaning. If you are unsure if arrays are parallel, you can plot the corresponding ```Ndata``` elements against each other. To do so, we can use ```T->Draw()```, which we'll discuss in more detail later.