# ROOT RDataFrame

[RDataFrame documentation](https://root.cern/doc/master/classROOT_1_1RDataFrame.html)

- RDF is ROOT's high-level analysis interface. 

- Users define their analysis as a sequence of operations to be performed on the data-frame object; 

    - the framework takes care of the management of the loop over entries as well as low-level details such as I/O and parallelisation.

- RDataFrame provides methods to perform most common operations required by ROOT analyses: 

    - at the same time, users can just as easily specify custom code that will be executed in the event loop.
<img src="images/rdf_1.png">

# HEP data analysis with RDataFrame
RDataFrame allows reading and writing trees, aiming at making HEP analysis easy to write and fast to perform.

In [1]:
import ROOT

treename = "dataset"
filename = "data/example_file.root"
df = ROOT.RDataFrame(treename, filename)

print(f"Columns in the dataset: {df.GetColumnNames()}")

Welcome to JupyROOT 6.26/11
Columns in the dataset: { "a", "b", "vec1", "vec2" }


In [14]:
df.Display().Print()

+-----+------------+------------+------------+--------------+
| Row | a          | b          | vec1       | vec2         | 
+-----+------------+------------+------------+--------------+
| 0   | 0.97771140 | 0.99974175 | -3.22012f  | 0.894402f    | 
+-----+------------+------------+------------+--------------+
| 1   | 2.2802012  | 0.48497361 | -1.80835f  | 0.0800873f   | 
|     |            |            | 0.236065f  | 0.479906f    | 
|     |            |            | -3.97713f  | 0.519888f    | 
|     |            |            | -0.293643f | 0.317273f    | 
+-----+------------+------------+------------+--------------+
| 2   | 0.56348245 | 0.39231399 |            |              | 
+-----+------------+------------+------------+--------------+
| 3   | 3.0421559  | 0.33353925 | 0.727539f  | 0.796610f    | 
|     |            |            | -3.81258f  | 0.331128f    | 
|     |            |            | -2.87416f  | -0.00277938f | 
+-----+------------+------------+------------+--------------

In [2]:
# Try it!
# To-do: Check the content in the root
# "!" tells the notebook that we are executing a bash command
# "rootls" lists the content of the root file
!rootls data/example_file.root

# The printout is the name of the tree that the file contains

[32mdataset[0m


In [20]:
# Instructor: scroll and hide the first cell. do not look at the cell 1. Use Cell2 and the documentation
# Try it!
# Documentation: https://root.cern/doc/master/classROOT_1_1RDataFrame.html
# To-do: Create an RDataFrame with the tree in the file "data/example_file.root"
df_try = ROOT.RDataFrame("dataset", "data/example_file.root")

Now we can `Define` new quantities, `Filter` rows based on custom expressions and retrieve some data aggregations such as a `Count` and a `Mean`:

In [15]:
# Instructor: go to documentation. Point to transformation section.
# Have student guess what functions to use

def1 = df.Define("c", "a+b")

fil1 = def1.Filter("c < 0.5")

count = fil1.Count()
mean = fil1.Mean("c")
display = fil1.Display(["a","b","c"])

print(f"Number of rows after filter: {count.GetValue()}")
print(f"Mean of column c after filter: {mean.GetValue()}")
print("Dataset contents:")
display.Print()

Number of rows after filter: 111
Mean of column c after filter: 0.3858351299780433
Dataset contents:
+-----+------------+-------------+------------+
| Row | a          | b           | c          | 
+-----+------------+-------------+------------+
| 11  | 0.28843593 | 0.042303163 | 0.33073909 | 
+-----+------------+-------------+------------+
| 13  | 0.25636993 | 0.12553929  | 0.38190921 | 
+-----+------------+-------------+------------+
| 30  | 0.11045690 | 0.31782435  | 0.42828125 | 
+-----+------------+-------------+------------+
| 92  | 0.26561222 | 0.17973985  | 0.44535206 | 
+-----+------------+-------------+------------+
| 107 | 0.20609357 | 0.17557919  | 0.38167276 | 
+-----+------------+-------------+------------+


In [21]:
# Try it!
# To-do: Define a column named "c_try", with its content being a*2. Then define a column named "d_try", with its content being c_try+b. Then keep only the rows with "d_try>1"
df_try = df_try.Define("c_try","a*2").Define("d_try","c_try+b").Filter("d_try>1")

# To-do: Get the number of rows passing the requirement with Count() and get the number with GetValue()
df_try.Count().GetValue()

1820

In [25]:
# To-do: Check the column names of df_try with GetColumnNames()
print(df_try.GetColumnNames())

{ "a", "b", "c_try", "d_try", "vec1", "vec2" }


# Think about data-flow
RDataFrame is built with a modular and flexible workflow in mind, summarised as follows:

* build a data-frame object by specifying your data-set
* apply a series of transformations to your data
  * filter (e.g. apply some cuts) or
  * define a new column (e.g. the result of an expensive computation on columns)
* apply actions to the transformed data to produce results (e.g. fill a histogram)

### Important Note!
Make sure to **book all transformations and actions before** you access the contents of any of the results: this lets RDataFrame accumulate work and then produce all results at the same time, upon first access to any of them.

In [28]:
%%time
df1 = df.Filter("a >= 0.2")

# Book in advance all Mean operations on the dataset columns
cols = df.GetColumnNames()
mean_ops = [df1.Mean(col) for col in cols]

# Ask the result of one mean operation.
# RDataFrame will process the whole computation graph
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")
print(f"First mean result is: {mean_ops[0].GetValue()}")
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")

Number of RDataFrame runs so far: 7
First mean result is: 14.28575315324604
Number of RDataFrame runs so far: 8
CPU times: user 406 ms, sys: 40.8 ms, total: 447 ms
Wall time: 1.32 s


In [29]:
%%time
# Print all results, the event loop won't be run another time
for col, mean_op in zip(cols, mean_ops):
    print(f"Mean value of {col}: {mean_op.GetValue()}")
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")

Mean value of a: 14.28575315324604
Mean value of b: 0.5107120234035626
Mean value of vec1: 0.35305373042684285
Mean value of vec2: -0.0048903633967354275
Number of RDataFrame runs so far: 8
CPU times: user 183 µs, sys: 30 µs, total: 213 µs
Wall time: 212 µs


# Histograms with RDataFrame
RDataFrame helps you streamline the creation and filling of histogram objects from your data. 

For example:

In [30]:
%jsroot on
c = ROOT.TCanvas()
h = df.Histo1D("vec1")
h.Draw()
c.Draw()

In [31]:
# Try it!
# To-do: Get a histogram of the "d_try" column from the df_try RDataFrame
h_try = df_try.Histo1D("d_try")

In [32]:
# To-do: Draw h_try and confirm that the number of entries agree with the output of Count() from the earlier step
h_try.Draw()
c.Draw()

- `Histo1D` will create a one-dimensional histogram holding `double` values. 

- `Histo{2,3}D` do the same in higher dimensions. 

- These operations also accept a tuple with the same arguments that would be passed to the equivalent histogram object constructors. 

- For example:

In [33]:
histo_name = "histo_name"
histo_title = "histo_title"
nbinsx = 100
xlow = -10
xup = 10

# The traditional TH1D constructor
# ROOT.TH1D(histo_name, histo_title, nbinsx, xlow, xup)

# With RDataFrame
c = ROOT.TCanvas()
h = df.Histo1D((histo_name, histo_title, nbinsx, xlow, xup), "vec1")
h.Draw()
c.Draw()

In [39]:
# Try it!
# To-do: Zoom in to the x range of -4 to 4 by directly changing the code below
h = df.Histo1D((histo_name, histo_title, 80, -4, 4), "vec1")
h.Draw()
c.Draw()

# Operation categories in RDataFrame
There are 3 main types of operations you can perform on RDataFrames:

In [40]:
%%html
<style>
  th { font-size: 30px } # title
  td { font-size: 30px } # datatable
</style>

**Transformations**: manipulate the dataset, return a modified RDataFrame for further processing.

| Transformation    | Description                                                |
|-------------------|------------------------------------------------------------|
| Alias()           | Introduce an alias for a particular column name.           |
| Define()          | Creates  a new column in the dataset.                      |
| Filter()          | Filter rows based on user-defined conditions.              |

**Actions**: aggregate (parts of) the dataset into a result.

| Action                        | Description                                                                          |
|------------------------------------|--------------------------------------------------------------------------------------|
| Count()                            | Return the number of events processed.                                               |
| Display()                          | Provides a printable object representing the dataset contents.                       |
| Graph()                            | Fills a TGraph  with the two columns provided.                                       |
| Histo1D(), Histo2D(), Histo3D()    | Fill a one-, two-, three-dimensional histogram with the processed column values.     |
| Max(), Min()                       | Return the maximum(minimum) of processed column values.                              |
| Snapshot()        | Writes processed data-set to a new TTree.              |
| ...                                | ...  

**Queries**: these methods  query information about your dataset and the RDataFrame status.

| Operation           | Description                                                                              |
|---------------------|------------------------------------------------------------------------------------------|
| GetColumnNames()    | Get the names of all the available columns of the dataset.                               |
| GetColumnType()     | Return the type of a given column as a string.                                           |
| SaveGraph()         | Export the computation graph of an RDataFrame in graphviz format for easy inspection.     |
| ...                 | ...                                                                                      |