![ROOT Logo](http://root.cern.ch/img/logos/ROOT_Logo/website-banner/website-banner-%28not%20root%20picture%29.jpg)
<br />
# **NanoAOD files processed with Distristruted RDataFrame in Python**
<hr style="border-top-width: 4px; border-top-color: #34609b;">


[`df102_NanoAODDimuonAnalysis`](https://root.cern.ch/doc/master/df102__NanoAODDimuonAnalysis_8py.html) ROOT tutorial running with PyRDF.

The NanoAOD-like input files are filled with 66 mio. events from CMS OpenData containing muon candidates part of 2012 dataset ([DOI: 10.7483/OPENDATA.CMS.YLIC.86ZZ](http://opendata.cern.ch/record/6004) and [DOI: 10.7483/OPENDATA.CMS.M5AD.Y3V3](http://opendata.cern.ch/record/6030)).

The macro matches muon pairs and produces an histogram of the dimuon mass spectrum showing resonances up to the Z mass. Note that the bump at 30 GeV is not a resonance but a trigger effect.

Some more details about the dataset:
- It contains about 66 millions events (muon and electron collections, plus some other information, e.g. about primary vertices)
- It spans two compressed ROOT files located on EOS for about a total size of 7.5 GB.

Date: April 2019<br>
Author: Stefan Wunsch (KIT, CERN)<br>
Adapted to PyRDF: Javier Cervantes Villanueva (CERN)

**Requirements: ROOT-HEAD (Use the Bleeding Edge in the SWAN configuration)**

In [7]:
import ROOT

RDataFrame = ROOT.RDF.Experimental.Distributed.AWS.RDataFrame


def run_cpubound(npartitions=10):
    # Create the RDF
    # Increasing nentries would increase the overall runtime
    nentries = int(1e9)
    df = RDataFrame(nentries, npartitions=npartitions)

    # Decide parameters of the random distributions of the RDF columns
    gaus_mean = 10
    gaus_sigma = 1
    exp_tau = 20
    poisson_mean = 30

    df_withcols = df.Define("x",f"gRandom->Gaus({gaus_mean},{gaus_sigma})")\
                    .Define("y",f"gRandom->Exp({exp_tau})")\
                    .Define("z",f"gRandom->PoissonD({poisson_mean})")

    # Decide how many operations per column you want to run
    # Increasing this would increase the overall runtime
    nops_percol = 10
    oplist = [df_withcols.Mean(f"{colname}") for colname in ["x","y","z"] for _ in range(nops_percol)]

    # Start a stopwatch and trigger the execution of the computation graph.
    # Asking for the first value in the list is enough to trigger everything
    print("Starting the CPU bound benchmark.")
    t = ROOT.TStopwatch()
    first_value = oplist[0].GetValue()
    realtime = round(t.RealTime(), 2)
    print(f"CPU bound benchmark finished in {realtime} seconds.")

    # Decide the name of the output csv to store runtime information.
    outcsv = "distrdf_cpubound.csv"

    with open(outcsv, "a+") as f:
        f.write(str(realtime))
        f.write("\n")

# run_cpubound()

In [8]:
for partition in range(2,7):
    run_cpubound(10*2**(6-partition))

Starting the CPU bound benchmark.
Benchmark report: AWSBENCH(npartitions=160, mapwalltime=198.0167, reducewalltime=2.2468)
CPU bound benchmark finished in 204.05 seconds.
Starting the CPU bound benchmark.
Benchmark report: AWSBENCH(npartitions=80, mapwalltime=203.5296, reducewalltime=0.9778)
CPU bound benchmark finished in 207.16 seconds.
Starting the CPU bound benchmark.
Benchmark report: AWSBENCH(npartitions=40, mapwalltime=219.1495, reducewalltime=0.4676)
CPU bound benchmark finished in 222.42 seconds.
Starting the CPU bound benchmark.
Benchmark report: AWSBENCH(npartitions=20, mapwalltime=253.9883, reducewalltime=0.666)
CPU bound benchmark finished in 257.27 seconds.
Starting the CPU bound benchmark.
Benchmark report: AWSBENCH(npartitions=10, mapwalltime=317.3827, reducewalltime=0.1062)
CPU bound benchmark finished in 320.6 seconds.
