# Datasets
Author: Javier Duarte


## Load datasets from `ROOT` files using `uproot`
Here we load the `ROOT` datasets in python using `uproot` (see: https://github.com/scikit-hep/uproot). For more information about how to use uproot, see the [`Uproot and Awkward Array for columnar analysis HATS@LPC`](https://indico.cern.ch/e/uproothats2021) tutorial.

In [1]:
import uproot

## Load `ROOT` files
Here we simply open two `ROOT` files using `uproot` and display the branch content of one of the trees.

In [2]:
import numpy as np
import h5py

treename = 'HZZ4LeptonsAnalysisReduced'
filename = {}
upfile = {}

!mkdir -p data
!wget -O data/ntuple_4mu_bkg.root https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1
!wget -O data/ntuple_4mu_VV.root https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1

filename['bkg'] = 'data/ntuple_4mu_bkg.root'
filename['VV'] = 'data/ntuple_4mu_VV.root'

upfile['bkg'] = uproot.open(filename['bkg'])
upfile['VV'] = uproot.open(filename['VV'])

print(upfile['bkg'][treename].show())

--2022-07-11 20:22:58--  https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8867265 (8.5M) [application/octet-stream]
Saving to: ‘data/ntuple_4mu_bkg.root’


2022-07-11 20:23:01 (8.93 MB/s) - ‘data/ntuple_4mu_bkg.root’ saved [8867265/8867265]

--2022-07-11 20:23:01--  https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4505518 (4.3M) [application/octet-stream]
Saving to: ‘data/ntuple_4mu_VV.root’


2022-07-11 20:23:03 (5.81 MB/s) - ‘data/ntuple_4mu_VV.root’ saved [4505518/4505518]

name                 | typename                 | interpretation                
---------------------+------------------

## Convert tree to `pandas` DataFrames
In my opinion, `pandas` DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html. 

In [3]:
import pandas as pd

branches = ['f_mass4l','f_massjj']

df = {}
df['bkg'] = upfile['bkg'][treename].arrays(branches, library='pd')
df['VV'] = upfile['VV'][treename].arrays(branches, library='pd')

# print first entry
print(df['bkg'].iloc[:1])

# print shape of DataFrame
print(df['bkg'].shape)

# print first entry for f_mass4l and f_massjj
print(df['bkg'][branches].iloc[:1])

# convert back into unstructured NumPY array
print(df['bkg'].values)
print(df['bkg'].values.shape)

# get boolean mask array
mask = (df['bkg']['f_mass4l'] > 125)
print(mask)

# cut using this boolean mask array
print(df['bkg']['f_mass4l'][mask])

    f_mass4l  f_massjj
0  91.098129    -999.0
(58107, 2)
    f_mass4l  f_massjj
0  91.098129    -999.0
[[  91.09813  -999.      ]
 [ 201.84761  -999.      ]
 [  89.279076 -999.      ]
 ...
 [  90.129845 -999.      ]
 [ 250.97742  -999.      ]
 [ 229.47015  -999.      ]]
(58107, 2)
0        False
1         True
2        False
3         True
4         True
         ...  
58102    False
58103     True
58104    False
58105     True
58106     True
Name: f_mass4l, Length: 58107, dtype: bool
1        201.847610
3        586.597412
4        135.589798
5        734.903442
6        341.958466
            ...    
58097    225.355103
58098    214.074249
58103    252.845184
58105    250.977417
58106    229.470154
Name: f_mass4l, Length: 42219, dtype: float32
