# Datasets
Author: Javier Duarte


## Load datasets from `ROOT` files using `uproot`
Here we load the `ROOT` datasets in python using `uproot` (see: https://github.com/scikit-hep/uproot). For more information about how to use uproot, see the [`Uproot and Awkward Array for columnar analysis HATS@LPC`](https://indico.cern.ch/e/uproothats2020) tutorial.

In [1]:
import uproot

## Load `ROOT` files
Here we simply open two `ROOT` files using `uproot` and display the branch content of one of the trees.

In [2]:
import numpy as np
import h5py

treename = 'HZZ4LeptonsAnalysisReduced'
filename = {}
upfile = {}

!mkdir -p data
!wget -O data/ntuple_4mu_bkg.root https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1
!wget -O data/ntuple_4mu_VV.root https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1

filename['bkg'] = 'data/ntuple_4mu_bkg.root'
filename['VV'] = 'data/ntuple_4mu_VV.root'

upfile['bkg'] = uproot.open(filename['bkg'])
upfile['VV'] = uproot.open(filename['VV'])

print(upfile['bkg'][treename].show())

--2020-06-20 17:20:16--  https://zenodo.org/record/3901869/files/ntuple_4mu_bkg.root?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8867265 (8.5M) [application/octet-stream]
Saving to: ‘data/ntuple_4mu_bkg.root’


2020-06-20 17:20:19 (4.99 MB/s) - ‘data/ntuple_4mu_bkg.root’ saved [8867265/8867265]

--2020-06-20 17:20:19--  https://zenodo.org/record/3901869/files/ntuple_4mu_VV.root?download=1
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4505518 (4.3M) [application/octet-stream]
Saving to: ‘data/ntuple_4mu_VV.root’


2020-06-20 17:20:20 (5.05 MB/s) - ‘data/ntuple_4mu_VV.root’ saved [4505518/4505518]

f_run                      (no streamer)              asdtype('>i4')
f_lumi                     (no streamer)    

## Convert tree to `pandas` DataFrames
In my opinion, `pandas` DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html. 

In [3]:
import pandas as pd

df = {}
df['bkg'] = upfile['bkg'][treename].pandas.df()
df['VV'] = upfile['VV'][treename].pandas.df()

# print first entry
print(df['bkg'].iloc[:1])

# print shape of DataFrame
print(df['bkg'].shape)

# print first entry for f_mass4l and f_massjj
print(df['bkg'][['f_mass4l','f_massjj']].iloc[:1])

# convert back into unstructured NumPY array
print(df['bkg'].values)
print(df['bkg'].values.shape)

# get boolean mask array
mask = (df['bkg']['f_mass4l'] > 125)
print(mask)

# cut using this boolean mask array
print(df['bkg']['f_mass4l'][mask])

       f_run  f_lumi  f_event  f_weight  f_int_weight  f_pu_weight  \
entry                                                                
0          1       4      630  0.000648           0.0     1.229054   

       f_eff_weight  f_lept1_pt  f_lept1_eta  f_lept1_phi  ...  f_jet2_pt  \
entry                                                      ...              
0               1.0    32.80312     0.354339     -1.41646  ...        0.0   

       f_jet2_eta  f_jet2_phi  f_jet2_e  f_D_bkg_kin   f_D_bkg    f_D_gg  \
entry                                                                      
0             0.0         0.0       0.0     0.363088  0.363088 -0.000022   

         f_D_g4  f_Djet_VAJHU    f_pfmet  
entry                                     
0      0.827116          -1.0  18.884806  

[1 rows x 62 columns]
(58107, 62)
        f_mass4l  f_massjj
entry                     
0      91.098129    -999.0
[[ 1.00000000e+00  4.00000000e+00  6.30000000e+02 ...  8.27115893e-01
  -1.00000000