# Datasets
Author: Javier Duarte


## Convert datasets from `ROOT` to `HDF5`
Here we convert the datasets using `root2hdf5` utility which comes with `rootpy`: http://www.rootpy.org/commands/root2hdf5.html

In [1]:
%%bash
root2hdf5 -f data/ntuple_4mu_VV.root data/ntuple_4mu_bkg.root

INFO:rootpy.root2hdf5] Converting data/ntuple_4mu_VV.root ...
INFO:rootpy.root2hdf5] Will convert 1 tree in /
INFO:rootpy.root2hdf5] Converting tree 'HZZ4LeptonsAnalysisReduced' with 25817 entries ...
INFO:rootpy.root2hdf5] Created data/ntuple_4mu_VV.h5
INFO:rootpy.root2hdf5] Converting data/ntuple_4mu_bkg.root ...
INFO:rootpy.root2hdf5] Will convert 1 tree in /
INFO:rootpy.root2hdf5] Converting tree 'HZZ4LeptonsAnalysisReduced' with 58107 entries ...
INFO:rootpy.root2hdf5] Created data/ntuple_4mu_bkg.h5


## Load `HDF5` datasets
Here we load the converted `HDF5` datasets into structured `NumPy` arrays. Note these structures arrays permit one to manipulate the data by named fields: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html

In [2]:
import numpy as np
import h5py

treename = 'HZZ4LeptonsAnalysisReduced'
filename = {}
h5file = {}
params = {}

filename['bkg'] = 'data/ntuple_4mu_bkg.h5'
filename['VV'] = 'data/ntuple_4mu_VV.h5'

h5file['bkg'] = h5py.File(filename['bkg'], 'r') # open HDF5 file read-only
h5file['VV'] = h5py.File(filename['VV'], 'r') 


params['bkg'] = h5file['bkg'][treename][()] # structured NumPy array
params['VV'] = h5file['VV'][treename][()] # structured NumPy array

# print all variables
print(params['bkg'].dtype.names)

# print the shape of the NumPy array
print(params['bkg'].shape)

# print the the first entry of the NumPy array
print(params['bkg'][0])

# print mass4l value of first entry
print(params['bkg'][0]['f_mass4l'])

# print massjj value of first entry
print(params['bkg'][0]['f_massjj'])

('f_run', 'f_lumi', 'f_event', 'f_weight', 'f_int_weight', 'f_pu_weight', 'f_eff_weight', 'f_lept1_pt', 'f_lept1_eta', 'f_lept1_phi', 'f_lept1_charge', 'f_lept1_pfx', 'f_lept1_sip', 'f_lept2_pt', 'f_lept2_eta', 'f_lept2_phi', 'f_lept2_charge', 'f_lept2_pfx', 'f_lept2_sip', 'f_lept3_pt', 'f_lept3_eta', 'f_lept3_phi', 'f_lept3_charge', 'f_lept3_pfx', 'f_lept3_sip', 'f_lept4_pt', 'f_lept4_eta', 'f_lept4_phi', 'f_lept4_charge', 'f_lept4_pfx', 'f_lept4_sip', 'f_iso_max', 'f_sip_max', 'f_Z1mass', 'f_Z2mass', 'f_angle_costhetastar', 'f_angle_costheta1', 'f_angle_costheta2', 'f_angle_phi', 'f_angle_phistar1', 'f_pt4l', 'f_eta4l', 'f_mass4l', 'f_mass4lErr', 'f_njets_pass', 'f_deltajj', 'f_massjj', 'f_D_jet', 'f_jet1_pt', 'f_jet1_eta', 'f_jet1_phi', 'f_jet1_e', 'f_jet2_pt', 'f_jet2_eta', 'f_jet2_phi', 'f_jet2_e', 'f_D_bkg_kin', 'f_D_bkg', 'f_D_gg', 'f_D_g4', 'f_Djet_VAJHU', 'f_pfmet')
(58107,)
(1, 4, 630, 0.00064811, 0., 1.2290536, 1., 32.80312, 0.35433915, -1.4164597, -1., 0., -1.2187678, 23.37

## Convert `NumPy` arrays to `pandas` DataFrames
In my opinion, `pandas` DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html. 
So we'll use this instead of structured `NumPy` arrays.

In [3]:
import pandas as pd

df = {}
df['bkg'] = pd.DataFrame(params['bkg'])
df['VV'] = pd.DataFrame(params['VV'])

# print first entry
print(df['bkg'].iloc[:1])

# print shape of DataFrame
print(df['bkg'].shape)

# print first entry for f_mass4l and f_massjj
print(df['bkg'][['f_mass4l','f_massjj']].iloc[:1])

# convert back into unstructured NumPY array
print(df['bkg'].values)
print(df['bkg'].values.shape)

# get boolean array
print(df['bkg']['f_mass4l'] > 125)

# cut usigg this boolean array
print(df['bkg']['f_mass4l'][df['bkg']['f_mass4l'] > 125])

   f_run  f_lumi  f_event  f_weight  f_int_weight  f_pu_weight  f_eff_weight  \
0      1       4      630  0.000648           0.0     1.229054           1.0   

   f_lept1_pt  f_lept1_eta  f_lept1_phi    ...      f_jet2_pt  f_jet2_eta  \
0    32.80312     0.354339     -1.41646    ...            0.0         0.0   

   f_jet2_phi  f_jet2_e  f_D_bkg_kin   f_D_bkg    f_D_gg    f_D_g4  \
0         0.0       0.0     0.363088  0.363088 -0.000022  0.827116   

   f_Djet_VAJHU    f_pfmet  
0          -1.0  18.884806  

[1 rows x 62 columns]
(58107, 62)
    f_mass4l  f_massjj
0  91.098129    -999.0
[[ 1.00000000e+00  4.00000000e+00  6.30000000e+02 ...  8.27115893e-01
  -1.00000000e+00  1.88848057e+01]
 [ 1.00000000e+00  1.97100000e+03  3.81019000e+05 ...  4.15622257e-02
  -1.00000000e+00  2.95897903e+01]
 [ 1.00000000e+00  7.37000000e+03  1.42459000e+06 ...  8.38690639e-01
  -1.00000000e+00  2.04643517e+01]
 ...
 [ 1.00000000e+00  3.44060000e+04  6.65084800e+06 ...  8.02676916e-01
  -1.00000000e