# Datasets
Author: Javier Duarte


## Convert datasets from `ROOT` to `HDF5`
Here we convert the datasets using `root2hdf5` utility which comes with `rootpy`: http://www.rootpy.org/commands/root2hdf5.html

In [None]:
%%bash
root2hdf5 -f data/ntuple_4mu_VV.root data/ntuple_4mu_bkg.root

## Load `HDF5` datasets
Here we load the converted `HDF5` datasets into structured `NumPy` arrays. Note these structures arrays permit one to manipulate the data by named fields: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html

In [None]:
import numpy as np
import h5py

treename = 'HZZ4LeptonsAnalysisReduced'
filename = {}
h5file = {}
params = {}

filename['bkg'] = 'data/ntuple_4mu_bkg.h5'
filename['VV'] = 'data/ntuple_4mu_VV.h5'

h5file['bkg'] = h5py.File(filename['bkg'], 'r') # open HDF5 file read-only
h5file['VV'] = h5py.File(filename['VV'], 'r') 

params['bkg'] = h5file['bkg'][treename][()] # returns a structured NumPy array
params['VV'] = h5file['VV'][treename][()]

# print all variables
print(params['bkg'].dtype.names)

# print the shape of the NumPy array
print(params['bkg'].shape)

# print the the first entry of the NumPy array
print(params['bkg'][0])

# print mass4l value of first entry
print(params['bkg'][0]['f_mass4l'])

# print massjj value of first entry
print(params['bkg'][0]['f_massjj'])

## Convert `NumPy` arrays to `pandas` DataFrames
In my opinion, `pandas` DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html. 
So we'll use this instead of structured `NumPy` arrays.

In [None]:
import pandas as pd

df = {}
df['bkg'] = pd.DataFrame(params['bkg'])
df['VV'] = pd.DataFrame(params['VV'])

# print first entry
print(df['bkg'].iloc[:1])

# print shape of DataFrame
print(df['bkg'].shape)

# print first entry for f_mass4l and f_massjj
print(df['bkg'][['f_mass4l','f_massjj']].iloc[:1])

# convert back into unstructured NumPY array
print(df['bkg'].values)
print(df['bkg'].values.shape)

# get boolean array
print(df['bkg']['f_mass4l'] > 125)

# cut usigg this boolean array
print(df['bkg']['f_mass4l'][df['bkg']['f_mass4l'] > 125])