# Datasets
Original Author: Javier Duarte | Edited by Sitong An for CMSDAS2019@Pisa

Now we start on the problem of gluon fusion (gg) / vector boson fusion (VV) classificaiton proper. 

First we need to import the dataset from ROOT files. This notebook uses root2hdf5 utility. Alternativaly, you could also use uproot, described in notebook 2.1

## Convert datasets from `ROOT` to `HDF5`
Here we convert the datasets using `root2hdf5` utility which comes with `rootpy`: http://www.rootpy.org/commands/root2hdf5.html

In [3]:
!pip install tables --user

Collecting tables
[?25l  Downloading https://files.pythonhosted.org/packages/12/63/007b9cf964a8c6b1a92ddc00b2a6a369e132e69ee71d800cb212847e061e/tables-3.4.4-cp27-cp27mu-manylinux1_x86_64.whl (3.8MB)
[K    100% |████████████████████████████████| 3.8MB 5.6MB/s eta 0:00:01
[31mtensorflow 1.8.0 requires backports.weakref>=1.0rc1, which is not installed.[0m
[31mpy2neo 4.0.0 requires colorama, which is not installed.[0m
[31mpy2neo 4.0.0 requires neo4j-driver>=1.6.0, which is not installed.[0m
[31mpylint 2.0.0 has requirement astroid>=2.0.0, but you'll have astroid 1.6.5 which is incompatible.[0m
[31mtensorboard 1.8.0 has requirement bleach==1.5.0, but you'll have bleach 2.1.3 which is incompatible.[0m
[31mtensorboard 1.8.0 has requirement html5lib==0.9999999, but you'll have html5lib 1.0.1 which is incompatible.[0m
Installing collected packages: tables
[33m  The scripts pt2to3, ptdump, ptrepack and pttree are installed in '/eos/user/q/qnguyen/.local/bin' which is not on PATH.


In [4]:
%%bash
root2hdf5 -f data/ntuple_4mu_VV.root data/ntuple_4mu_bkg.root

INFO:rootpy.root2hdf5] Converting data/ntuple_4mu_VV.root ...
INFO:rootpy.root2hdf5] Will convert 1 tree in /
INFO:rootpy.root2hdf5] Converting tree 'HZZ4LeptonsAnalysisReduced' with 25817 entries ...
INFO:rootpy.root2hdf5] Created data/ntuple_4mu_VV.h5
INFO:rootpy.root2hdf5] Converting data/ntuple_4mu_bkg.root ...
INFO:rootpy.root2hdf5] Will convert 1 tree in /
INFO:rootpy.root2hdf5] Converting tree 'HZZ4LeptonsAnalysisReduced' with 58107 entries ...
INFO:rootpy.root2hdf5] Created data/ntuple_4mu_bkg.h5


## Load `HDF5` datasets
Here we load the converted `HDF5` datasets into structured `NumPy` arrays. Note these structures arrays permit one to manipulate the data by named fields: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html

In [5]:
import numpy as np
import h5py

treename = 'HZZ4LeptonsAnalysisReduced'
filename = {}
h5file = {}
params = {}

filename['bkg'] = 'data/ntuple_4mu_bkg.h5'
filename['VV'] = 'data/ntuple_4mu_VV.h5'

h5file['bkg'] = h5py.File(filename['bkg'], 'r') # open HDF5 file read-only
h5file['VV'] = h5py.File(filename['VV'], 'r') 

params['bkg'] = h5file['bkg'][treename][()] # returns a structured NumPy array
params['VV'] = h5file['VV'][treename][()]



print all variables

In [6]:
print(params['bkg'].dtype.names)

('f_run', 'f_lumi', 'f_event', 'f_weight', 'f_int_weight', 'f_pu_weight', 'f_eff_weight', 'f_lept1_pt', 'f_lept1_eta', 'f_lept1_phi', 'f_lept1_charge', 'f_lept1_pfx', 'f_lept1_sip', 'f_lept2_pt', 'f_lept2_eta', 'f_lept2_phi', 'f_lept2_charge', 'f_lept2_pfx', 'f_lept2_sip', 'f_lept3_pt', 'f_lept3_eta', 'f_lept3_phi', 'f_lept3_charge', 'f_lept3_pfx', 'f_lept3_sip', 'f_lept4_pt', 'f_lept4_eta', 'f_lept4_phi', 'f_lept4_charge', 'f_lept4_pfx', 'f_lept4_sip', 'f_iso_max', 'f_sip_max', 'f_Z1mass', 'f_Z2mass', 'f_angle_costhetastar', 'f_angle_costheta1', 'f_angle_costheta2', 'f_angle_phi', 'f_angle_phistar1', 'f_pt4l', 'f_eta4l', 'f_mass4l', 'f_mass4lErr', 'f_njets_pass', 'f_deltajj', 'f_massjj', 'f_D_jet', 'f_jet1_pt', 'f_jet1_eta', 'f_jet1_phi', 'f_jet1_e', 'f_jet2_pt', 'f_jet2_eta', 'f_jet2_phi', 'f_jet2_e', 'f_D_bkg_kin', 'f_D_bkg', 'f_D_gg', 'f_D_g4', 'f_Djet_VAJHU', 'f_pfmet')


print the shape of the NumPy array

In [7]:
print(params['bkg'].shape)

(58107,)


print the the first entry of the NumPy array

In [8]:
print(params['bkg'][0])

(1, 4, 630, 0.00064811, 0., 1.2290536, 1., 32.80312, 0.35433915, -1.4164597, -1., 0., -1.2187678, 23.372433, 0.25018594, 0.9821058, 1., 0.02498744, -1.2894976, 19.231188, -1.2650819, -0.02464505, -1., 0.03294736, -1.8134387, 6.870271, -0.45518333, -0.80094427, 1., 0., 0.54434615, 0., 0., 51.681366, 12.933985, 0.963756, 0.22460742, 0.7629764, 2.597239, 1.9826626, 45.872066, -0.35886624, 91.09813, 0., 1., -999., -999., -999., 35.839058, 0.5195489, 2.5928829, 36.253284, 0., 0., 0., 0., 0.3630876, 0.3630876, -2.2493208e-05, 0.8271159, -1., 18.884806)


print mass4l value of first entry

In [9]:
print(params['bkg'][0]['f_mass4l'])

91.09813


print massjj value of first entry

In [10]:
print(params['bkg'][0]['f_massjj'])

-999.0


## Convert `NumPy` arrays to `pandas` DataFrames
In my opinion, `pandas` DataFrames are a more convenient/flexible data container in python: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html. 
So we'll use this instead of structured `NumPy` arrays.

In [11]:
import pandas as pd

df = {}
df['bkg'] = pd.DataFrame(params['bkg'])
df['VV'] = pd.DataFrame(params['VV'])

print first entry

In [12]:
print(df['bkg'].iloc[:1])

   f_run  f_lumi  f_event  f_weight  f_int_weight  f_pu_weight  f_eff_weight  \
0      1       4      630  0.000648           0.0     1.229054           1.0   

   f_lept1_pt  f_lept1_eta  f_lept1_phi    ...      f_jet2_pt  f_jet2_eta  \
0    32.80312     0.354339     -1.41646    ...            0.0         0.0   

   f_jet2_phi  f_jet2_e  f_D_bkg_kin   f_D_bkg    f_D_gg    f_D_g4  \
0         0.0       0.0     0.363088  0.363088 -0.000022  0.827116   

   f_Djet_VAJHU    f_pfmet  
0          -1.0  18.884806  

[1 rows x 62 columns]


print shape of DataFrame

In [13]:
print(df['bkg'].shape)

(58107, 62)


print first entry for f_mass4l and f_massjj

In [14]:
print(df['bkg'][['f_mass4l','f_massjj']].iloc[:1])

    f_mass4l  f_massjj
0  91.098129    -999.0


convert back into unstructured NumPY array

In [15]:
print(df['bkg'].values)
print(df['bkg'].values.shape)

[[ 1.00000000e+00  4.00000000e+00  6.30000000e+02 ...  8.27115893e-01
  -1.00000000e+00  1.88848057e+01]
 [ 1.00000000e+00  1.97100000e+03  3.81019000e+05 ...  4.15622257e-02
  -1.00000000e+00  2.95897903e+01]
 [ 1.00000000e+00  7.37000000e+03  1.42459000e+06 ...  8.38690639e-01
  -1.00000000e+00  2.04643517e+01]
 ...
 [ 1.00000000e+00  3.44060000e+04  6.65084800e+06 ...  8.02676916e-01
  -1.00000000e+00  1.25938129e+01]
 [ 1.00000000e+00  3.44060000e+04  6.65092600e+06 ...  7.27994442e-02
  -1.00000000e+00  3.35813141e+01]
 [ 1.00000000e+00  3.44060000e+04  6.65094300e+06 ...  1.21763824e-02
  -1.00000000e+00  1.98157139e+01]]
(58107, 62)


get boolean array

In [16]:
print(df['bkg']['f_mass4l'] > 125)

0        False
1         True
2        False
3         True
4         True
5         True
6         True
7         True
8         True
9        False
10        True
11        True
12        True
13       False
14        True
15        True
16        True
17        True
18        True
19        True
20        True
21       False
22        True
23        True
24       False
25       False
26       False
27        True
28        True
29        True
         ...  
58077     True
58078     True
58079     True
58080    False
58081     True
58082     True
58083     True
58084     True
58085    False
58086     True
58087    False
58088     True
58089     True
58090     True
58091     True
58092     True
58093     True
58094    False
58095     True
58096     True
58097     True
58098     True
58099    False
58100    False
58101    False
58102    False
58103     True
58104    False
58105     True
58106     True
Name: f_mass4l, Length: 58107, dtype: bool


cut usigg this boolean array

In [17]:
print(df['bkg']['f_mass4l'][df['bkg']['f_mass4l'] > 125])

1        201.847610
3        586.597412
4        135.589798
5        734.903442
6        341.958466
7        254.073425
8        209.200073
10       249.034500
11       152.691574
12       217.140778
14       381.590332
15       264.585419
16       193.972916
17       310.875885
18       178.564941
19       222.176819
20       301.437988
22       473.301605
23       199.354355
27       126.852921
28       280.004089
29       231.275574
30       256.445831
31       248.573166
32       572.620361
33       593.505493
36       220.300735
37       178.801941
38       229.302322
39       277.066864
            ...    
58066    229.215164
58067    154.112381
58069    218.869415
58070    405.237152
58071    216.917175
58072    235.896942
58073    249.296753
58075    151.795853
58076    177.322540
58077    238.749847
58078    248.517776
58079    195.696304
58081    256.388489
58082    221.203857
58083    204.776276
58084    320.152130
58086    186.158600
58088    199.261490
58089    186.380890


## (Optional) Direct conversion to pandas dataframe from ROOT

Supported in ROOT 6.14 release:

Supported in ROOT 6.16 release:

Root.RDataFrame is a utility that aims to provide similar functionalities to pandas Dataframe within ROOT supported since ROOT 6.14. It is developed to provide for a more modern, convenient data analysis interface from within ROOT. If you have extra time, you could click on this link [https://nbviewer.jupyter.org/url/root.cern.ch/doc/master/notebooks/df001_introduction.py.nbconvert.ipynb] for an  introductory demonstration of how it works.
For more information, go to [https://root.cern/doc/master/group__tutorial__dataframe.html]
