### When running this notebook via the Galaxy portal
You can access your data via the dataset number. Using a Python kernel, you can access dataset number 42 with ``handle = open(get(42), 'r')``.
To save data, write your data to a file, and then call ``put('filename.txt')``. The dataset will then be available in your galaxy history.
<br><br>Note that if you are putting/getting to/from a different history than your default history, you must also provide the history-id.
<br><br>More information including available galaxy-related environment variables can be found at https://github.com/bgruening/docker-jupyter-notebook. This notebook is running in a docker container based on the Docker Jupyter container described in that link.

# Ntuple to data frame conversion

The following notebook converts ntuples to pandas data frame and writes the output to hdf5 files. Events are selected when running over the ntuples and new variables are created and put into a data frame. Code adds the background category, whether the event is coming from a signal simulation or not (useful when training a BDT or NN) and the weight used to scale the MC to data.

The current code takes about 4 - 5 hours on the simulated 2Lep background and signal samples. I.e. processing about 118 million events.

First import some of the needed modules.

In [2]:
import ROOT as R
import import_ipynb
import setPath
from os import listdir
from os.path import isfile, join
from Input.OpenDataPandaFramework13TeV import *
%jsroot on

Welcome to JupyROOT 6.24/02
importing Jupyter notebook from setPath.ipynb
importing Jupyter notebook from /storage/galaxy/jobs_directory/003/3139/working/jupyter/Input/OpenDataPandaFramework13TeV.ipynb
This library contains handy functions to ease the access and use of the 13TeV ATLAS OpenData release

getBkgCategories()
	 Dumps the name of the various background cataegories available 
	 as well as the number of samples contained in each category.
	 Returns a vector with the name of the categories

getSamplesInCategory(cat)
	 Dumps the name of the samples contained in a given category (cat)
	 Returns dictionary with keys being DSIDs and values physics process name from filename.

getMCCategory()
	 Returns dictionary with keys DSID and values MC category

initialize(indir)
	 Collects all the root files available in a certain directory (indir)



Setting luminosity to 10064 pb^-1

###############################
#### Background categories ####
###############################
Category    

Set the path to the open data ntuples and which skim you are interested in:

In [3]:
opendatadir = "/storage/shared/data/fys5555/ATLAS_opendata/"
analysis = "2lep"

Make the ROOT::TChain for adding all the root files and eventually looping over all the events.

In [4]:
background = R.TChain("mini")
data = R.TChain("mini")

Get all the MC and data files available for the selected data set and make lists with the background and signal categories (useful information to add into the data frame later)

In [5]:
mcfiles = initialize(opendatadir+"/"+analysis+"/MC")
datafiles = initialize(opendatadir+"/"+analysis+"/Data")
allfiles = z = {**mcfiles, **datafiles}
Backgrounds = getBkgCategories(); 
Signals = getSignalCategories();

####################################################################################################
BACKGROUND SAMPLES
####################################################################################################
####################################################################################################
SIGNAL SAMPLES
####################################################################################################
###############################
#### Background categories ####
###############################
Category             N(samples)
-------------------------------
Diboson                      10
Higgs                        20
Wjets                        42
Wjetsincl                     6
Zjets                        42
Zjetsincl                     3
singleTop                     6
topX                          3
ttbar                         1
###############################
#### Signal categories ####
###############################
Category             N

Some more preparatory steps to classify the individual backgrounds into categories.

In [6]:
getSignalCategories()

###############################
#### Signal categories ####
###############################
Category             N(samples)
-------------------------------
GG_ttn1                       4
Gee                           5
Gmumu                         5
RS_G_ZZ                       5
SUSYC1C1                     10
SUSYC1N2                     18
SUSYSlepSlep                 14
TT_directTT                   4
ZPrimeee                      4
ZPrimemumu                    4
ZPrimett                     12
dmV_Zll                      10


['GG_ttn1',
 'Gee',
 'Gmumu',
 'RS_G_ZZ',
 'SUSYC1C1',
 'SUSYC1N2',
 'SUSYSlepSlep',
 'TT_directTT',
 'ZPrimeee',
 'ZPrimemumu',
 'ZPrimett',
 'dmV_Zll']

In [7]:
MCcat = {}
for cat in allfiles:
    for dsid in allfiles[cat]["dsid"]:
        try:
            MCcat[int(dsid)] = cat
        except:
            continue

Adding the background to the TChain and check number of events.

In [8]:
dataset_IDs = []
background.Reset()
for b in Backgrounds+Signals:
    i = 0
    if not b in mcfiles.keys(): continue
    for mc in mcfiles[b]["files"]:
        if not os.path.isfile(mc): continue
        try:
            dataset_IDs.append(int(mcfiles[b]["dsid"][i]))
            background.Add(mc)
        except:
            print("Could not get DSID for %s. Skipping"%mc)
        i += 1
nen = background.GetEntries()
print("Added %i entries for backgrounds"%(nen))

Added 121180468 entries for backgrounds


Adding all the available data into the TChain.

In [9]:
data.Reset(); 
for d in datafiles["data"]["files"]:  
    if not os.path.isfile(d): continue
    data.Add(d)
nen = data.GetEntries()
print("Added %i entries for data"%(nen))

Added 24411580 entries for data


These are the variables/features we want to add to our data frame and which will be filled during the loop over events. Here you can add and remove variables depending on what you will use the resulting data frame to.

In [23]:
columns = {"lep_pt1":[],"lep_eta1":[],"lep_phi1":[],"lep_E1":[],
           "lep_pt2":[],"lep_eta2":[],"lep_phi2":[],"lep_E2":[],
           "met":[], "mll":[], "njet20":[], "njet60":[], "nbjet80":[],
           "isSF":[], "isOS":[], "weight":[],"category":[],"isSignal":[],
           "lep_z0_1":[], "lep_z0_2":[], "lep_trackd0pvunbiased_1":[],
           "lep_trackd0pvunbiased_2":[], "lep_tracksigd0pvunbiased_1":[], "lep_tracksigd0pvunbiased_2":[],
           "met_phi":[], "lep_pt_syst_1":[], "lep_pt_syst_2":[], "met_et_syst":[], "lep_etcone20_1":[],  
           "lep_etcone20_2":[], "lep_ptcone30_1":[], "lep_ptcone30_2":[]
           
           
          
          }

This is the event loop (needs to be run twice; one for MC and one for data if you are interested in both). It makes some selections, creates new variables and fill the list in the dictionary defined above. 

In [25]:
%%time
import time
isData = 0; 

if isData == 1: ds = data 
else: ds = background     

l1 = R.TLorentzVector() 
l2 = R.TLorentzVector() 
dileptons = R.TLorentzVector() 
    
i = 0   
for event in ds: 
    
    if i%100000 == 0 and i>0: 
        print("Total events %i/%i"%(i,ds.GetEntries()))
        #break
    i += 1 
    
    sig_lep_idx = []
    nsig_lep = 0
    for j in range(ds.lep_n):
        if ds.lep_etcone20[j]/ds.lep_pt[j] > 0.15: continue
        if ds.lep_ptcone30[j]/ds.lep_pt[j] > 0.15: continue
        sig_lep_idx.append(j)
        nsig_lep += 1
        
    if not nsig_lep == 2: continue 
    njet20 = 0
    njet60 = 0
    nbjet60 = 0
    nbjet70 = 0
    nbjet77 = 0
    nbjet80 = 0
    for j in range(ds.jet_n):
        if ds.jet_pt[j] > 20000:
            njet20 += 1
            if ds.jet_MV2c10[j] < 0.1758:
                nbjet80 += 1
        if ds.jet_pt[j] > 60000:
            njet60 += 1
        
    ## Require "good leptons": 
    idx1 = sig_lep_idx[0]
    idx2 = sig_lep_idx[1]
    
    ## Set Lorentz vectors: 
    l1.SetPtEtaPhiE(ds.lep_pt[idx1]/1000., ds.lep_eta[idx1], ds.lep_phi[idx1], ds.lep_E[idx1]/1000.);
    l2.SetPtEtaPhiE(ds.lep_pt[idx2]/1000., ds.lep_eta[idx2], ds.lep_phi[idx2], ds.lep_E[idx2]/1000.);
    ## Variables are stored in the TTree with unit MeV, so we need to divide by 1000 
    ## to get GeV, which is a more practical and commonly used unit. 

    dileptons = l1 + l2;   
    
    columns["lep_pt1"].append(ds.lep_pt[idx1]/1000.0)
    columns["lep_eta1"].append(ds.lep_eta[idx1])
    columns["lep_phi1"].append(ds.lep_phi[idx1])
    columns["lep_E1"].append(ds.lep_E[idx1]/1000.0)
    
    columns["lep_pt2"].append(ds.lep_pt[idx2]/1000.0)
    columns["lep_eta2"].append(ds.lep_eta[idx2])
    columns["lep_phi2"].append(ds.lep_phi[idx2])
    columns["lep_E2"].append(ds.lep_E[idx2]/1000.0)
    
    columns["met"].append(ds.met_et/1000.0)
    columns["mll"].append(dileptons.M())
    
    columns["njet20"].append(njet20)
    columns["njet60"].append(njet60)
    
 
    columns["nbjet80"].append(nbjet80)
    
    if not isData:
        Type = MCcat[ds.channelNumber]
        # print("Type",Type)
        columns["category"].append(Type)
    else:
        columns["category"].append("data")
    
        
    if Type in Backgrounds:
        columns["isSignal"].append(0)
    elif Type in Signals:
        columns["isSignal"].append(1)
    else:
        columns["isSignal"].append(0)
    
    if ds.lep_charge[idx1] == ds.lep_charge[idx2]: columns["isOS"].append(0)
    else: columns["isOS"].append(1)
        
    if ds.lep_type[idx1] == ds.lep_type[idx2]: columns["isSF"].append(1)
    else: columns["isSF"].append(0)
        
    if isData:
        columns["weight"].append(1.0)
    else:
        W = ((ds.mcWeight)*(ds.scaleFactor_PILEUP)*
             (ds.scaleFactor_ELE)*(ds.scaleFactor_MUON)*
             (ds.scaleFactor_BTAG)*(ds.scaleFactor_LepTRIGGER))*((ds.XSection*lumi)/ds.SumWeights)
        columns["weight"].append(W)
   
        
  
    columns["lep_z0_1"].append(ds.lep_z0[idx1])
    columns["lep_z0_2"].append(ds.lep_z0[idx2])
    
    columns["lep_trackd0pvunbiased_1"].append(ds.lep_trackd0pvunbiased[idx1])
    columns["lep_trackd0pvunbiased_2"].append(ds.lep_trackd0pvunbiased[idx2])
    
    columns["lep_tracksigd0pvunbiased_1"].append(ds.lep_tracksigd0pvunbiased[idx1])
    columns["lep_tracksigd0pvunbiased_2"].append(ds.lep_tracksigd0pvunbiased[idx2])
    
    columns["met_phi"].append(ds.met_phi)
    
    columns["lep_pt_syst_1"].append(ds.lep_pt_syst[idx1])
    columns["lep_pt_syst_2"].append(ds.lep_pt_syst[idx2])
    
    columns["met_et_syst"].append(ds.met_et_syst)
    
    columns["lep_etcone20_1"].append(ds.lep_etcone20[idx1]/ds.lep_pt[idx1] )
    columns["lep_etcone20_2"].append(ds.lep_etcone20[idx2]/ds.lep_pt[idx2])
                                    
    columns["lep_ptcone30_1"].append(ds.lep_etcone20[idx1]/ds.lep_pt[idx1] )
    columns["lep_ptcone30_2"].append(ds.lep_etcone20[idx2]/ds.lep_pt[idx2])

        
print("Done!")
if isData == 0:
    print("Remebered to run over data? No? Set data = 1 at the top and run again")
else:
    print("Remebered to run over MC? No? Set data = 0 at the top and run again")

Done!
Remebered to run over data? No? Set data = 1 at the top and run again
CPU times: user 2.86 ms, sys: 0 ns, total: 2.86 ms
Wall time: 2.85 ms


Finally convert the dictionary to a data frame

In [26]:
df = pd.DataFrame(data=columns)

In [27]:
print(df)

       lep_pt1  lep_eta1  lep_phi1      lep_E1    lep_pt2  lep_eta2  lep_phi2  \
0   124.174242  0.874867  1.808438  174.804609  36.407164 -0.088034  2.809092   
1    81.009211 -0.172663  2.475907   82.219758  49.662348  0.419312  1.085146   
2    71.252680 -0.190694  2.687513   72.552219  34.968070 -1.183697 -2.000721   
3    52.469828  0.626280 -3.089676   63.100617  42.395828  0.286550 -0.852670   
4    61.642246 -0.359743 -0.503622   65.674258  31.970502 -1.249790  1.979540   
5    44.796750  2.258051 -2.900190  216.569531  23.854400  0.491639  0.989600   
6    50.000758  2.364431 -1.107304  268.303844  37.587879  1.913280  1.915223   
7    83.771773  1.829904 -1.813111  267.806469  30.891246  2.213547  2.053218   
8    61.179313  0.665834 -2.552287   75.249344  37.775004  0.687066 -0.124923   
9    45.091684 -1.007820 -1.399480   69.996703  37.807648  0.652300 -0.048723   
10   52.586027 -0.258767  0.366813   54.356461  45.281090  0.417318  2.399606   

        lep_E2        met  

... and write it to a file for later use. There are many more possibilites for file format. Have a look at the pandas documentation (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html) for possibilites. 

In [None]:
df.to_hdf("/storage/shared/data/2lep_df_forML_signal.hdf5","mini")

In [None]:
df.to_hdf("/storage/shared/data/2lep_df_forML.hdf5","mini")

In [None]:
df["nbjet60"]

In [None]:
reread = pd.read_hdf("/storage/shared/data/2lep_df_forML.hdf5")

In [None]:
reread