### When running this notebook via the Galaxy portal
You can access your data via the dataset number. Using a Python kernel, you can access dataset number 42 with ``handle = open(get(42), 'r')``.
To save data, write your data to a file, and then call ``put('filename.txt')``. The dataset will then be available in your galaxy history.
<br><br>Note that if you are putting/getting to/from a different history than your default history, you must also provide the history-id.
<br><br>More information including available galaxy-related environment variables can be found at https://github.com/bgruening/docker-jupyter-notebook. This notebook is running in a docker container based on the Docker Jupyter container described in that link.


# Data Handling (Python) 


In [2]:
import ROOT as R
import import_ipynb
import setPath
from os import listdir
from os.path import isfile, join
from Input.OpenDataPandaFramework13TeV import *
%jsroot on

Welcome to JupyROOT 6.24/02
importing Jupyter notebook from setPath.ipynb
importing Jupyter notebook from /storage/galaxy/jobs_directory/003/3068/working/jupyter/Input/OpenDataPandaFramework13TeV.ipynb
This library contains handy functions to ease the access and use of the 13TeV ATLAS OpenData release

getBkgCategories()
	 Dumps the name of the various background cataegories available 
	 as well as the number of samples contained in each category.
	 Returns a vector with the name of the categories

getSamplesInCategory(cat)
	 Dumps the name of the samples contained in a given category (cat)
	 Returns dictionary with keys being DSIDs and values physics process name from filename.

getMCCategory()
	 Returns dictionary with keys DSID and values MC category

initialize(indir)
	 Collects all the root files available in a certain directory (indir)



Setting luminosity to 10064 pb^-1

###############################
#### Background categories ####
###############################
Category    

## 1. Reading the dataset

Set the analaysis to run (*1largeRjet1lep*, *1lep1tau*, *3lep*, *exactly2lep*, *GamGam*, *2lep*, *4lep*)

Set the directory where you have downloaded the ATLAS OpenData samples you want to run over

In [9]:
opendatadir = "/storage/shared/data/fys5555/ATLAS_opendata/"
analysis = "2lep"

In [10]:
background = R.TChain("mini")
data = R.TChain("mini")

A list of all the background samples, category and their IDs can be found in **Infofile.txt**. The cross-section, efficiencies etc. needed for scaling are stored in the **Files_<---->**. We read these files and add all the samples to the TChain. We also (for later convenience) make a vector containing the dataset IDs. 

In [11]:
mcfiles = initialize(opendatadir+"/"+analysis+"/MC")
datafiles = initialize(opendatadir+"/"+analysis+"/Data")
allfiles = z = {**mcfiles, **datafiles}
Backgrounds = getBkgCategories() 

####################################################################################################
BACKGROUND SAMPLES
####################################################################################################
####################################################################################################
SIGNAL SAMPLES
####################################################################################################
###############################
#### Background categories ####
###############################
Category             N(samples)
-------------------------------
Diboson                      10
Higgs                        20
Wjets                        42
Wjetsincl                     6
Zjets                        42
Zjetsincl                     3
singleTop                     6
topX                          3
ttbar                         1


In [12]:
MCcat = {}
for cat in allfiles:
    for dsid in allfiles[cat]["dsid"]:
        try:
            MCcat[int(dsid)] = cat
        except:
            continue

In [13]:
dataset_IDs = []
background.Reset()
for b in Backgrounds:
    i = 0
    if not b in mcfiles.keys(): continue
    for mc in mcfiles[b]["files"]:
        if not os.path.isfile(mc): continue
        try:
            dataset_IDs.append(int(mcfiles[b]["dsid"][i]))
            background.Add(mc)
        except:
            print("Could not get DSID for %s. Skipping"%mc)
        i += 1
nen = background.GetEntries()
print("Added %i entries for backgrounds"%(nen))

Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Diboson.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Higgs.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Wjets.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Wjetsincl.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Zjets.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/Zjetsincl.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/singleTop.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/topX.root. Skipping
Could not get DSID for /storage/shared/data/fys5555/ATLAS_opendata//2lep/MC/ttbar.root. Skipping
Added 118288518 entries for backgrounds


In [15]:
data.Reset(); 
for d in datafiles["data"]["files"]:  
    if not os.path.isfile(d): continue
    data.Add(d)
nen = data.GetEntries()
print("Added %i entries for data"%(nen))

Added 24411580 entries for data


## 2. Event selection

For machine learning using (semi)-unsupervised learning, we need as unbiased data as possible. Thus, we need to construct a dataframe with all the neccesary features, which is a lot of data.

In [16]:
# Retrieve lumi from library
%store -r lumi

l1 = R.TLorentzVector()
l2 = R.TLorentzVector()

dilepton = R.TLorentzVector()


no stored variable or alias lumi


This is the cell where the analysis is performed. Note that the cell needs to be run twice:

1. with data = 0 to run over MC
2. with data = 1 to run over data

Note that the MC running takes ~5 minutes for 3lep analysis. Much(!!!) more time for e.g. 2lep analysis! Data running is relatively fast for 3lep. 

In [9]:
import pandas as pd

In [10]:
%%time
import time
import numpy as np
isData = 0; 

if isData == 1: ds = data 
else: ds = background

legal_flavor_tot = [13*4, 11*4]
    
columns = {"met":[], "XSection":[], 
           "lep_n ":[],"tot_lep_invariant_mass":[], "mean_lep_pt":[], "mean_lep_E":[], "mean_lep_ptcone30":[], "mean_lep_etcone20":[], "mean_lep_eta":[], "mean_lep_phi":[],
           "jet_n":[], "mean_jet_pt":[], "mean_jet_E":[], "mean_jet_eta":[], "mean_jet_phi":[],
           "photon_n":[], "mean_photon_pt":[], "mean_photon_E":[], "mean_photon_ptcone30":[], "mean_photon_etcone20":[],"mean_photon_eta":[], "mean_photon_phi":[],
           "largeRjet_n":[], "tot_largeRjet_m":[],"mean_largeRjet_pt":[], "mean_largeRjet_E":[], "mean_largeRjet_eta":[], "mean_largeRjet_phi":[],
           "tau_n":[], "mean_tau_pt":[], "mean_tau_E":[], "mean_tau_eta":[], "mean_tau_phi":[],
           "mean_lep_pt_syst":[], "met_et_syst ":[], 
           "mean_jet_pt_syst":[], "mean_photon_pt_syst":[], 
           "mean_largeRjet_pt_syst":[], "mean_tau_pt_syst":[]
          }

i = 0   
for event in ds: 
    
    if i%100000 == 0 and i>0: 
        print("Total events %i/%i"%(i,ds.GetEntries()))
    i += 1 
   
    # First event selection, require charge conservation, and lep flavor, and 
    # to only have the two largest leptons    
    
    ## Cut #1: Require 2 or more leptons, but must find the indices of the two we pick
    if not ds.lep_n >= 2: continue
    lep_pt = np.zeros(ds.lep_n)
    lep_type = np.zeros(ds.lep_n)
    for i in range(ds.lep_n):
        lep_pt[i] = ds.lep_pt[i]
        lep_type = ds.lep_type[i]
        
    elec_index = np.where(lep_type == 11)[0]
    muon_index = np.where(lep_type == 13)[0]
    
    if (len(muon_index) < len(elec_index)) and (len(elec_index) == 2):
        lep_index = elec_index
    elif (len(elec_index) < len(muon_index)) and (len(muon_index) == 2): 
        lep_index = muon_index
    else:  
        max_index = np.where(lep_pt == np.max(lep_pt))[0]
        lep_typ = lep_type[max_index]
        lep_ind = np.where(lep_type == lep_typ)[0]
        
    lep0 = lep_index[0]
    lep1 = lep_index[1]
    
    
    ## Cut #2: Require opposite charge
    if not ds.lep_charge[lep0] + ds.lep_charge[lep1] == 0 : continue
    

    
    ## Require "good leptons": 
    
    if ds.lep_pt[lep0]/1000.0 < 25: continue
    if ds.lep_etcone20[lep0]/ds.lep_pt[lep0] > 0.15:
        continue
    if ds.lep_ptcone30[lep0]/ds.lep_pt[lep0] > 0.15:
        continue
    #if not (ds.lep_flag[0] & 512): continue
        
    if ds.lep_pt[lep1]/1000.0 < 25:
        continue
    if ds.lep_etcone20[lep1]/ds.lep_pt[lep1] > 0.15:
        continue
    if ds.lep_ptcone30[lep1]/ds.lep_pt[lep1] > 0.15:
        continue
    #if not (ds.lep_flag[1] & 512): continue
    
    l1.SetPtEtaPhiE(ds.lep_pt[lep0]/1000., ds.lep_eta[lep0],
                    ds.lep_phi[lep0], ds.lep_E[lep0]/1000.)
    l2.SetPtEtaPhiE(ds.lep_pt[lep1]/1000., ds.lep_eta[lep1],
                    ds.lep_phi[lep1], ds.lep_E[lep1]/1000.)
    
    dilepton = l1 + l2 
    
    dilep_inv_mass = dilepton.M()
    ## Event selection:
    
    ### General information 
    columns["met"].append(ds.met_et/1000.0)
    columns["XSection"].append(ds.XSection)  
    
    ### Lep information
    columns["lep_n"].append(ds.lep_n)
    columns["tot_lep_invariant_mass"].append(dilep_inv_mass)
    columns["mean_lep_pt"].append(np.mean(ds.lep_pt))
    columns["mean_lep_E"].append(np.mean(ds.lep_E))
    columns["mean_lep_ptcone30"].append(np.mean(ds.lep_ptcone30))
    columns["mean_lep_etcone20"].append(np.mean(ds.lep_etcone20))
    columns["mean_lep_eta"].append(np.mean(ds.lep_eta))
    columns["mean_lep_phi"].append(np.mean(ds.lep_phi))
    
    ### Jet information
    columns["jet_n"].append(ds.jet_n)
    columns["mean_jet_pt"].append(np.mean(ds.jet_pt))
    columns["mean_jet_E"].append(np.mean(ds.jet_E))
    columns["mean_jet_eta"].append(np.mean(ds.jet_eta))
    columns["mean_jet_phi"].append(np.mean(ds.jet_phi))
    
    ### Photon information
    columns["photon_n"].append(ds.photon_n)
    columns["mean_photon_pt"].append(np.mean(ds.photon_pt))
    columns["mean_photon_E"].append(np.mean(ds.photon_E))
    columns["mean_photon_ptcone30"].append(np.mean(ds.photon_ptcone30))
    columns["mean_photon_etcone20"].append(np.mean(ds.photon_etcone20))
    columns["mean_photon_eta"].append(np.mean(ds.photon_eta))
    columns["mean_photon_phi"].append(np.mean(ds.photon_phi))
    
    ### LargeRjet information
    columns["largeRjet_n"].append(ds.largeRjet_n)
    columns["tot_largeRjet_m"].append(ds.largeRjet_m)
    columns["mean_largeRjet_pt"].append(np.mean(ds.largeRjet_pt))
    columns["mean_largeRjet_E"].append(np.mean(ds.largeRjet_E))
    columns["mean_largeRjet_eta"].append(np.mean(ds.largeRjet_eta))
    columns["mean_largeRjet_phi"].append(np.mean(ds.largeRjet_phi))
    
    
    ### Tau information 
    columns["tau_n"].append(ds.tau_n)
    columns["mean_tau_pt"].append(np.mean(ds.tau_pt))
    columns["mean_tau_E"].append(np.mean(ds.tau_E))
    columns["mean_tau_eta"].append(np.mean(ds.tau_eta))
    columns["mean_tau_phi"].append(np.mean(ds.tau_phi))
    
    
    ### Systematic uncertainty
    columns["mean_lep_pt_syst"].append(ds.lep_pt_syst)
    columns["met_et_syst "].append(ds.met_et_syst)
    columns["mean_jet_pt_syst"].append(ds.jet_pt_syst)
    columns["mean_photon_pt_syst"].append(ds.photon_pt_syst)
    columns["mean_largeRjet_pt_syst"].append(ds.largeRjet_pt_syst)
    columns["mean_tau_pt_syst"].append(ds.tau_pt_syst)
    
    
    
    

df = pd.DataFrame(data=columns)
        
print("Done!")
if isData == 0:
    print("Remebered to run over data? No? Set data = 1 at the top and run again")
else:
    print("Remebered to run over MC? No? Set data = 0 at the top and run again")


Total events 100000/1601489
Done!
Remebered to run over data? No? Set data = 1 at the top and run again
CPU times: user 20.1 s, sys: 198 ms, total: 20.3 s
Wall time: 20.4 s


In [None]:
df.to_hdf("datatest.csv","mini")