# How to use:

This notebook can be used to prepare two datasets from the CERN Open Data Portal for the analysis.
The two datasets can be found at the following links:

- Run 2012B: https://opendata.cern.ch/record/12365
- Run 2012C: https://opendata.cern.ch/record/12366

This notebook extracts all the neccessary information from the original files in the root format and saves them to a CSV file for easy use.
Some basic filtering is performed to reduce the file size.

After downloading the original files from the Cern Open Data Portal, place them in the same directory as this notebook and execute all cells. 
The resulting dataset is then saved in the same directory as "Run2012BC_DoubleMuons_prefiltered.zip".

The four following imports are required.


In [None]:
import pandas as pd
import uproot
import numpy as np
import awkward as ak

In [None]:
#open files
events1 = uproot.open("Run2012B_DoubleMuParked.root:Events") #read file 1
events2 = uproot.open("Run2012C_DoubleMuParked.root:Events") #read file 2

In [None]:
#create boolean mask to filter out events with exactly 2 myons
mask1 = events1['nMuon'].array(library ="np") == 2
mask2 = events2['nMuon'].array(library ="np") == 2

In [None]:
#create new df to copy values into 
colNames =  ['pt','eta','phi','Q','dxy','dz','Iso3'] #names of columns in df
#the columns dxy, dz and Iso3 are currently only in use in the analysis behind the thesis
nCols = len(colNames) #how many columns per particle
df = pd.DataFrame(columns = pd.MultiIndex.from_arrays([nCols*['mu1']+nCols*['mu2'],colNames+colNames])) 
df

In [None]:
#dictionary to translate from root column names to df column names
#names fromm root file
rootCols = ['Muon_pt','Muon_eta','Muon_phi','Muon_charge','Muon_dxy','Muon_dz','Muon_pfRelIso03_all']
names = dict(zip(rootCols,colNames))

In [None]:
for c in rootCols:
    data1 = ak.to_numpy(ak.Array.__getitem__(events1[c].array(),mask1))
    data2 = ak.to_numpy(ak.Array.__getitem__(events2[c].array(),mask2))
    data = np.concatenate((data1,data2)).T
    df["mu1",names[c]] = data[0]
    df["mu2",names[c]] = data[1]
df = df[(df.mu1.Iso3 >= 0) & (df.mu2.Iso3 >= 0)]
df = df.sort_index(axis=1)
df

In [None]:
df.to_csv("Run2012BC_DoubleMuons_prefiltered.zip",index = False)