# Optimization study notebook

### Data exploration

The first step of any analysis is to understand what we are searching for. In our analysis we aim to measure the central exclusive di-lepton production, $pp\to p\oplus \ell\ell \oplus p$ process with $\ell\in\{ e,\mu \} $. Feinman diagram of this process are shown bellow: 

<img src="img/diagrams.png" alt="Feinmann diagrams" style="width: 700px;"/>

Where in our case, we will consider only electrons and muons. 

---
The measurement using 2016 data was published in [JHEP07(2018)153](https://arxiv.org/abs/1803.04496). This is the first time the 2017 data will be used to measure the central exclusive di-lepton production process at higher precision.



---
<b>Remark</b>: Exclusive production of $\tau$ leptons was not measured at the LHC yet since $\tau$-leptons differ from electrons and muons by their relatively short lifetime ($c\tau_0=87\mu m$) and are observed only via their decay products. The main challenge with measuring $\tau$s is elusive neutrinos (which escape detection). Hence measurement of the momentum of $\tau$-lepton is tricky. Yet, it is possible because the opening angle between two daughter particles boosted with [lorentz factor](https://en.wikipedia.org/wiki/Lorentz_factor) $\gamma$ is given by $\theta ~\sim 2/\gamma$. With a large enough boost, the opening angle became collinear, and neutrino 4-momentum can be measured.

To understand the process better, we will explore the final state signature. As we mentioned [earlier](https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideCMSDataAnalysisSchoolCERN2020TaggedProtonsLongExercise#TASK_3_Optimization_study), samples are stored in the `h5py` data format, which can be easily accessed with Jupyter notebook. 


In [None]:
#start with standard python imports
#!pip install --user mplhep #enable this line if the import of mplhep fails
import numpy as np
import pandas as pd
import h5py
import matplotlib.pyplot as plt
import mplhep as hep
from matplotlib.colors import LogNorm

In [None]:
#to make the plots in CMS style execute this line
plt.style.use([hep.style.ROOT, hep.style.firamath])
plt.style.use(hep.style.CMS)

In [None]:
#Execute this line if running on SWAN, otherwise update the path to the data files:
PATH='/eos/user/c/cmsdas/long-exercises/pps-exclusive-dilepton/h5py'
#PATH='output'

## Loading the data (signal)

We will load `h5py` files of the simulated signal events. Note that three different central exclusive di-lepton production processes are considered: exclusive, semi-exclusive, and inclusive (see Figure 1). We will load the files and convert them to pandas datafrme. Let's explore the differences between the processes.

### Dataformat:

We will use the following code `GetData(filename.h5)` to read the `h5` file and convert the data to pandas dataframe.

In [None]:
def GetData(filename):
    
    """ opens a summary file and converts it to a pandas dataframe """
    
    with h5py.File(filename, 'r') as f:
        dset = f['protons']
        dset_columns = f['columns']
        columns = list( dset_columns )
        columns_str = [ item.decode("utf-8") for item in columns ]
        return pd.DataFrame( dset, columns=columns_str )
    
    return pd.DataFrame()

In [None]:
#load the signal/background samples into the dataframes (takes some time)
df_signal_excl = GetData(PATH+'/output-GGToEE_Elastic_v0_signal_xa120_era2017_preTS2.h5')
df_background = GetData(PATH+'/output-MC13TeV_DYToLL50toInf_fxfx_v0.h5')

In [None]:
#Load the data samples into the dataframes (takes some time)
df_data={}
eras=['B']
#eras=['B','C','D', E','F'] #uncooment to process all data
for x in eras:
    df_data[x] = GetData(PATH+'/output-UL2017{}-El.h5'.format(x))
    df_data[x]['era']=x
    print('output-UL2017{}-El shape = {}'.format(x,df_data[x].shape))

#combine all into a single one
df_data=pd.concat([df_data[x] for x in eras])

### Exploring the data files

Similarly to what we did with ROOT files in the short exercise, let's look at the info we have in the dataframes:

In [None]:
def PrintInfoFromDF(df):
    print('Print all branches:')
    print(df.keys())
    print('Size of the data is ',df.shape)

In [None]:
PrintInfoFromDF(df_signal_excl)
PrintInfoFromDF(df_data)

As you can see, we have 38 different columns and 212744 raws in the file (each raw corresponds to a different event). In data we have added an extra column to flag which data-taking era is the event comming from.

**TASK A**

Look at distributions of different kinematic variables (among different processes) and try to see if you observe any difference... The code below `PlotFromDF(variable, dataframes, labels)` will plot normalized shapes of selected variables. 


In [None]:
def PlotFromDF(variable, xmin, xmax, nbins, dataframes, _labels, ax, log=False):
    bins = np.linspace(xmin,xmax,nbins)
    data=[]; labels=[]
    for df, label in zip(dataframes, _labels):
        h, _ = np.histogram(df[variable], bins,density=True)
        data.append(h)
        labels.append(label)
    hep.histplot(data, bins, ax=ax, label=labels)
    hep.cms.label(llabel="Preliminary", rlabel="CMSvDAS 2020", ax=ax)
    ax.legend(); 
    ax.set(xlabel=variable, ylabel='p.d.f.')
    if log: plt.yscale("log")
    plt.savefig(variable+'.png')  

In the following example, we will plot the di-lepton [acoplanarity](https://en.wikipedia.org/wiki/Acoplanarity) defined by:
$$A = 1 - \Delta\phi(\mu,\mu)/\pi$$

In the exclusive events, due to absence of additional radiation, both leptons expected to be produced back-to-back, or with $\Delta\phi(\mu,\mu)\sim\pi$. 

In [None]:
# we will be plotting MC prediction with the data, where the data is mostly populated with background events
procc = [df_signal_excl,df_background,df_data]
labels = ['Exclusive dilep','inclusive Z','data']

In [None]:
f, ax = plt.subplots()
PlotFromDF('Acopl',0,1,100,procc,labels, ax, log=True)

Next importants variables are the mass and the number of tracks associated to the primary vertex. In the ntuples we store two types of variables: 
- `ExtraPfCands` number of tracks measure from the PFlow candidates (for more info follow the [Tracks and Vertices](https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideCMSDataAnalysisSchoolCERN2020TrackingAndVertexingShortExercise) short exercise)
- `ExtraPfCands_v1` same as above but including only tracks with $|\eta|<$2.1

In [None]:
f, ax = plt.subplots()
PlotFromDF('ExtraPfCands',0,20,20,procc,labels, ax, log=False)

In [None]:
f, ax = plt.subplots()
PlotFromDF('ExtraPfCands_v1',0,20,20,procc,labels, ax, log=False)

In [None]:
f, ax = plt.subplots()
PlotFromDF('InvMass',0,800,100,procc,labels, ax, log=True)

## Selection of the signal region

We see discrimination between our signal and the main background when plotting acoplanarity and track multiplicity variables.  It is common to define a figure of merit to choose the optimal selection cut. The simple approach is to ask: _Which cut will give us the highest_ $Z_0=\frac{s}{\sqrt{b}}$ _value_. Note, this figure of merit is valid for high background rate compared to the signal. However, a better approximation for the Poisson counting experiment, is to use $Z_0 = \sqrt{2\left( \left(s+b\right)\ln(1+\frac{s}{b})-s \right)}$ (Eq. 97 in [Eur. Phys. J. C71, 1554 (2011)](https://arxiv.org/pdf/1007.1727.pdf)), but we can stick to $\frac{s}{\sqrt{b}}$.

<b>TASK B</b>

Write a code that computes significance for different selection cuts. Since we are interested in the relative estimate of the significance, compute the signal and background cut efficiencies:
$$\varepsilon_s/\sqrt{\varepsilon_b} = \left(N_s^\text{cut}/\sqrt{N_b^\text{cut}}\right) / \left(N_s^\text{no-cut}/\sqrt{N_b^\text{no-cut}}\right) = \left(N_s^\text{cut}/N_s^\text{no-cut}\right) / \left(\sqrt{N_b^\text{cut} / N_b^\text{no-cut}}\right)$$

as a funciton of the cut.

Optimize selection for electrons and muons separately. 


In [None]:
# function that computes signal significance
def computeZ(Ns, Nb):
    return Ns/np.sqrt(Nb)

In [None]:
# function that computes event yields
def ComputeYields(signal, background, variable, cut, cut_type):
    if cut_type==-1:
        Ns = signal[signal[variable]<cut].groupby('EventNum').ngroups
        Nb = background[background[variable]<cut].groupby('EventNum').ngroups
    elif cut_type==0:
        Ns = signal[signal[variable]==cut].groupby('EventNum').ngroups
        Nb = background[background[variable]==cut].groupby('EventNum').ngroups
    elif cut_type==1:
        Ns = signal[signal[variable]>cut].groupby('EventNum').ngroups
        Nb = background[background[variable]>cut].groupby('EventNum').ngroups   
    else: Ns=0; Nb=1; print('Error: wrong _sign_')
    return Ns, Nb


In [None]:
#Relative significance for a single cut value (example for Acopl<0.09):
Nscut, Nbcut = ComputeYields(df_signal_excl,df_data,'Acopl',0.09,-1)
Nsnocut, Nbnocut = ComputeYields(df_signal_excl,df_data,'Acopl',1,-1)
sig_ratio = computeZ(Nscut, Nbcut) / computeZ(Nsnocut, Nbnocut)
print('Relative significance = ',sig_ratio)
print('Signal acceptance = ',Nscut/Nsnocut)