# Discover the Higgs with Deep Neural Networks
# Chapter 1: Introduction and Data Preparation

The input data was created by 13 TeV ATLAS open data available at http://opendata.atlas.cern/release/2020/documentation/index.html

For more information read:<br>
Review of the 13 TeV ATLAS Open Data release, Techn. Ber., All figures including auxiliary figures are available at https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-OREACH-PUB-2020-001: CERN, 2020, url: http://cds.cern.ch/record/2707171

## The Higgs Boson at ATLAS Experiment

The data is measured by the ATLAS detector, one of the four big detectors at the Large Hadron Collider (LHC) at the CERN research center:

<div>
<img src='figures/ATLAS_detector.png' width='700'/>
</div>
ATLAS Experiment © 2008 CERN

The data analyzed in this jupyter notebook measured at a centre-of-mass energy of $\sqrt{s}=13 \text{ TeV}$ with an integrated luminosity of $10 \text{ fb}^{-1}$ in the year 2016. To search for H$\rightarrow$ZZ$\rightarrow llll$ events only events with four reconstructed leptons in the final state are included in the given datasets. This process is also called the "golden channel" since it has the most clearest signature for the Higgs measurement. The final hits in the ATLAS detector are similar to:

<div>
<img src='figures/ATLAS_four_lepton_event.png' width='700'/>
</div>
https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/CONFNOTES/ATLAS-CONF-2011-162/

In the following feynman diagram the Higgs boson originates from the interaction of the gluons of the two collided protons. Since the Higgs boson only couples to massive particles the boson can not directly be produced by the massless gluons. Therefore, the Higgs boson is generated via the intermediate step of a top quark loop. The top quark is the heaviest known elementary particle and thus provides a very strong coupling to the Higgs boson. Due to its high mass, the Higgs boson decays almost immediately into two Z bosons. Since these Z bosons also have a high mass, they also decay into a lepton-antilepton-pair before they reach the detector.
<div>
<img src='figures/H_ZZ_feynman_diagram.png' width='500'/>
</div>

## Simulation and Event Weights

In quantum physics, no concrete process outcomes are predicted, only their probabilities. In order to make a prediction for the measurement at the ATLAS detector, the frequencies must be simulated. For this purpose, random events are generated on the basis of the probability densities and then their respective measurement in the detector is simulated. Generating the expected number of events results in the following distribution for the lepton with the largest transverse momentum. The distribution itself is again split into the different Higgs processes (ggH125_ZZ4lep, VBFH125_ZZ4lep, WH125_ZZ4lep and ZH125_ZZ4lep) and the background processes (llll, Zee, Zmumu and ttbar_lep).
<div>
<img src='figures/event_weights_few_not_applied.png' width='500'/>
</div>

This graphic already offers a good insight into how the data to be measured could be distributed. Unfortunately, the distribution is not very smooth due to the low number of events. To improve the prediction, more events are simulated than would actually be expected in the data. Especially for processes of high interest, like here the Higgs processes, especially many events are generated. The higher statistic results into much smoother predictions.
<div>
<img src='figures/event_weights_all_not_applied.png' width='500'/>
</div>

However, now both the ratios between the different processes have shifted and the prediction of the total events has increased extremely. This distribution of the "raw" simulation events thus no longer corresponds to what can be expected for the actual measurement. To correct this, event weights are applied. Thus, each simulated event will enter the distribution only as fraction of an event given by the respective event weight. This weight depends on the respective simulated process as well as on the kinematic region of the event. In addition, there are also negative weights to compensate for excess simulated events. The result is comparable to the initial distribution but offers a much smoother prediction.
<div>
<img src='figures/event_weights_all_applied.png' width='500'/>
</div>

## Data Preparation

### Load the Data

In [None]:
# Necessary imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

# Import some common functions created for this notebook
import common

# Random state
random_state = 10
_ = np.random.RandomState(random_state)

The goal of this lab course is to train a deep neural network to separate Higgs boson signal from background events. The most important signal sample ggH125_ZZ4lep corresponds to the process $gg\rightarrow$H$\rightarrow$ZZ. The dominant background sample is $llll$ resulting from Z and ZZ decays.
After training the DNN model will be used to classify the events of the data samples.

Higgs signal samples:
- ggH125_ZZ4lep
- VBFH125_ZZ4lep
- WH125_ZZ4lep
- ZH125_ZZ4lep

Background samples:
- llll
- Zee
- Zmumu
- ttbar_lep

Data samples:
- data_A
- data_B
- data_C
- data_D

In [None]:
# Define the input samples
sample_list_signal = ['ggH125_ZZ4lep', 'VBFH125_ZZ4lep', 'WH125_ZZ4lep', 'ZH125_ZZ4lep']
sample_list_background = ['llll', 'Zee', 'Zmumu', 'ttbar_lep']

In [None]:
sample_path = 'input'
# Read all the samples
no_selection_data_frames = {}
for sample in sample_list_signal + sample_list_background:
    no_selection_data_frames[sample] = pd.read_csv(os.path.join(sample_path, sample + '.csv'))

### Input Variables

The input provides several variables to classify the events. Since each event has multiple leptons, they were ordered in descending order based on their transverse momentum. Thus, lepton 1 has the highest transverse momentum, lepton 2 the second highest, and so on. <br>
Most of the given variables can be called low-level, because they represent event or object properties, which can be derived directly from the reconstruction in the detector. In contrast to this are high-level variables, which result from the combination of several low-level variables. In the given dataset the only high-level variables are invariant masses of multiple particles:<br>
$m_{inv} = \sqrt{\left(\sum\limits_{i=1}^{n} E_i\right)^2 - \left(\sum\limits_{i=1}^{n} \vec{p}_i\right)^2}$


List of all available variables:<br>
- Scale and event weight
     - The scaling for a dataset is given by the sum of event weights, the cross section, luminosity and a efficiency scale factor
     - Each event has an additional specific event weight
     - To combine simulated events and finally compare them to data each event has to be scaled by the event weight
     - The weight are not used for training
     - Variable name: `totalWeight`
- Number of jets
     - Jets are particle showers which result primarily from quarks and gluons
     - Variable name: `jet_n`
- Invariant four lepton mass
     - The invariant mass $m_{inv}(l_1, l_2, l_3, l_4)$ is the reconstructed invariant mass of the full four lepton event.<br>
     This variable is to be displayed later but not used for training.
     - Variable name: `lep_m_llll`
- Invariant two lepton mass
     - Invariant masses $m_{inv}(l_i, l_j)$ of all combinations of two leptons
     - Variable names: `lep_m_ll_12`, `lep_m_ll_13`, `lep_m_ll_14`, `lep_m_ll_23`, `lep_m_ll_24`, `lep_m_ll_34`
- Transverse momentum $p_T$ of the leptons
     - The momentum in the plane transverse to the beam axis
     - Variable names: `lep1_pt`, `lep2_pt`, `lep3_pt`, `lep4_pt`
- Lepton azimuthal angle
     - The azimuthal angle $\phi$ is measured in the plane transverse to the beam axis
     - Variable name: `lep1_phi`, `lep2_phi`, `lep3_phi`, `lep4_phi`
- Lepton pseudo rapidity
     - The angle $\theta$ is measured between the lepton track and the beam axis.<br>
     Since this angle is not invariant against boosts along the beam axis, the pseudo rapidity $\eta = - \ln{\tan{\frac{\theta}{2}}}$ is primarily used in the ATLAS analyses
     - Variable names: `lep1_eta`, `lep2_eta`, `lep3_eta`, `lep4_eta`
- Lepton energy
     - The energy of the leptons reconstructed from the calorimeter entries
     - Variable name: `lep1_e`, `lep2_e`, `lep3_e`, `lep4_e`
- Lepton PDG-ID
     - The lepton type is classified by a n umber given by the Particle-Data-Group.<br>
     The lepton types are PDG-ID$(e)=11$, PDG-ID$(\mu)=13$ and PDG-ID$(\tau)=15$
     - Variable name: `lep1_pdgId`, `lep2_pdgId`, `lep3_pdgId`, `lep4_pdgId`
- Lepton charge
     - The charge of the given lepton reconstructed by the lepton track
     - Variable name: `lep1_charge`, `lep2_charge`, `lep3_charge`, `lep4_charge`

### Event Pre-Selection

Before we start with the pre-selection of the input data check the number of events per process.

In [None]:
# Loop over all processes
for sample in sample_list_signal + sample_list_background:
    # Sum over the weights is equal to the number of expected events
    n_events = sum(no_selection_data_frames[sample]['totalWeight'])
    # Number of raw simulation events
    n_events_raw = len(no_selection_data_frames[sample]['totalWeight'])
    print(f'{sample}: {round(n_events, 2)}; {n_events_raw} (raw)')

Although the final selection of the data is to be performed on the basis of a DNN, a rough pre-selection of the data is still useful.
For this purpose, selection criteria are defined, which return either true or false based on the event kinematics and thus decide whether the respective event is kept or discarded.
Suitable criteria for this analysis are very basic selections that must be clearly fulfilled by H$\rightarrow$ZZ$\rightarrow llll$ processes. So lets have again a look on the corresponding feynman diagram.

<div>
<img src='figures/H_ZZ_feynman_diagram.png' width='500'/>
</div>

From this feynman diagram of the Higgs decay two very basic criteria can be derived.

<font color='blue'>
Task:

Implement the baseline selection criteria that reduce the background while keeping almost all Higgs events:
1. Lepton charge:<br>
    The Higgs boson is electrically neutral. Thus, the total charge of all its decay products has to be neutral.
2. Lepton type:<br>
    If a Z boson decays into two leptons only lepton pairs of the same type can be produced. So the process Z$\rightarrow ee$ and Z$\rightarrow \mu\mu$ are possible but not Z$\rightarrow e\mu$. Since $\tau$ leptons have a very high mass, they decay before reaching the detector and are therefore not considered in this notebook.

Keep in mind that the leptons are ordered by their transverse momentum. Thus, it is not obivous which leptons  originate from the same Z boson.
</font>

In [None]:
def selection_lepton_type(lep_type_0, lep_type_1, lep_type_2, lep_type_3):
    """Only keep lepton type combinations resulting from H->ZZ->llll"""
    # Select events like eeee, mumumumu or eemumu
    sum_lep_type = lep_type_0 + lep_type_1 + lep_type_2 + lep_type_3
    return sum_lep_type == 44 or sum_lep_type == 48 or sum_lep_type == 52


def selection_lepton_charge(lep_charge_0, lep_charge_1, lep_charge_2, lep_charge_3):
    """Only keep lepton charge combinations resulting from H->ZZ->llll"""
    # Select events where the sum of all lepton charges is zero
    sum_lep_charge = lep_charge_0 + lep_charge_1 + lep_charge_2 + lep_charge_3
    return sum_lep_charge == 0

In [None]:
# Create a copy of the original data frame to investigate later
data_frames = no_selection_data_frames.copy()

# Apply the chosen selection criteria
for sample in sample_list_signal + sample_list_background:
    # Selection on lepton type
    type_selection = np.vectorize(selection_lepton_type)(
        data_frames[sample].lep1_pdgId,
        data_frames[sample].lep2_pdgId,
        data_frames[sample].lep3_pdgId,
        data_frames[sample].lep4_pdgId)
    data_frames[sample] = data_frames[sample][type_selection]

    # Selection on lepton charge
    charge_selection = np.vectorize(selection_lepton_charge)(
        data_frames[sample].lep1_charge,
        data_frames[sample].lep2_charge,
        data_frames[sample].lep3_charge,
        data_frames[sample].lep4_charge)
    data_frames[sample] = data_frames[sample][charge_selection]

<font color='blue'>
Task:

Check wether your selection criteria have the required effects 
</font>

In [None]:
# Loop over all processes
for sample in sample_list_signal + sample_list_background:
    # Sum over the weights is equal to the number of expected events
    n_events = sum(data_frames[sample]['totalWeight'])
    # Number of raw simulation events
    n_events_raw = len(data_frames[sample]['totalWeight'])
    print(f'{sample}: {round(n_events, 2)}; {n_events_raw} (raw)')

If you are happy with your baseline selection continue with the investigation of the data.
In order to use this preselection also in the following chapters lets save these functions to import them later on.

In [None]:
from inspect import getsource
%save functions/selection_lepton_type.py getsource(import numpy as np)
%save -a functions/selection_lepton_charge.py getsource(selection_lepton_charge)

### Data Investigation

Before one can decide which variables are suitable for training, one must first get a feel for the input variables.
For this purpose, the input samples are merged into a set of signal events and a set of background events. Afterwards, the behavior of signal and background can be studied in multiple variables.

In [None]:
# Merge the signal and background data frames
data_frame_signal = common.merge_data_frames(sample_list_signal, data_frames)
data_frame_background = common.merge_data_frames(sample_list_background, data_frames)

The function common.plot_hist(variable, data_frame_1, data_frame_2) plots the given variable of the two datasets.
The variable must be a dictionary containing atleast the variable to plot. Additionally one can also specify the binning (list or numpy array) and the xlabel. The created histogram is automatically saved in the plots directory<br>

<font color='blue'>
Task:

Which variable is the most discriminant? Which variables seem not discriminant at all? 
</font>

<font color='green'>
Answer:

- Most discriminant is the invariant mass of the four lepton system $m_{inv}(l_1, l_2, l_3, l_4)$ with a peak at the Higgs mass of 125 GeV.
- The angualar variables $\phi$ are not sensitive to any process due to the symmetry of the detector.
</font>

An example for the transverse momnetum of the leading lepton is given below:

In [None]:
# leading lepton pt
var_lep1_pt = {'variable': 'lep1_pt',
               'binning': np.linspace(0, 300, 50),
               'xlabel': '$p_T$ (lep 1) [GeV]'}

common.plot_hist(var_lep1_pt, data_frames)
common.plot_normed_signal_vs_background(var_lep1_pt, data_frame_signal, data_frame_background)
plt.show()

<font color='blue'>
Task:

What is the purity in signal events for the given data?
</font>

In [None]:
signal_event_number = sum(data_frame_signal.totalWeight)
background_event_number = sum(data_frame_background.totalWeight)
signal_background_ratio = signal_event_number/background_event_number
print(f'There are {round(signal_event_number, 2)} signal ({len(data_frame_signal)} raw MC events) and {round(background_event_number, 2)} backgound events ({len(data_frame_background)} raw MC events)\nThis gives a purity of {round(signal_background_ratio*100, 2)}%')

As one could already see, the number of simulated raw events is significantly higher than the weighted number of expected events. The contribution of a simulated event to the final prediction is thus given by the respective event weight.

<font color='blue'>
Task:

How many raw events are included in each process simulation and what the corresponding total prediction? What the is minimal, median, maximal event weight of each process?
</font><br>

In [None]:
for sample in sample_list_signal + sample_list_background:
    print(f'{sample}:')
    n_events = sum(data_frames[sample]['totalWeight'])
    n_events_raw = len(data_frames[sample]['totalWeight'])
    n_events_neg = sum(data_frames[sample]['totalWeight'] * (data_frames[sample]['totalWeight'] < 0))
    n_events_neg_raw = sum(list(data_frames[sample]['totalWeight'] < 0))
    min_weight = data_frames[sample]['totalWeight'].min()
    med_weight = data_frames[sample]['totalWeight'].median()
    max_weight = data_frames[sample]['totalWeight'].max()
    print(f'  Raw events:      {n_events_raw}')
    print(f'  Prediction:      {round(n_events, 2)}')
    print(f'  Neg. raw events: {round(100 * n_events_neg_raw / n_events_raw, 2)}%')
    print(f'  Neg. events:     {abs(round(100 * n_events_neg / n_events, 2))}%')
    print(f'  Minimal weight:  {min_weight}')
    print(f'  Median weight:   {med_weight}')
    print(f'  Maximal weight:  {max_weight}')
    print()

As expected, the different processes were simulated with different accuracy. To model 267 $llll$ events more than half a million raw events are used but 96 Z$\rightarrow ll$ events are modelled by only 500 generated events.

Furthermore, we can see the event weights even go into the negative range. Negative weighted events are produced to compensate overshooting predictions in certain kinematic areas.