# A particle physics application for Neural Networks: Top quark tagging

This tutorial uses a lot of material from Lisa Benato and Dirk Krücker (https://github.com/dkgithub/wuhan_DL_labs)

Notebook based on [CNNTopTagging.ipynb from LMU course](https://github.com/fuenfundachtzig/LMU_DA_ML/blob/master/CNNTopTagging.ipynb)

### The Standard Model and the top quark

<br>
<img src="figures/top_tagging/SM.png" width="400" >

The **Standard Model** of elementary particles represents our knowledge of the microscopic world. It describes the matter constituents (quarks and leptons) and their interactions (mediated by bosons), that are the electromagnetic, the weak and the strong interactions.

Among all these particles, the **top quark** still represents a very peculiar case. It is the heaviest known elementary particle (mass of 172.5 GeV) and it has a very short lifetime ($10^{-25}$ seconds): this means we can only see its decay products. It has been discovered in 1995 by the CDF and D0 experiments at Tevatron (Fermilab, Chicago). The top quark is considered a key particle to searches for new physics beyond the Standard Model and to precision measurements.

The ideal tool for measuring the top quark properties is a particle collider. The **Large Hadron Collider** (LHC), situated nearby Geneva, between France and Switzerland, is the largest proton-proton collider ever built on Earth. It consists of a 27 km circumference ring, where proton beams are smashed at a centre-of-mass energy of 13 TeV (99.999999% the speed of light). At the LHC, 40 Million collisions / second occur, yielding an enormous amount of data. Thanks to these data, **ATLAS** and **CMS** experiments discovered the missing piece of the Standard Model, the Higgs boson, in 2012.

During a collision, the energy is so high that protons are "broken" into their fundamental components, i.e. **quarks** and **gluons**, that can interact, producing particles that we don't observe in our everyday life, such as the top quark. The production of a top quark is, by the way, a relatively "rare" phenomenon, since there are other physical processes that occur way more often, such as those initiated by strong interaction, producing lighter quarks (such as up, down, strange quarks). In high energy physics, we speak about the **cross-section** of a process. We say that the top quark production has a smaller cross-section than the production of light quarks.

The experimental consequence is that distinguishing the decay products of a top quark from a light quark can be extremely difficult, given that the latter process has a way larger probability to happen.

### Experimental signature of top quark in a particle detector

Let's first understand what are the experimental signatures and how our detectors work. This is a sketch of the CMS experiment.

<br>
<img src="figures/top_tagging/EPS_CMS_Slice.png" width="1000" >

A collider detector is organized in layers: each layer is able to distinguish and measure different particles and their properties. For example, the silicon tracker detects each particle that is charged. The electromagnetic calorimeter detects photons and electrons. The hadronic calorimeter detects hadrons (such as protons and neutrons). The muon chambers detect muons (that have a long lifetime and travel through the inner layers).

Our physics problem consists into detecting the so-called "hadronic decay" of a top quark. The decay chain is sketched here: the top quark decays into a bottom quark and into a $W$ boson, that in turn decays into light quarks (in the picture, up and down quarks).

<br>
<img src="figures/top_tagging/top.png" width="500" >

Our background is, instead, represented by light quark (or quarks) produced by the strong interaction (in jargon, QCD). Here we have a sketch of one possible background event.

<br>
<img src="figures/top_tagging/QCD.png" width="200" >

#### Jets

Without going into the theoretical details, the nature of particles experiencing the strong interaction (like quarks) is such that they cannot travel free, but they are forced to be "confined" into hadrons. One hadron can be seen as a "combination" of quarks. Let's think about the electromagnetic interaction: a positive charge and a negative charge are attracted to each other, and they will tend to form a state that is neutral under the electromagnetic interaction. Analogously, quarks try to combine together, forming a bond state that is neutral under the strong interaction. This process is called **hadronization**, and it has a very important consequence. Quarks won't appear as single isolated particles in a detector, but rather as **jets** of particles.

There are many different algorithms that are able to reconstruct quarks (and gluons) as jets (i.e., anti-$k_T$ algorithm [arXiv:0802.1189](https://arxiv.org/abs/0802.1189)). They basically loop over the shower of particles produced by the hadronization, trying to cluster them together as one single entity. The algorithms are designed such in a way that the momentum of the clustered jet is proportional to the initial energy of the quark. A sketch giving an intuitive idea of a jet is displayed here (Klaus Rabbertz, KIT):

<br>
<img src="figures/top_tagging/Rabbertz_from_quark_to_rec_jet.png" width="500" >

#### Jets substructure

Many physically motivated approaches have been used in the past to distinguish a jet initiated by a top quark from jets due to QCD. One remarkable property is the so-called **jet substructure**. The idea is to try to distinguish how many "sub-jets" are included in a jet. Out of our sketches presented before, since the top quark decays into three separated quarks, we would expect it to show a three-pronged sub-structure. QCD, on the other hand, is mainly due to single quark/gluon radiation, hence it shows a one-pronged sub-structure. One largely used approach to study the jet substructure is the so called *n-subjettiness* ([arxiv:1011.2268](https://arxiv.org/abs/1011.2268)).

## Jet images -  a nail for the hammer?

One Ansatz is to use techniques from image recognition with neural networks, namely convolutional neural networks. This requires to transform our   jet constituent data into an image.

We unroll the cylindrical surface of the detector along the azimuthal and longitudinal coordinates and subdivide the area into pixels. The pixel values then correspond to the energy deposits (component transverse to the beam direction) of our jet constituents. Here we will use this as a grayscale image, but in principle one could use multiple features, similar to the colours of images with more information than just the energy (e.g. number of particles, energy for neutral and charged particles as done in https://arxiv.org/abs/1612.01551)

<br>
<img src="figures/top_tagging/images_jets.png" width="800" >

(Figure from https://arxiv.org/abs/1612.01551)

We do not further discuss CNN in this notebook, but if you are interested please check the original notebook [CNNTopTagging.ipynb](https://github.com/fuenfundachtzig/LMU_DA_ML/blob/master/CNNTopTagging.ipynb).

## Applying Deep Set method to top-tagging dataset

Sets are a nice representation for objects in particles physics. Let's apply this to the jet constituents of the TopTagging dataset.

We have prepared a subset of this dataset in original form containing the 4-momenta (Energy, px, py, pz) of up to 200 jet constituents:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, GlobalAveragePooling1D, Masking
from tensorflow.keras.callbacks import History

In [None]:
top_tagging_path = "top_tagging_with_adjacency.npz"

In [None]:
import os
if not os.path.exists(top_tagging_path):
    import requests
    url = "https://cloud.physik.lmu.de/index.php/s/AtESAET6JK6DiWZ/download"
    res = requests.get(url)
    with open(top_tagging_path, "wb") as f:
        f.write(res.content)

In [None]:
npz_file = np.load(top_tagging_path)

In [None]:
X = npz_file["jet_4mom"]
y = npz_file["y"]

In [None]:
X.shape

Here we have 10k events, each with 200 4-dim particles. Missing entries are set to 0

In [None]:
X[0]

We can reuse the `JetScaler` we defined for the Higgs Dataset: 

In [None]:
class JetScaler:
    def __init__(self, mask_value=-999):
        self.mask_value = mask_value
        self.scaler = RobustScaler()
    
    def fill_nan(self, X):
        "replace missing values by nan"
        X[(X == self.mask_value).all(axis=-1)] = np.nan
        
    def fit(self, X):
        X = np.array(X) # copy
        self.fill_nan(X)
        X = X.reshape(-1, X.shape[-1]) # make 2D
        self.scaler.fit(X)
        
    def transform(self, X):
        orig_shape = X.shape
        X = np.array(X).reshape(-1, X.shape[-1])
        self.fill_nan(X)
        X = self.scaler.transform(X)
        X = np.nan_to_num(X, 0) # replace missing values by 0
        return X.reshape(*orig_shape) # turn back into 3D

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = JetScaler(mask_value=0)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

For the NN we can use a simple Sequential stack of layers since we only use the jet constituents as inputs:

In [None]:
model = tf.keras.Sequential([
    Masking(input_shape=X_train.shape[1:]),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    GlobalAveragePooling1D(),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(1, activation="sigmoid"),
])

Here we were can use a [Masking](https://stackoverflow.com/questions/75410827/how-does-masking-work-in-tensorflow-keras) layer to ignore missing values. 
*(Important: Only possible because the sequence is never completely empty.)*

Again, the first layers operate independently on each constituent:

In [None]:
model.summary()

In [None]:
model.compile(loss="binary_crossentropy", optimizer="Adam")

In [None]:
history = History()
history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=32,
    shuffle=True,
    callbacks=[history],
)

In [None]:
pd.DataFrame(history.history).plot()

In [None]:
scores = model.predict(X_test)

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(y_test, scores)

In [None]:
def plot_top_tagging_performance(fpr, tpr):
    plt.plot(tpr, 1. / fpr)
    plt.ylabel("QCD jet rejection")
    plt.xlabel("Top quark jet efficiency")
    plt.yscale("log")
    plt.grid()

    print("Top quark jet selection efficiency at 10^3 QCD jet rejection: ", np.max(tpr[fpr < 0.001]))
    print("QCD jet rejection at 30% Top quark jet efficiency: ", 1. / np.min(fpr[tpr > 0.3]))


In [None]:
plot_top_tagging_performance(fpr, tpr)

<div class="alert alert-block alert-success">
    <h2>Exercise 1</h2>
    As usual, play with the options: number of layers, number of neurons, switch off masking, ...
</div>