# Project2: Anomaly Detection for Exotic Event Identification at the Large Hadron Collider 




## Brief Introduction to the Standard Model and Large Hadron Collider


The Standard model (SM) of Particle Physics is the most complete model physicists have for understanding the interactions of the fundamental particles in the universe. The elementary particles of the SM are shown in Fig.1.

---
<figure>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Standard_Model_of_Elementary_Particles.svg/627px-Standard_Model_of_Elementary_Particles.svg.png" alt="SM" style="width: 600px;"/>
    <figcaption>Fig.1 - Elementary particles of the Standard Model.</figcaption>
</figure>

---

It is comprised of matter particles (**fermions**):
- **leptons**
    - electrons
    - muon
    - tau
    - and respective neutrinos
- **quarks** which are the building blocks of protons

as well as force carrier particles (**bosons**):
- photon and W/Z bosons (electroweak force)
- gluons (strong force)

and the Higgs boson which is attributed to the mechanism which gives particles their mass.


Though the SM has experimentally stood the test of time, many outstanding questions about the universe and the model itself remain, and scientist continue to probe for inconsistencies in the SM in order to find new physics. More exotic models such as **Supersymmetry (SUSY)** predic mirror particles which may exist and have alluded detection thus far. 

---

The **Large Hadron Collider** (LHC) is a particle smasher capable of colliding protons at a centre of mass energy of 14 TeV.
**ATLAS** is general purpouse particle detectors tasked with recording the remnants of proton collisions at the collicion point. The main purpouse of this experiment is to test the SM rigorously, and ATLAS was one of two expeririments (ATLAS+CMS) responsible for the discovery of the **Higgs boson in 2012**. 

Find an animation of how particles are reconstructed within a slice of the ATLAS detector here: https://videos.cern.ch/record/2770812. Electrons, muons, photons, quark jets, etc, will interact with different layers of the detector in different ways, making it possible to design algorithms which distinguish reconstructed particles, measure their trajectories, charge and energy, and identify them as particular types.

Figure 2 shows an event display from a data event in ATLAS in which 2 muons (red), 2 electrons (green), and 1 quark-jet (purple cone) are found. This event is a candidate to a Higgs boson decaying to four leptons with an associated jet: $$H (+j)\rightarrow 2\mu 2e (+j)$$ 



---

<figure>
    <img src="https://twiki.cern.ch/twiki/pub/AtlasPublic/EventDisplayRun2Physics/JiveXML_327636_1535020856-RZ-LegoPlot-EventInfo-2017-10-18-19-01-24.png" alt="Higgs to leptons" style="width: 600px;"/>
    <figcaption>Fig.2 - Event display of a Higgs candidate decaying to two muons and two electrons.</figcaption>
</figure>

---


Particles are shown transversing the detector material. The 3D histogram show 
* the azimuth $\phi$ ( angle around the beam, 0 is up)
* pseudo-rapidity $\eta$ (trajectory along the beam) positions of the particle directions with respect to the interaction point.
* The total energy measured for the particle is denoted by $E$,
* the transverse momentum ($p_T$) deposited by the particle in giga-electronvolts (GeV) are shown by the hight of the histograms.

A particle kinematics can then be described by a four-vector  $$\bar{p} = (E,p_T,\eta,\phi)$$

An additional importan quantity is the missing energy in the transverse plane (MET). This is calculated by taking the negative sum of the transverse momentum of all particles in the event.
$$\mathrm{MET} = -\sum p_T$$

With perfect detector performance the MET will sum to 0 if all outgoing particles are observed by the detector. Neutrinos cannot be measured by the detector and hence their precense produces non-zero MET.

## Anomally detection dataset

For the anomally detection project we will use the dataset discussed in this publication: <p><a href="https://arxiv.org/pdf/2105.14027.pdf" title="Anomalies">The Dark Machines Anomaly Score Challenge:
Benchmark Data and Model Independent Event
Classification for the Large Hadron Collider</a></p>

Familiarise yourself with the paper, in particular from sections 2.1 to 4.4.

---

The dataset contains a collection of simulated proton-proton collisions in a general particle physics detector (such as ATLAS). We will use a dataset containing `340 000` SM events (referred to as channel 2b in the paper) which have at least 2 electrons/muons in the event with $p_T>15$ GeV. 

**The events can be found in `background_chan2b_7.8.csv`**


You can see all the SM processes that are simulated in Table 2 of the paper, 

    e.g., an event with a process ID of `w_jets` is a simulated event of two protons producing a lepton and neutrino and at least two jets.
    
$$pp\rightarrow \ell\nu(+2j)$$

---

The datasets are collected as CSV files where each line represents a single event, with the current format:

`event ID; process ID; event weight; MET; METphi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...`
See Section 2.2 for a description of the dataset.
Variables are split by a semicolon `";"`
- `event ID`: an identifier for the event number in the simulation
- `process ID`: an identifier for the event simulation type
- `event weight`: the weight associated to the simulated event (how important that event is)
- `MET`: the missing transverse energy
- `METphi`: the azimuth angle (direction) of the MET

the a list of objects (particles) whose variables are split by commas `","` in the following orger:
- `obj`: the object type,

    |Key|Particle|
    |---|---|
    |j|jet|
    |b|b-jet|
    |e-|electron|
    |e+|positron|
    |m-|muon|
    |m+|muon+|
    |g|photon|
    
    *see Table 1 of the paper*
- `E`: the total measured particle energy in MeV, [0,inf]
- `pt`: the transverse mementum in MeV, [0,inf]
- `eta`: pseudo-rapidity, [-inf,inf]
- `phi`: azimuth angle, radians [-3.14,3.14]

---

In addition to the SM events we are also provided simulated events from `Beyond Standard Model` (BSM) exotic physics models. They are summarised here:

|Model | File Name | 
|---|---|
|**SUSY chargino-chargino process**||
||`chacha_cha300_neut140_chan2b.csv`|
||`chacha_cha400_neut60_chan2b.csv`|
||`chacha_cha600_neut200_chan2b.csv`|
|**SUSY chargino-neutralino processes**||
||`chaneut_cha200_neut50_chan2b.csv`|
||`chaneut_cha250_neut150_chan2b.csv`|
|**$Z'$ decay to leptons**||
||`pp23mt_50_chan2b.csv`|
||`pp24mt_50_chan2b.csv`|
|**Gluino and RPV SUSY**||
||`gluino_1000.0_neutralino_1.0_chan2b.csv`||
||`stlp_st1000_chan2b.csv`||



## Project description

### Overview
The task is to design an anomaly detection algorithm which is trained on the SM dataset and which can be used to flag up interesting (exotic) events from the BSM physics models.

You will do this by designing a robust `AutoEncoder` which is trained on the event level variables `MET; METphi` and the kinematics of the particle level objects. The `AutoEncoder` needs to duplicate the input as output effectively while going through a laten space (bottleneck). 

You will then need to evaluate and discuss the performance of your `AutoEncoder` on the exotic models listed above, and come up with an appropiate metric to identify events from non SM physics.

# **Breakdown**

In the project report you will be assessed in the following way.

1. **Data exploration and preprocessing (20%):** Inspect the datasets; visualise the data (e.g. tables, plots, etc) in an appropriate way; study the composition of the dataset; perform any necessary preprocessing.
2. **Model selection (30%):** Choose a promissing approach; construct the machine learning model; optimise the relevant hyperparameters; train your chosen model.
3. **Performance evaluation (30%):** Evaluate the model in a way that gauges its ability to generalise to unseen data; compare to other approaches; identify the best approach. 
4. **Discussion, style throughout (20%):** Discuss the reasoning or intuition behind your choices; the results you obtain through your studies; the relative merits of the methods you have developed, _etc._ Similarly, make sure that you write efficient code, document your work, clearly convey your results, and convince us that you have mastered the material.


## Data Preprocessing
* The data is provided in a CSV (text) format with semicolon and comma seperated list with **one line per event**. We need to convert this into an appropiate format for our neural networks. 
* Since the number of particles per event is variable you will need to **truncate** and **mask** particles in the event. The following steps need to be perfomed on the SM (background) sample:
     1. Create variables where you count the number of electrons, photons, muons, jets and bjets in the event (ignore charge) before any truncation.
     2. Choose an appropiate number of particles to study per event (recommended: **8** particles are used in the paper)
     3. Check the particles are sorted by energy (largest to smallest)
     4. If the event has more than 8 particles choose the **8 particles** with **highest energy and truncate** the rest.
     5. convert energy and momentum variables by logarithm (e.g., `log`) - this is to prioritise differences in energy **scale** over more minor differences. 
     6. If the event has less than 8 particles, create kinematic variables with 0 values for the missing particles.
* The final set of training variables should look something like this (the exact format is up to you)
    |N ele| N muon| N jets| N bjets| N photons| log(MET)| METphi| log(E1)| log(pt1)| eta1| phi1| ... | phi8|
    |-|-|-|-|-|-|-|-|-|-|-|-|-|
    
    7. After the dataset is ready, use `MinMaxScalar` or similar to standardise the training variables over the SM dataset
* After the SM dataset has been processed use the same processing on the BSM (signal samples). Use the same standardisation functions as on the SM dataset, *Do not recalculate the standardisation*.
* Keep associated metatata (`event ID; process ID; event weight;`) though this does not need processing. 
* Randomise and split the SM (background) dataset into training and testing datasets (the BSM samples don't need to be split (*Why?*))
* *Hint*: It is suggested that you write a class or function for the preprocessing which takes a csv path as input and provides the processed dataset. After you have done the data processing its suggested you save the datasets so as to not have to recalculate them again if the kernel is restarted. 

## Training
* Design an appropiate algorithm which reconstrucuts the input variables after going though a laten space. Choose an appropiate cost function.
    * The suggested method for ease of implementation is the `AutoEncoder`
    * However, if you consider learning about or trying something else, as described in the paper, you should feel welcome to try `VAEs`, `ConvAEs`, `ConvVAEs`, etc. Don't feel you **have** to create an `AE`.

* Explore different architectures for the model, and explain in detail your choice of model, and the final parameters chosen.
* It is suggested to create a class or function around your algorithm which allows you to easily tweek hyperparameters of the model (depth, number of nodes, number of laten variables, activation functions, regularisation, etc)
* Train the model over several parameters to find the best algorithm. Document the process throught and discuss your choices. Keep track of validation performance. Save the models the best points. 
* Explore the results and document your findings. Ask as many questions about your model as you can, and document your findings. Does the model generalise well to data it hasn't seen?

## Evaluation
In the evaluation explore different datasets an try answer as many questions about the performance as possible. 
* Evaluate the performance of the `AE` on BSM dataset. Which models are more or less similar to the SM?
* Explore the anomaly score as a handle on finding new physics. Consider scanning over different anomaly scores and calculating the signal and background efficiencies at each point (plot this for different BSM models). How might you choose a value which flags up a non-SM event? 
* Explore SM events. Which look more anomolous than others? Are there any particular features which are responsible, e.g. particle counts, MET ranges, etc.? 
* Discuss any limitations your algorithm has. How might you update and improve your model in future? Discuss any issues you had, or things you would have liked to try given more time.

---

To complete this project, you should **Submit your Jupyter notebook** as a "report." See the comments below on documentation,



**You should submit by Friday 10th Feb 2023 at 10AM:**
* your report notebook via Turnitin.
    

For all task we're not looking for exceptional performace and high scores (although those are nice too), **we're mostly concerned with _best practices:_** If you are careful and deliberate in your work, and show us that you can use the tools introduced in the course so far, we're happy!

Training all of these models in sequence takes a very long time so **don't spend hours on training hundreds of epochs.** Be conservative on epoch numbers (30 is more than enough) and use appropiate techniques like EarlyStopping to speed things up. Once you land on a good model you can allow for longer training times if performance can still improve.



### Documentation

**Change the filename to contain Name_Surname**

Your report notebook should run without errors and give (mostly) reproducible results. **Please dont clear the report before submitting**! It is important that **all** code is annotated and that you provide brief commentary **at each step** to explain your approach. Explain *why* you chose a given approach and *discuss* the results. You can also include any failed approaches if you provide reasonable explanation; we care more about you making an effort and showing that you understand the core concepts.

This is not in the form of a written report so do not provide pages of background material, but do try to clearly present your work so that the markers can easily follow your reasoning and can reproduce each of the steps through your analysis. Aim to convince us that you have understood the material covered in the course.

To add commentary above (or below) a code snippet create a new cell and add your text in "Markdown" format. Do not add any substantial commentary as a code comment in the same cell as the code. To change the new cell into markdown select from the drop down menu on the bar above the main window (the default is code)

# Happy Anomaly Hunting
---
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.</p>&mdash; Josh Wills (@josh_wills) <a href="https://twitter.com/josh_wills/status/198093512149958656?ref_src=twsrc%5Etfw">May 3, 2012</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

---

Your code follows....

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


In [2]:
class dataRead():
    def __init__(self, csv_path):
        self.csv_path = csv_path
        self.data = pd.read_csv(self.csv_path,ignore_errors=True)
        

In [112]:
labels = ["Event ID","Process ID","Event Weight",
          "e-","m-","j","b","log(MET)","METphi"]
to_be_replaced = [ "e-","m-","j","b"]
replace_labels = ["N ele","N muon","N jets","N bjets"]
for i in range(8):
    i = str(i)
    labels.extend(["log(E"+i+")","log(pt"+i+")","eta"+i,"phi"+i])
    
labels[9:]

['log(E0)',
 'log(pt0)',
 'eta0',
 'phi0',
 'log(E1)',
 'log(pt1)',
 'eta1',
 'phi1',
 'log(E2)',
 'log(pt2)',
 'eta2',
 'phi2',
 'log(E3)',
 'log(pt3)',
 'eta3',
 'phi3',
 'log(E4)',
 'log(pt4)',
 'eta4',
 'phi4',
 'log(E5)',
 'log(pt5)',
 'eta5',
 'phi5',
 'log(E6)',
 'log(pt6)',
 'eta6',
 'phi6',
 'log(E7)',
 'log(pt7)',
 'eta7',
 'phi7']

In [64]:
processing_frame = pd.DataFrame(columns=labels)
processing_frame

Unnamed: 0,Event ID,Process ID,Event Weight,e-,m-,j,b,log(MET),METphi,log(E0),...,eta5,phi5,log(E6),log(pt6),eta6,phi6,log(E7),log(pt7),eta7,phi7


In [5]:
# open file and read lines in a for loop

with open('background_chan2b_7.8.csv', 'r') as f:
    for line in f:
        #split each line with semi colon and comma
     
        line = line.split(';')
    
        metadata = [line[0], line[1], line[2]]
        print(line[5])
        analys_data = [line.count('e-'), line.count('m-'), line.count('j'), line.count('b'),np.log(float(line[3])), line[4]]
        trun = line[8:]
        print(metadata)
        break
        # df = df.append(pd.Series([,]), ignore_index=True)
    
        
  

           



j,335587,132261,-1.57823,1.02902
['5702564', 'z_jets', '1']


In [65]:
meta_data = pd.read_csv('background_chan2b_7.8.csv', sep=';', header=None, usecols=np.arange(0, 5))
meta_data

Unnamed: 0,0,1,2,3,4
0,5702564,z_jets,1,102549.0,-2.966200
1,13085335,z_jets,1,103468.0,1.961930
2,74025,wtopbar,1,129408.0,-1.178890
3,2419445,z_jets,1,77774.2,-1.091710
4,43639,wtop,1,107151.0,-1.026420
...,...,...,...,...,...
340263,30,ttbar,1,65677.3,-1.153600
340264,111,ttbar,1,58730.1,0.529769
340265,75,ttbar,1,342729.0,0.804597
340266,15181306,z_jets,1,246999.0,-0.849401


In [66]:
# print all file names in the directory with the .csv extension
import glob
import os

path = os.getcwd()
file_name  = []
for file in glob.glob("*.csv"):
    file_name.append(file)
    

file_name

['chaneut_cha250_neut150_chan2b.csv',
 'chacha_cha300_neut140_chan2b.csv',
 'background_chan2b_7.8.csv',
 'chacha_cha600_neut200_chan2b.csv',
 'gluino_1000.0_neutralino_1.0_chan2b.csv',
 'pp24mt_50_chan2b.csv',
 'chaneut_cha200_neut50_chan2b.csv',
 'pp23mt_50_chan2b.csv',
 'stlp_st1000_chan2b.csv',
 'chacha_cha400_neut60_chan2b.csv']

In [40]:
for f in file_name:
    print(f)
  

chaneut_cha250_neut150_chan2b.csv
chacha_cha300_neut140_chan2b.csv
background_chan2b_7.8.csv
chacha_cha600_neut200_chan2b.csv
gluino_1000.0_neutralino_1.0_chan2b.csv
pp24mt_50_chan2b.csv
chaneut_cha200_neut50_chan2b.csv
pp23mt_50_chan2b.csv
stlp_st1000_chan2b.csv
chacha_cha400_neut60_chan2b.csv


In [129]:
try:
  pd.read_csv('background_chan2b_7.8.csv', sep=';', header=None,usecols=np.arange(5,16))
except Exception as e:
  df = pd.read_csv('background_chan2b_7.8.csv', sep=';', header=None,usecols=np.arange(5,int(str(e)[-3:-1])))
  
df
  

Unnamed: 0,5,6,7,8,9,10,11,12,13,14
0,"j,335587,132261,-1.57823,1.02902","j,107341,106680,-0.0989776,-2.67901","j,85720.1,62009,0.840127,-1.73805","j,270540,58844.5,2.20566,1.6064","j,55173.9,52433.5,-0.183147,2.62501","j,48698.6,37306.4,-0.719927,-1.7898","j,148467,23648,-2.52332,-1.70799","e-,186937,131480,0.888915,-0.185666","e+,80014.3,79281.7,0.135844,0.275231",
1,"j,224322,109177,-1.34681,-1.16114","m-,117239,105718,-0.462717,2.01556","m+,17640,16335.2,0.397024,-1.90025",,,,,,,
2,"b,169640,104808,-1.05742,1.86718","j,61032.4,59133.2,-0.0435408,1.87833","m-,47792.1,44843.2,-0.360693,-0.359565","m+,61263,38912,-1.02619,2.71994",,,,,,
3,"j,220498,108012,1.33305,1.86414","j,190667,24036.4,2.75989,-2.24786","m-,573772,96793.9,2.4656,-1.26936","m+,27163.7,16336.2,1.09569,3.00354",,,,,,
4,"j,88031.8,53684.4,1.06622,2.33696","b,111635,25450.8,2.15602,2.49257","m-,495482,176135,1.69422,0.307253","m+,49479.9,16162.5,1.78421,-1.08938",,,,,,
...,...,...,...,...,...,...,...,...,...,...
340263,"j,106621,104145,-0.202905,2.76947","b,86433,69506.2,0.675256,0.209997","m-,84809.3,83942.7,0.143566,-1.10066","m+,22615.6,20313.1,0.471722,1.96371",,,,,,
340264,"j,108441,69580.2,-1.00902,2.97987","b,75268.6,46885,-1.04939,-1.94795","j,132186,31154.3,2.12222,-0.238847","m+,47820.2,46241.3,0.260579,1.43295","m-,180498,41739.4,-2.14378,0.707004",,,,,
340265,"j,1.2493e+06,372926,-1.87853,-2.24094","j,169624,104308,-1.06531,2.82518","b,99959.5,79440.1,0.687757,0.36651","j,50584.5,40771,0.674197,-1.57172","j,65878.6,28415.3,-1.47952,-0.0351285","m+,153626,150929,0.188772,0.967007","m-,129960,66476.9,-1.29055,2.21314",,,
340266,"j,158547,151791,-0.266908,2.83552","j,490997,85590.5,2.43207,1.64324","j,103956,49619,1.36048,1.8057","m-,290093,197360,-0.934967,-0.962359","m+,34498.7,27742.5,-0.684455,-0.0529732",,,,,


In [164]:
#each row extract values and split the strings with comma
each_row = df.iloc[0].str.split(',',expand=True).sort_values(by=1,ascending=False)
b = each_row[0].value_counts()
processing_frame.loc[0,list(b.keys())] = each_row[0].value_counts()


In [165]:
processing_frame


Unnamed: 0,Event ID,Process ID,Event Weight,e-,m-,j,b,log(MET),METphi,log(E0),...,phi5,log(E6),log(pt6),eta6,phi6,log(E7),log(pt7),eta7,phi7,e+
0,,,,1,,7,,,,85720.1,...,1.6064,186937,131480,0.888915,-0.185666,148467,23648,-2.52332,-1.70799,1.0


In [132]:
each_row = df.iloc[0].str.split(',',expand=True).sort_values(by=1,ascending=False)

each_row[[1,2,3,4]][:8].values.flatten()

array(['85720.1', '62009', '0.840127', '-1.73805', '80014.3', '79281.7',
       '0.135844', '0.275231', '55173.9', '52433.5', '-0.183147',
       '2.62501', '48698.6', '37306.4', '-0.719927', '-1.7898', '335587',
       '132261', '-1.57823', '1.02902', '270540', '58844.5', '2.20566',
       '1.6064', '186937', '131480', '0.888915', '-0.185666', '148467',
       '23648', '-2.52332', '-1.70799'], dtype=object)

In [133]:
processing_frame.loc[0,labels[9:]] = each_row[[1,2,3,4]][:8].values.flatten()

In [134]:
processing_frame

Unnamed: 0,Event ID,Process ID,Event Weight,e-,m-,j,b,log(MET),METphi,log(E0),...,eta5,phi5,log(E6),log(pt6),eta6,phi6,log(E7),log(pt7),eta7,phi7
0,,,,,,,,,,85720.1,...,2.20566,1.6064,186937,131480,0.888915,-0.185666,148467,23648,-2.52332,-1.70799
