# Searching for Exotic Particles

<div>
<img src="https://raw.githubusercontent.com/illinois-ipaml/MachineLearningForPhysics/main/img/Project_ExoticParticles-Figure.jpg" width=500></img>
</div>


## <span style="color:Orange">Overview</span>

A number of theories that propose to explain what happened in the very early universe (the first small fraction of a second) and link elementary particle physics and cosmology predict the existence of exotic particles that have yet to be discovered. IF these particles exist, they could contribute significantly to the dark matter in the universe and/or explain other puzzles in particle physics.

## <span style="color:Orange">Data Sources</span>

Original Source
* https://archive.ics.uci.edu/ml/datasets/HEPMASS ​ (top-level description)

File URLs
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/1000_test.csv.gz
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/1000_train.csv.gz
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/all_test.csv.gz
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/all_train.csv.gz
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/not1000_test.csv.gz
* https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/not1000_train.csv.gz

## <span style="color:Orange">Questions</span>

### <span style="color:LightGreen">Question 01</span>

What is the Large Hadron Collider (LHC)? What is it about the LHC that makes it possible to produce heavy particles like the Higgs boson?

***Ans***: 1.LHC is a large particle accelerator built by CERN.

2.LHC can accelerate the partilce to extremely high energy (about 7TeV), such that it exceeds the threshold to produce the heavy particles (for example, Higgs boson is about 125GeV).

### <span style="color:LightGreen">Question 02</span>

The Higgs boson is the last particle in the SM to be discovered and completes the constituent picture of that theory in the SM. In what way(s) does the Higgs boson play a particularly important role in the SM?

***Ans***:In QFT, standard gauge fields are massless, however, this is contradictary to the experimental obervation: W,Z bosons are massive. This can be realized by spontaneous symmetry breaking, where the higgs field pick a vacuum where its vacuum expectation value (vev) is not zero. And nonzero VeV will make the massless fields massive, this procedure is called the Higgs mechanism.

### <span style="color:LightGreen">Question 03</span>

Briefly describe the ATLAS and CMS experiments that collect proton-proton collision data at the LHC to study the Higgs boson.

***Ans:*** Both are particle accelerator designed to detect Higgs boson. The main difference is the way they constrain the particles in the accelerator. ATLAS use toroidal magnet, such that the magnetic field is circular. CMS use solenoid magenet, the magnetic field is uniform.

### <span style="color:LightGreen">Question 04</span>

Based on ref [[1]](https://arxiv.org/pdf/1402.4735.pdf), can you describe what the exotic particles in the benchmark models (a) HIGGS and (b) SUSY are why they would be important for fundamental physics?

***Ans:*** (a) In HIGGS, the exotic particles are heavy neutral higgs $H^0$, which is generated from the signal process gg$→$$H^0$. This can be used to explore the physics beyond standard model.

(b)In SUSY, the exotic particles are supersymmetric particles $\chi^\pm$. This can be used to test the truth supersymmetry. The (minimal) supersymmetry is one of the most possible candidate for grand unification theory that unifies the electric, weak and strong couplings.

<hr style="border:1px solid rgba(255, 255, 255, 1); margin: 2em 0;">

The remaining questions refer to the following data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00347 (also linked from above)
Machine learning is used in high-energy physics experiments to search for the signatures of exotic particles. These signatures are learned from Monte Carlo simulations of the collisions that produce these particles and the resulting decay products. In each of the three data sets from the data source, the goal is to separate particle-producing collisions from a background source.

The mass of the new particle is unknown, so three separate data sets are provided. In each data set, 50% of the data is from a signal process, while 50% is from the background process. The data is separated into a training set of 7 million examples and a test set of 3.5 million for each.

* In the `1000` dataset, the signal particle has mass=1000. (Note: this dataset does not include a mass feature since all signal examples have the same mass.)

* In the `not1000` dataset, the signal particle's mass is drawn uniformly from the set {500, 750, 1250, 1500}. The mass is included as an input feature; for the background examples, the mass is selected randomly from this same set.

Download the `not1000_train.csv.gz` and `1000_training.csv.gz` files from the data source

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()
import numpy as np
import pandas as pd
import os.path
import subprocess
import urllib.request
from sklearn import cluster, decomposition

In [14]:
def wget_data(url: str):
    local_path = r'C:\Users\84632\Desktop\ML\phys503\dataset'
    os.makedirs(local_path, exist_ok=True)
    # 如果系统没有 wget，可以用 urllib 替代
    filename = os.path.join(local_path, os.path.basename(url))
    urllib.request.urlretrieve(url, filename)
    print(f"Downloaded to {filename}")

def locate_data(name, check_exists=True):
    local_path = r'C:\Users\84632\Desktop\ML\phys503\dataset'
    path = os.path.join(local_path, name)
    if check_exists and not os.path.exists(path):
        raise RuntimeError(f'No such data file: {path}')
    return path

In [12]:
wget_data("https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/1000_train.csv.gz")
wget_data("https://courses.physics.illinois.edu/phys498mlp/sp2025/data/projects/ExoticParticles/hepmass/not1000_train.csv.gz")

Downloaded to C:\Users\84632\Desktop\ML\phys503\dataset\1000_train.csv.gz
Downloaded to C:\Users\84632\Desktop\ML\phys503\dataset\not1000_train.csv.gz


In [15]:
thousand_train = pd.read_csv(locate_data("1000_train.csv.gz"))
not_thousand_train = pd.read_csv(locate_data("not1000_train.csv.gz"))

### <span style="color:LightGreen">Question 05</span>

What is the size and shape of each data set?

In [17]:
print("the shape of 1000_train is",thousand_train.shape)
print("the size of the 1000_train is", thousand_train.size)
print("the shape of not1000_train is",not_thousand_train.shape)
print("the size of the not1000_train is", not_thousand_train.size)

the shape of 1000_train is (7000000, 27)
the size of the 1000_train is 189000000
the shape of not1000_train is (7000000, 29)
the size of the not1000_train is 203000000


The shape of 1000_train is (7000000, 27);
 the size of the 1000_train is 189000000.
 The shape of not1000_train is (7000000, 29);
 the size of the not1000_train is 203000000.

### <span style="color:LightGreen">Question 06</span>

The data set’s first column is the class label (1 for signal, 0 for background), followed by the 27 normalized features (22 low-level features then 5 high-level features), and a 28th mass feature for dataset `not1000`. See the ​original paper  (ref [[2]](https://arxiv.org/pdf/1601.07913.pdf)) for more detailed information. Can you explain what those normalized features are?

 1.The leading lepton momenta,  
 2.the momenta of the four leading jets,  
 3.the b-tagging informaiton of the jets,  
 4.the missing transverse momentum magnitude and angle.


### <span style="color:LightGreen">Question 07</span>

In the `1000` data set, can you draw the histogram of 27 normalized features for signal and background separately? Can you describe the significant differences between these histograms?

In [20]:
label_col = thousand_train.columns[0]
feature_cols = thousand_train.columns[1:28]
signal = thousand_train[thousand_train[label_col]==1]
background = thousand_train[thousand_train[label_col]==0]


### <span style="color:LightGreen">Question 08</span>

Do the same data process as Q4 for `not1000` data set.

### <span style="color:LightGreen">Question 09</span>

What difference do you find from `not1000` data set and `1000` data set ?

### <span style="color:LightGreen">Question 10</span>

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the LHC accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. When you read through the reference paper [[2]](https://arxiv.org/pdf/1402.4735.pdf), what particle properties do those 28 features represent?

### <span style="color:LightGreen">Question 11</span>

Using the data sets in this project, can you draw the histogram of 28 normalized features for signal and background separately? Could you tell the significant differences from these histogram?

## <span style="color:Orange">References</span>

__[<span style="color:Red">1</span>]__ P.J. Sadowski, D. Whiteson, P. Baldi, "Searching for Exotic Particles in High-Energy Physics with Deep Learning", _Nature Commun. 5 (2014) 4308_, e-Print: [1402.4735](https://arxiv.org/abs/1402.4735) [hep-ph]

__[<span style="color:Red">2</span>]__ P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, D. Whiteson, "Parameterized Machine Learning for High-Energy Physics", _Eur.Phys.J.C 76 (2016) 5, 235_, e-Print: [1601.07913](https://arxiv.org/abs/1601.07913) [hep-ex]

## <span style="color:Orange">Acknowledgements</span>

* Initial version: Mark Neubauer

© Copyright 2025