Skip to content

An AICrowd Challenge: Logistic Regression classifier that predicts whether an event's decay signature was the one of a Higgs Boson

Notifications You must be signed in to change notification settings

EliaFantini/Higgs-Boson-Classifier-using-LHC-CERN-data

Repository files navigation

⚛️Higgs_Boson_Classifier GitHub commit activity GitHub last commit GitHub code size GitHub repo size GitHub follow GitHub fork GitHub watchers GitHub star

This project aims at classifying the decay signature of events measured by the Large Hadron Collider at CERN, predicting whether it's the one of a Higgs Boson or not, thanks to Logistic Regression.

The problem was part of a Machine Learning challenge from AICrowd. Our team, called pasta-balalaika, reached the position 50/307 on the leaderboard, with an F1 score of 0.74 and an accuracy of 0.82. This project was also done as an assignment of the EPFL course CS-433 Machine Learning.

Higgs Boson

The Higgs boson is an elementary particle in the Standard Model of physics which explains why other particles have mass. Its discovery at the Large Hadron Collider at CERN was announced in March 2013.

In this project, we applied machine learning techniques to actual CERN particle accelerator data to recreate the process of “discovering” the Higgs particle. Physicists at CERN smash protons into one another at high speeds to generate even smaller particles as by-products of the collisions. Rarely, these collisions can produce a Higgs boson. Since the Higgs boson decays rapidly into other particles, scientists don’t observe it directly, but rather measure its “decay signature”, or the products that result from its decay process.

Since many decay signatures look similar, we estimated the likelihood that a given event’s signature was the result of a Higgs boson (signal) or some other process/particle (background). To do this, we implemented a pre-processing pipeline and different binary classification techniques and compared their performance with hyperparameters tuning and cross validation.

Hbb_v2

Authors

How to install and reproduce results

Download this repository as a zip file and extract it into a folder The easiest way to run the code is to install Anaconda 3 distribution (available for Windows, macOS and Linux). To do so, follow the guidelines from the official website (select python of version 3): https://www.anaconda.com/download/

Additional package versions are specified in the requirements.txt file , you can just run the following command on Anaconda Prompt (anaconda3):

cd *THE_FOLDER_PATH_WHERE_YOU_DOWNLOADED_AND_EXTRACTED_THIS_REPOSITORY*
conda install --file requirements.txt

Download the training and testing datasets here (logging into AICrowd might be required to download)

Then, just run run.py with the following command to train and test the model:

python run.py

Files description

  • experiments/experiments_models.ipynb : this Jupyter notebook contains our cross validation and hyperparameter experiments with different models

  • experiments/experiments_preprocesing.ipynb: this Jupyter notebook contains our experiments with different preprocessing techniques

  • experiments/generate_graphs.ipynb: notebook that generates the graphs for the paper

  • helper.py: contains helper functions which were used for setting up our experiments

  • implementations.py: contains 6 default required funcitons + additional minimization algorithms, and accoring loss funcitons

  • metrics.py: contains our implementations of different metrics

  • preprocessing.py: contains methods for the preprocessing of data

  • report.pdf: pdf with the report of the project

  • run.py: contains the code for reproducing our best submission file

  • utils.py: miscellaneous other functions, e.g. loading data, splitting it, etc..

  • requirements.txt: file which includes package requirements for running the code

Others

For further details on the implementation choice and the experiments, please read the report.pdf file.

🛠 Skills

Python, PyTorch, Matplotlib, Jupyter Notebooks. Machine learning, Logistic Regression, analysis of the impact of different preprocessing techniques on training, shallow modelling, plotting the experiments, ensuring reproducibility.

🔗 Links

portfolio linkedin