First and foremost, a big warm welcome! ππ π ππ
This GitHub repository contains the work related to the 2022 RICH AI Capstone Project completed by Nico Van den Hooff, Rakesh Pandey, Mukund Iyer, and Shiva Jena as part of the University of British Columbia Master of Data Science program. The project was completed in partnership with scientists at TRIUMF, Canada's particle accelerator centre and one of the world's leading subatomic physics research centres.
This document (the README.md
file) is a hub to give you some general information about the project. You can either navigate straight to one of the sections by using the links below, or simply scroll down to find out more.
The NA62 experiment at CERN (European organization for nuclear research) studies the rate of the ultra-rare meson decay into a pion particle to verify the Standard Model in physics. The aim of the RICH AI project is to develop a binary classification model that utilizes advanced Machine Learning ("ML") to distinguish pion decays from muon decays by using the output of a Ring Imaging Cherenkov ("RICH") detector. The challenge of the project lies in concurrently increasing the "pion efficiency" (the rate at which pion decays are correctly classified) while also decreasing the muon efficiency (the rate at which muon decays are incorrectly classified as pion decays), in order to surpass the performance of CERN's current algorithm that is employed (i.e. a simple maximum likelihood algorithm that fits a circle to the light image emitted by a particle decay, and classifies the particle based on an expected radius size).
The data used to build the machine learning models had 2 million examples. It was controlled for momentum and converted to point cloud form by the addition of a time dimension. Two deep learning models based on recent academic research were applied: PointNet and Dynamic Graph CNN ("DGCNN"). The overall best performing model was PointNet as it exceeded the prior pion efficiency achieved by CERN's NA62 algorithm, while also maintaining a similar muon efficiency.
The final data product of the project consists of a modularized machine learning pipeline that takes in the raw experiment data in a HDF5 format, pre-processes it, prepares training data, trains a classifier model on the training data (or alternatively loads a trained model), and finally evaluates the model to make a prediction. In further studies, there is room for more work to be done on debiasing the ring centre locations of the particles, as well as further hyperparameter tuning for the machine learning models.
The final report of the project, which contains much greater detail about the data and modelling processes can be accessed here.
The report is hosted as a Jupyter Book on GitHub pages, and the underlying files that are used to build the Jupyter Book can be accessed here. The Jupyter Book itself is built automatically via a GitHub Actions workflow, which triggers if there is a push to the main
branch of this repository that changes a file within the richai/docs/final_report/
directory.
The corresponding slides for the final presentation that was given to the UBC Master of Data Science faculty and cohort can be accessed here.
Finally, the original project proposal and corresponding presentation slides can be accessed here.
At a high level, the overall project structure is as follows:
.
βββ configs/
β βββ README.md
β βββ config.yaml
βββ dataset/
β βββ balance_data.py
β βββ data_loader.py
β βββ rich_dataset.py
β βββ rich_pmt_positions.npy
βββ docs/
β βββ final_presentation/
β βββ final_report/
β βββ proposal/
β βββ README.md
βββ models/
β βββ dgcnn.py
β βββ pointnet.py
βββ notebooks/
β βββ DGCNN_operating_point.ipynb
β βββ EDA.ipynb
β βββ README.md
β βββ balance_data.ipynb
β βββ data_generation_process.ipynb
β βββ gbdt_analysis_results.ipynb
β βββ global_values.ipynb
β βββ plotting_NA62_rings.ipynb
β βββ pointnet_model_runs.ipynb
β βββ pointnet_operating_point.ipynb
β βββ presentation_plots.ipynb
βββ saved_models/
β βββ README.md
βββ src/
β βββ evaluate.py
β βββ train.py
βββ utils/
β βββ gbt_dataset.py
β βββ helpers.py
β βββ plotting.py
βββ CODE_OF_CONDUCT.md
βββ CONTRIBUTING.md
βββ LICENSE
βββ README.me
The RICH AI project was developed using singularity
containers with the following package dependencies.
- pandas==1.3.5
- torch==1.11.0
- sklearn==0.24.0
- pyyaml==6.0
- jupyterlab
The configuration file contains all of the parameters for the dataset, filters, model training, and scoring.
File configs/config.yaml
can be used to control data set paths, filters, model parameters/hyperparameters, train/test/saved model paths, PyTorch Dataloader
configurations such as batch size, number of workers, etc., number of training epochs, device id, and many more.
Before beginning the training process, it is recommended that you double-check that the configuration parameters, such as datset and model paths, are correct.
The data was generated as part of the 2018 NA62 experiments performed at CERN. There are a total of 11 million labeled decay events, each containing the features detailed above. However, there was a large class imbalance in the data set as only 10% of the decay examples were pion decays (the class of interest). More details can be here.
The sub directory dataset
contains scripts for creating a custom PyTorch Dataset
and DataLoader
for deep learning models, along with a balance_data
to create balanced dataset by undersampling the higher sized class.
rich_dataset.py
processes the raw project data from HDF5 format and extracts events, hits and position data into a custom PyTorchDataset
.dataloader.py
creates PyTorchDataLoader
s used to load data (train/test/validation) in batches as feed into the neural network models.balance_data.py
reads HDF5 files from the provided source file paths, creates balanced data by undersampling the higher sized class, and saves the HDF5 file to the specified path. Usage details can be found in the notebooks.
The data set configuration can be controlled and customized using the dataset
section of configuration file.
To train PointNet
use the following command at the root directory.
python src/train.py --model pointnet
To train Dynamic Graph CNN
use the following command at the root directory.
python src/train.py --model dgcnn
The trained model object can be found at the path specified in
configs/config.yaml
asmodel.<model_name>.saved_model
.
To evaluate PointNet
on test data or to score on a new data, use the following command at the root directory.
python src/evaluate.py --model pointnet
To evaluate Dynamic Graph CNN
on test data or to score on a new data, use the following command at the root directory.
python src/evaluate.py --model dgcnn
Model scored
.csv
data can be found in the path specified inconfigs/config.yaml
asmodel.<model_name>.predictions
. It contains actual labels, predicted labels, and predicted probabilities.
Saved models trained with different configurations and corresponding results as
.csv
files can be found on thetriumf-ml1
server at the path/fast_scratch_1/capstone_2022/models/
. Please refer appendix in the final report to learn more about the different model runs.
The richai/utils
directory contains the following scripts:
helpers.py
which contains code for useful helper functions that were used throughout the project.plotting.py
which contains code for useful plotting functions that were used throughout the project.gbt_dataset.py
which contains the code for the data set used for the XGBoost model
A number of Jupyter Notebooks were written during the course of the project to support the project analysis and to perform procedures such as Exploratory Data Analysis. The actual Jupyter Notebooks are saved here. Alternatively, they are also included in the final project report within the supplementary Jupyter Notebooks section of the report appendix (it is easier to view them here, rather than on the GitHub website).
The project code of conduct can be found here.
The project contributing file can be found here.
All work for the RICH AI capstone project is performed under an MIT license, which can be found here.
For a full list of project references, please see the references section of the final report.