Skip to content

TRIUMF-Capstone2022/richai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TRIUMF RICH AI

final_report GitHub deployments License GitHub release (latest by date)

First and foremost, a big warm welcome! πŸŽˆπŸŽ‰ 🎊 🎈🎈

This GitHub repository contains the work related to the 2022 RICH AI Capstone Project completed by Nico Van den Hooff, Rakesh Pandey, Mukund Iyer, and Shiva Jena as part of the University of British Columbia Master of Data Science program. The project was completed in partnership with scientists at TRIUMF, Canada's particle accelerator centre and one of the world's leading subatomic physics research centres.

This document (the README.md file) is a hub to give you some general information about the project. You can either navigate straight to one of the sections by using the links below, or simply scroll down to find out more.

Executive summary

The NA62 experiment at CERN (European organization for nuclear research) studies the rate of the ultra-rare meson decay into a pion particle to verify the Standard Model in physics. The aim of the RICH AI project is to develop a binary classification model that utilizes advanced Machine Learning ("ML") to distinguish pion decays from muon decays by using the output of a Ring Imaging Cherenkov ("RICH") detector. The challenge of the project lies in concurrently increasing the "pion efficiency" (the rate at which pion decays are correctly classified) while also decreasing the muon efficiency (the rate at which muon decays are incorrectly classified as pion decays), in order to surpass the performance of CERN's current algorithm that is employed (i.e. a simple maximum likelihood algorithm that fits a circle to the light image emitted by a particle decay, and classifies the particle based on an expected radius size).

The data used to build the machine learning models had 2 million examples. It was controlled for momentum and converted to point cloud form by the addition of a time dimension. Two deep learning models based on recent academic research were applied: PointNet and Dynamic Graph CNN ("DGCNN"). The overall best performing model was PointNet as it exceeded the prior pion efficiency achieved by CERN's NA62 algorithm, while also maintaining a similar muon efficiency.

The final data product of the project consists of a modularized machine learning pipeline that takes in the raw experiment data in a HDF5 format, pre-processes it, prepares training data, trains a classifier model on the training data (or alternatively loads a trained model), and finally evaluates the model to make a prediction. In further studies, there is room for more work to be done on debiasing the ring centre locations of the particles, as well as further hyperparameter tuning for the machine learning models.

Contributors

Report

The final report of the project, which contains much greater detail about the data and modelling processes can be accessed here.

Jupyter book

The report is hosted as a Jupyter Book on GitHub pages, and the underlying files that are used to build the Jupyter Book can be accessed here. The Jupyter Book itself is built automatically via a GitHub Actions workflow, which triggers if there is a push to the main branch of this repository that changes a file within the richai/docs/final_report/ directory.

Final presentation slides

The corresponding slides for the final presentation that was given to the UBC Master of Data Science faculty and cohort can be accessed here.

Project proposal

Finally, the original project proposal and corresponding presentation slides can be accessed here.

Project structure

At a high level, the overall project structure is as follows:

.
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ README.md
β”‚   └── config.yaml
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ balance_data.py
β”‚   β”œβ”€β”€ data_loader.py
β”‚   β”œβ”€β”€ rich_dataset.py
β”‚   └── rich_pmt_positions.npy
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ final_presentation/
β”‚   β”œβ”€β”€ final_report/
β”‚   β”œβ”€β”€ proposal/
β”‚   └── README.md
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ dgcnn.py
β”‚   └── pointnet.py
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ DGCNN_operating_point.ipynb
β”‚   β”œβ”€β”€ EDA.ipynb
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ balance_data.ipynb
β”‚   β”œβ”€β”€ data_generation_process.ipynb
β”‚   β”œβ”€β”€ gbdt_analysis_results.ipynb
β”‚   β”œβ”€β”€ global_values.ipynb
β”‚   β”œβ”€β”€ plotting_NA62_rings.ipynb
β”‚   β”œβ”€β”€ pointnet_model_runs.ipynb
β”‚   β”œβ”€β”€ pointnet_operating_point.ipynb
β”‚   └── presentation_plots.ipynb
β”œβ”€β”€ saved_models/
β”‚   └── README.md
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ evaluate.py
β”‚   └── train.py
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ gbt_dataset.py
β”‚   β”œβ”€β”€ helpers.py
β”‚   └── plotting.py
β”œβ”€β”€ CODE_OF_CONDUCT.md
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ LICENSE
└── README.me

Dependencies

The RICH AI project was developed using singularity containers with the following package dependencies.

  • pandas==1.3.5
  • torch==1.11.0
  • sklearn==0.24.0
  • pyyaml==6.0
  • jupyterlab

Configuration file

The configuration file contains all of the parameters for the dataset, filters, model training, and scoring. File configs/config.yaml can be used to control data set paths, filters, model parameters/hyperparameters, train/test/saved model paths, PyTorch Dataloader configurations such as batch size, number of workers, etc., number of training epochs, device id, and many more.

Before beginning the training process, it is recommended that you double-check that the configuration parameters, such as datset and model paths, are correct.

Dataset

The data was generated as part of the 2018 NA62 experiments performed at CERN. There are a total of 11 million labeled decay events, each containing the features detailed above. However, there was a large class imbalance in the data set as only 10% of the decay examples were pion decays (the class of interest). More details can be here.

The sub directory dataset contains scripts for creating a custom PyTorch Dataset and DataLoader for deep learning models, along with a balance_data to create balanced dataset by undersampling the higher sized class.

  • rich_dataset.py processes the raw project data from HDF5 format and extracts events, hits and position data into a custom PyTorch Dataset.
  • dataloader.py creates PyTorch DataLoaders used to load data (train/test/validation) in batches as feed into the neural network models.
  • balance_data.py reads HDF5 files from the provided source file paths, creates balanced data by undersampling the higher sized class, and saves the HDF5 file to the specified path. Usage details can be found in the notebooks.

The data set configuration can be controlled and customized using the dataset section of configuration file.

Model training

To train PointNet use the following command at the root directory.

python src/train.py --model pointnet

To train Dynamic Graph CNN use the following command at the root directory.

python src/train.py --model dgcnn

The trained model object can be found at the path specified in configs/config.yaml as model.<model_name>.saved_model.

Model evaluation and scoring on new data

To evaluate PointNet on test data or to score on a new data, use the following command at the root directory.

python src/evaluate.py --model pointnet

To evaluate Dynamic Graph CNN on test data or to score on a new data, use the following command at the root directory.

python src/evaluate.py --model dgcnn

Model scored .csv data can be found in the path specified in configs/config.yaml as model.<model_name>.predictions. It contains actual labels, predicted labels, and predicted probabilities.

Saved models trained with different configurations and corresponding results as .csv files can be found on the triumf-ml1 server at the path /fast_scratch_1/capstone_2022/models/. Please refer appendix in the final report to learn more about the different model runs.

Utility scripts

The richai/utils directory contains the following scripts:

  • helpers.py which contains code for useful helper functions that were used throughout the project.
  • plotting.py which contains code for useful plotting functions that were used throughout the project.
  • gbt_dataset.py which contains the code for the data set used for the XGBoost model

Jupyter Notebooks

A number of Jupyter Notebooks were written during the course of the project to support the project analysis and to perform procedures such as Exploratory Data Analysis. The actual Jupyter Notebooks are saved here. Alternatively, they are also included in the final project report within the supplementary Jupyter Notebooks section of the report appendix (it is easier to view them here, rather than on the GitHub website).

Code of conduct

The project code of conduct can be found here.

Contributing

The project contributing file can be found here.

License

All work for the RICH AI capstone project is performed under an MIT license, which can be found here.

References

For a full list of project references, please see the references section of the final report.