Allergen Chip data challenge result

This repository contains the Allergen Chip data challenge result including best notebooks source code and documentation.

The Allergen Chip Challenge, a collaborative project led by the French Society of Allergology (SFA) in partnership with the Health Data Hub, focuses on personalized medicine through exploring the antibody profile (immunoglobulin E) developed by each patient in response to their environment.

The goal of the competition was to develop Machine Learning models that can predict the presence and severity of an allergic disease based on this personalized profile.

Understand the relationship between exposome, IgE, and clinical expression of allergy, in particular :
Efficiently diagnose and stratify allergic patients
Gain a better understanding of IgE clusters in relation to age, symptoms, and climate area
Tackle the complexity of immune response using AI methods

The Challenge Results

292 participants from multiple countries around the world
3135 submissions during the 3 months of the challenge
Best F1 Macro score 0.786 on the private portion of the test set

The Development Environment

All the models of the Allergen Chip challenge were built on trustii.io hosted Jupyterhub environnement, each team had access to a notebook session holding 4 vCPU and 16GB RAM and 20GB persistent storage.

Best Notebooks

Note : To determine the final scores, a weighted average was used: 80% of the score came from the F1 Macro generated by trustii.io platform, while 20% was based on subjective criteria such as code interpretability and inference speed decided by the organizer of the challenge. A basic model, "DummyClassifier", which predicts the most frequent label for each target, was established as a baseline. This reference allowed for recalibration of the F1Score between 0 (performance equivalent to the DummyClassifier) and 1 (if the Classifier predicts all observations perfectly). This recalibration was done using a MinMaxScaler.

Ranking	Team	Score Public Leaderboard	Score Private Leaderboard	Final Score	Winning model summary
2	newbee (Ning Jia)	0.7729	0.7861	0.7518	Individual binary classifiers were developed for the 27 targets using LightGBM and catboost, with AUC as the training metric. An automated pipeline was created for feature selection and model training with cross-validation. For final predictions, a threshold was chosen for optimal F1 scores, and results from LightGBM and catboost were ensembled. Predictions were then adjusted based on learned target associations.
1	Rakesh Jarupula	0.7708	0.7748	0.7551	The model's effectiveness is largely attributed to feature engineering. Key steps include data pre-processing, where rows and columns were dropped or modified to align train-test distributions, and an allergen-to-treatment mapping was developed. Features were engineered based on row and column statistics, while irrelevant allergen proteins were dropped. Model training was target-specific, using features relevant to each allergen. To address class imbalances, the scale_pos_weight parameter was used. Bayesian Search was employed for hyperparameter tuning with Repeated Stratified KFold for training. Predictions from three models were averaged to avoid overfitting.
3	Mithil & Neeraj Salunkhe	0.7670	0.7719	0.7286	For preprocessing, Multilabel Stratified K-Fold cross-validation was used along with feature engineering, including row-wise statistics and interaction features, which boosted the model's F1 score. The most effective model was a finely-tuned XGBoost, wrapped in a MultiOutputClassifier, after testing various architectures. During post-processing, rather than a standard 0.5 threshold, individual optimal thresholds were determined for each target, improving the model's F1 performance. A noticeable correlation was found between positive samples in a target and its threshold. This tailored threshold approach played a pivotal role in enhancing the model's accuracy

For more details check out each winning solution report and source code in the 'repository' above.

The Dataset

The dataset has been provided by Société Française of Allergology (SFA) and prepared with Trustii.io data scientists. The dataset has been constructed from data of more than 4,000 patients includes tabular data associated with image files.

If you are interested by accessing the dataset or collaborating with Trustii.io and SFA on this project, please reach out to us at contact@trustii.io.

More information

To access the challenge forum discussions and the dataset description, check out the challenge webpage at https://app.trustii.io.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
1st in private leaderboard		1st in private leaderboard
2nd in private leaderboard		2nd in private leaderboard
3rd in private leaderboard		3rd in private leaderboard
HeroImage.png		HeroImage.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Allergen Chip data challenge result

The Challenge Results

The Development Environment

Best Notebooks

The Dataset

More information

About

Releases

Packages

Contributors 3

Languages

License

Trustii-team/AllergenChip

Folders and files

Latest commit

History

Repository files navigation

Allergen Chip data challenge result

The Challenge Results

The Development Environment

Best Notebooks

The Dataset

More information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages