DataSolve-WK-2022

This is a first place solution to the DataSolve 2022 competition organized by Wolters Kluwer.

Weights & Biases Dashboard Link

Tech Stack

HuggingFace transformers and Datasets library.
HuggingFace Trainer and PyTorch.
XGBoost and Catboost library.
Weights & Biases for experiment tracking.

Platform used

The transformer based models were primarily trained on jarvislabs.ai and lambdalabs.com on RTX5000, A5000 and A100 GPU instances.
XGBoost and Catboost models were trained on Kaggle’s P100 GPU.

Code Details

Data preprocessing notebook → this notebook contains the EDA and the Cross-validation setup used for the competition. The processed and splitted data is saved in the dataset here.
Transformer train-test-split pipeline → Transformer based pipeline. Note, this pipeline just use a simple split from train set for evaluation which might be not the most general way to evaluate models and have a huge risk of overfitting on the validation set. This pipeline was just used for running quick and dirty experiments and the final experiments were ran with the 5-fold pipeline for more robust results. Same notebook can be found on GitHub repo too.
Transformer 5-fold pipeline → All the transformer models used for the final ensemble were based out of this notebook. This pipeline runs a 5-fold training. The folds are stratified using the iterstrat package which helps to stratify multi-label data. Specifically, MultilabelStratifiedKFold was used to create the folds. You can also check the complete data preprocessing and preparation stage in this notebook.
XGBoost pipeline - traditional approaches like XGBoost with TFIDF/CountVectorizer was also used for diversity in the final ensemble. This used the same 5-folds used in the transformer pipleline for comparing the results and leak-free ensemble. Same notebook can be found on GitHub repo too.
Catboost pipeline - replaced XGBoost model with Catboost (this was trained on CPU as catboost doesn’t support multi-label training on GPU yet). Same notebook can be found on GitHub repo too.
Hill climbing ensemble - The final leaderboard score (0.92276 private LB) was obtained from this notebook. This notebook uses hill climbing algorithm for selecting the final models with corresponding weights obtaining overall best score on the cross-validation setup and then take the weighted average of them.

Extras

Many experiments were run during the course of the competition. To keep track of all the experiments, I used wandb.ai. The W&B dashboard can be accessed here. All the configuration, console logs, saved artifacts/models can be viewed there. Furthermore, all the code that went into each experiment can be also viewed. For example: best single model’s code.
To download the out-of-fold (OOF) predictions, test set predictions and submission files saved for each experiment, this notebook was used and the output was used as a dataset for Hill Climbing Ensemble notebook
Even though model efficiency was not the main aim of competition, still I tried Knowledge distillation to make the base model (which is much smaller and easier to deploy) as performant as the large model (or an ensemble of models) for faster inference (ideally suited for deployment environments) out of curiosity. However, due to the limited time I couldn’t make it work. This is the corresponding notebook for the same.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
notebooks		notebooks
.env		.env
.gitignore		.gitignore
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

notebooks

notebooks

.env

.env

.gitignore

.gitignore

README.md

README.md

setup.sh

setup.sh

Repository files navigation

DataSolve-WK-2022

Weights & Biases Dashboard Link

Tech Stack

Platform used

Code Details

Extras

The best and the final score was obtained by this version of the hill climbing ensemble notebook

About

Releases

Packages

Languages

Gladiator07/DataSolve-WK-2022

Folders and files

Latest commit

History

Repository files navigation

DataSolve-WK-2022

Weights & Biases Dashboard Link

Tech Stack

Platform used

Code Details

Extras

The best and the final score was obtained by this version of the hill climbing ensemble notebook

About

Resources

Stars

Watchers

Forks

Languages