Skip to content

Gladiator07/DataSolve-WK-2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataSolve-WK-2022

This is a first place solution to the DataSolve 2022 competition organized by Wolters Kluwer.

Weights & Biases Dashboard Link

Tech Stack

Platform used

  • The transformer based models were primarily trained on jarvislabs.ai and lambdalabs.com on RTX5000, A5000 and A100 GPU instances.
  • XGBoost and Catboost models were trained on Kaggle’s P100 GPU.

Code Details

  • Data preprocessing notebook → this notebook contains the EDA and the Cross-validation setup used for the competition. The processed and splitted data is saved in the dataset here.
  • Transformer train-test-split pipeline → Transformer based pipeline. Note, this pipeline just use a simple split from train set for evaluation which might be not the most general way to evaluate models and have a huge risk of overfitting on the validation set. This pipeline was just used for running quick and dirty experiments and the final experiments were ran with the 5-fold pipeline for more robust results. Same notebook can be found on GitHub repo too.
  • Transformer 5-fold pipeline → All the transformer models used for the final ensemble were based out of this notebook. This pipeline runs a 5-fold training. The folds are stratified using the iterstrat package which helps to stratify multi-label data. Specifically, MultilabelStratifiedKFold was used to create the folds. You can also check the complete data preprocessing and preparation stage in this notebook.
  • XGBoost pipeline - traditional approaches like XGBoost with TFIDF/CountVectorizer was also used for diversity in the final ensemble. This used the same 5-folds used in the transformer pipleline for comparing the results and leak-free ensemble. Same notebook can be found on GitHub repo too.
  • Catboost pipeline - replaced XGBoost model with Catboost (this was trained on CPU as catboost doesn’t support multi-label training on GPU yet). Same notebook can be found on GitHub repo too.
  • Hill climbing ensemble - The final leaderboard score (0.92276 private LB) was obtained from this notebook. This notebook uses hill climbing algorithm for selecting the final models with corresponding weights obtaining overall best score on the cross-validation setup and then take the weighted average of them.

Extras

  • Many experiments were run during the course of the competition. To keep track of all the experiments, I used wandb.ai. The W&B dashboard can be accessed here. All the configuration, console logs, saved artifacts/models can be viewed there. Furthermore, all the code that went into each experiment can be also viewed. For example: best single model’s code.
  • To download the out-of-fold (OOF) predictions, test set predictions and submission files saved for each experiment, this notebook was used and the output was used as a dataset for Hill Climbing Ensemble notebook
  • Even though model efficiency was not the main aim of competition, still I tried Knowledge distillation to make the base model (which is much smaller and easier to deploy) as performant as the large model (or an ensemble of models) for faster inference (ideally suited for deployment environments) out of curiosity. However, due to the limited time I couldn’t make it work. This is the corresponding notebook for the same.

The best and the final score was obtained by this version of the hill climbing ensemble notebook

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published