Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

The CrossFit Challenge :weight_lifting: and The NLP Few-shot Gym :sweat_drops:

This repository contains code accompanying our preprint paper "CrossFit πŸ‹οΈ: A Few-shot Learning Challenge for Cross-task Generalization in NLP" (Paper).

CrossFit πŸ‹οΈ is a task setup which aims at building few-shot learners that generalize across diverse NLP tasks. For example, we explore whether models trained with non-classification tasks becomes good few-shot learner for classfication tasks; whether models trained with non-MRC QA tasks becomes good few-shot learners for MRC QA tasks.

NLP Few-shot Gym πŸ’¦ is a repository of 160 different NLP tasks that we gather from existing open-access datasets. We manually create a two-level task ontology to analyze cross-task generalization in different settings.

[:memo: Update 2022-04-15] The code is based on old versions of transformers and torch. If you need to develop on top of newer versions please reference this issue.

[:memo: Update 2022-04-15] We found some formatting issue during dataset processing. Several tasks may be affected. If you have used the previous version, please update the data files using the latest code.

Quick Links

Configure Environment

# Create a new conda environment (optional)
conda create -n crossfit python=3.6.9
conda activate crossfit
# For building the NLP Few-shot Gym
pip install datasets==1.4.0 py7zr wget
# For reproducing the baseline methods
pip install torch==1.1.0 higher==0.2.1 scikit-learn==0.24.1 scipy==1.4.1 rouge==1.0.0
pip install git+

Building the NLP Few-shot Gym

The following code will automatically prepare the data using πŸ€— huggingface datasets, reconstruct the few-shot train/dev sets we sampled, and verify the files with MD5Sum. The processing will take roughly 3 hours.

cd tasks
# Construct the gym
# --n_proc=10 means the tasks will be prosessed in parallel with 10 subprocesses. 
python --build --n_proc=10
# Verify with MD5Sum
python --verify

If the processing is successful, the verification script will output [Success] All files are consistent.

If the processing for any individual task goes wrong (e.g., some datasets are hosted on google drive and there is daily quota issue), you can re-try later by running individual scripts.

# For example, if you want to construct glue_sst2
cd tasks

Disclaimer: We use publicly-available datasets from πŸ€— huggingface datasets to construct the few-shot gym. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included, please contact us!

Baseline Methods

πŸ˜ƒ Please check ./example_scripts for more examples!

Fine-tune a single few-shot task

Here we take BoolQ as an example. There are five different samples of train/dev sets for BoolQ in the directory data/boolq/. For each sample, we do a grid search over learning rate (1e-5, 2e-5, 5e-5) and batch size (2, 4, 8). This script will not save the final model, however the results will be logged in a csv file in --output_dir.

python \
--task_dir data/boolq/ \
--do_train \
--do_predict \
--learning_rate_list 1e-5 2e-5 5e-5 \
--bsz_list 2 4 8 \
--total_steps 1000 \
--eval_period 100 \
--warmup_steps 100 \
--model facebook/bart-base \
--output_dir models/singletask-boolq \
--predict_batch_size 32 \


  • The script will load the original bart-base weights by default. If you want to fine-tune pre-trained weights from a file, please specify --checkpoint $CHECKPOINT.
  • If you want to fine-tune with your own task, please process your data in the same format as in data/boolq/.
  • We provide a script that does not tune hyperparameters and saves the final model in ./example_scripts.

Upstream Learning

Upstream learning refers to the stage between general pre-training and down-stream fine-tuning. In this stage we allow access to a set of training tasks, and we test the few-shot learning ability on a set of test tasks after this stage. Please check Table 1 in our preprint paper for more details about task partitions.

We include two upstream learning methods: multi-task learning and MAML (model-agnostic meta-learning). We are currently working on first-order meta-learning algorithms (First-order MAML and Reptile)!

Multi-task Learning
python \
--do_train \
--train_dir data \
--custom_tasks_splits ${TASK_SPLIT} \
--total_steps 16980 \
--warmup_steps 1018 \
--model facebook/bart-base \
--output_dir models/upstream-multitask \
--train_batch_size 32 \
--num_train_epochs 10;
python \
--do_train \
--learning_rate 1e-5 \
--output_dir models/upstream-maml \
--custom_tasks_splits ${TASK_SPLIT} \
--total_steps 6000 \
--warmup_steps 360 \
--train_batch_size 1 \
--gradient_accumulation_steps 4 \
--num_train_epochs 40;

MAML is memory intensive. The experiment above is done with a Quadro RTX 8000 GPU (48GB). If you want to reduce memory usage, please reduce --inner_bsz.


  • You can specify what tasks to use during upstream learning by creating a json file and passing it to --custom_tasks_splits.
  • The --total_steps in the scripts above are pre-computed so that the learning rate decreases to zero linearly during learning. We also pre-compute --warmup_steps to be 6% of the total steps.

Download Our Checkpoints

Here we provide the checkpoints after upstream learning.

Task Partition Multi-task Meta-learn
1. Random

πŸ˜ƒ Please stay tuned for more checkpoints!

Useful Tools

  • ./example_scripts/ will help you fine-tune a list of tasks sequentially, given a certain model initialization.
  • ./example_scripts/ will read each results.csv files in a given directory, then compute mean and standard deviation of dev/test performance.


We thank authors and crowd-workers of all resources used in our study! This work would not have been possible without your efforts. We thank πŸ€— huggingface datasets team for making datasets more accessible. Our code is modified from shmsw25/bart-closed-book-qa, thanks to the authors!

Contact Us

If you find bugs in our code, encounter problems when running the code, or have suggestions for the CrossFit project, please submit an issue, or reach out to Qinyuan ( and Bill (!

If you used our code in your study, or find our paper useful, please cite us with the bibkey ye-etal-2021-crossfit in the official ACL Anthology, or use the following BibTeX:

    title = "{C}ross{F}it: A Few-shot Learning Challenge for Cross-task Generalization in {NLP}",
    author = "Ye, Qinyuan and Lin, Bill Yuchen  and Ren, Xiang",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "10.18653/v1/2021.emnlp-main.572",
    pages = "7163--7189",


Code for paper "CrossFit πŸ‹οΈ: A Few-shot Learning Challenge for Cross-task Generalization in NLP" (






No releases published


No packages published