This Huggingface Transformers-forked framework was used to run experiments for Junghyun Min1's master's thesis The roots and effects of heuristics in NLI and QA models, produce in collaboration with Tom McCoy1 and Tal Linzen2. This framework was developed on Nafise Sadat Moosavi3's original implementation of HANS evaluation on Huggingface Transformers.
1Department of Cognitive Science, Johns Hopkins University, Baltimore, MD
2Department of Linguistics; Department of Data Science, New York University, New York, NY
3Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Please run setup.py and pip -r install requirements.txt
to setup the necessary tools and packages.
We recommened that you take advantage of the PyTorch Docker image provided by Huggingface, located at docker/transformers-pytorch-gpu
. If you are new to Docker, you're in luck because I was as well 24 hours ago.
We walk through the Docker setup and installation in this section.
First, download the repository. The Docker image needs to be in the root directory of the repository.
mkdir transformers && cd transformers
git clone https://www.github.com/aatlantise/transformers
cp docker/transformer-pytorch-gpu/Dockerfile ./
Then, we build an image. We will name the image transformers-docker
.
docker build . -t transofmers-docker
Now, for our convenivnce, we create a shared drive before creating the container. We name it roots-effects
.
mkdir ~/container-data
docker run -it --name roots-effects -v ~/container-data:/data transformers-docker
Perchance, you may need to specify a GPU in a multiGPU machine or server:
docker run -e NVIDIA_VISIBLE_DEVICES=[GPU-number] -it -P --name [container-name] -v [shared-drive-local-path]:[shared-drive-container-path] [image name]
You are in the container! Now let's finish the setup.:
cd transformers
pip install -e . && pip install -r experiments/requirements.txt
You're ready to run your experiments! If you run into a versioning / GPU problem while running jobs, it is likely a PyTorch issue. Visit pytorch.org and install what is appropriate for your device and ocontainer.
We provide two tasks: natural language inference (NLI) and question answering (QA). We use the MultiNLI and the BoolQ datasets for training and in-distribution evaluation, respectively. The dataset for BoolQ, cached and tsv, can be found in /examples/boolq/boolq_data
, while the MNLI dataset must be downloaded using curl https://dl.fbaipublicfiles.com/glue/data/WNLI.zip -o MNLI.zip
, then unzipped.
For each task, we provide an out-of-distribution evaluation set that targets three known heuristics in NLI models (Mccoy et al. 2019b). We use Heuristics Analysis for NLI Systems to evaluate NLI models, and its QA adaptation QA-HANS to evaluate QA models. The datasets are located in /examples/[task-name]/hans/
.
We include figures for one iteration for each task for your reference, located in /examples/[task-name]/[task-name]_save/
. [task-name]_[step-count].txt
include in-distribution accuracy at step [step-count]
, while hans_[step-count].txt
and hansres_[step-count].txt
include HANS evaluation labels and its accuracy broken down into categories and subcases.
To run your experiment, simply configure mnli.sh
or boolq.sh
with your parameters and run bash mnli.sh
or bash boolq.sh
. Your model will be run, and at every checkpoint, the model's evaluation results will be saved.
If your GPU environment uses Slurm, feel free to use the mnli_finetune.scr
or boolq_finetune.scr
scripts.
On a k80 GPU, the BoolQ training will take a couple hours, while MNLI training may take between 12 and 18 hours. Note that the experiment only produces test set accuracies as [task-name]_[step-count].txt
and HANS label outputs as hans_[step-count].txt
. Run python evaluate_heur_output.py hans_[step-count].txt && mv formattedFile.txt hansres_[step-count].txt
to produce HANS performances on categories and subcases.
If you have any questions, please feel free to contact us at jmin10@jhu.edu, tom.mccoy@jhu.edu, and linzen@nyu.edu.