Skip to content

Latest commit

 

History

History
131 lines (115 loc) · 4.9 KB

README.md

File metadata and controls

131 lines (115 loc) · 4.9 KB

EMQA

This repository contains code and data for running the experiments and reproducing the results of the paper: "Towards More Equitable Question Answering Systems: How Much More Data Do You Need?".

Model

Dataset

Download the dataset from the following links and put them under the data directory.

  • TyDi QA (Original dataset from Google): train | dev
  • TyDi QA (Seperated by language): train | dev
  • SQuAD (Original train set for zero-shot setting): link
  • tSQuAD: link
  • mSQuAD: link
  • Disproportional allocations: link

Running Experiments

Requirements

After creating a virtual environment, installing Python 3.6+, PyTorch 1.3.1+, and CUDA (tested with 10.1), install the Transformers library as follows:

pip install transformers

Training

If you want to use multilingual-bert model, run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path=bert-base-multilingual-uncased \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/tydiqa-goldp-v1.1-train.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './train_cache_output/'
      --overwrite_cache

Otherwise, run the following command to use XLM-Roberta-Large model instead:

python run_squad.py  \
      --model_type=xlm-roberta \
      --model_name=xlm-roberta-large  \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/tydiqa-goldp-v1.1-train.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json'  \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5  \
      --num_train_epochs 3  \
      --max_seq_length 384  \
      --doc_stride 128  \
      --output_dir './train_cache_output/' \
      --overwrite_cache

Evaluating

For the evaluation-only situation, replace the model path of --model_name with the path to the cache directory of your pre-trained model and run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path='./train_cache_output/' \
      --do_eval \
      --do_lower_case \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './eval_cache_output/'
      --overwrite_cache

Fine-tuning

For fine-tuning, run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path='./train_cache_output/' \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/dataset_for_fineTuning.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './fineTune_cache_output/'
      --overwrite_cache

Citation

If you use this code, please refer to our ACL 2021 paper using the following BibTeX entry:

@inproceedings{debnath-etal-2021-towards,
    title = "Towards More Equitable Question Answering Systems: How Much More Data Do You Need?",
    author = "Debnath, Arnab  and Rajabi, Navid  and Alam, Fardina Fathmiul and Anastasopoulos, Antonios",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "",
    pages = "",
}

License

Our code and data for EMQA are available under the MIT License.