Skip to content

NavidRajabi/EMQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMQA

This repository contains code and data for running the experiments and reproducing the results of the paper: "Towards More Equitable Question Answering Systems: How Much More Data Do You Need?".

Model

Dataset

Download the dataset from the following links and put them under the data directory.

  • TyDi QA (Original dataset from Google): train | dev
  • TyDi QA (Seperated by language): train | dev
  • SQuAD (Original train set for zero-shot setting): link
  • tSQuAD: link
  • mSQuAD: link
  • Disproportional allocations: link

Running Experiments

Requirements

After creating a virtual environment, installing Python 3.6+, PyTorch 1.3.1+, and CUDA (tested with 10.1), install the Transformers library as follows:

pip install transformers

Training

If you want to use multilingual-bert model, run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path=bert-base-multilingual-uncased \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/tydiqa-goldp-v1.1-train.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './train_cache_output/'
      --overwrite_cache

Otherwise, run the following command to use XLM-Roberta-Large model instead:

python run_squad.py  \
      --model_type=xlm-roberta \
      --model_name=xlm-roberta-large  \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/tydiqa-goldp-v1.1-train.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json'  \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5  \
      --num_train_epochs 3  \
      --max_seq_length 384  \
      --doc_stride 128  \
      --output_dir './train_cache_output/' \
      --overwrite_cache

Evaluating

For the evaluation-only situation, replace the model path of --model_name with the path to the cache directory of your pre-trained model and run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path='./train_cache_output/' \
      --do_eval \
      --do_lower_case \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './eval_cache_output/'
      --overwrite_cache

Fine-tuning

For fine-tuning, run the following command:

python run_squad.py \
      --model_type bert \
      --model_name_or_path='./train_cache_output/' \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file './data/dataset_for_fineTuning.json' \
      --predict_file './data/tydiqa-goldp-v1.1-dev.json' \
      --per_gpu_train_batch_size 24 \
      --per_gpu_eval_batch_size 24 \
      --learning_rate 3e-5 \
      --num_train_epochs 3 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir './fineTune_cache_output/'
      --overwrite_cache

Citation

If you use this code, please refer to our ACL 2021 paper using the following BibTeX entry:

@inproceedings{debnath-etal-2021-towards,
    title = "Towards More Equitable Question Answering Systems: How Much More Data Do You Need?",
    author = "Debnath, Arnab  and Rajabi, Navid  and Alam, Fardina Fathmiul and Anastasopoulos, Antonios",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "",
    pages = "",
}

License

Our code and data for EMQA are available under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published