To structure Turku Paraphrase Corpus data files (train.json, dev.json, test.json and texts.json.gz) the same way as SQUAD run the make_paraphrase_data.py
python3 make_paraphrase_data.py
--file dev.json
--context texts.json.gz
--output qa_data/dev.json
All of the original code for fine tuning can be found in this HuggingFace repository. The code for this project was fetched from the mentioned repository on 7th of June 2021.
To run the code in CSC Mahti:
Clear the loaded modules and load pytorch
module purge
module load pytorch/1.8
Important! To run HuggingFace example code install transformers from the source:
git clone https://github.com/huggingface/transformers
cd transformers
python -m pip install --user .
To install the requirements:
python -m pip install --user -r requirements.txt
Notes:
- This script only works with models that have a fast tokenizer
- If your dataset contains samples with no possible answers (like SQUAD version 2), you need to pass along the flag
--version_2_with_negative
.
Example code for fine-tuning BERT on the preprocessed Turku Paraphrase Corpus dataset.
python3 run_qa.py \
--model_name_or_path TurkuNLP/bert-base-finnish-cased-v1 \
--train_file train.json \
--validation_file dev.json \
--test_file test.json \
--do_train \
--do_eval \
--do_predict \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--learning_rate 2e-5 \
--num_train_epochs 2 \
--max_seq_length 512 \
--doc_stride 128 \
--version_2_with_negative \
--output_dir /output/ \
--cache_dir /caches/ \