Training recipes and scripts for system paper "Exploring Model Consensus to Generate Translation Paraphrases" (In WNGT workshop at ACL20) in duolingo STAPLE shared task.
You can download the out-of-domain data and required tools by running
bash prepare_data.sh
If you want to use the in-domain STAPLE dataset, you should download from Dataverse, and extract the training data to data/raw/
We trained the model using a modified version of fairseq. You may have to compile it by running
cd tools/fairseq
pip install --editable .
The preprocessing procedures include
- punctuation normalization, removing non-printable characters
- tokenization
- BPE (shared vocab with size of 40k for both En and Pt)
- parallel data filtering
All data preprocessing are in data/preprocess_data.sh
Scripts for pre-training with out-of-domain data and fine-tuning with in-domain data are in the recipes/
directory. You can also evaluate the weighted marco F1 score with the scripts.
If you use this work, please cite it as
@inproceedings{li-etal-2020-exploring,
Author = {Zhenhao Li, and Marina Fomicheva, and Lucia Specia},
Title = {Exploring Model Consensus to Generate Translation Paraphrases},
Booktitle = {Proceedings of the ACL Workshop on Neural Generation and Translation (WNGT)},
Publisher = {ACL},
Year = {2020}
}