Project for my master thesis "Improving Nerual Machine Translation Robustness via Data Augmentation"
We experimented with data augmentation methods (Back-translation, forward-translation, fuzzy match) and external datasets from speech transcripts to improve the neural machine translation model's performance on noisy test sets. We followed the WMT19 Robustness Shared Task in Fr-En directions.
The training and preprocessing scripts for all systems are provided in this repository.
OpenNMT-py
PyTorch
fairseq
You may run this script and it will download data needed automatically.
bash prepare_data.sh
Datasets used in the experiments can be catogorized as in-domain and out-of-domain. The in-domain data is MTNT dataset. For out-of-domain data, we use WMT15 fr-en News Translation data.
The preprocessing include tokenization with Moses tokenizer.perl along with BPE.
We conducted 4 experiments, namely:
- Model comparison (RNN, CNN and Transformer) on noisy texts
- Data agumentation (back-translation, forward-translation, fuzzy match)
- External data (human transcripts from IWSLT and MuST-C, ASR generated transcripts)
- Submissions to WMT19 Leaderboard
Details about the experiments and results can be found here (TODO: add thesis link)