Steps to train a Transformer model on the WMT English-German dataset
- OpenNMT-tf (>= 1.10.0)
Please follow the instructions to install and build SentencePiece. If installed in a custom location, change the
SP_PATH variable in the scripts.
pip install OpenNMT-tf[tensorflow_gpu]>=1.10.0
Before running the script, look at the links in the file header to download the datasets. Depending on the task, you may need to change the filenames and the folders paths.
/data/wmt/ contains the raw parallel datasets.
The script will train a SentencePiece model using the same source and target vocabulary. It will tokenize the dataset and prepare the train/valid/test files. A new directory
data/ will contain the generated files.
We recommend training on 4 GPUs to get the best performance:
Or if you have only 1 GPU, run the dedicated script:
- SentencePiece model and vocabulary
- Pre-tokenized dataset
- Pre-trained averaged model:
This model achieved the following scores:
|Test set||NIST BLEU|