JParaCrawl Fine-tuning Example
This repository includes an example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
Our goal is to train (fine-tune) the domain-adapted NMT model in a few hours.
We wrote this document as beginner-friendly so that many people can try NMT experiments. Thus, some parts might be too easy or redundant for experts.
In this example, we focus on fine-tuning the pre-trained model for KFTT corpus, which contains Wikipedia articles related to Kyoto. We prepared two examples, English-to-Japanese and Japanese-to-English. We recommend you to try the English-to-Japanese example if you are fluent Japanese speaker, otherwise Japanese-to-English since we expect you to read the MT output. In the following, we use the Japanese-to-English example.
JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT. In this example, we will fine-tune the model pre-trained on JParaCrawl.
For more details about JParaCrawl, visit the official web site.
This example uses the followings.
For fairseq, we recommend to use the same version as we pre-trained the model.
$ cd fairseq $ git checkout c81fed46ac7868c6d80206ff71c6f6cfe93aee22
We prepared the Docker container that was already installed the requisites.
Use the following commands to run.
Note that you can change
~/jparacrawl-experiments to the path you want to store the experimental results.
This will be connected to the container as
$ docker pull morinoseimorizo/jparacrawl-fairseq $ docker run -it --gpus 1 -v ~/jparacrawl-experiments:/host_disk morinoseimorizo/jparacrawl-fairseq bash
Prepare the data
First, you need to prepare the corpus and pre-trained model.
$ cd /host_disk $ git clone https://github.com/MorinoseiMorizo/jparacrawl-finetune.git # Clone the repository. $ cd jparacrawl-finetune $ ./get-data.sh # This script will download KFTT and sentencepiece model for pre-processing the corpus. $ ./preprocess.sh # Split the corpus into subwords. $ cp ./ja-en/*.sh ./ # If you try the English-to-Japanese example, use en-ja directory instead. $ ./get-model.sh # Download the pre-trained model.
These commands will download the KFTT corpus and the pre-trained NMT model.
Then will tokenize the corpus to subwords with the provided sentencepiece models.
The subword tokenized corpus is located at
$ head -n 2 corpus/spm/kyoto-train.en ▁Known ▁as ▁Se s shu ▁( 14 20 ▁- ▁150 6) , ▁he ▁was ▁an ▁ink ▁painter ▁and ▁Zen ▁monk ▁active ▁in ▁the ▁Muromachi ▁period ▁in ▁the ▁latter ▁half ▁of ▁the ▁15 th ▁century , ▁and ▁was ▁called ▁a ▁master ▁painter . ▁He ▁revolutionize d ▁the ▁Japanese ▁ink ▁painting .
You can see that a word is tokenized into several subwords.
We use subwords to reduce the vocabulary size and express a low-frequent word as a combination of subwords.
For example, the word
revolutionized is tokenized into
Decoding with pre-trained NMT models
Before fine-tuning experiments, let's try to decode (translate) a file with the pre-trained model to see how the current model works.
decode.sh that decodes the KFTT test set with the pre-trained NMT model.
We can automatically evaluate the translation results by comparing reference translations.
Here, we use BLEU scores, which is the most used evaluation matrix in the MT community.
The script automatically calculates the BLEU score and save it to
BLEU scores ranges 0 to 100, so this result is somewhat low.
$ cat decode/test.log BLEU+case.mixed+numrefs.1+smooth.exp+tok.intl+version.1.4.2 = 14.2 50.4/22.0/11.2/5.9 (BP = 0.868 ratio = 0.876 hyp_len = 24351 ref_len = 27790)
It is also important to check outputs as well as BLEU scores.
Input and output files are located on
$ head -n4 ./corpus/kftt-data-1.0/data/orig/kyoto-test.ja InfoboxBuddhist 道元（どうげん）は、鎌倉時代初期の禅僧。 曹洞宗の開祖。 晩年に希玄という異称も用いた。。 $ head -n4 ./decode/kyoto-test.ja.true.detok InfoboxBuddhist Dogen is a Zen monk from the early Kamakura period. The founder of the Soto sect. In his later years, he also used the heterogeneous name "Legend".
This is just an example so the result may vary.
You can also find the reference translations at
$ head -n4 ./corpus/kftt-data-1.0/data/orig/kyoto-test.en Infobox Buddhist Dogen was a Zen monk in the early Kamakura period. The founder of Soto Zen Later in his life he also went by the name Kigen.
The current model mistranslated the name "Kigen" to "Legend" at line 4. Also, "heterogeneous" is not an appropriate translation. Let's see how this could be improved by fine-tuning.
Fine-tuning on KFTT corpus
Now, let's move to fine-tuning. By fine-tuning, the model will adapt to the specific domain, KFTT. Thus, we can expect the translation accuracy improves.
Following scripts will fine-tune the pre-trained model with the KFTT training set.
$ nohup ./fine-tune_kftt_fp32.sh &> fine-tune.log & $ tail -f fine-tune.log
Modern GPUs can use mixed-precision training that make use of Tensor Cores, which can compute half-precision floating-point faster.
If you want to use this feature, run
fine-tune_kftt_mixed.sh instead of
fine-tune_kftt_fp32.sh with Volta or later generations GPUs such as Tesla V100 or Geforce RTX 2080 Ti GPUs.
Training will take several hours to finish. We tested on single RTX 2080Ti GPU with mixed-precision training and it finished in two hours. Training time drastically differs based on the environment, so it may take a few more hours.
Once it finished, you can find the BLEU score on the
You can see the BLEU score is greatly improved by fine-tuning.
$ cat models/fine-tune/test.log BLEU+case.mixed+numrefs.1+smooth.exp+tok.intl+version.1.4.2 = 26.4 57.8/31.7/20.1/13.5 (BP = 0.992 ratio = 0.992 hyp_len = 27572 ref_len = 27790)
Translated text is on
$ head -n4 models/fine-tune/kyoto-test.ja.true.detok Nickel buddhist Dogen was a Zen priest in the early Kamakura period. He was the founder of the Soto sect. In his later years, he also used another name, Kigen.
The fine-tuned model could correctly translate line 4.
In this document, we described how to use the pre-trained model and fine-tune it with KFTT. By fine-tuning, we can obtain the domain-specific NMT model with a low computational cost.
We listed some examples to go further for NMT beginners.
- Looking into the provided scripts and find what commands are used.
- Try to translate your documents with the pre-trained and fine-tuned models.
- You need to edit
- See how well the model works.
- You need to edit
- Try fine-tuning with other English-Japanese parallel corpora.
- NMT architectures
- JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
- The Kyoto Free Translation Task (KFTT)
Please send an issue on GitHub or contact us by email.
NTT Communication Science Laboratories
jparacrawl-ml -a- hco.ntt.co.jp