# Machine Translation with Transformer

In this notebook, you will understand how to use Transformers introduced in [Vaswani et al., 2017]  You will learn how to load a pretrained Transformer model and evaluate it on `newstest2016`. In addition, you are able to translate a few sentences youself with the `BeamSearchTranslator`.

## Preparation

We start with some usual preparation such as importing libraries and setting the environment.


In [1]:
import warnings
warnings.filterwarnings('ignore')

import random
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
import sacremoses

np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
ctx = mx.gpu(0)

## Use the Pretrained Transformer model

Next, we load the Transformer model in GluonNLP model zoo, which returns the model + the source and target vocabulary.

In [2]:
import nmt

wmt_transformer_model, wmt_src_vocab, wmt_tgt_vocab = \
    nlp.model.get_model('transformer_en_de_512',
                        dataset_name='WMT2014',
                        pretrained=True,
                        ctx=ctx)
# we are using mixed vocab of EN-DE, so the source and target language vocab are the same
print('#Source Vocab:', len(wmt_src_vocab), ', #Target Vocab:', len(wmt_tgt_vocab))

#Source Vocab: 36794 , #Target Vocab: 36794


In [3]:
print(wmt_transformer_model) # Print the model

NMTModel(
  (encoder): TransformerEncoder(
    (dropout_layer): Dropout(p = 0.1, axes=())
    (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
    (transformer_cells): HybridSequential(
      (0): TransformerEncoderCell(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attention_cell): MultiHeadAttentionCell(
          (_base_cell): DotProductAttentionCell(
            (_dropout_layer): Dropout(p = 0.1, axes=())
          )
          (proj_query): Dense(512 -> 512, linear)
          (proj_key): Dense(512 -> 512, linear)
          (proj_value): Dense(512 -> 512, linear)
        )
        (proj): Dense(512 -> 512, linear)
        (ffn): PositionwiseFFN(
          (ffn_1): Dense(512 -> 2048, linear)
          (activation): Activation(relu)
          (ffn_2): Dense(2048 -> 512, linear)
          (dropout_layer): Dropout(p = 0.1, axes=())
          (layer_norm): LayerNorm(eps=1e-05, axis=-1, center=True, scale=True, in_channels=512)
        )


### Load and Preprocess WMT 2016 Dataset

We then load the newstest2016 segment in WMT 2016 English-German test dataset for evaluation purpose.

Firstly, look at the WMT 2016 corpus. `GluonNLP` provides [WMT2016BPE](../../api/modules/data.rst#gluonnlp.data.WMT2016BPE)
and [WMT2016](../../api/modules/data.rst#gluonnlp.data.WMT2016) classes. The former contains a BPE-tokenized dataset, while the later contains the raw text. Here, we use the former for scoring, and the latter for
demonstrating actual translation.

For the BPE, it is one way to convert words to sub-words. E.g, the word **cheapest** will be converted to **cheap@@** and **est**, and **sunnyvale** will be converted to **sunny@@** and **vale**. The representational ability of the vocabulary is greatly improved by using sub-words. This is a common trick in NLP.

In [4]:
import hyperparameters as hparams

#wmt_data_text = nlp.data.WMT2014BPE('newstest2014', # BPE: cheapest --> cheap@@, est
wmt_data_text = nlp.data.WMT2016BPE('newstest2016', # BPE: cheapest --> cheap@@, est
                                    src_lang=hparams.src_lang,
                                    tgt_lang=hparams.tgt_lang)
print('Source language %s, Target language %s' % (hparams.src_lang, hparams.tgt_lang))
print('Sample BPE tokens: "{}"'.format(wmt_data_text[14]))

#wmt_test_text = nlp.data.WMT2014('newstest2014',
wmt_test_text = nlp.data.WMT2016('newstest2016',
                                 src_lang=hparams.src_lang,
                                 tgt_lang=hparams.tgt_lang)
# For demo process, will only evaluate the prediction of the first 50 sentences
wmt_data_text, wmt_test_text = gluon.data.SimpleDataset([wmt_data_text[i] for i in range(18)]), gluon.data.SimpleDataset([wmt_test_text[i] for i in range(18)])

print('Sample raw text: "{}"'.format(wmt_test_text[16]))

Source language en, Target language de
Sample BPE tokens: "('Delta State University police chief L@@ ynn Bu@@ ford said university officials heard about the shooting at 10 : 18 a.m.', 'Delta State University Poli@@ zeich@@ ef L@@ ynn Bu@@ ford sagte , dass Universitäts@@ -@@ Mitarbeiter das Sch@@ ießen um 10 : 18 Uhr gehört haben .')"
Sample raw text: "('By the end of the day, there would be one more death: Lamb took his own life as police closed in on him.', 'Bis zum Ende des Tages gab es einen weiteren Tod: Lamm nahm sich das Leben, als die Polizei ihn einkesselte.')"


In [5]:
# Slice the target part of the dataset using .transform
wmt_test_tgt_sentences = wmt_test_text.transform(lambda src, tgt: tgt)
print('Sample target sentence: "{}"'.format(wmt_test_tgt_sentences[16]))

Sample target sentence: "Bis zum Ende des Tages gab es einen weiteren Tod: Lamm nahm sich das Leben, als die Polizei ihn einkesselte."


We further process the dataset using the `.transform()` API. The preprocessing have the following 4 steps:

1) Clip the source and target sequences

2) Split the string input to a list of tokens

3) Map the string token into its index in the vocabulary

4) Append EOS token to source sentence and add BOS and EOS tokens to target sentence.

In [6]:
import dataprocessor

# wmt_transform_fn includes the four preprocessing steps mentioned above.
wmt_transform_fn = dataprocessor.TrainValDataTransform(wmt_src_vocab, wmt_tgt_vocab)
wmt_dataset_processed = wmt_data_text.transform(wmt_transform_fn, lazy=False)

def get_length_index_fn():
    global idx
    idx = 0
    def transform(src, tgt):
        global idx
        result = (src, tgt, len(src), len(tgt), idx)
        idx += 1
        return result
    return transform

wmt_data_text_with_len = wmt_dataset_processed.transform(get_length_index_fn(), lazy=False)
# Five elements: Source Token Ids, Target Token Ids, Source Seq Length, Target Seq length, Index
print(wmt_data_text_with_len[0][0], '\n', wmt_data_text_with_len[0][1])

[ 7224 25657  7048 12057 11496 29502     3] 
 [    2  7224 16715  7048 12057 11496 29502     3]


### Creating `Sampler` and `DataLoader` for the `WMT 2016` Dataset

Now, we have obtained the transformed datasets. The next step is to construct sampler and DataLoader. First, we need to construct batchify function, which pads and stacks sequences to form mini-batch.

In [7]:
wmt_test_batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(),                   # Source Token IDs
    nlp.data.batchify.Pad(),                   # Target Token IDs
    nlp.data.batchify.Stack(dtype='float32'),  # Source Sequence Length
    nlp.data.batchify.Stack(dtype='float32'),  # Target Sequence Length
    nlp.data.batchify.Stack())                 # Index

* [Tuple](https://gluon-nlp.mxnet.io/api/modules/data.batchify.html?highlight=batchify#gluonnlp.data.batchify.Tuple) is the GluonNLP way of applying different batchify functions to each element of a dataset item. In this case, we are applying `Pad` to `src` and `tgt`, `Stack` to `len(src)` and `len(tgt)` with conversion to float32, and simple `Stack` to `idx` without type conversion.
* [Pad](https://gluon-nlp.mxnet.io/api/modules/data.batchify.html?highlight=batchify#gluonnlp.data.batchify.Pad) takes the elements from all dataset items in a batch, and pad them according to the item of maximum length to form a padded matrix/tensor.
* [Stack](https://gluon-nlp.mxnet.io/api/modules/data.batchify.html?highlight=batchify#gluonnlp.data.batchify.Stack) simply stacks all elements in a batch, and requires all elements to be of the same length.

We can then construct bucketing samplers, which generate batches by grouping sequences with similar lengths. Here, we use [FixedBucketSampler](https://gluon-nlp.mxnet.io/api/modules/data.html?highlight=fixedbucketsampler#gluonnlp.data.FixedBucketSampler). `FixedBucketSampler` aims to assign each data sample to a bucket based on its length. The buckets are determined automatically.

 Please refer to [BucketSampler](https://gluon-nlp.mxnet.io/api/notes/data_api.html) for more information.

In [8]:
wmt_test_batch_sampler = nlp.data.FixedBucketSampler(
    lengths=wmt_data_text_with_len.transform(lambda src, tgt, src_len, tgt_len, idx: (src_len, tgt_len)), #(src, tgt)
    num_buckets=3,
    batch_size=2)
print(wmt_test_batch_sampler.stats())

FixedBucketSampler:
  sample_num=18, batch_num=10
  key=[(28, 36), (49, 63), (70, 90)]
  cnt=[11, 5, 2]
  batch_size=[2, 2, 2]


Given the samplers, we can use [DataLoader](https://mxnet.apache.org/versions/master/api/python/gluon/data.html#mxnet.gluon.data.DataLoader) to sample the datasets.

In [9]:
wmt_test_data_loader = gluon.data.DataLoader(
    wmt_data_text_with_len,
    batch_sampler=wmt_test_batch_sampler,
    batchify_fn=wmt_test_batchify_fn,
    num_workers=8)  # Note that we can use multi-processing
print('Number of testing batches:', len(wmt_test_data_loader))

Number of testing batches: 10


### Evaluate Transformer

Next, we evaluate the performance of the model on the `newstest2016` dataset. We first define the `BeamSearchTranslator` to generate the translations.

In [10]:
print('Beam Size =', hparams.beam_size, ', Lengh penalty Alpha=', hparams.lp_alpha, ', Length penalty K=', hparams.lp_k)
wmt_translator = nmt.translation.BeamSearchTranslator(
    model=wmt_transformer_model,
    beam_size=hparams.beam_size,
    scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha, K=hparams.lp_k),
    max_length=200)

Beam Size = 4 , Lengh penalty Alpha= 0.6 , Length penalty K= 5


Then we caculate the `loss` as well as the `bleu` score on the newstest2016 WMT 2016 English-German test dataset. This may take a while.

In [11]:
import time
import utils

eval_start_time = time.time()
wmt_test_loss_function = nlp.loss.MaskedSoftmaxCELoss()
wmt_test_loss_function.hybridize()
wmt_detokenizer = nlp.data.SacreMosesDetokenizer()
wmt_test_loss, wmt_test_translation_out = nmt.utils.evaluate(
    wmt_transformer_model,
    wmt_test_data_loader,
    wmt_test_loss_function,
    wmt_translator,
    wmt_tgt_vocab,
    wmt_detokenizer,
    ctx)

wmt_test_bleu_score, _, _, _, _ = nmt.bleu.compute_bleu(
    [wmt_test_tgt_sentences],
    wmt_test_translation_out,
    tokenized=False,
    tokenizer=hparams.bleu,
    split_compound_word=False,
    bpe=False)

print('WMT16 EN-DE SOTA model test loss: %.2f; test bleu score: %.2f; time cost %.2fs' %(wmt_test_loss, wmt_test_bleu_score * 100, (time.time() - eval_start_time)))

[nltk_data] Downloading package punkt to /home/andreto/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/andreto/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/andreto/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


  0%|          | 0/10 [00:00<?, ?it/s]

Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/mxnet_p37/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
[2022-06-05 21:12:46.698 ip-172-31-28-47:15718 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-06-05 21:12:46.727 ip-172-31-28-47:15718 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
WMT16 EN-DE SOTA model test loss: 1.37; test bleu score: 24.29; time cost 14.72s


In [12]:
print('Sample translations:')
num_pairs = 1

for i in range(num_pairs):
    print('EN:')
    print(wmt_test_text[i][0])
    print('DE-Candidate:')
    print(wmt_test_translation_out[i])
    print('DE-Reference:')
    print(wmt_test_tgt_sentences[i])
    print('========')

Sample translations:
EN:
Obama receives Netanyahu
DE-Candidate:
Obama erhält Netanjahu
DE-Reference:
Obama empfängt Netanyahu


### Translation Inference

We herein show the actual translation example (EN-DE) when given a source language using the SOTA Transformer model.

In [13]:
import utils

print('Translate the following English sentence into German:')

sample_src_seq = 'We love language.'
print('[\'' + sample_src_seq + '\']')
sample_tgt_seq = nmt.utils.translate(wmt_translator, sample_src_seq, wmt_src_vocab, wmt_tgt_vocab, wmt_detokenizer,
                                 ctx)
print('The German translation is:')
print(sample_tgt_seq)

Translate the following English sentence into German:
['We love language.']
The German translation is:
['Wir sind erfreut darüber, dass']
