# Chapter 6 : Machine Translation with the Transformer

Humans master sequence transduction, transferring a representation to another object. We can easily imagine a mental representation of a sequence. If somebody says *The flowers in my garden are beautiful*, we can easily visualize a garden with flowers in it. We see images of the garden, although we might never have seen that garden. We might even imagine chirping birds and the scent of flowers.

A machine must learn transduction from scratch with numerical representations. Recurrent or convolutional approaches have produced interesting results but have not reached significant BLEU translation evaluation scores. Translating requires the representation of language A transposed into language B.

The transformer model’s self-attention innovation increases the analytic ability of machine intelligence. A sequence in language A is adequately represented before attempting to translate it into language B. Self-attention brings the level of intelligence required by a machine to obtain better BLEU scores.

The seminal Attention Is All You Need Transformer obtained the best results for English-German and English-French translations in 2017. Since then, the scores have been improved by other transformers.

## Defining machine translation

*Vaswani et al. (2017)* tackled one of the most difficult NLP problems when designing the Transformer. The human baseline for machine translation seems out of reach for us human-machine intelligence designers. This did not stop *Vaswani et al. (2017)* from publishing the Transformer’s architecture and achieving state-of-the-art BLEU results.

Machine translation is the process of reproducing human translation by machine transductions and outputs:

![Alt text](machine_translation_process.png)

The general idea in Figure 6.1 is for the machine to do the following in a few steps:

* Choose a sentence to translate
* Learn how words relate to each other with hundreds of millions of parameters
* Learn the many ways in which words refer to each other
* Use machine transduction to transfer the learned parameters to new sequences
* Choose a candidate translation for a word or sequence

The process always starts with a sentence to translate from a source language, ***A***. The process ends with an output containing a translated sentence in language ***B***. The intermediate calculations involve transductions.

## **Human transductions and translations**

A human interpreter at the European Parliament, for instance, will not translate a sentence word by word. **Word-by-word translations often make no sense because they lack the proper grammatical structure and cannot produce the right translation because the context of each word is ignored**.

**Human transduction takes a sentence in language A and builds a cognitive *representation* of the sentence’s meaning**. An interpreter (oral translations) or a translator (written translations) at the European Parliament will only then **transform that transduction into an interpretation of that sentence in language B**.

We will name the translation done by the interpreter or translator in language B a reference sentence.

A human translator will not translate sentence A into sentence B several times but only once in real life. However, more than one translator could translate sentence A in real life. For example, you can find several French to English translations of Les Essais by Montaigne. If you take one sentence, A, out of the original French version, you will thus find several versions of sentence B noted as references $1$ to $n$.

If you go to the European Parliament one day, you might notice that the interpreters only translate for a limited time of two hours, for example. Then another interpreter takes over. No two interpreters have the same style, just like writers have different styles. Sentence A in the source language might be repeated by the same person several times in a day but be translated into several reference sentence B versions:
$$
reference ={reference~1, reference~2,\dots reference~n}
$$

Machines have to find a way to think the same way as human translators.

## **Machine transductions and translations**

The transduction process of the original Transformer architecture uses the encoder stack, the decoder stack, and all the model’s parameters to represent a *reference sequence*. We will refer to that output sequence as the *reference*.

Why not just say "output prediction"? The problem is that **there is no single output prediction**. The Transformer, like humans, will produce a result we can refer to, but that can change if we train it differently or use different transformer models!

We immediately realize that the human baseline of human transduction, representations of a language sequence, is quite a challenge. However, much progress has been made.

An evaluation of machine translation proves that NLP has progressed. To determine that one solution is better than another, each NLP challenger, lab, or organization must refer to the same datasets for the comparison to be valid.

Let’s now explore a WMT dataset.

## **Preprocessing a WMT dataset**

*Vaswani et al. (2017)* present the Transformer’s achievements on the WMT 2014 English-to-German translation task and the WMT 2014 English-to-French translation task. The Transformer achieves a state-of-the-art BLEU score. BLEU will be described in the *Evaluating machine translation with BLEU* section of this chapter.

The 2014 WMT contained several European language datasets. One of the datasets contained data taken from version 7 of the Europarl corpus. We will be using the [French-English dataset from the European Parliament Proceedings Parallel Corpus, 1996-2011](https://www.statmt.org/europarl/v7/fr-en.tgz).

In [1]:
!curl -L https://www.statmt.org/europarl/v7/fr-en.tgz --output "data/fr-en.tgz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  193M  100  193M    0     0  2164k      0  0:01:31  0:01:31 --:--:-- 2255k01:37  0:00:36  0:01:01 2049k


In [5]:
!tar -xvzf "data/fr-en.tgz" --directory "data/"

europarl-v7.fr-en.en
europarl-v7.fr-en.fr


## **Preprocessing the raw data**

n this section, we will preprocess `europarl-v7.fr-en.en` and `europarl-v7.fr-en.fr`.

Open `read.py`, which is in this chapter’s GitHub directory. Ensure that the two europarl files are in the same directory as `read.py`.

The program begins using standard Python functions and `pickle` to dump the serialized output files:

In [40]:
import pickle
from pickle import dump

In [41]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# split a loaded document into sentences
def to_sentences(doc):
    return doc.strip().split('\n')

# shortest and longest sentence lengths
def sentence_lengths(sentences):
    lengths = [len(s.split()) for s in sentences]
    return min(lengths), max(lengths)

The imported sentence lines must be cleaned to avoid training useless and noisy tokens. The lines are normalized, tokenized on white spaces, and converted to lowercase. The punctuation is removed from each token, non-printable characters are removed, and tokens containing numbers are excluded. The cleaned line is stored as a string.

The program runs the cleaning function and returns clean appended strings:

In [42]:
# clean lines
import re
import string
import unicodedata
def clean_lines(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for line in lines:
        # normalize unicode characters
        line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
        line = line.decode('UTF-8')
        # tokenize on white space
        line = line.split()
        # convert to lower case
        line = [word.lower() for word in line]
        # remove punctuation from each token
        line = [word.translate(table) for word in line]
        # remove non-printable chars form each token
        line = [re_print.sub('', w) for w in line]
        # remove tokens with numbers in them
        line = [word for word in line if word.isalpha()]
        # store as string
        cleaned.append(' '.join(line))
    return cleaned

In [43]:
# load English data
filename = './data/europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
cleanf=clean_lines(sentences)
# The dataset is now clean, and pickle dumps it into a serialized file named English.pkl
filename = 'English.pkl'
outfile = open(filename, 'wb')
pickle.dump(cleanf, outfile)
outfile.close()
print(filename," saved")

English data: sentences=2007723, min=0, max=668
English.pkl  saved


We now repeat the same process with the French data and dump it into a serialized file named `French.pkl:`

In [49]:
# load French data
filename = './data/europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
cleanf=clean_lines(sentences)
filename = 'French.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()
print(filename," saved")

French data: sentences=2007723, min=0, max=693
French.pkl  saved


## **Finalizing the preprocessing of the datasets**

Now, open `read_clean.py` in the same directory as `read.py`. Our process now defines the function that will load the datasets that were cleaned up in the previous section and then save them once the preprocessing is finalized:

In [50]:
from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

We now define a function that will create a vocabulary counter. It is important to know how many times a word is used in the sequences we will parse. For example, if a word is only used once in a dataset containing two million lines, we will waste our energy using precious GPU resources to learn it. Let’s define the counter:

In [51]:
# create a frequency table for all words
def to_vocab(lines):
    vocab = Counter()
    for line in lines:
        tokens  = line.split()
        vocab.update(tokens)
    return vocab

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurrence):
        tokens = [k for k,c in vocab.items() if c >= min_occurrence]
        return set(tokens)

In this case, `min_occurrence=5` and the words below or equal to this threshold have been removed to avoid wasting the training model’s time analyzing them.

We now have to deal with **Out-Of-Vocabulary (OOV)**words. **OOV** words **can be misspelled words, abbreviations, or any word that does not fit standard vocabulary representations**. We could use automatic spelling, but it would not solve all of the problems. For this example, we will simply replace **OOV** words with the `unk` (unknown) token:

In [52]:
# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
    new_lines = list()
    for line in lines:
        new_tokens = list()
        for token in line.split():
            if token in vocab:
                new_tokens.append('unk')
            else:
                new_tokens.append('unk')
        new_line = ' '.join(new_tokens)
        new_lines.append(new_line)
    return new_lines

We will now run the functions for the English dataset, save the output, and then display 20 lines:

In [53]:
# load English dataset
filename = 'English.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(20):
    print("line",i,":",lines[i])

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl
line 0 : unk unk unk unk
line 1 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 2 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 3 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 4 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 5 : unk unk unk unk unk unk unk unk
line 6 : unk unk unk unk unk unk unk unk unk
line 7 : unk unk unk unk unk unk unk
line 8 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 9 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 10 : unk unk un

Let’s now run the functions for the French dataset, save the output, and then display $20$ lines:

In [54]:
# load French dataset
filename = 'French.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(20):
    print("line",i,":",lines[i])

French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl
line 0 : unk unk unk unk
line 1 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 2 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 3 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 4 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 5 : unk unk unk unk unk unk unk unk unk unk unk
line 6 : unk unk unk unk unk unk unk unk
line 7 : unk unk unk unk unk unk unk unk
line 8 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk
line 9 : unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk unk

This section shows how raw data must be processed before training. The datasets are now ready to be plugged into a transformer to be trained.

Each line of the French dataset is the sentence to translate. Each line of the English dataset is the reference for a machine translation model. The machine translation model must produce an *English candidate translation* that matches the *reference*.

BLEU provides a method to evaluate `candidate` translations produced by machine translation models.

## **Evaluating machine translation with BLEU**

Papineni et al. (2002) came up with an efficient way to evaluate a human translation. The human baseline was difficult to define. However, they realized that we could obtain efficient results if we compared human translation with machine translation, word for word.

*Papineni et al. (2002)* named their method the **Bilingual Evaluation Understudy Score (BLEU)**.

In this section, we will use the **Natural Language Toolkit (NLTK)** to implement [BLEU](http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu):

We will begin with geometric evaluations.

## **Geometric evaluations**

The BLEU method **compares the parts of a candidate sentence to a reference sentence or several reference sentences**.

In [55]:
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

It then simulates a comparison between a candidate translation produced by the machine translation model and the actual translation(s) references in the dataset. Remember that a sentence could have been repeated several times and translated by different translators in different ways, making it challenging to find efficient evaluation strategies.

The program can evaluate one or more references:

In [56]:
#Example 1
reference = [['the', 'cat', 'likes', 'milk'], ['cat', 'likes' 'milk']]
candidate = ['the', 'cat', 'likes', 'milk']
score = sentence_bleu(reference, candidate)
print('Example 1', score)
#Example 2
reference = [['the', 'cat', 'likes', 'milk']]
candidate = ['the', 'cat', 'likes', 'milk']
score = sentence_bleu(reference, candidate)
print('Example 2', score)

Example 1 1.0
Example 2 1.0


A straightforward evaluation P of the candidate C, the reference R, and the number of correct tokens found in C (N) can be represented as a geometric function:

$$
P(N, C, R) = \prod^N_{n-1}p_n

This geometric approach is rigid if you are looking for a $3$-gram overlap, for example:

In [57]:
#Example 3
reference = [['the', 'cat', 'likes', 'milk']]
candidate = ['the', 'cat', 'enjoys','milk']
score = sentence_bleu(reference, candidate)
print('Example 3', score)

Example 3 1.0547686614863434e-154


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
