Machine Translation with the Transformer [Denis Rothman book]
--------------------------

### Preprocessing a WMT dataset

We downloaded the French-English dataset from the <i style = "color: violet">European Parliament Proceedings Parallel Corpus 1996-2011</i>. The link is [dataset_french_english](https://www.statmt.org/europarl/v7/fr-en.tgz).

We extracted two files from the dataset:

- europarl-v7.fr-en.en
- europarl-v7.fr-en.fr


#### Preprocessing the raw data

We will preprocess europarl-v7.fr-en.en and europarl-v7.fr-en.fr.



In [1]:
import pickle
from pickle import dump


We define the function to load the file into memory:

In [9]:
def load_doc(filename):
    # open the file as a read only
    with open(filename, mode="rt", encoding="utf-8") as f:
    
        # read all text
        text = f.read()
    
    return text

The loaded document is then split into sentences:

In [3]:
def to_sentences(doc):
    
    return doc.strip().split('\n')

The shortest and the longest lengths are retrieved:

In [4]:
def sentence_lengths(sentences):
    
    lengths = [len(s.split()) for s in sentences]
    
    return min(lengths), max(lengths)

The imported sentence lines now need to be cleaned to avoid training useless and noisy tokens. The lines are normalized, tokenized on white spaces, and converted to lower case. The punctuation is removed from each token, non printable characters are removed, and tokens containing numbers are excluded. The cleaned line is stored as a string. The program runs the cleaning function and returns clean appended strings:

In [5]:
# clean lines
import re
import string
import unicodedata
def clean_lines(lines):
    
    cleaned = list()
    
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    
    # prepare translation table for removing punctuation
    table  = str.maketrans('', '', string.punctuation)
    
    for line in lines:
        
        # normalize unicode characters
        line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
        line = line.decode('UTF-8')
        
        # tokenize on white space
        line = line.split()
        
        # convert to lower case
        line = [word.lower() for word in line]
        
        # remove punctuation from each token
        line = [word.translate(table) for word in line]
        
        # remove non-printable chars from each token
        line = [re_print.sub('', w) for w in line]
        
        # remove tokens with numbers in them
        line = [word for word in line if word.isalpha()]
        
        # store as string
        cleaned.append(' '.join(line))
    
    return cleaned

The English data is loaded and cleaned first.

In [10]:
# load English data
filename = 'data/europarl-v7.fr-en.en'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

cleanf = clean_lines(sentences)

English data: sentences=2007723, min=0, max=668


The dataset is now clean, and pickle dumps it into a serialized file named English.pkl:

In [15]:
filename = 'data/English.pkl'
outfile = open(filename, 'wb')
pickle.dump(cleanf, outfile)
outfile.close()
print(filename, " saved")

data/English.pkl  saved


We now repeat the same process with the French data and dump it into a serialized file named French.pkl:

In [16]:
# load French data
filename = 'data/europarl-v7.fr-en.fr'

doc = load_doc(filename)

sentences = to_sentences(doc)

minlen, maxlen = sentence_lengths(sentences)

print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

cleanf = clean_lines(sentences)

filename = 'data/French.pkl'

outfile = open(filename, 'wb')

pickle.dump(cleanf, outfile)

outfile.close()

print(filename, " saved")


French data: sentences=2007723, min=0, max=693
data/French.pkl  saved


The main preprocessing is done. But we still need to make sure the datasets do not contain noisy and confused tokens.

#### Finalizing the preprocessing of the datasets

Let us now define a function that will load the datasets that were cleaned up in the previous section and then save them once the preprocessing is finalized:

In [17]:
from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
    
    return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
    
    dump(sentences, open(filename, 'wb'))
    
    print('Saved: %s' % filename)

We now define a function that will create a vocabulary counter. It is important to know how many times a word is used in the sequences we will parse. For example, if a word is only used once in a dataset containing two million lines, we will waste our energy if we use precious GPU resources to learn it. Let's define the counter:

In [18]:
# create a frequency table for all words
def to_vocab(lines):
    vocab = Counter()
    for line in lines:
        tokens = line.split()
        vocab.update(tokens)
    return vocab

The vocabulary counter will detect words with a frequency that is below min_occurance:

In [19]:
# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
    tokens = [k for k, c in vocab.items() if c >= min_occurance]
    return set(tokens)

In this case, min_occurance = 5 and the words that are below or equal to this threshold have been removed to avoid wasting the training model's time analyzing them.

We now have to deal with **Out-Of-Vocabulary (OOV)** words. OOV words can be misspelled words, abbreviations, or any word that does not fit standard vocabulary representations. We could use automatic spelling, but it would not solve all of the problems. For this example, we will simply replace OOV words with the unk (unknown) token:

In [20]:
# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
    
    new_lines = list()
    
    for line in lines:
        
        new_tokens = list()
        
        for token in line.split():
            
            if token in vocab:
                
                new_tokens.append(token)
            
            else:
                
                new_tokens.append('unk')
        
        new_line = ' '.join(new_tokens)
        
        new_lines.append(new_line)
    
    return new_lines

We will now run the functions for the English dataset, then save the output and display 20 lines:

In [21]:
# load English dataset
filename = 'data/English.pkl'

lines = load_clean_sentences(filename)

# calculate vocabulary
vocab = to_vocab(lines)

print('English Vocabulary: %d' % len(vocab))

# reduce vocabulary
vocab = trim_vocab(vocab, 5)

print('New English Vocabulary: %d' % len(vocab))

# mark out of vocabulary words
lines = update_dataset(lines, vocab)

# save updated dataset
filename = 'data/english_vocab.pkl'

save_clean_sentences(lines, filename)

# spot check
for i in range(20):
    
    print("line", i, ":", lines[i])

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: data/english_vocab.pkl
line 0 : resumption of the session
line 1 : i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
line 2 : although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
line 3 : you have requested a debate on this subject in the course of the next few days during this partsession
line 4 : in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
line 5 : please rise then for this minute s silence
line 6 : the house rose and observed a minute s silence
line 7 : madam presid

Let us now run the functions for the French dataset, then save the output and display 20 lines:

In [22]:
# load French dataset
filename = 'data/French.pkl'

lines = load_clean_sentences(filename)

# calculate vocabulary
vocab = to_vocab(lines)

print('French Vocabulary: %d' % len(vocab))

# reduce vocabulary
vocab = trim_vocab(vocab, 5)

print('New French Vocabulary: %d' % len(vocab))

# mark out of vocabulary words
lines = update_dataset(lines, vocab)

# save updated dataset
filename = 'data/french_vocab.pkl'

save_clean_sentences(lines, filename)

# spot check
for i in range(20):
    
    print("line", i, ":", lines[i])

French Vocabulary: 141642
New French Vocabulary: 58800
Saved: data/french_vocab.pkl
line 0 : reprise de la session
line 1 : je declare reprise la session du parlement europeen qui avait ete interrompue le vendredi decembre dernier et je vous renouvelle tous mes vux en esperant que vous avez passe de bonnes vacances
line 2 : comme vous avez pu le constater le grand bogue de lan ne sest pas produit en revanche les citoyens dun certain nombre de nos pays ont ete victimes de catastrophes naturelles qui ont vraiment ete terribles
line 3 : vous avez souhaite un debat a ce sujet dans les prochains jours au cours de cette periode de session
line 4 : en attendant je souhaiterais comme un certain nombre de collegues me lont demande que nous observions une minute de silence pour toutes les victimes des tempetes notamment dans les differents pays de lunion europeenne qui ont ete touches
line 5 : je vous invite a vous lever pour cette minute de silence
line 6 : le parlement debout observe une minut

### Evaluating machine translation with BLEU

Let us begin with the Geometric evaluations.

#### Geometric evaluations

The BLEU method compares the parts of a candidate sentence to a reference sentence or several reference sentences.

In [23]:
# import the nltk module
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

It then simulates a comparison between a candidate translation produced by the 
machine translation model and the actual translation(s) references in the dataset. 
Bear in mind that a sentence could have been repeated several times and translated 
by different translators in different ways, making it challenging to find efficient 
evaluation strategies.

The program can evaluate one or more references:

In [24]:
#Example 1
reference = [['the', 'cat', 'likes', 'milk'], ['cat', 'likes' 'milk']] 
candidate = ['the', 'cat', 'likes', 'milk']
score = sentence_bleu(reference, candidate) 
print('Example 1', score)
#Example 2
reference = [['the', 'cat', 'likes', 'milk']] 
candidate = ['the', 'cat', 'likes', 'milk'] 
score = sentence_bleu(reference, candidate) 
print('Example 2', score)

Example 1 1.0
Example 2 1.0


The output of both examples is 1.

A straightforward evaluation P of the candidate (C), the reference (R), and the 
number of correct tokens found in C (N) can be represented as a geometric function:
$$
P(N, C, R) = \prod_{n = 1}^N p_n
$$

This geometric approach is rigid if you are looking for 3-gram overlap, for example:

In [25]:
#Example 3
reference = [['the', 'cat', 'likes', 'milk']] 
candidate = ['the', 'cat', 'enjoys','milk'] 
score = sentence_bleu(reference, candidate) 
print('Example 3', score)

Example 3 1.0547686614863434e-154


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


A human can see that the score should be 1 and not 0.7. The hyperparameters can be changed, but the approach remains rigid.

### Applying a smoothing technique

Smoothing is a very efficient method. BLEU smoothing can be traced back to label smoothing, applied to softmax outputs in the Transformer.

#### Chencherry smoothing

Let's first evaluate a French-English example with smoothing:

In [26]:
#Example 4
reference = [['je','vous','invite', 'a', 'vous', 'lever','pour', 
'cette', 'minute', 'de', 'silence']]
candidate = ['levez','vous','svp','pour', 'cette', 'minute', 'de', 
'silence']
score = sentence_bleu(reference, candidate) 
print("without soothing score", score)

without soothing score 0.37188004246466494


Now, let us add some open minded smoothing to the evaluation:

In [27]:
chencherry = SmoothingFunction()

r1 = list('je vous invite a vous lever pour cette minute de silence')
candidate = list('levez vous svp pour cette minute de silence')

print("with smoothing score", sentence_bleu([r1], candidate, smoothing_function=chencherry.method1))

with smoothing score 0.6194291765462159


### Translations with Trax

#### Installing Trax

In [None]:
import os
import numpy as np

try:
    import trax
except ImportError:
    !pip install -U trax

#### Creating a Transformer model

Our Trax function will retrieve a pretrained model configuration in a few lines of code:

In [None]:
model = trax.models.Transformer(
    input_vocab_size = 33300,
    d_model = 512, d_ff = 2048,
    n_heads = 8, n_encoder_layers = 6, n_decoder_layers = 6,
    max_len = 2048, mode = 'predict'
)

#### Initializing the model using pretrained weights

Let us give life to the model by initializing the weights:

In [None]:
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz', 
                     weights_only=True)

#### Tokenizing a sentence

In [None]:
sentence = 'I am only a machine but I have machine intelligence.'
tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams. 
                                   vocab_dir='gs://trax-ml/vocabs/', 
                                   vocab_file='ende_32k.subword'))[0]

#### Decoding from the Transformer

The Transformer encodes the sentence in English and will decode it in German. The 
model and its weights constitute its set of abilities.
Trax has made the decoding function intuitive to use:

In [None]:
tokenized = tokenized[None, :] # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample( 
    model, tokenized, temperature=0.0)

#### De-tokenizing and displaying the translation

Google Brain has produced a mainstream, disruptive, and intuitive implementation 
of the Transformer with Trax.
The program now de-tokenizes and displays the translation in a few lines:

In [None]:
tokenized_translation = tokenized_translation[0][:-1] # Remove batch 
and EOS.
translation = trax.data.detokenize(tokenized_translation, 
                                   vocab_dir='gs://trax-ml/vocabs/', 
                                   vocab_file='ende_32k.subword') 
print("The sentence:",sentence)
print("The translation:",translation)