# Neural Machine Translation

In this notebook I will try to implement a Neural Machine Tranlation system with tensorflow.
We will be using the code from the files in the repository.

In [None]:
from collections import Counter

import NMT_Model
import nmt_data_utils
import nmt_model_utils

## The data

The dataset we will be working with are english and german sentences from the European Parliament. It contains about 2 million sentence pairs, but we will only use a small fraction of them.

In [2]:
# load the english texts
with open('/Users/thomas/Jupyter_Notebooks/Play/Neural_Machine_Translation/de-en/europarl-v7.de-en.en',
          'r',
          encoding = 'utf-8') as f:
    en = f.readlines()
    

In [3]:
# load the german texts
with open('/Users/thomas/Jupyter_Notebooks/Play/Neural_Machine_Translation/de-en/europarl-v7.de-en.de',
          'r',
          encoding = 'utf-8') as f:
    de = f.readlines()

In [4]:
len(en), len(de)

(1920209, 1920209)

In [5]:
# first 5 sentence pairs. 
for line in zip(en[:5], de[:5]):
    print(line, '\n')

('Resumption of the session\n', 'Wiederaufnahme der Sitzungsperiode\n') 

('I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.\n', 'Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.\n') 

("Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.\n", 'Wie Sie feststellen konnten, ist der gefürchtete "Millenium-Bug " nicht eingetreten. Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden.\n') 

('You have requested a debate on this subject in the course of the next few days, during this p

In [6]:
# remove unnecessary new lines. 
de = [line.strip() for line in de]
en = [line.strip() for line in en]

In [7]:
# we will only use sentences of similar lengths in order to make training easier. 
len_en = [len(sent) for sent in en if 20 < len(sent) < 50]
len_dist = Counter(len_en).most_common()
len_dist

[(47, 7266),
 (49, 7113),
 (45, 6928),
 (35, 6833),
 (48, 6813),
 (21, 6642),
 (44, 6519),
 (46, 6491),
 (43, 6443),
 (40, 6130),
 (42, 6108),
 (37, 5824),
 (41, 5793),
 (34, 5711),
 (39, 5682),
 (29, 5659),
 (38, 5599),
 (33, 5496),
 (36, 5452),
 (31, 4651),
 (32, 4554),
 (30, 4441),
 (27, 4117),
 (28, 4062),
 (26, 3989),
 (25, 3911),
 (24, 3762),
 (23, 3473),
 (22, 2776)]

In [8]:
# 158238 sentences that contain betwenn 20 and 50 words.
len(len_en)

158238

In [9]:
_de = []
_en = []
for sent_de, sent_en in zip(de, en):
    if 20 < len(sent_en) < 50:
        _de.append(sent_de)
        _en.append(sent_en)

In [10]:
%%time

# but we will not use all 150 000 sentences, only 5000 for the beginning. 
en_preprocessed, en_most_common = nmt_data_utils.preprocess(_en[:5000])
de_preprocessed, de_most_common = nmt_data_utils.preprocess(_de[:5000], language = 'german')


CPU times: user 1.32 s, sys: 17.2 ms, total: 1.34 s
Wall time: 1.36 s


In [11]:
len(en_preprocessed), len(de_preprocessed)

(5000, 5000)

In [12]:
# for some of the sentences there is not german or english counterpart, i.e. only an empy array []
# therefore we will remove those sentence pairs.
en_preprocessed_clean, de_preprocessed_clean = [], []

for sent_en, sent_de in zip(en_preprocessed, de_preprocessed):
    if sent_en != [] and sent_de != []:
        en_preprocessed_clean.append(sent_en)
        de_preprocessed_clean.append(sent_de)
    else:
        continue

In [13]:
len(en_preprocessed_clean), len(de_preprocessed_clean)


(4988, 4988)

In [14]:
for e, d in zip(en_preprocessed_clean, de_preprocessed_clean[:1000]):
    print('English:\n', e)
    print('German:\n', d, '\n'*3)

English:
 ['resumption', 'of', 'the', 'session']
German:
 ['wiederaufnahme', 'der', 'sitzungsperiode'] 



English:
 ['please', 'rise', ',', 'then', ',', 'for', 'this', 'minute', "'", 's', 'silence', '.']
German:
 ['ich', 'bitte', 'sie', ',', 'sich', 'zu', 'einer', 'schweigeminute', 'zu', 'erheben', '.'] 



English:
 ['(', 'the', 'house', 'rose', 'and', 'observed', 'a', 'minute', "'", 's', 'silence', ')']
German:
 ['(', 'das', 'parlament', 'erhebt', 'sich', 'zu', 'einer', 'schweigeminute', '.', ')'] 



English:
 ['madam', 'president', ',', 'on', 'a', 'point', 'of', 'order', '.']
German:
 ['frau', 'präsidentin', ',', 'zur', 'geschäftsordnung', '.'] 



English:
 ['madam', 'president', ',', 'on', 'a', 'point', 'of', 'order', '.']
German:
 ['frau', 'präsidentin', ',', 'zur', 'geschäftsordnung', '.'] 



English:
 ['thank', 'you', ',', 'mr', 'segni', ',', 'i', 'shall', 'do', 'so', 'gladly', '.']
German:
 ['vielen', 'dank', ',', 'herr', 'segni', ',', 'das', 'will', 'ich', 'gerne', 'tun', 

 ['fragestunde', '(', 'kommission', ')'] 



English:
 ['the', 'next', 'item', 'is', 'question', 'time', '(', 'b5-0003/2000', ')', '.']
German:
 ['nach', 'der', 'tagesordnung', 'folgt', 'die', 'fragestunde', '.'] 



English:
 ['we', 'will', 'examine', 'questions', 'to', 'the', 'commission', '.']
German:
 ['wir', 'behandeln', 'die', 'anfragen', 'an', 'die', 'kommission', '.'] 



English:
 ['mr', 'purvis', 'has', 'the', 'floor', 'for', 'a', 'procedural', 'motion', '.']
German:
 ['das', 'wort', 'hat', 'herr', 'purvis', 'zur', 'geschäftsordnung', '.'] 



English:
 ['question', 'no', '28', 'by', '(', 'h-0781/99', ')', ':']
German:
 ['anfrage', 'nr.', '28', 'von', '(', 'h-0781/99', ')', ':'] 



English:
 ['thank', 'you', 'for', 'your', 'reply', '.']
German:
 ['vielen', 'dank', 'für', 'die', 'antwort', '.'] 



English:
 ['so', 'much', 'for', 'the', 'political', 'aspect', '.']
German:
 ['soviel', 'zur', 'politischen', 'seite', 'des', 'problems', '.'] 



English:
 ['now', 'to', 'the', 'te

English:
 ['it', 'is', 'all', 'about', 'creating', 'a', 'socially', 'just', 'europe', '.']
German:
 ['es', 'geht', 'um', 'ein', 'sozialgerechtes', 'europa', '.'] 



English:
 ['there', 'is', 'clearly', 'a', 'more', 'ambitious', 'goal', '.']
German:
 ['selbstverständlich', 'geht', 'es', 'um', 'ein', 'viel', 'ehrgeizigeres', 'ziel', '.'] 



English:
 ['it', 'is', ',', 'of', 'course', ',', 'thoroughly', 'orwellian', '.']
German:
 ['das', 'erinnert', 'stark', 'an', 'orwell', '.'] 



English:
 ['(', 'the', 'president', 'cut', 'the', 'speaker', 'off', ')']
German:
 ['(', 'der', 'präsident', 'unterbricht', 'den', 'redner', ')', '.'] 



English:
 ['there', 'is', 'more', 'to', 'be', 'said', 'on', 'that', 'score', '.']
German:
 ['darüber', 'wird', 'ja', 'auch', 'noch', 'weiter', 'zu', 'sprechen', 'sein', '.'] 



English:
 ['this', 'is', 'the', 'crux', 'of', 'the', 'matter', '.']
German:
 ['hier', 'muß', 'der', 'entscheidende', 'punkt', 'gesetzt', 'werden', '.'] 



English:
 ['of', 'all', '

 ['the', 'same', 'applies', 'to', 'european', 'air', 'traffic', 'control', '.']
German:
 ['das', 'betrifft', 'genauso', 'den', 'bereich', 'der', 'europäischen', 'flugsicherung', '.'] 



English:
 ['you', 'still', 'have', 'not', 'finished', 'that', 'agenda', '.']
German:
 ['hier', 'haben', 'sie', 'ihre', 'tagesordnung', 'noch', 'abzuarbeiten', '.'] 



English:
 ['i', 'do', 'not', 'necessarily', 'mean', 'financial', 'resources', '.']
German:
 ['damit', 'meine', 'ich', 'nicht', 'notwendigerweise', 'nur', 'die', 'finanziellen', 'mittel', '.'] 



English:
 ['there', 'must', 'also', 'be', 'savings', '.']
German:
 ['es', 'muß', 'auch', 'gespart', 'werden', '.'] 



English:
 ['it', 'is', 'not', 'easy', 'to', 'combine', 'those', 'aims', '.']
German:
 ['die', 'verknüpfung', 'beider', 'ziele', 'ist', 'nicht', 'einfach', '.'] 



English:
 ['there', 'is', 'no', 'obvious', 'way', '.']
German:
 ['sie', 'ergibt', 'sich', 'nicht', 'von', 'selbst', '.'] 



English:
 ['that', 'foundation', 'is', 'c

In [15]:
en_most_common[:100], len(en_most_common), len(de_most_common)

([('.', 3981),
  ('the', 1864),
  ('is', 1371),
  ('this', 860),
  (',', 842),
  ('to', 822),
  ('we', 736),
  ('i', 677),
  ('that', 619),
  ('a', 611),
  ('of', 592),
  ('it', 486),
  ('not', 474),
  (')', 451),
  ('(', 450),
  ('be', 384),
  ('are', 370),
  ('in', 367),
  ('you', 356),
  ('?', 339),
  ('will', 324),
  ('have', 281),
  ('for', 249),
  ('no', 245),
  ('on', 235),
  ('mr', 228),
  ('thank', 227),
  ('and', 222),
  ('at', 221),
  ('what', 200),
  ('question', 199),
  ('there', 191),
  ('vote', 185),
  ('debate', 185),
  ('can', 179),
  ('do', 172),
  ('very', 168),
  (':', 166),
  ('must', 163),
  ('has', 158),
  ('by', 158),
  ('closed', 157),
  ('take', 148),
  ('parliament', 147),
  ('report', 147),
  ('president', 136),
  ('place', 129),
  ('commissioner', 129),
  ('was', 126),
  ('so', 125),
  ('should', 120),
  ('would', 119),
  ('much', 116),
  ('with', 115),
  ('all', 108),
  ('but', 107),
  ('an', 99),
  ('about', 98),
  ('now', 96),
  ('they', 95),
  ('our', 9

### Create Vocab

In [16]:
# now we can create oyr lookup dicts for english and german, i.e. our vocab. 
# we will also include special tokens, later on used in the model. 
specials = ["<unk>", "<s>", "</s>", '<pad>']

en_word2ind, en_ind2word, en_vocab_size = nmt_data_utils.create_vocab(en_most_common, specials)
de_word2ind, de_ind2word, de_vocab_size = nmt_data_utils.create_vocab(de_most_common, specials)

In [17]:
en_vocab_size, de_vocab_size

(4139, 5414)

### Convert to indices


In [18]:
# in order to feed the sentences to the network, we have to convert them to ints, corresponding to their indices
# in the lookup dicts. 
# we reverse the source language sentences, i.e. the english sentences as this alleviates learning for the seq2seq 
# model. Apart from this we also include EndOfSentence and StartOfSentence tags, which are needed as well. 
en_inds, en_unknowns = nmt_data_utils.convert_to_inds(en_preprocessed_clean, en_word2ind, reverse = True, eos = True)
de_inds, de_unknowns = nmt_data_utils.convert_to_inds(de_preprocessed_clean, de_word2ind, sos = True, eos = True)

In [19]:
[nmt_data_utils.convert_to_words(sentence, en_ind2word) for sentence in  en_inds[:2]]

[['session', 'the', 'of', 'resumption', '</s>'],
 ['.',
  'silence',
  's',
  "'",
  'minute',
  'this',
  'for',
  ',',
  'then',
  ',',
  'rise',
  'please',
  '</s>']]

In [20]:
[nmt_data_utils.convert_to_words(sentence, de_ind2word) for sentence in  de_inds[:2]]

[['<s>', 'wiederaufnahme', 'der', 'sitzungsperiode', '</s>'],
 ['<s>',
  'ich',
  'bitte',
  'sie',
  ',',
  'sich',
  'zu',
  'einer',
  'schweigeminute',
  'zu',
  'erheben',
  '.',
  '</s>']]

## Train the model

Now we are ready to train the model. 

In [21]:
# hyperparams. 
# those are probably not perfect, but work fine for now. 
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 128
rnn_size_decoder = 128
embedding_dim = 300

batch_size = 64
epochs = 500
clip = 5
keep_probability = 0.8
learning_rate = 0.01
learning_rate_decay_steps = 1000
learning_rate_decay = 0.9


In [None]:
# create the graph and train the model. 
nmt_model_utils.reset_graph()

nmt = NMT_Model.NMT(en_word2ind,
                    en_ind2word,
                    de_word2ind,
                    de_ind2word,
                    './models/local_one/my_model',
                    'TRAIN',
                    embedding_dim = embedding_dim,
                    num_layers_encoder = num_layers_encoder,
                    num_layers_decoder = num_layers_decoder,
                    batch_size = batch_size,
                    clip = clip,
                    keep_probability = keep_probability,
                    learning_rate = learning_rate,
                    epochs = epochs,
                    rnn_size_encoder = rnn_size_encoder,
                    rnn_size_decoder = rnn_size_decoder, 
                    learning_rate_decay_steps = learning_rate_decay_steps,
                    learning_rate_decay = learning_rate_decay)
  
nmt.build_graph()
nmt.train(en_inds, de_inds)


## Test the Model

We can now use the trained model to translate english sentences to german. For now we will only translate senteces the model originally was trained on. 

**Note:** the network was only trained on 5000 sentences for 50 epochs. 

In [22]:
_de_inds, _de_unknowns = nmt_data_utils.convert_to_inds(de_preprocessed_clean, de_word2ind, sos = True,  eos = True)

In [24]:
# the inference model does not necessaryly need to get input batches. we can just give it. the whole input
# data, but the the batchsize has to be specified as the lenght of the input data.
nmt_model_utils.reset_graph()

nmt = NMT_Model.NMT(en_word2ind,
                    en_ind2word,
                    de_word2ind,
                    de_ind2word,
                    './models/local_one/my_model',
                    'INFER',
                    num_layers_encoder = num_layers_encoder,
                    num_layers_decoder = num_layers_decoder,
                    batch_size = len(en_inds[:50]),
                    keep_probability = 1.0,
                    learning_rate = 0.0,
                    beam_width = 0,
                    rnn_size_encoder = rnn_size_encoder,
                    rnn_size_decoder = rnn_size_decoder)

nmt.build_graph()
preds = nmt.infer(en_inds[:50], restore_path =  './models/local_one/my_model', targets = _de_inds[:50])


Instructions for updating:
Use the retry module or similar alternatives.
Graph built.
Restore graph from  ./models/local_one/my_model
INFO:tensorflow:Restoring parameters from ./models/local_one/my_model


In [26]:
# show some of the created translations
# Note: the way bleu score is probably not the perfect way to do it
nmt_model_utils.sample_results(preds, en_ind2word, de_ind2word, en_word2ind, de_word2ind, _de_inds[:50], en_inds[:50])




 ----------------------------------------------------------------------------------------------------
Actual Text:
resumption of the session

Actual translation:
wiederaufnahme der sitzungsperiode

Created translation:
wiederaufnahme der sitzungsperiode

Bleu-score: 1.0



 ----------------------------------------------------------------------------------------------------
Actual Text:
please rise , then , for this minute ' s silence .

Actual translation:
ich bitte sie , sich zu einer schweigeminute zu erheben .

Created translation:
ich bitte ich forderung forderung vor vor schweigeminute erheben erheben

Bleu-score: 0.41545589177443254



 ----------------------------------------------------------------------------------------------------
Actual Text:
( the house rose and observed a minute ' s silence )

Actual translation:
( das parlament erhebt sich zu einer schweigeminute . )

Created translation:
( das parlament erhebt der einer einer schweigeminute . )

Bleu-score: 0.5253819

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


# Conclusion

Training the model is hard on my laptop, so I only trained it on 5000 sentences. It would be interesting to scale up the model, to see how it really performs. In the state it is now it obviously does not generalize well and create good, meaningful translations.