## What is Machine Translation?
Machine translation is challenging given the inherent ambiguity and 
exibility of human language.

 Statistical machine translation replaces classical rule-based systems with models that learn
to translate from examples.

 Neural machine translation models fit a single model rather than a pipeline of fine-tuned
models and currently achieve state-of-the-art results.

## What is Machine Translation?

Machine translation is the task of automatically converting source text in one language to text
in another language.

Given a sequence of text in a source language, there is no one single best translation of that text to another language. This is because of the natural ambiguity and 
exibility of human language. This makes the challenge of automatic machine translation difficult, perhaps one of the most difficult in artificial intelligence:

Classical machine translation methods often involve rules for converting text in the source language to the target language. The rules are often developed by linguists and may operate atthe lexical, syntactic, or semantic level. This focus on rules gives the name to this area of study: Rule-based Machine Translation, or RBMT.

# What is Statistical Machine Translation?

Statistical machine translation, or SMT for short, is the use of statistical models that learn to
translate text from a source language to a target language given a large corpus of examples.


This task of using a statistical model can be stated formally as follows:
Given a sentence T in the target language, we seek the sentence S from which the
translator produced T. We know that our chance of error is minimized by choosing
that sentence S that is most probable given T. Thus, we wish to choose S so as to
maximize Pr(S|T).

# What is Neural Machine Translation?


Neural machine translation, or NMT for short, is the use of neural network models to learn
a statistical model for machine translation. The key benefitt to the approach is that a singlesystem can be trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

# Encoder-Decoder Model

Multilayer Perceptron neural network models can be used for machine translation, although the
models are limited by a fixed-length input sequence where the output must be the same length.
These early models have been greatly improved upon recently through the use of recurrent
neural networks organized into an encoder-decoder architecture that allow for variable length
input and output sequences.


An encoder neural network reads and encodes a source sentence into a fixed-length
vector. A decoder then outputs a translation from the encoded vector. The whole
encoder-decoder system, which consists of the encoder and the decoder for a language
pair, is jointly trained to maximize the probability of a correct translation given a
source sentence.

# Encoder-Decoders with Attention

Although effective, the Encoder-Decoder architecture has problems with long sequences of text to be translated. The problem stems from the fixed-length internal representation that must be used to decode each word in the output sequence. The solution is the use of an attention mechanism that allows the model to learn where to place attention on the input sequence as each word of the output sequence is decoded.


Using a fixed-sized representation to capture all the semantic details of a very long
sentence [...] is very difficult. [...] A more efficient approach, however, is to read
the whole sentence or paragraph [...], then to produce the translated words one at
a time, each time focusing on a different part of the input sentence to gather the
semantic details required to produce the next output word.

# What are Encoder-Decoder Models for Neural Machine Translation

## Encoder-Decoder Architecture for NMT

The Encoder-Decoder architecture with recurrent neural networks has become an effective
and standard approach for both neural machine translation (NMT) and sequence-to-sequence
(seq2seq) prediction in general. The key benefits of the approach are the ability to train a single
end-to-end model directly on source and target sentences and the ability to handle variable
length input and output sequences of text. As evidence of the success of the method, the
architecture is the core of the Google translation service.

## Sutskever NMT Model

In this section, we will look at the neural machine translation model developed by Ilya Sutskever,
et al. as described in their 2014 paper Sequence to Sequence Learning with Neural Networks.
We will refer to it as the Sutskever NMT Model, for lack of a better name. This is an important
paper as it was one of the first to introduce the Encoder-Decoder model for machine translation
and more generally sequence-to-sequence learning. It is an important model in the field of
machine translation as it was one of the first neural machine translation systems to outperform
a baseline statistical machine learning model on a large translation task.

## Problem

The model was applied to English to French translation, specifically the WMT 2014 translation
task. The translation task was processed one sentence at a time, and an end-of-sequence (<EOS>)
token was added to the end of output sequences during training to signify the end of the
translated sequence. This allowed the model to be capable of predicting variable length output
sequences.


    Note that we require that each sentence ends with a special end-of-sentence symbol <EOS>, which enables the model to dense a distribution over sequences of all possible lengths.


    | Sequence to Sequence Learning with Neural Networks, 2014.


The model was trained on a subset of the 12 Million sentences in the dataset, comprised of
348 Million French words and 304 Million English words. This set was chosen because it was
pre-tokenized. The source vocabulary was reduced to the 160,000 most frequent source English
words and 80,000 of the most frequent target French words. All out-of-vocabulary words were
replaced with the UNK token.

## Model


An Encoder-Decoder architecture was developed where an input sequence was read in entirety and encoded to a fixed-length internal representation. A decoder network then used this internal
representation to output words until the end of sequence token was reached. LSTM networks were used for both the encoder and decoder.


The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector

![title](picture6.png)

## Model Configuration

The following provides a summary of the model configuration taken from the paper:
    
 Input sequences were reversed.

 A 1000-dimensional word embedding layer was used to represent the input words.

 Softmax was used on the output layer.

 The input and output models had 4 layers with 1,000 units per layer.

 The model was fit for 7.5 epochs where some learning rate decay was performed.

 A batch-size of 128 sequences was used during training.

 Gradient clipping was used during training to mitigate the chance of gradient explosions.

 Batches were comprised of sentences with roughly the same length to speed-up computation.

The model was fit on an 8-GPU machine where each layer was run on a different GPU.
Training took 10 days.

## Cho NMT Model

Neural machine translation system described by Kyunghyun Cho, et al. in their 2014 paper titled Learning Phrase Representations using RNN Encoder-
Decoder for Statistical Machine Translation. 


Importantly, the Cho Model is used only to score candidate translations and is not used directly for translation like the Sutskever model above. Although extensions to
the work to better diagnose and improve the model do use it directly and alone for translation.

### Problem

As above, the problem is the English to French translation task from the WMT 2014 workshop.
The source and target vocabulary were limited to the most frequent 15,000 French and English
words which covers 93% of the dataset, and out of vocabulary words were replaced with UNK.
... called RNN Encoder-Decoder that consists of two recurrent neural networks
(RNN). One RNN encodes a sequence of symbols into a fixed-length vector rep-
resentation, and the other decodes the representation into another sequence of
symbols.

![title](picture7.png)

### Extensions
In the paper On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,
Cho, et al. investigate the limitations of their model. They discover that performance degrades
quickly with the increase in the length of input sentences and with the number of words outside
of the vocabulary.

Our analysis revealed that the performance of the neural machine translation suers
significantly from the length of sentences.

![title](picture8.png)

To address the problem of unknown words, they suggest dramatically increasing the vocabu-
lary of known words during training. They address the problem of sentence length in a follow-up
paper titled Neural Machine Translation by Jointly Learning to Align and Translate in which
they propose the use of an attention mechanism. Instead of encoding the input sentence to a
fixed length vector, a fuller representation of the encoded input is kept and the model learns to
use to pay attention to dierent parts of the input for each word output by the decoder.
Each time the proposed model generates a word in a translation, it (soft-)searches
for a set of positions in a source sentence where the most relevant information is
concentrated. The model then predicts a target word based on the context vectors
associated with these source positions and all the previous generated target words.

### A wealth of technical details are provided in the paper; for example:

 A similarly configured model is used, although with bidirectional layers.


 The data is prepared such that 30,000 of the most common words are kept in the vocabulary.

 The model is first trained with sentences with a length up to 20 words, then with sentences
with a length up to 50 words.

 A batch size of 80 sentences is used and the model was fit for 4-6 epochs.

 A beam search was used during the inference to nd the most likely sequence of words for
each translation.

# Litrerature review of:
## https://arxiv.org/abs/1703.03906

"Massive Exploration of Neural Machine Translation Architectures"

# How to Configure Encoder-Decoder Models for Machine Translation

The encoder-decoder architecture for recurrent neural networks is achieving state-of-the-art
results on standard machine translation benchmarks and is being used in the heart of industrial
translation services. The model is simple, but given the large amount of data required to train it,
tuning the myriad of design decisions in the model in order get top performance on your problem
can be practically intractable. Thankfully, research scientists have used Google-scale hardware
to do this work for us and provide a set of heuristics for how to congure the encoder-decoder
model for neural machine translation and for sequence prediction generally.

## Encoder-Decoder Model for Neural Machine Translation

The Encoder-Decoder architecture for recurrent neural networks is displacing classical phrase-
based statistical machine translation systems for state-of-the-art results. As evidence, by their
2016 paper Google's Neural Machine Translation System: Bridging the Gap between Human
and Machine Translation, Google now uses the approach in their core of their Google Translate
service.

# Baseline Model

  Embedding: 512-dimensions.
  
 RNN Cell: Gated Recurrent Unit or GRU.

 Encoder: Bidirectional.

 Encoder Depth: 2-layers (1 layer in each direction).

 Decoder Depth: 2-layers.

 Attention: Bahdanau-style.

 Optimizer: Adam.

 Dropout: 20% on input.

![title](picture9.png)

# Word Embedding Size

A word-embedding is used to represent words input to the encoder. This is a distributed
representation where each word is mapped to a xed-sized vector of continuous values. The
benefit of this approach is that dierent words with similar meaning will have a similar
representation. This distributed representation is often learned while tting the model on the
training data. The embedding size denes the length of the vectors used to represent words. It
is generally believed that a larger dimensionality will result in a more expressive representation,
and in turn, better skill. Interestingly, the results show that the largest size tested did achieve
the best results, but the benet of increasing the size was minor overall.

### RNN Cell Type

There are generally three types of recurrent neural network cells that are commonly used:
 Simple RNN.

 Long Short-Term Memory or LSTM.

 Gated Recurrent Unit or GRU.

**The LSTM was developed to address the vanishing gradient problem of the Simple RNN**
that limited the training of deep RNNs. 

The GRU was developed in an attempt to simplify the LSTM. Results showed that both the GRU and LSTM were significantly better than the Simple
RNN, but the LSTM was generally better overall.

## Encoder-Decoder Depth


Generally, deeper networks are believed to achieve better performance than shallow networks.
The key is to find a balance between network depth, model skill, and training time. 

This is because we generally do not have infinite resources to train very deep networks if the benefit
to skill is minor. The authors explore the depth of both the encoder and decoder models and
the impact on model skill. 

When it comes to encoders, it was found that depth did not have a dramatic impact on skill and more surprisingly, a 1-layer unidirectional model performs only
slightly worse than a 4-layer unidirectional configuration. 


A two-layer bidirectional encoder performed slightly better than other configurations tested.



**Recommendation**: Use a 1-layer bidirectional encoder and extend to 2 bidirectional layers
for a small lift in skill.
A similar story was seen when it came to decoders. The skill between decoders with 1, 2,
and 4 layers was different by a small amount where a 4-layer decoder was slightly better. An
8-layer decoder did not converge under the test conditions.

On the decoder side, deeper models outperformed shallower ones by a small margin.
| Massive Exploration of Neural Machine Translation Architectures, 2017.


**Recommendation**: Use a 1-layer decoder as a starting point and use a 4-layer decoder for
better results.

## Direction of Encoder Input

The order of the sequence of source text can be provided to the encoder a number of ways:
 Forward or as-normal.
 Reversed.
 Both forward and reversed at the same time.



The authors explored the impact of the order of the input sequence on model skill comparing
various unidirectional and bidirectional configurations. Generally, they confirmed previous
findings that a reversed sequence is better than a forward sequence and that bidirectional is
slightly better than a reversed sequence.
... bidirectional encoders generally outperform unidirectional encoders, but not by
a large margin. The encoders with reversed source consistently outperform their
non-reversed counterparts.
 $- Massive Exploration of Neural Machine Translation Architectures, 2017.$


**Recommendation**: Use a reversed order input sequence or move to bidirectional for a
small lift in model skill.

## Attention Mechanism

A problem with the naive Encoder-Decoder model is that the encoder maps the input to a
fixed-length internal representation from which the decoder must produce the entire output
sequence. Attention is an improvement to the model that allows the decoder to pay attention to different words in the input sequence as it outputs each word in the output sequence. The authors look at a few variations on simple attention mechanisms. The results show that having attention results in dramatically better performance than not having attention.


While we did expect the attention-based models to signicantly outperform those without an attention mechanism, we were surprised by just how poorly the [no attention] models fared.

 $- Massive Exploration of Neural Machine Translation Architectures, 2017.$


The simple weighted average style attention described by Bahdanau, et al. in their 2015
paper Neural machine translation by jointly learning to align and translate was found to perform the best.

**Recommendation**: Use attention and prefer the Bahdanau-style weighted average style
attention.

## Inference

It is common in neural machine translation systems to use a beam-search to sample the
probabilities for the words in the sequence output by the model. The wider the beam width, the
more exhaustive the search, and, it is believed, the better the results. The results showed that
a modest beam-width of 3-5 performed the best, which could be improved only very slightly
through the use of length penalties. The authors generally recommend tuning the beam width
on each specific problem.

We found that a well-tuned beam search is crucial to achieving good results, and
that it leads to consistent gains of more than one BLEU point
 
 $- Massive Exploration of Neural Machine Translation Architectures, 2017.$

    
**Recommendation**: Start with a greedy search (beam=1) and tune based on your problem.

## Final Model
The authors pull together their findings into a single best model and compare the results of this model to other well-performing models and state-of-the-art results. The specific configurations of this model are summarized in the table below, taken from the paper. These parameters may be taken as a good or best starting point when developing your own encoder-decoder model for an NLP application.

![title](picture10.png)

## Project: Develop a Neural Machine Translation Model

Tutorial Overview
This tutorial is divided into the following parts:
1. German to English Translation Dataset
2. Preparing the Text Data
3. Train Neural Translation Model
4. Evaluate Neural Translation Model

## Download the English-German pairs dataset.
http://www.manythings.org/anki/deu-eng.zip

# Clean Text

In [26]:
import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in lines]
    return pairs


# clean a list of lines
def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [re_punc.sub('', w) for w in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return array(cleaned)


# save a list of clean sentences to file
def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'), protocol=4)
    print('Saved: %s' % filename)
# load dataset
filename = './deu-eng/deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
# spot check
for i in range(100):
    print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))
        
        


Saved: english-german.pkl
[go] => [geh]
[hi] => [hallo]
[hi] => [gru gott]
[run] => [lauf]
[run] => [lauf]
[wow] => [potzdonner]
[wow] => [donnerwetter]
[duck] => [kopf runter]
[fire] => [feuer]
[help] => [hilfe]
[help] => [zu hulf]
[stay] => [bleib]
[stop] => [stopp]
[stop] => [anhalten]
[wait] => [warte]
[wait] => [warte]
[begin] => [fang an]
[do it] => [mache es]
[do it] => [tue es]
[go on] => [mach weiter]
[hello] => [hallo]
[hello] => [sers]
[hurry] => [beeil dich]
[hurry] => [schnell]
[i hid] => [ich versteckte mich]
[i hid] => [ich habe mich versteckt]
[i ran] => [ich rannte]
[i see] => [ich verstehe]
[i see] => [aha]
[i try] => [ich versuche es]
[i try] => [ich probiere es]
[i won] => [ich hab gewonnen]
[i won] => [ich habe gewonnen]
[i won] => [ich habe gewonnen]
[oh no] => [oh nein]
[relax] => [entspann dich]
[shoot] => [feuer]
[shoot] => [schie]
[smile] => [lacheln]
[sorry] => [entschuldigung]
[ask me] => [frag mich]
[ask me] => [fragt mich]
[ask me] => [fragen sie mich]
[at

# SPLIT TEXT

In [27]:
from pickle import load
from pickle import dump
from numpy.random import shuffle
# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))


# save a list of clean sentences to file
def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)
     
    
# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')
# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:9000], dataset[9000:]

# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

Saved: english-german-both.pkl
Saved: english-german-train.pkl
Saved: english-german-test.pkl


# Train Neural Translation Model

In [31]:
from pickle import load
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))


# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# max sentence length
def max_length(lines):
    return max(len(line.split()) for line in lines)

# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    
    return X


# one hot encode target sequence
def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y


# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    # compile model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model


dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)

# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)

# fit model
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1,
save_best_only=True, mode='min')

model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY),
callbacks=[checkpoint], verbose=2)

English Vocabulary Size: 2185
English Max Length: 5
German Vocabulary Size: 3529
German Max Length: 9
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 9, 256)            903424    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 5, 256)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 5, 256)            525312    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 5, 2185)           561545    
Total params: 2,515,593
Trainable params: 2,515,593
Non-trainable params: 0
________________________

<keras.callbacks.History at 0x7f9fa478af10>

![title](picture11.png)

In [48]:
# Evaluate a model

from pickle import load
from numpy import argmax
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import corpus_bleu

def load_clean_sentences(filename):
    return load(open(filename, 'rb'))


# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer


# max sentence length
def max_length(lines):
    return max(len(line.split()) for line in lines)
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None


# generate target given source sequence
def predict_sequence(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
        word = word_for_id(i, tokenizer)
        if word is None:
            break
            
        target.append(word)
    return ' '.join(target)


# evaluate the skill of the model
def evaluate_model(model, sources, raw_dataset, eng_tokenizer):
    actual, predicted = list(), list()
    for i, source in enumerate(sources):
        # translate encoded source text
        source = source.reshape((1, source.shape[0]))
        translation = predict_sequence(model, eng_tokenizer, source)
        raw_target, raw_src,_ = raw_dataset[i]
        if i < 10:
            print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
        actual.append([raw_target.split()])
        predicted.append(translation.split())
        
    # calculate BLEU score

    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
    
# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
# load model
model = load_model('model.h5')
# test on some training sequences
print('train')
evaluate_model(model, trainX, train, eng_tokenizer)
# test on some test sequences
print('test')
evaluate_model(model, testX, test, eng_tokenizer)

train
src=[das ist in ordnung], target=[thats okay], predicted=[this is]
src=[tom a], target=[tom ate], predicted=[tom ate]
src=[wir sind alt], target=[were old], predicted=[were old]
src=[wie kalt ist es], target=[how cold is it], predicted=[how bad is it]
src=[nimm meine], target=[take mine], predicted=[take mine]
src=[bring wein], target=[bring wine], predicted=[get wine]
src=[ich las lippen], target=[i read lips], predicted=[i read lips]
src=[ich kummere mich um tom], target=[ill take tom], predicted=[ill tell tom]
src=[ich habe dafur gesorgt], target=[i saw to it], predicted=[i saw to it]
src=[ich rannte], target=[i ran], predicted=[i ran]
BLEU-1: 0.882148
BLEU-2: 0.835760
BLEU-3: 0.729296
BLEU-4: 0.398758
test
src=[meine lungen schmerzen], target=[my lungs hurt], predicted=[my very hurts]
src=[tom hat gegessen], target=[tom ate], predicted=[tom ate one]
src=[gib mal gas], target=[get a move on], predicted=[go to me]
src=[ich habe eine arbeit], target=[ive got a job], predicted=[i

### Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.

 Data Cleaning. Different data cleaning operations could be performed on the data, such
as not removing punctuation or normalizing case, or perhaps removing duplicate English
phrases.

 Vocabulary. The vocabulary could be refined, perhaps removing words used less than 5
or 10 times in the dataset and replaced with unk.

 More Data. The dataset used to fit the model could be expanded to 50,000, 100,000
phrases, or more.

 Input Order. The order of input phrases could be reversed, which has been reported to
lift skill, or a Bidirectional input layer could be used.

 Layers. The encoder and/or the decoder models could be expanded with additional layers
and trained for more epochs, providing more representational capacity for the model.
Units. The number of memory units in the encoder and decoder could be increased,
providing more representational capacity for the model.

Regularization. The model could use regularization, such as weight or activation
regularization, or the use of dropout on the LSTM layers.

 Pre-Trained Word Vectors. Pre-trained word vectors could be used in the model.


 Alternate Measure. Explore alternate performance measures beside BLEU such as
ROGUE. Compare scores for the same translations to develop an intuition for how the
measures differ in practice.

 Recursive Model. A recursive formulation of the model could be used where the next
word in the output sequence could be conditional on the input sequence and the output
sequence generated so far.
