Neural Machine Translation, or NMT, can be directly trained on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning

# Encoder-Decoder Model
The early models were Multilayer Perceptrons, limiting the input and output into a fixed-length sequence. Recurrent neural networks have greatly improved upon this.

* An encoder neural network reads and encodes a source sentence into a fixed-length "context" vector
* A decoder, usually an RNN, then outputs a translation from the encoded vector
* The encoder and decoder pair for a language pair is jointly trained to maximize the probability of a correct translation given a source sentence

## Attention
* Has problems with long sequences of text
    * Because of the fixed length internal representations
* An attention mechanism allows the model to learn where to place attention on the input sequence
    * A more efficient approach than to use a fixed size representation to capture all the semantic details
    * Read the whole sentence or paragraph, then produce the translated words one at a time, each time focusing on a different part of the input sentence to gather the semantic details to produce the next word
    
## Problems
* Scaling to larger vocabularies of words
* Slow speed of training the models
* Inference speed

## Long Short-Term Memory
### Sequence to Sequence
* Sequence prediction involves forecasting the next value in a real valued sequence or outputting a class label for an input sequence
* The hardest type of sequence prediction problems is using a sequence as an input and requiring a sequence prediction output; seq2seq
* The length of the input and output may vary
* An effective approach is the Encoder-Decoder LSTM, designed specifically for seq2seq problems
### Encoder-Decoder LSTM
* The innovation here is the fixed-size internal representation in the heart of the model, where the inputs are read to and the outputs are read from
    * Called **sequence embedding** for this
* Reversing the input sequence in the training and testing sets introduces many short term dependencies that makes the optimization problems simpler, leading to better performance
* Also used on images with a CNN used to extract features from input images, which was read by a decoder LSTM
### Applications
* Machine Translation
* Learning to execute and calculate the outcome of small programs
* Image captioning
* Conversational modeling (generating answers to textual questions)
* Movement classification (generating a sequence of commands from a sequence of gestures)
### Implementation
* First, the input sequence is shown to the network one encoded character at a time
* One or more LSTM layers can be used to implement the encoder model
* The output of the model is a fixed-size vector that represents the internal representation of the input sequence

model = Sequential()
model.add(LSTM(..., input_shape = (...)))

* The decoder now transforms the learned internal representation of  the input sequence into the correct output sequence
* One or more LSTM layers can be used to implement the decoder model

Loading the data

In [1]:
import numpy as np
import tensorflow as tf
import pathlib

path = pathlib.Path('deu.txt')
text = path.read_text(encoding = 'utf-8')
lines = text.splitlines()
pairs = [line.split('\t')[:2] for line in lines]

In [2]:
pairs[0]

['Go.', 'Geh.']

In [3]:
inp = [inp for targ, inp in pairs]
targ = [targ for targ, inp in pairs]

In [4]:
inp[-1]

'Ohne Zweifel findet sich auf dieser Welt zu jedem Mann genau die richtige Ehefrau und umgekehrt; wenn man jedoch in Betracht zieht, dass ein Mensch nur Gelegenheit hat, mit ein paar hundert anderen bekannt zu sein, von denen ihm nur ein Dutzend oder weniger nahesteht, darunter höchstens ein oder zwei Freunde, dann erahnt man eingedenk der Millionen Einwohner dieser Welt\xa0leicht, dass seit Erschaffung ebenderselben wohl noch nie der richtige Mann der richtigen Frau begegnet ist.'

In [5]:
targ[-1]

'Doubtless there exists in this world precisely the right woman for any given man to marry and vice versa; but when you consider that a human being has the opportunity of being acquainted with only a few hundred people, and out of the few hundred that there are but a dozen or less whom he knows intimately, and out of the dozen, one or two friends at most, it will easily be seen, when we remember the number of millions who inhabit this world, that probably, since the earth was created, the right man has never yet met the right woman.'

Creating the tf.data dataset

In [6]:
BUFFER_SIZE = len(inp)
BATCH_SIZE = 64

dataset = tf.data.Dataset.from_tensor_slices((inp, targ)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)

In [7]:
dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>

In [8]:
for example_input_batch, example_target_batch in dataset.take(1):
    print(example_input_batch[:5])
    print()
    print(example_target_batch[:5])
    break

tf.Tensor(
[b'Wohin haben Sie Tom geschickt?'
 b'Was ist deine Lieblingssportart im Sommer?'
 b'D\xc3\xbcrfte ich dich um einen gro\xc3\x9fen Gefallen bitten?'
 b'Gerechterweise muss man sagen, dass er mit seinem begrenzten Personal und Material sein Bestes geleistet hat.'
 b'Wer sind diese M\xc3\xa4nner?'], shape=(5,), dtype=string)

tf.Tensor(
[b'Where did you send Tom?' b"What's your favorite summer sport?"
 b'Could I ask you a big favor?'
 b'To do him justice, he did his best with his limited men and supplies.'
 b'Who are these men?'], shape=(5,), dtype=string)


### Text Prepocessing
One of the goals of a saveable models should be to do all the text processing to happen inside, taking in a tf.strings and returning tf.strings output
#### Standardization
It is it important to standardize the input text because the model is dealing with multilingual text with a limited vocabulary

Firstly, we split accented Unicode characters and replce them with the ASCII equivalents

In [9]:
import tensorflow_text as tf_text

example_text = tf.constant('¿Todavía está en casa?')

print(example_text.numpy())
print(tf_text.normalize_utf8(example_text, 'NFKD').numpy())


b'\xc2\xbfTodav\xc3\xada est\xc3\xa1 en casa?'
b'\xc2\xbfTodavi\xcc\x81a esta\xcc\x81 en casa?'


In [10]:
example_text = tf.constant('Grüß Gott mögen!')

print(example_text.numpy())
print(tf_text.normalize_utf8(example_text, 'NFKD').numpy())

b'Gr\xc3\xbc\xc3\x9f Gott m\xc3\xb6gen!'
b'Gru\xcc\x88\xc3\x9f Gott mo\xcc\x88gen!'


Unicode Normalization

In [11]:
def tf_lower_and_split_punct(text):
    # split accented characters
    text = tf_text.normalize_utf8(text)
    text = tf.strings.lower(text)
    # keep space, a to z, and select punctuation
    text = tf.strings.regex_replace(text, '[^ a-z.?!,ä,ö,ü,ß]', '')
    #replacing umlauts
    text = tf.strings.regex_replace(text, 'ä', 'a')
    text = tf.strings.regex_replace(text, 'ö', 'o')
    text = tf.strings.regex_replace(text, 'ü', 'u')
    text = tf.strings.regex_replace(text, 'ß', 'ss')
    # add spaces around punctuation
    text = tf.strings.regex_replace(text, '[.?!,]', r' \0 ')
    # strip whitespace
    text = tf.strings.strip(text)
    
    text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
    return text

In [12]:
print(example_text.numpy().decode())
print(tf_lower_and_split_punct(example_text).numpy().decode())


Grüß Gott mögen!
[START] gruss gott mogen ! [END]


#### Text Vectorization
This layer will handle the vocabulary extraction and conversion of input text to sequence of tokens

In [13]:
max_vocab_size = 5000

input_text_processor = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=tf_lower_and_split_punct,
    max_tokens=max_vocab_size
)

The experimental.preprocessing layers generally have an adapt method, which reads one epoch of training data, and initializes the layer based on the data

Here, it determines the vocabulary

In [14]:
input_text_processor.adapt(inp)

In [15]:
# first 50 words from the vocabulary
input_text_processor.get_vocabulary()[:50]

['',
 '[UNK]',
 '[START]',
 '[END]',
 '.',
 ',',
 'ich',
 'tom',
 '?',
 'nicht',
 'ist',
 'das',
 'du',
 'sie',
 'zu',
 'es',
 'die',
 'er',
 'der',
 'hat',
 'in',
 'dass',
 'ein',
 'wir',
 'habe',
 'was',
 'mir',
 '!',
 'sich',
 'auf',
 'mit',
 'war',
 'wie',
 'mich',
 'eine',
 'den',
 'und',
 'maria',
 'ihr',
 'haben',
 'an',
 'sind',
 'kann',
 'einen',
 'bin',
 'so',
 'noch',
 'von',
 'fur',
 'sehr']

And for the English TextVectorization layer:

In [16]:
output_text_processor = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=tf_lower_and_split_punct,
    max_tokens=max_vocab_size
)

output_text_processor.adapt(targ)

In [17]:
output_text_processor.get_vocabulary()[:50]

['',
 '[UNK]',
 '[START]',
 '[END]',
 '.',
 'tom',
 'to',
 'you',
 'the',
 'i',
 '?',
 'a',
 'is',
 'that',
 'in',
 'do',
 'he',
 'of',
 'it',
 ',',
 'was',
 'have',
 'this',
 'me',
 'dont',
 'for',
 'my',
 'what',
 'are',
 'mary',
 'we',
 'your',
 'his',
 'be',
 'im',
 'and',
 'on',
 'with',
 'like',
 'know',
 'want',
 'not',
 'at',
 'she',
 'can',
 'did',
 'how',
 'has',
 'very',
 'think']

Now a batch of string can be converted into a batch of token IDs:

In [18]:
example_tokens = input_text_processor(example_input_batch)
example_tokens[:3, :10]

<tf.Tensor: shape=(3, 10), dtype=int64, numpy=
array([[   2,  618,   39,   13,    7, 1202,    8,    3,    0,    0],
       [   2,   25,   10,  119,    1,   61,  644,    8,    3,    0],
       [   2, 1464,    6,   54,   60,   43,  520,  338,  676,    8]])>

get_vocabulary can be used to convert it back

In [19]:
input_vocab = np.array(input_text_processor.get_vocabulary())
tokens = input_vocab[example_tokens[0].numpy()]
' '.join(tokens)

'[START] wohin haben sie tom geschickt ? [END]            '

In [23]:
class ShapeChecker():
    def __init__(self):
        # Keep a cache of every axis-name seen
        self.shapes = {}

    def __call__(self, tensor, names, broadcast=False):
        if not tf.executing_eagerly():
            return

        if isinstance(names, str):
            names = (names,)

        shape = tf.shape(tensor)
        rank = tf.rank(tensor)

        if rank != len(names):
            raise ValueError(f'Rank mismatch:\n'
                           f'    found {rank}: {shape.numpy()}\n'
                           f'    expected {len(names)}: {names}\n')

        for i, name in enumerate(names):
            if isinstance(name, int):
                old_dim = name
            else:
                old_dim = self.shapes.get(name, None)
            new_dim = shape[i]

            if (broadcast and new_dim == 1):
                continue

            if old_dim is None:
                # If the axis name is new, add its length to the cache.
                self.shapes[name] = new_dim
                continue

            if new_dim != old_dim:
                raise ValueError(f"Shape mismatch for dimension: '{name}'\n"
                                 f"    found: {new_dim}\n"
                                 f"    expected: {old_dim}\n")


## The Encoder/Decoder Model


In [20]:
# constants for the model
embedding_dim = 256
units = 1024

## The Encoder
1. Takes a list of token IDs (from the `input_text_processor`)
2. Looks up an embedding vector for each token (using a `layers.Embedding`)
3. Processes the embeddings nto a new sequence (using a `layers.GRU`)
4. Returns:
    * The processed sequence (to be passed into the attention head)
    * The internal state (to be used passed into the decoder)

In [None]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, input_vocab_size, embedding_dim, enc_units):
        super(Encoder, self).__init__()
        self.enc_units = enc_units
        self.input_vocab_size = input_vocab_size
        
        # the embedding layer converts tokens into vectors
        self.embedding = tf.keras.layers.Embedding(self.input_vocab_size,
                                                   embedding_dim)
        
        # the GRU RNN layer processes those vectors sequentially
        self.gru = tf.keras.layers.GRU(self.enc_units,
                                       # return the sequence and state
                                       return_sequence=True,
                                       return_state=True,
                                      recurrent_initializer='glorot_uniform')
    def call(self, tokens, state=None):
        shape_checker = ShapeChecker()
        shape_checker(tokens, ('batch', 's'))
        
        # 2. embedding layer looking up the embedding for each token
        vectors = self.embedding(tokens)
        shape_checker(vectors, ('batch', 's', 'embed_dim'))
        
        # 3. the GRU processing the embedding sequence
        #    output shape: (batch, s, enc_units)
        #    state shape:  (batch, enc_units)
        output, state = self.gru(vectors, initial_state=state)
        shape_checker(output, ('batch', 's', 'enc_units'))
        shape_checker(state,  ('batch', 'enc_units'))
        
        # 4. return the new sequence and its state
        return output, state