# Chapter 14: Recurrent Neural Networks

## Exercises

### 1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN? And a vector-to-sequence RNN?

A sequence-to-sequence RNN is generally used for a model for predicting the future behavior of some input sequence. This can be used to create a predictive model for determining what the next word you are about to type might be.

A sequence-to-vector RNN is good for classifying sequences, such as sentiment analysis. Also a sequence-to-vector RNN is used for finding the embeddings of a vocabulary of words in a denser, smaller vector space.

In machine translation, vector-to-sequence RNN is good for decoding embeddings from input in one language into words in the target language. You can also use vector-to-sequence RNNs to generate captions for images.

### 2. Why do people use encoder-decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

Most models for automatic translation encode vocabularies as a vector space where each word is a perpendicular unit vector. For a vocabulary of 50,000 words, this means the input sequences are vectors in a 50,000-dimensional space. Training a sequence-to-sequence RNN for machine translation with large vocabularies would take a large amount of memory, making it inefficient.

Using an Encoder-Decoder model allows you to train the encoder to find a denser representation of the words, making training more efficient. Also training the model to find an embedding also helps the model learn what words are closely related to one another.

### 3. How could you combine a convolutional neural network and an RNN to classify videos?

Since a video is a sequence of images, you could create a convolutional neural network where each cell is a convolutional layers which learn feature maps for the images in each frame of the video. You could have it learn one set of feature maps for the input and another feature map for the previous output.

### 4. What are the advantages of building an RNN using `dynamic_rnn()` rather than `static_rnn()`?

The `static_rnn()` function creates new graph nodes for each time step in the sequence. This means that if you are processing a sequence with a large number of steps, you risk getting an OOM error when building your TensorFlow graph. The `dynamic_rnn()` function uses a while loop to perform multiple operations using the same nodes. The `dyanmic_rnn()` function also allows you to swap memory between the GPU and the CPU using the `swap_memory` parameter. It also accepts a single tensor as an input instead of a list of tensors for each time step in the sequence.

### 5. How can you deal with variable-length input sequences? What about variable-length output sequences?

The `dynamic_rnn()` function takes a `sequence_length` parameter which is a 1D tensor of integers which represent the sequence length of each of the inputs. Input sequences that are less than the maximum length sequence are padded with zeros.

For variable-length output sequences, since it is not possible to determine how long each output will be prior to training, so each output sequence ends with an end of sequence (EOS) character to delimit the end.

### 6. What is a common way to distribute training and execution of a deep RNN across multiple GPUs?

In order to distribute an RNN across devices you cannot just simply call the `tf.device()` function. This is because TensorFlow's built in RNN cell classes like `BasicRNNCell` do not create the graph ncdes themselves, rather they are cell factories.

In order to distribute an RNN across devices, you must define a new cell factory which actually creates each cell on a separate device. For an example, see `DeviceCellWrapper` in `RecurrentNeuralNetworks.ipynb`.

### 7. _Embedded Reber grammars_ are artificial grammars used to produce strings. Train an RNN to identify whether or not a string represents the grammar discussed in [Jenny Orr's introduction](http://www.willamette.edu/~gorr/classes/cs449/reber.html) or not. You will first need to write a function capable of generating a training batch containing about 50% strings that respect the grammar and 50% that do not.

In [0]:
# Define a function to generate Rebber grammars. Each key in the dict
# is a node in the graph. Each element in the list is an adjacent node
# in the graph.

import numpy as np

# Adjacency list for the graph.
reber_grammar_graph = [
  [('B', 1)],
  [('T', 2), ('P', 3)],
  [('S', 2), ('X', 4)],
  [('T', 3), ('V', 5)],
  [('X', 3), ('S', 6)],
  [('P', 4), ('V', 6)],
  [('E', None)],
]

def generate_reber_grammar():
  idx = 0
  result = ''
  while idx is not None:
    chars = reber_grammar_graph[idx]
    c, idx = chars[np.random.randint(0, len(chars))]
    result += c
  return result

In [0]:
generate_reber_grammar()

'BTSXXTTTTTTTTTTTVVE'

In [0]:
# Defining a function for generating embedded Reber grammar.

REBER_GRAPH = 'reber_graph'

embedded_reber_grammar_graph = [
  [('B', 1)],
  [('T', 2), ('P', 3)],
  [(REBER_GRAPH, 4)],
  [(REBER_GRAPH, 5)],
  [('T', 6)],
  [('P', 6)],
  [('E', None)]
]

def generate_embedded_reber_grammar():
  idx = 0
  result = ''
  while idx is not None:
    chars = embedded_reber_grammar_graph[idx]
    c, idx = chars[np.random.randint(0, len(chars))]
    result += c if c != REBER_GRAPH else generate_reber_grammar()
  return result

In [0]:
generate_embedded_reber_grammar()

'BPBPVPXVVEPE'

In [0]:
# Generate a corrupted string by creating an embedded Reber grammar
# and then change a single character.

def generate_corrupted_string():
  erg_string = generate_embedded_reber_grammar()
  chars = set(erg_string)
  idx = np.random.randint(0, len(erg_string))
  bad_char = np.random.choice(list(chars - set(erg_string[idx])))
  return '{}{}{}'.format(erg_string[:idx], bad_char, erg_string[idx+1:])

In [0]:
generate_corrupted_string()

'BPBPPVVEPE'

In [0]:
# One-hot encoding each string.

char_to_idx_map = {c: i for i, c in enumerate('BEPSTVX')}

def char_to_vector(c):
  data = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  data[char_to_idx_map[c]] = 1.0
  return data

def one_hot_encode(erg_str):
  return np.array([char_to_vector(c) for c in erg_str], dtype=np.float32)

In [0]:
one_hot_encode(generate_embedded_reber_grammar())

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)

In [0]:
# Pad a one-hot encoded embedded Reber grammar string.
def pad_zeros(ohe_erg_str, length):
  str_length = len(ohe_erg_str)
  if str_length > length:
    raise Exception(
        'the 2nd argument of pad_zeros must be gte the length of the first '
        'argument')
  for i in range(length - str_length):
    ohe_erg_str = \
        np.concatenate((ohe_erg_str, [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]))
  return ohe_erg_str

In [0]:
pad_zeros(one_hot_encode(generate_embedded_reber_grammar())[:10], 15)

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.]])

In [0]:
# Generate a single training batch.

def generate_batch(batch_size, max_seq_length):
  good_batch = []
  bad_batch = []
  while len(good_batch) < batch_size / 2:
    seq = one_hot_encode(generate_embedded_reber_grammar())
    if len(seq) > max_seq_length:
      continue
    good_batch.append((seq, True))
  while len(bad_batch) < batch_size / 2:
    seq = one_hot_encode(generate_corrupted_string())
    if len(seq) > max_seq_length:
      continue
    bad_batch.append((seq, False))
  batch = []
  seq_lengths = []
  labels = []
  all_seqs = good_batch + bad_batch
  np.random.shuffle(all_seqs)
  for seq, is_valid in all_seqs:
    batch.append(pad_zeros(seq, max_seq_length))
    seq_lengths.append(len(seq))
    labels.append([int(is_valid)])
  return np.array(batch, dtype=np.float32), \
      np.array(labels, dtype=np.int32), \
      np.array(seq_lengths, dtype=np.int32)

In [0]:
# Write the TensorFlow graph

import tensorflow as tf

n_chars = 7
max_sequence_length = 20
n_outputs = 1
n_neurons = 100
learning_rate = 0.01

graph = tf.Graph()

with graph.as_default():
  X = tf.placeholder(tf.float32, (None, max_sequence_length, n_chars))
  y = tf.placeholder(tf.float32, (None, 1))
  seq_length = tf.placeholder(tf.int32, (None))

  cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)
  _, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32,
                                sequence_length=seq_length)
  logits = tf.layers.dense(states, n_outputs)

  xentropy = \
      tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.cast(y, tf.float32),
                                              logits=logits)
  loss = tf.reduce_mean(xentropy)
  opt = tf.train.AdamOptimizer(learning_rate)
  training_op = opt.minimize(loss)

  y_pred = tf.cast(tf.greater(logits, 0.0), tf.float32)
  accuracy = tf.reduce_mean(tf.cast(tf.equal(y, y_pred), tf.float32))

  init = tf.global_variables_initializer()

In [0]:
# Training a model to recognize embedded Reber grammars.

n_epochs = 50
n_batches = 25
batch_size = 100
validation_set_size = 200

with graph.as_default():
  with tf.Session() as sess:
    sess.run(init)
    X_valid, y_valid, seq_len_valid = \
        generate_batch(validation_set_size, max_sequence_length)
    valid_feed_dict = {
        X: X_valid,
        y: y_valid,
        seq_length: seq_len_valid,
    }
    for epoch in range(n_epochs):
      for _ in range(n_batches):
        X_batch, y_batch, seq_len_batch = \
            generate_batch(batch_size, max_sequence_length)
        sess.run(training_op, feed_dict={
            X: X_batch,
            y: y_batch,
            seq_length: seq_len_batch,
        })
      if epoch % 5 == 0:
        loss_val = loss.eval(feed_dict=valid_feed_dict)
        acc_val = accuracy.eval(feed_dict=valid_feed_dict)
        print('Epoch: {}\tLoss: {}\tAccuracy: {}'.format(epoch, loss_val,
                                                         acc_val))

Epoch: 0	Loss: 0.6584188938140869	Accuracy: 0.6449999809265137
Epoch: 5	Loss: 0.6347793340682983	Accuracy: 0.4300000071525574
Epoch: 10	Loss: 0.13499197363853455	Accuracy: 0.9599999785423279
Epoch: 15	Loss: 0.08909790217876434	Accuracy: 0.9750000238418579
Epoch: 20	Loss: 0.04386292025446892	Accuracy: 0.9900000095367432
Epoch: 25	Loss: 0.0029373581055551767	Accuracy: 1.0
Epoch: 30	Loss: 0.001209043781273067	Accuracy: 1.0
Epoch: 35	Loss: 0.0013871783157810569	Accuracy: 1.0
Epoch: 40	Loss: 0.00021878271945752203	Accuracy: 1.0
Epoch: 45	Loss: 0.0001390562829328701	Accuracy: 1.0


### 8. Tacle the ["How much did it rain? II" Kaggle competition](https://www.kaggle.com/c/how-much-did-it-rain-ii), this is a time series prediction task.

[Luis Andre Dutra e Silva's interview](http://blog.kaggle.com/2015/12/17/how-much-did-it-rain-ii-2nd-place-luis-andre-dutra-e-silva/) shows some insights that he used to reach second place in the competition.

In [0]:
# First install kaggle API and get the data. The kaggle JSON is stored
# as a local file in the Colab kernel to avoid revealing PII.

!pip install kaggle
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle
!kaggle config path -p .
!kaggle competitions download -c how-much-did-it-rain-ii

In [0]:
import zipfile
import os

for f in os.listdir():
  if f[-4:] == '.zip':
    zip_ref = zipfile.ZipFile(f, mode='r')
    zip_ref.extractall()
    zip_ref.close()

In [0]:
# Upload the training data.

import pandas as pd

training_df = pd.read_csv('train.csv')
training_df = training_df.dropna().reset_index()

In [0]:
training_df.head()

Unnamed: 0,index,Id,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,RefComposite_5x5_90th,RhoHV,RhoHV_5x5_10th,RhoHV_5x5_50th,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
0,6,2,1,2.0,9.0,5.0,7.5,10.5,15.0,10.5,16.5,23.5,0.998333,0.998333,0.998333,0.998333,0.375,-0.125,0.3125,0.875,1.059998,-1.410004,-0.350006,1.059998,1.016
1,9,2,16,2.0,18.0,14.0,17.5,21.0,20.5,18.0,20.5,23.0,0.995,0.995,0.998333,1.001667,0.25,0.125,0.375,0.6875,0.349991,-1.059998,0.0,1.059998,1.016
2,10,2,21,2.0,24.5,16.5,21.0,24.5,24.5,21.0,24.0,28.0,0.998333,0.995,0.998333,0.998333,0.25,0.0625,0.1875,0.5625,-0.350006,-1.059998,-0.350006,1.759994,1.016
3,11,2,26,2.0,12.0,12.0,16.0,20.0,16.5,17.0,19.0,21.0,0.998333,0.995,0.998333,0.998333,0.5625,0.25,0.4375,0.6875,-1.76001,-1.76001,-0.350006,0.709991,1.016
4,12,2,31,2.0,22.5,19.0,22.0,25.0,26.0,23.5,25.5,27.5,0.998333,0.995,0.998333,1.001667,0.0,-0.1875,0.25,0.625,-1.059998,-2.12001,-0.710007,0.349991,1.016


In [0]:
# Defining a function to prepare the data.

import numpy as np

features = [
  'minutes_past',
  'radardist_km',
  'Ref',
  'Ref_5x5_10th',
  'Ref_5x5_50th',
  'Ref_5x5_90th',
  'RefComposite',
  'RefComposite_5x5_10th',
  'RefComposite_5x5_50th',
  'RefComposite_5x5_90th',
  'RhoHV',
  'RhoHV_5x5_10th',
  'RhoHV_5x5_50th',
  'RhoHV_5x5_90th',
  'Zdr',
  'Zdr_5x5_10th',
  'Zdr_5x5_50th',
  'Zdr_5x5_90th',
  'Kdp',
  'Kdp_5x5_10th',
  'Kdp_5x5_50th',
  'Kdp_5x5_90th',
]

def prepare_data_for_model(df):
  # Going to organize the data into sequences sorted by minutes_after by id.
  sequences = dict()
  max_len = 1
  for i, row in df.iterrows():
    entry = [row[k] for k in features]
    try:
      sequences[row['Id']].append((entry, row['Expected']))
      if len(sequences[row['Id']]) > max_len:
        max_len = len(sequences[row['Id']])
    except:
      sequences[row['Id']] = [(entry, row['Expected'])]
  data, outputs, seq_lengths = [], [], []
  for i in sequences:
    outputs.append([sequences[i][0][1]])
    seq_lengths.append(len(sequences[i]))
    S = sorted(sequences[i], key=lambda r: r[0][0])
    data_entry = []
    for entry, _ in S:
      data_entry.append(entry)
    for _ in range(max_len - len(S)):
      data_entry.append([0.0] * len(features))
    data.append(data_entry)
  return np.array(data, dtype=np.float32), \
      np.array(outputs, dtype=np.float32), \
      np.array(seq_lengths, dtype=np.int32)

In [0]:
X_train_valid, y_train_valid, seq_length_train_valid = \
    prepare_data_for_model(training_df)

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid, seq_length_train, seq_length_valid = \
    train_test_split(X_train_valid, y_train_valid, seq_length_train_valid,
                     test_size=2000)

In [0]:
# Defining the model graph using 2 layers of LSTM cells.

import tensorflow as tf

n_steps = max(seq_length_train)
n_features = len(features)
n_neurons = 200
n_layers = 2
n_outputs = 1
learning_rate = 0.1

graph = tf.Graph()

with graph.as_default():
  X = tf.placeholder(tf.float32, (None, n_steps, n_features))
  y = tf.placeholder(tf.float32, (None, 1))
  seq_length = tf.placeholder(tf.int32, (None))

  lstm_cells = [tf.nn.rnn_cell.BasicLSTMCell(num_units=n_neurons)
                for _ in range(n_layers)]
  multi_cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)
  outputs, states = tf.nn.dynamic_rnn(multi_cell, X, dtype=tf.float32)
  top_layer_h_states = states[-1][1]
  logits = tf.layers.dense(top_layer_h_states, n_outputs)

  loss = tf.reduce_mean(tf.abs(logits - y))
  opt = tf.train.AdamOptimizer(learning_rate)
  training_op = opt.minimize(loss)

  saver = tf.train.Saver()
  init = tf.global_variables_initializer()

In [0]:
# Function for generating random batches.

def shuffle_batch(X, y, seq_length, batch_size):
  rnd_idx = np.random.permutation(len(X))
  n_batches = len(X) // batch_size
  for batch_idx in np.array_split(rnd_idx, n_batches):
    X_batch, y_batch, seq_length_batch = \
        X[batch_idx], y[batch_idx], seq_length[batch_idx]
    yield X_batch, y_batch, seq_length_batch

In [0]:
# Training the model and evaluating against a validation set.

n_epochs = 10
batch_size = 256
n_batches = len(X_train) // batch_size
max_rounds_since_best_loss = 100

with graph.as_default():
  with tf.Session() as sess:
    sess.run(init)
    best_loss = float('inf')
    rounds_since_best_loss = 0
    for epoch in range(n_epochs):
      print('Epoch:', epoch)
      mean_train_loss = 0.0
      for X_batch, y_batch, seq_length_batch in \
          shuffle_batch(X_train, y_train, seq_length_train, batch_size):
        _, loss_train_val = sess.run([training_op, loss], feed_dict={
            X: X_batch,
            y: y_batch,
            seq_length: seq_length_batch,
        })
        mean_train_loss += loss_train_val
        if loss_train_val < best_loss:
          best_loss = loss_train_val
          rounds_since_best_loss = 0
          saver.save(sess, 'rainfall_model.cpkt')
        else:
          rounds_since_best_loss += 1
      loss_valid_val = loss.eval(feed_dict={
          X: X_valid,
          y: y_valid,
          seq_length: seq_length_valid,
      })
      mean_train_loss /= n_batches
      print(
          'Validation loss: {:.4f}\nMean training loss: {:.4f}' \
              .format(loss_valid_val, mean_train_loss))
      print('======')
      if rounds_since_best_loss >= max_rounds_since_best_loss:
        print('Early stopping.')
        break
    else:
      saver.save(sess, 'rainfall_model.cpkt')

Epoch: 0
Validation loss: 12.8471
Mean training loss: 12.6746
Early stopping.


In [0]:
# Getting the loss on the entire training set.

with graph.as_default():
  with tf.Session() as sess:
    saver.restore(sess, 'rainfall_model.cpkt')
    loss_val = loss.eval(feed_dict={
      X: X_train_valid,
      y: y_train_valid,
      seq_length: seq_length_train_valid,
    })
    print('Total loss:', loss_val)

Total loss: 12.640512


### 9. Go through [TensorFlow's Word2Vec tutoral](https://www.tensorflow.org/tutorials/representation/word2vec) to create word embeddings, and then go through the NMT tutorial to train a Vietnamese-to-English translation system.

See the **Embeddings** section of `RecurrentNeuralNetworks.ipynb` in this section in order to see an implementation of word embeddings based on TensorFlow's Word2Vec tutorial.

The book uses a link to a tutorial that I could not find on TensorFlow's site, so instead I will be going through [this tutorial](https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention) on translating Viatnamese to English.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

!pip install -q tensorflow-gpu==2.0.0-beta1
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

[K     |████████████████████████████████| 348.9MB 72kB/s 
[K     |████████████████████████████████| 3.1MB 40.0MB/s 
[K     |████████████████████████████████| 501kB 54.8MB/s 
[?25h

In [0]:
# Download the file contaning training data.

origin = \
    'http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'
path_to_zip = tf.keras.utils.get_file('spa-eng.zip', origin=origin,
                                      extract=True)

path_to_file = os.path.dirname(path_to_zip) + '/spa-eng/spa.txt'

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [0]:
# Preprocessing the data for training.

def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')
  
def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())
  # Adds a space between a word and punctuation following it.
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)
  # Removing special characters and numbers.
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
  w = w.rstrip().strip()
  # Add start and end tokens.
  w = '<start> ' + w + ' <end>'
  return w

In [0]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [0]:
# Create the dataset from the downloaded data.

def create_dataset(path, n_examples):
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
  word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]
                for l in lines[:n_examples]]
  return zip(*word_pairs)

In [0]:
# Testing generate the datasets.

en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [0]:
# Defining some utility functions.

def max_length(tensor):
  return max(len(t) for t in tensor)

def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

  return tensor, lang_tokenizer

def load_dataset(path, n_examples=None):
  targ_lang, inp_lang = create_dataset(path, n_examples)

  input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
  target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [0]:
# Creating the data set.

n_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang = \
    load_dataset(path_to_file, n_examples)

max_length_targ, max_length_inp = \
    max_length(target_tensor), max_length(input_tensor)

In [0]:
# Splitting the data set into a training and a validation set.

input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = \
    train_test_split(input_tensor, target_tensor, test_size=0.2)

print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val),
      len(target_tensor_val))

24000 24000 6000 6000


In [0]:
def convert(lang, tensor):
  for t in tensor:
    if t != 0:
      print ("%d ----> %s" % (t, lang.index_word[t]))

print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
8 ----> no
23 ----> te
126 ----> creo
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
4 ----> i
30 ----> don
12 ----> t
291 ----> believe
6 ----> you
3 ----> .
2 ----> <end>


In [0]:
# Create a tf.data dataset.

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train) // BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices(
    (input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [0]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 16]), TensorShape([64, 11]))

### Writing the encoder and decoder model

The following model implements an encoder-decoder model using the more recent TensorFlow 2 API. Below is a graph diagram of a model which implements the [attention equations discussed in Luong's paper](https://arxiv.org/abs/1508.04025v5):

<img width="500" src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg">

The model implemented below uses [Bahdanau attention](https://arxiv.org/pdf/1409.0473.pdf) for the encoder which computes the attention weights differently from Luong's equations. The input is the output of the encoder, which has the shape (_batch_size_, _max_length_, _hidden_size_), and the hidden state of the encoder, which has the shape (_batch_size_, _hidden_size_). The attention equations are given by the following equations:

$$ \alpha_{ts} = \frac{\exp{\left(\text{score}\left(\mathbf{h}_t, \bar{\mathbf{h}}_s \right)\right)}}{\sum\limits_{s'\,=\,1}^S \exp{\left(\text{score}\left(\mathbf{h}_t, \bar{\mathbf{h}}_{s'} \right)\right)}} \;\;\; (\text{Attention weights}) $$

$$ \mathbf{c}_t = \sum\limits_s \alpha_{ts} \bar{\mathbf{h}}_s \;\;\; (\text{Context vector}) $$

$$ \boldsymbol{\alpha}_t = \tanh\left( \mathbf{W}_c \left[ \mathbf{c}_t ; \mathbf{h}_t \right] \right) \;\;\; (\text{Attention vector}) $$

$$ \text{score}\left( \mathbf{h}_t, \bar{\mathbf{h}}_s \right) = \left\{
  \begin{matrix}
    \mathbf{h}_t^{\,T} \mathbf{W} \, \bar{\mathbf{h}}_s && (\text{Luong's multiplicative equation}) \\
    \mathbf{v}_a^{\,T} \tanh\left( \mathbf{W}_1 \mathbf{h}_t + \mathbf{W}_2 \bar{\mathbf{h}}_s \right) && (\text{Bahdanau's additive equation})
  \end{matrix}
\right. $$

Below is a TensorFlow implementation of an Encoder-Decoder model which uses an attention vector. It uses the TensorFlow 2 API so it is a bit different from the other code in this repository.

In [0]:
# Defining the encoder, which uses a GRU cell to find 

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
    super(Encoder, self).__init__()
    self.batch_size = batch_size
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units, return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
  
  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state=hidden)
    return output, state
  
  def initialize_hidden(self):
    return tf.zeros((self.batch_size, self.enc_units))

In [0]:
# Testing the encoder by inspecting the shape of the outputs.

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

sample_hidden = encoder.initialize_hidden()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(
    sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(
    sample_hidden.shape))

Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)


In [0]:
# Implementing Bahdanau attention.

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))
    
    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [0]:
# Instantiating the Bahnadau attention.

attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden,
                                                      sample_output)

print("Attention result shape: (batch size, units) {}".format(
    attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(
    attention_weights.shape))

Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)


In [0]:
# Decoder class which decodes the embedding into the target language.

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
    super(Decoder, self).__init__()
    self.batch_size = batch_size
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units, return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # output shape == (batch_size, 1, hidden_size)
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    x = self.fc(output)

    return x, state, attention_weights

In [0]:
# Testing the Decoder class.

decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(
    sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (64, 4935)


In [0]:
# Define the optimizer and loss function.

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)
  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask
  return tf.reduce_mean(loss_)

In [0]:
# Checkpoints for saving during training.

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [0]:
# Defining the training operation for each iteration.

@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE,
                               1)
    # Teacher forcing: using the target as the input to the next step
    for t in range(targ.shape[1]):
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
      loss += loss_function(targ[:,t], predictions)
      # Teacher forcing
      dec_input = tf.expand_dims(targ[:,t], 1)
    
    batch_loss = loss / int(targ.shape[1])
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

In [0]:
# Training the model.

EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 100 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch,
                                                   batch_loss.numpy()))
  
  if (epoch + 1) % 2 == 0:
    checkpoint.save(checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 5.3267
Epoch 1 Batch 100 Loss 2.1815
Epoch 1 Batch 200 Loss 1.9093
Epoch 1 Batch 300 Loss 1.7158
Epoch 1 Loss 2.0671
Time taken for 1 epoch 74.93305230140686 sec

Epoch 2 Batch 0 Loss 1.6474
Epoch 2 Batch 100 Loss 1.4590
Epoch 2 Batch 200 Loss 1.3656
Epoch 2 Batch 300 Loss 1.2044
Epoch 2 Loss 1.3617
Time taken for 1 epoch 45.92219686508179 sec

Epoch 3 Batch 0 Loss 1.2264
Epoch 3 Batch 100 Loss 1.0203
Epoch 3 Batch 200 Loss 0.9307
Epoch 3 Batch 300 Loss 0.7746
Epoch 3 Loss 0.9569
Time taken for 1 epoch 45.209630727767944 sec

Epoch 4 Batch 0 Loss 0.8824
Epoch 4 Batch 100 Loss 0.7105
Epoch 4 Batch 200 Loss 0.6011
Epoch 4 Batch 300 Loss 0.5303
Epoch 4 Loss 0.6672
Time taken for 1 epoch 45.702401876449585 sec

Epoch 5 Batch 0 Loss 0.6057
Epoch 5 Batch 100 Loss 0.4647
Epoch 5 Batch 200 Loss 0.3705
Epoch 5 Batch 300 Loss 0.3567
Epoch 5 Loss 0.4630
Time taken for 1 epoch 45.356289863586426 sec

Epoch 6 Batch 0 Loss 0.4209
Epoch 6 Batch 100 Loss 0.3320
Epoch 6 Batch 200 L

In [0]:
# Translating Spanish to English using the trained model.

def evaluate(sentence):
  attention_plot = np.zeros((max_length_targ, max_length_inp))
  sentence = preprocess_sentence(sentence)

  inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_inp,
                                                         padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''
  hidden = [tf.zeros((1, units))]
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  for t in range(max_length_targ):
    predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden,
                                                         enc_out)
    attention_weights = tf.reshape(attention_weights, (-1,))
    attention_plot[t] = attention_weights.numpy()

    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<end>':
      return result, sentence, attention_plot

    dec_input = tf.expand_dims([predicted_id], 0)
    
  return result, sentence, attention_plot

In [0]:
# Translating a sentence.

def translate(sentence):
  result, sentence, attention_plot = evaluate(sentence)
  print('Input: ', sentence)
  print('Predicted translation: ', result)

In [0]:
# Restoring the model from the last checkpoint.

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f76ea2faeb8>

In [0]:
# Some translation examples:

translate(u'esta es mi vida.')

Input:  <start> esta es mi vida . <end>
Predicted translation:  <start> this is my life . <end> 


In [0]:
translate(u'hace mucho frio aqui.')

Input:  <start> hace mucho frio aqui . <end>
Predicted translation:  <start> it s very cold here . <end> 


In [0]:
translate(u'¿todavia estan en casa?')

Input:  <start> ¿ todavia estan en casa ? <end>
Predicted translation:  <start> are you still home ? <end> 
