# Chapter 14: Recurrent Neural Networks

## Exercises

### 1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN? And a vector-to-sequence RNN?

A sequence-to-sequence RNN is generally used for a model for predicting the future behavior of some input sequence. This can be used to create a predictive model for determining what the next word you are about to type might be.

A sequence-to-vector RNN is good for classifying sequences, such as sentiment analysis. Also a sequence-to-vector RNN is used for finding the embeddings of a vocabulary of words in a denser, smaller vector space.

In machine translation, vector-to-sequence RNN is good for decoding embeddings from input in one language into words in the target language. You can also use vector-to-sequence RNNs to generate captions for images.

### 2. Why do people use encoder-decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

Most models for automatic translation encode vocabularies as a vector space where each word is a perpendicular unit vector. For a vocabulary of 50,000 words, this means the input sequences are vectors in a 50,000-dimensional space. Training a sequence-to-sequence RNN for machine translation with large vocabularies would take a large amount of memory, making it inefficient.

Using an Encoder-Decoder model allows you to train the encoder to find a denser representation of the words, making training more efficient. Also training the model to find an embedding also helps the model learn what words are closely related to one another.

### 3. How could you combine a convolutional neural network and an RNN to classify videos?

Since a video is a sequence of images, you could create a convolutional neural network where each cell is a convolutional layers which learn feature maps for the images in each frame of the video. You could have it learn one set of feature maps for the input and another feature map for the previous output.

### 4. What are the advantages of building an RNN using `dynamic_rnn()` rather than `static_rnn()`?

The `static_rnn()` function creates new graph nodes for each time step in the sequence. This means that if you are processing a sequence with a large number of steps, you risk getting an OOM error when building your TensorFlow graph. The `dynamic_rnn()` function uses a while loop to perform multiple operations using the same nodes. The `dyanmic_rnn()` function also allows you to swap memory between the GPU and the CPU using the `swap_memory` parameter. It also accepts a single tensor as an input instead of a list of tensors for each time step in the sequence.

### 5. How can you deal with variable-length input sequences? What about variable-length output sequences?

The `dynamic_rnn()` function takes a `sequence_length` parameter which is a 1D tensor of integers which represent the sequence length of each of the inputs. Input sequences that are less than the maximum length sequence are padded with zeros.

For variable-length output sequences, since it is not possible to determine how long each output will be prior to training, so each output sequence ends with an end of sequence (EOS) character to delimit the end.

### 6. What is a common way to distribute training and execution of a deep RNN across multiple GPUs?

In order to distribute an RNN across devices you cannot just simply call the `tf.device()` function. This is because TensorFlow's built in RNN cell classes like `BasicRNNCell` do not create the graph ncdes themselves, rather they are cell factories.

In order to distribute an RNN across devices, you must define a new cell factory which actually creates each cell on a separate device. For an example, see `DeviceCellWrapper` in `RecurrentNeuralNetworks.ipynb`.

### 7. _Embedded Reber grammars_ are artificial grammars used to produce strings. Train an RNN to identify whether or not a string represents the grammar discussed in [Jenny Orr's introduction](http://www.willamette.edu/~gorr/classes/cs449/reber.html) or not. You will first need to write a function capable of generating a training batch containing about 50% strings that respect the grammar and 50% that do not.

In [0]:
# Define a function to generate Rebber grammars. Each key in the dict
# is a node in the graph. Each element in the list is an adjacent node
# in the graph.

import numpy as np

# Adjacency list for the graph.
reber_grammar_graph = [
  [('B', 1)],
  [('T', 2), ('P', 3)],
  [('S', 2), ('X', 4)],
  [('T', 3), ('V', 5)],
  [('X', 3), ('S', 6)],
  [('P', 4), ('V', 6)],
  [('E', None)],
]

def generate_reber_grammar():
  idx = 0
  result = ''
  while idx is not None:
    chars = reber_grammar_graph[idx]
    c, idx = chars[np.random.randint(0, len(chars))]
    result += c
  return result

In [0]:
generate_reber_grammar()

'BTSXXTTTTTTTTTTTVVE'

In [0]:
# Defining a function for generating embedded Reber grammar.

REBER_GRAPH = 'reber_graph'

embedded_reber_grammar_graph = [
  [('B', 1)],
  [('T', 2), ('P', 3)],
  [(REBER_GRAPH, 4)],
  [(REBER_GRAPH, 5)],
  [('T', 6)],
  [('P', 6)],
  [('E', None)]
]

def generate_embedded_reber_grammar():
  idx = 0
  result = ''
  while idx is not None:
    chars = embedded_reber_grammar_graph[idx]
    c, idx = chars[np.random.randint(0, len(chars))]
    result += c if c != REBER_GRAPH else generate_reber_grammar()
  return result

In [0]:
generate_embedded_reber_grammar()

'BPBPVPXVVEPE'

In [0]:
# Generate a corrupted string by creating an embedded Reber grammar
# and then change a single character.

def generate_corrupted_string():
  erg_string = generate_embedded_reber_grammar()
  chars = set(erg_string)
  idx = np.random.randint(0, len(erg_string))
  bad_char = np.random.choice(list(chars - set(erg_string[idx])))
  return '{}{}{}'.format(erg_string[:idx], bad_char, erg_string[idx+1:])

In [0]:
generate_corrupted_string()

'BPBPPVVEPE'

In [0]:
# One-hot encoding each string.

char_to_idx_map = {c: i for i, c in enumerate('BEPSTVX')}

def char_to_vector(c):
  data = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  data[char_to_idx_map[c]] = 1.0
  return data

def one_hot_encode(erg_str):
  return np.array([char_to_vector(c) for c in erg_str], dtype=np.float32)

In [0]:
one_hot_encode(generate_embedded_reber_grammar())

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)

In [0]:
# Pad a one-hot encoded embedded Reber grammar string.
def pad_zeros(ohe_erg_str, length):
  str_length = len(ohe_erg_str)
  if str_length > length:
    raise Exception(
        'the 2nd argument of pad_zeros must be gte the length of the first '
        'argument')
  for i in range(length - str_length):
    ohe_erg_str = \
        np.concatenate((ohe_erg_str, [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]))
  return ohe_erg_str

In [0]:
pad_zeros(one_hot_encode(generate_embedded_reber_grammar())[:10], 15)

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.]])

In [0]:
# Generate a single training batch.

def generate_batch(batch_size, max_seq_length):
  good_batch = []
  bad_batch = []
  while len(good_batch) < batch_size / 2:
    seq = one_hot_encode(generate_embedded_reber_grammar())
    if len(seq) > max_seq_length:
      continue
    good_batch.append((seq, True))
  while len(bad_batch) < batch_size / 2:
    seq = one_hot_encode(generate_corrupted_string())
    if len(seq) > max_seq_length:
      continue
    bad_batch.append((seq, False))
  batch = []
  seq_lengths = []
  labels = []
  all_seqs = good_batch + bad_batch
  np.random.shuffle(all_seqs)
  for seq, is_valid in all_seqs:
    batch.append(pad_zeros(seq, max_seq_length))
    seq_lengths.append(len(seq))
    labels.append([int(is_valid)])
  return np.array(batch, dtype=np.float32), \
      np.array(labels, dtype=np.int32), \
      np.array(seq_lengths, dtype=np.int32)

In [0]:
# Write the TensorFlow graph

import tensorflow as tf

n_chars = 7
max_sequence_length = 20
n_outputs = 1
n_neurons = 100
learning_rate = 0.01

graph = tf.Graph()

with graph.as_default():
  X = tf.placeholder(tf.float32, (None, max_sequence_length, n_chars))
  y = tf.placeholder(tf.float32, (None, 1))
  seq_length = tf.placeholder(tf.int32, (None))

  cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)
  _, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32,
                                sequence_length=seq_length)
  logits = tf.layers.dense(states, n_outputs)

  xentropy = \
      tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.cast(y, tf.float32),
                                              logits=logits)
  loss = tf.reduce_mean(xentropy)
  opt = tf.train.AdamOptimizer(learning_rate)
  training_op = opt.minimize(loss)

  y_pred = tf.cast(tf.greater(logits, 0.0), tf.float32)
  accuracy = tf.reduce_mean(tf.cast(tf.equal(y, y_pred), tf.float32))

  init = tf.global_variables_initializer()

In [0]:
# Training a model to recognize embedded Reber grammars.

n_epochs = 50
n_batches = 25
batch_size = 100
validation_set_size = 200

with graph.as_default():
  with tf.Session() as sess:
    sess.run(init)
    X_valid, y_valid, seq_len_valid = \
        generate_batch(validation_set_size, max_sequence_length)
    valid_feed_dict = {
        X: X_valid,
        y: y_valid,
        seq_length: seq_len_valid,
    }
    for epoch in range(n_epochs):
      for _ in range(n_batches):
        X_batch, y_batch, seq_len_batch = \
            generate_batch(batch_size, max_sequence_length)
        sess.run(training_op, feed_dict={
            X: X_batch,
            y: y_batch,
            seq_length: seq_len_batch,
        })
      if epoch % 5 == 0:
        loss_val = loss.eval(feed_dict=valid_feed_dict)
        acc_val = accuracy.eval(feed_dict=valid_feed_dict)
        print('Epoch: {}\tLoss: {}\tAccuracy: {}'.format(epoch, loss_val,
                                                         acc_val))

Epoch: 0	Loss: 0.6584188938140869	Accuracy: 0.6449999809265137
Epoch: 5	Loss: 0.6347793340682983	Accuracy: 0.4300000071525574
Epoch: 10	Loss: 0.13499197363853455	Accuracy: 0.9599999785423279
Epoch: 15	Loss: 0.08909790217876434	Accuracy: 0.9750000238418579
Epoch: 20	Loss: 0.04386292025446892	Accuracy: 0.9900000095367432
Epoch: 25	Loss: 0.0029373581055551767	Accuracy: 1.0
Epoch: 30	Loss: 0.001209043781273067	Accuracy: 1.0
Epoch: 35	Loss: 0.0013871783157810569	Accuracy: 1.0
Epoch: 40	Loss: 0.00021878271945752203	Accuracy: 1.0
Epoch: 45	Loss: 0.0001390562829328701	Accuracy: 1.0


### 8. Tacle the ["How much did it rain? II" Kaggle competition](https://www.kaggle.com/c/how-much-did-it-rain-ii), this is a time series prediction task.

[Luis Andre Dutra e Silva's interview](http://blog.kaggle.com/2015/12/17/how-much-did-it-rain-ii-2nd-place-luis-andre-dutra-e-silva/) shows some insights that he used to reach second place in the competition.

In [0]:
# First install kaggle API and get the data. The kaggle JSON is stored
# as a local file in the Colab kernel to avoid revealing PII.

!pip install kaggle
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle
!kaggle config path -p .
!kaggle competitions download -c how-much-did-it-rain-ii

In [0]:
import zipfile
import os

for f in os.listdir():
  if f[-4:] == '.zip':
    zip_ref = zipfile.ZipFile(f, mode='r')
    zip_ref.extractall()
    zip_ref.close()

In [0]:
# Upload the training data.

import pandas as pd

training_df = pd.read_csv('train.csv')
training_df = training_df.dropna().reset_index()

In [0]:
training_df.head()

Unnamed: 0,index,Id,minutes_past,radardist_km,Ref,Ref_5x5_10th,Ref_5x5_50th,Ref_5x5_90th,RefComposite,RefComposite_5x5_10th,RefComposite_5x5_50th,RefComposite_5x5_90th,RhoHV,RhoHV_5x5_10th,RhoHV_5x5_50th,RhoHV_5x5_90th,Zdr,Zdr_5x5_10th,Zdr_5x5_50th,Zdr_5x5_90th,Kdp,Kdp_5x5_10th,Kdp_5x5_50th,Kdp_5x5_90th,Expected
0,6,2,1,2.0,9.0,5.0,7.5,10.5,15.0,10.5,16.5,23.5,0.998333,0.998333,0.998333,0.998333,0.375,-0.125,0.3125,0.875,1.059998,-1.410004,-0.350006,1.059998,1.016
1,9,2,16,2.0,18.0,14.0,17.5,21.0,20.5,18.0,20.5,23.0,0.995,0.995,0.998333,1.001667,0.25,0.125,0.375,0.6875,0.349991,-1.059998,0.0,1.059998,1.016
2,10,2,21,2.0,24.5,16.5,21.0,24.5,24.5,21.0,24.0,28.0,0.998333,0.995,0.998333,0.998333,0.25,0.0625,0.1875,0.5625,-0.350006,-1.059998,-0.350006,1.759994,1.016
3,11,2,26,2.0,12.0,12.0,16.0,20.0,16.5,17.0,19.0,21.0,0.998333,0.995,0.998333,0.998333,0.5625,0.25,0.4375,0.6875,-1.76001,-1.76001,-0.350006,0.709991,1.016
4,12,2,31,2.0,22.5,19.0,22.0,25.0,26.0,23.5,25.5,27.5,0.998333,0.995,0.998333,1.001667,0.0,-0.1875,0.25,0.625,-1.059998,-2.12001,-0.710007,0.349991,1.016


In [0]:
# Defining a function to prepare the data.

import numpy as np

features = [
  'minutes_past',
  'radardist_km',
  'Ref',
  'Ref_5x5_10th',
  'Ref_5x5_50th',
  'Ref_5x5_90th',
  'RefComposite',
  'RefComposite_5x5_10th',
  'RefComposite_5x5_50th',
  'RefComposite_5x5_90th',
  'RhoHV',
  'RhoHV_5x5_10th',
  'RhoHV_5x5_50th',
  'RhoHV_5x5_90th',
  'Zdr',
  'Zdr_5x5_10th',
  'Zdr_5x5_50th',
  'Zdr_5x5_90th',
  'Kdp',
  'Kdp_5x5_10th',
  'Kdp_5x5_50th',
  'Kdp_5x5_90th',
]

def prepare_data_for_model(df):
  # Going to organize the data into sequences sorted by minutes_after by id.
  sequences = dict()
  max_len = 1
  for i, row in df.iterrows():
    entry = [row[k] for k in features]
    try:
      sequences[row['Id']].append((entry, row['Expected']))
      if len(sequences[row['Id']]) > max_len:
        max_len = len(sequences[row['Id']])
    except:
      sequences[row['Id']] = [(entry, row['Expected'])]
  data, outputs, seq_lengths = [], [], []
  for i in sequences:
    outputs.append([sequences[i][0][1]])
    seq_lengths.append(len(sequences[i]))
    S = sorted(sequences[i], key=lambda r: r[0][0])
    data_entry = []
    for entry, _ in S:
      data_entry.append(entry)
    for _ in range(max_len - len(S)):
      data_entry.append([0.0] * len(features))
    data.append(data_entry)
  return np.array(data, dtype=np.float32), \
      np.array(outputs, dtype=np.float32), \
      np.array(seq_lengths, dtype=np.int32)

In [0]:
X_train_valid, y_train_valid, seq_length_train_valid = \
    prepare_data_for_model(training_df)

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid, seq_length_train, seq_length_valid = \
    train_test_split(X_train_valid, y_train_valid, seq_length_train_valid,
                     test_size=2000)

In [0]:
# Defining the model graph using 2 layers of LSTM cells.

import tensorflow as tf

n_steps = max(seq_length_train)
n_features = len(features)
n_neurons = 200
n_layers = 2
n_outputs = 1
learning_rate = 0.1

graph = tf.Graph()

with graph.as_default():
  X = tf.placeholder(tf.float32, (None, n_steps, n_features))
  y = tf.placeholder(tf.float32, (None, 1))
  seq_length = tf.placeholder(tf.int32, (None))

  lstm_cells = [tf.nn.rnn_cell.BasicLSTMCell(num_units=n_neurons)
                for _ in range(n_layers)]
  multi_cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)
  outputs, states = tf.nn.dynamic_rnn(multi_cell, X, dtype=tf.float32)
  top_layer_h_states = states[-1][1]
  logits = tf.layers.dense(top_layer_h_states, n_outputs)

  loss = tf.reduce_mean(tf.abs(logits - y))
  opt = tf.train.AdamOptimizer(learning_rate)
  training_op = opt.minimize(loss)

  saver = tf.train.Saver()
  init = tf.global_variables_initializer()

In [0]:
# Function for generating random batches.

def shuffle_batch(X, y, seq_length, batch_size):
  rnd_idx = np.random.permutation(len(X))
  n_batches = len(X) // batch_size
  for batch_idx in np.array_split(rnd_idx, n_batches):
    X_batch, y_batch, seq_length_batch = \
        X[batch_idx], y[batch_idx], seq_length[batch_idx]
    yield X_batch, y_batch, seq_length_batch

In [0]:
# Training the model and evaluating against a validation set.

n_epochs = 10
batch_size = 256
n_batches = len(X_train) // batch_size
max_rounds_since_best_loss = 100

with graph.as_default():
  with tf.Session() as sess:
    sess.run(init)
    best_loss = float('inf')
    rounds_since_best_loss = 0
    for epoch in range(n_epochs):
      print('Epoch:', epoch)
      mean_train_loss = 0.0
      for X_batch, y_batch, seq_length_batch in \
          shuffle_batch(X_train, y_train, seq_length_train, batch_size):
        _, loss_train_val = sess.run([training_op, loss], feed_dict={
            X: X_batch,
            y: y_batch,
            seq_length: seq_length_batch,
        })
        mean_train_loss += loss_train_val
        if loss_train_val < best_loss:
          best_loss = loss_train_val
          rounds_since_best_loss = 0
          saver.save(sess, 'rainfall_model.cpkt')
        else:
          rounds_since_best_loss += 1
      loss_valid_val = loss.eval(feed_dict={
          X: X_valid,
          y: y_valid,
          seq_length: seq_length_valid,
      })
      mean_train_loss /= n_batches
      print(
          'Validation loss: {:.4f}\nMean training loss: {:.4f}' \
              .format(loss_valid_val, mean_train_loss))
      print('======')
      if rounds_since_best_loss >= max_rounds_since_best_loss:
        print('Early stopping.')
        break
    else:
      saver.save(sess, 'rainfall_model.cpkt')

Epoch: 0
Validation loss: 12.8471
Mean training loss: 12.6746
Early stopping.


In [0]:
# Getting the loss on the entire training set.

with graph.as_default():
  with tf.Session() as sess:
    saver.restore(sess, 'rainfall_model.cpkt')
    loss_val = loss.eval(feed_dict={
      X: X_train_valid,
      y: y_train_valid,
      seq_length: seq_length_train_valid,
    })
    print('Total loss:', loss_val)

Total loss: 12.640512
