# Hackathon 9

Written by Eleanor Quint
with images sourced from [Chris Olah's blog post on LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

Topics:
- TensorFlow RNNs and Cells
- LSTM

In today's demo, we'll teach an RNN how to speak English.

This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code.

In [None]:
# We'll start with our library imports...
from __future__ import print_function

import random
import os  # to work with file paths

import tensorflow as tf         # to specify and run computation graphs
from tensorflow import keras
import numpy as np              # for numerical operations taking place outside of the TF graph
import matplotlib.pyplot as plt # to draw plots

#### RNN/LSTM theory recap

Recurrent neural networks (RNNs) are computation graphs with loops (i.e., not directed acyclic graphs). Because the backpropagation algorithm only works with DAGs, we have to unroll the RNN through time. Tensorflow provides code that handles this automatically.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" width="80%">


The most common RNN unit is the LSTM, depicted below:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="80%">

We can see that each unit takes 3 inputs and produces 3 outputs, two which are forwarded to the same unit at the next timestep and one true output, $h_t$ depicted coming out of the top of the cell.

The upper right output going to the next timestep is the cell state. It carries long-term information between cells, and is calculated as: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png" width="80%">

where the first term uses the forget gate $f_t$ to decide to scale the previous state (potentially making it smaller to "forget" it), and the second term is the product of the update gate $i_t$ and the state update $\tilde{C}_t$. Each of the forget and update gates are activated with sigmoid, so their range is (0,1).

The true output and the second, lower output on the diagram are calculated by the output gate:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png" width="80%">

First, $o_t$ is calculated from the output of the previous timestep concatenated with the current input, but then it's mixed with the cell state to get the true output. Passing on this output to the next timestep as the hidden state gives the unit a kind of short term memory.

(Images sourced from [Colah's Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

Today, we're going to teach a recurrent model how to speak English by starting from a sequence of words and asking the model to predict what the next word should be. And what better way to learn English than by learning to talk like an angry media reviewer on the internet? We'll be using the IMDB review corpus for this task.

In [None]:
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

The data is encoded as a list of integers. We'll get the mapping of words to integer values in order to be able to translate the words back and forth.

In [None]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text if i != 0])

Then we'll pad the text sequences so that they're all of identical length and easier to work with

In [None]:
SEQ_LEN = 256

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=SEQ_LEN)
train_targets = keras.preprocessing.sequence.pad_sequences(train_data[:,1:],
                                                        value=word_index["<START>"],
                                                        padding='post',
                                                        maxlen=SEQ_LEN)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=SEQ_LEN)
test_targets = keras.preprocessing.sequence.pad_sequences(test_data[:,1:],
                                                        value=word_index["<START>"],
                                                        padding='post',
                                                        maxlen=SEQ_LEN)

print("Training data has shape {}".format(train_data.shape))
print("Training targets has shape {}".format(train_targets.shape))
print("Testing data has shape {}".format(train_data.shape))
print("Testing targets has shape {}".format(test_targets.shape))

In [None]:
# visualize some of the data
idx = random.randrange(train_data.shape[0])
print(train_data[idx])
print(decode_review(train_data[idx]))
print(train_targets[idx])
print(decode_review(train_targets[idx]))

Each datum is a string of up to 256 successive words from the corpus, and the target is a similar window, but shifted forward by one word. This is setup to train the model to, given a few preceding words, predict what the next word in the sequence will be.

Initially, in the data each word in the sequence is represented as an integer (notice the shape). This discrete representation fails to capture any semantic relationships between words. I.e., the model wouldn't know that "crimson" and "scarlet" are more similar than "red" and "blue". The solution is to learn an word embedding as the first part of the model to transform each integer into a relatively small, dense vector (as compared to a one-hot). Then, similar words will train to have similar embeddings.

We'll use [tf.nn.embedding_lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) to do this which we provide a (usually trainable) VOCAB_SIZE x EMBEDDING_SIZE matrix.

In [None]:
VOCAB_SIZE = max(word_index.values())
EMBEDDING_SIZE = int(np.sqrt(VOCAB_SIZE))
print("Vocab size is {} and is embedded into {} dimensions".format(VOCAB_SIZE, EMBEDDING_SIZE))

# setup input and embedding
input_ph = tf.placeholder(tf.int32, [None, 256])
target_ph = tf.placeholder(tf.int32, [None, 256])

with tf.variable_scope("embedding", reuse=tf.AUTO_REUSE):
    embedding_matrix = tf.get_variable('embedding_matrix', dtype=tf.float32, shape=[VOCAB_SIZE, EMBEDDING_SIZE],
                                       trainable=True)
word_embeddings = tf.nn.embedding_lookup(embedding_matrix, input_ph)
print("The output of the word embedding: " + str(word_embeddings))

Now we want to declare an architecture that looks like this (replacing GRU with LSTM and char embedding with word embedding).

<img src="https://tensorflow.org/tutorials/sequences/images/text_generation_training.png" width="80%">

TensorFlow separates the declaration of [RNNCells](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/RNNCell) from the [RNNs](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) that run them. In the code below, we declare an [LSTM cell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell), and create tensors for the inputs to the first unit. We use zeros for the initial hidden state and current state, but it's also possible to declare trainable variables for these as well.

In [None]:
LSTM_SIZE = 200 # number of units in the LSTM layer, this number taken from a "small" language model
BATCH_SIZE = 64

lstm_cell = tf.contrib.rnn.LSTMCell(LSTM_SIZE)

# Initial state of the LSTM memory.
initial_state = lstm_cell.zero_state(BATCH_SIZE, tf.float32)
print("Initial state of the LSTM: " + str(initial_state))

Then, we'll pass the newly declared cell and the training sequence of word embeddings to [tf.nn.dynamic_rnn](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) as the inputs over time to the LSTM. `dynamic_rnn` runs an `RNNCell` using an internal `while` loop, and returns the sequence of outputs from the LSTM at each timestep and the final state of the LSTM.

In [None]:
# setup RNN
outputs, state = tf.nn.dynamic_rnn(lstm_cell, word_embeddings,
                                   initial_state=initial_state,
                                   dtype=tf.float32)
print("The outputs over all timesteps: "+ str(outputs))
print("The final state of the LSTM layer: " + str(state))
logits = tf.layers.dense(outputs, VOCAB_SIZE)

And to calculate the loss between two sequences, we'll import a function from [tf.contrib.seq2seq](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq) called [sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss). It calculates the weighted cross-entropy loss between the first two arguments, and the third argument provides weights for averaging. We weight uniformly here, but weights could also be calculated based on where in the sequence the target is (e.g., penalize less earlier in the sequence, but more later) or based on the content of the target (e.g., low weight on guessing articles correctly and larger weight on getting nouns and verbs correct).

We'll optimize using TensorFlow's [RMSProp](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) optimizer, which requires an explicit learning rate, but otherwise as usual. We switch from the Adam optmizer because we don't want the adaptive learning rate feature, which can interact badly with the recurrent gradients.

In [None]:
LEARNING_RATE = 1e-4

loss = tf.contrib.seq2seq.sequence_loss(
    logits,
    target_ph,
    tf.ones([BATCH_SIZE, SEQ_LEN], dtype=tf.float32), # we'll use uniform weight over timesteps
    average_across_timesteps=True,
    average_across_batch=True)

optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE)
train_op = optimizer.minimize(loss)

Finally, we'll create a Session, initialize the variables, and run the train op once. This model is relatively heavyweight, so we don't want to optimize it on the login node.

In [None]:
session = tf.Session()
session.run(tf.global_variables_initializer())

# we'll just run one step, omitting the usual epoch code
_ = session.run(train_op, feed_dict={input_ph: train_data[idx:idx + BATCH_SIZE], target_ph: train_targets[idx:idx + BATCH_SIZE]})

Now, we'll try generating from the model.

In [None]:
# this value is to limit load on the login node, must be <= SEQ_LEN
GEN_LEN = 10

# generated_sequence needs to be a valid input
generated_sequence = np.array([[word_index["<START>"]] + [word_index["<PAD>"]]*(SEQ_LEN - 1)] * BATCH_SIZE) 
for idx in range(1, GEN_LEN):
    logits_val = session.run(logits, feed_dict={input_ph: generated_sequence})
    generated_words = np.array([np.argmax(l_val[idx]) for l_val in logits_val])
    generated_sequence[:,idx] = generated_words
    print(decode_review(generated_sequence[0]))

## Hackathon 9 Exercise

Create a 2 layer [LSTMCell](https://www.tensorflow.org/api_docs/python/tf/nn/rnn_cell/LSTMCell) with an [Attention Wrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/AttentionCellWrapper). Your code should use `train_ph` inputs as above and your code should finish with the `loss` tensor. This should be pretty straightforward with the TensorFlow documentation.

This model is very large and trains for a long time, so please don't try to optimize it in this notebook.

In [None]:
# Your code here