# Recurrent Neural Networks in TensorFlow

This week we will be using Recurrent Neural Networks, or RNNs, to generate text. The model that we have built is a rather simple and naive approach to generating text. There are better and more effective methods for text generation but we will not be exploring them today.

Lets jump into the code!

## Loading the Data
Before we do anything, we must load the data. The data we are using is from the book "Alice in Wonderland". We have already prepared some numpy files that contain the data to make things easier.

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
seed = 31337
np.random.seed(seed)

In [3]:
def embed_output(data, vocab):
    """Takes a list of words and outputs a list of the indices of the words in the vocab list"""
    result = np.empty(len(data))
    for idx, word in enumerate(data):
        result[idx] = int(np.where(vocab == data[idx])[0])  # Get index of word in vocab array
    return result

We will use the `embed_output` function to convert them into numbers based on the vocabulary, which is just a list of words. The `embed_output` function will just convert the list of words into a list of numbers. Obviously this is important because Neural Networks understand numbers and not words.

In [4]:
save_path = "saved/rnn.ckpt"
alice_file = "alice_with_periods.npz"

# Load alice in wonderland into an array of numbers called `alice_embed`
alice_load = np.load(alice_file)
alice_embed = embed_output(alice_load['words'], alice_load['vocab'])  # Alice as sequence of ints, shape [num_words]
embedding_size = len(alice_load['vocab'])
alice_load['words'][:100]

array(['alice', 's', 'adventures', 'in', 'wonderland', 'by', 'lewis',
       'carroll', 'chapter', 'i', '.', 'down', 'the', 'rabbit', 'hole',
       'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of',
       'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of',
       'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she',
       'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was',
       'reading', 'but', 'it', 'had', 'no', 'pictures', 'or',
       'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use',
       'of', 'a', 'book', 'thought', 'alice', 'without', 'pictures', 'or',
       'conversation', 'so', 'she', 'was', 'considering', 'in', 'her',
       'own', 'mind', 'as', 'well', 'as', 'she', 'could', 'for', 'the',
       'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and',
       'stupid', 'whether', 'the', 'pleasure', 'of'],
      dtype='<U14')

Now we need to break the list of words up into a list of sentences. Each sentence will be `sequence_length` units long. We then split the dataset into training and testing datasets.

In [14]:
# Split
sequence_length = 25  # The length of the generated sequences
truncated_length = (len(alice_embed)//sequence_length) * sequence_length  # Need to make everything divisible
alice_embed = alice_embed[:truncated_length]  # truncate the list of words to be divisible by `sequence_length`
alice_split = np.reshape(alice_embed, (-1, sequence_length))  # break the text into fixed length sentences
num_sequences = alice_split.shape[0]


indices = np.random.permutation(num_sequences)  # Randomly re-order the list of sentences
pct_train = 0.75
# Split dataset into training and testing
training_idx, test_idx = indices[:int(pct_train*num_sequences)], indices[int(pct_train*num_sequences):]
alice_train, alice_test = alice_split[training_idx, :], alice_split[test_idx, :]
print("alice_train.shape:", alice_train.shape)
alice_train

alice_train.shape: (842, 25)


array([[ 1899.,  1379.,   823., ...,   581.,   402.,  2556.],
       [  268.,  1336.,  1584., ...,  2280.,   265.,   436.],
       [ 1664.,  1819.,   740., ...,  1373.,   699.,   740.],
       ..., 
       [ 1104.,   593.,  2280., ...,  2556.,  1122.,   788.],
       [  823.,  1916.,  1104., ...,  2264.,   270.,  2387.],
       [  175.,   823.,  1822., ...,   788.,   325.,  1599.]])

## Training the RNN
Ok, we have the data but now we need to design the network architecture in TensorFlow. Lets start with the code needed to design the training architecture.

In [15]:
tf.reset_default_graph()  # Just in case we want to re-run the jupyter cell, this avoids getting errors
tf.set_random_seed(seed)  # For debugging purposes, make everyone use the same seed

Lets define our input and call it `x`. Then we generate one-hot encodings of `x` and put them in `x_hot`. One-hot encodings are important because just like in image recognition where you need to one-hot encode the labels so that there is an index associated with each class, here we need to one-hot encode the inputs so that there is an index associated with each word.

In [16]:
# Input sequence placeholder [batch_size, sequence_length]
x = tf.placeholder(shape=[None, sequence_length], dtype=tf.int32)
# one-hot encode x shaped `[batch_size, sequence_length, embedding_size]`
x_hot = tf.one_hot(x, depth=embedding_size)
batch_size = tf.shape(x)[0]  # This is a scalar tensor

Now is the tricky part. Recall that recurrent neural networks have cyclic graphs, where the outputs of the RNN get fed back into the RNN in a feedback cycle. In order to perform backpropogation, we need to "unroll" the RNN so that there are no loops in the graph. We achieve this by operating on sequences of a fixed length, and then just pass each element of that sequence through the RNN cell. This turns it into a feed-forward neural network instead of a recurrent neural network. This can be seen in the for loop in the code.

In [17]:
with tf.variable_scope('Unrolled') as scope:
    num_units = 256  # number of RNN units in a the RNN cell. This is the size of the state vectors used in the cell.
    # Make three different LSTM cells
    cell1 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)
    cell2 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)
    cell3 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)

    # The state of the RNN is initialized to (usually) zero at the start of every sequence.
    state = cell1.zero_state(batch_size=batch_size, dtype=tf.float32)  # The intial state of the RNN

    # Unroll the graph by passing the data into the RNN starting from the 1st element of the sequence until the last
    outputs = []  # python list of tensors so we can keep track of each timestep
    # We subtract one because when we generate the last word, we dont feed it back in.
    for i in range(sequence_length-1):
        # NOTE: each time we call the cell, it calls tf.get_variable internally. If scope.reuse is not True, it will
        # create new variables at each word in the sentence. We therefore want to stop creating those variables once
        # we have called each cell once.
        if i > 0: scope.reuse_variables()  # Reuse the parameters created in the 1st RNN cell

        # Just like in a feed forward network where more layers = more powerful network, we have several
        # LSTM cells. This enables the network to be able to learn highly non-linear functions.
        output, state = cell1(x_hot[:, i, :], state, scope='Cell1')  # Use the word as the input (1st layer)
        output, state = cell2(output, state, scope='Cell2')  # feed output and state from cell1 to cell2 (2nd layer)
        output, state = cell3(output, state, scope='Cell3')  # feed output and state from cell2 to cell3 (3rd layer)
        outputs.append(output)  # Append the outputs to the python list of tensors

    # Turn the python list back into a tensor. 
    # This "stacks" them into a tensor of shape `[batch_size, sequence_length, cell3.output_size]`
    outputs = tf.stack(outputs, axis=1, name='Outputs')  # Axis=1 makes the sequence dimension the 2nd dimension

Now that we have finished the recurrent parts of the network, we need to generate probabilites. Right now, the network is generating scores vectors that are of size `cell3.output_size`, but we want score vectors that are of size `embedding_size`. To achieve these requirements, we will use a softmax layer and apply it to the last dimension of `outputs`.

In [18]:
# we want to get probabilities out of the lstm for each word, so we use a softmax layer.
with tf.variable_scope('Softmax'):
    # Note that we need to go from vectors of size `cell3.output_size` to `embedding_size`
    w = tf.get_variable(
        name='Weight',
        initializer=tf.truncated_normal([cell3.output_size, embedding_size], stddev=0.01))
    b = tf.get_variable(name='Bias', initializer=tf.zeros(embedding_size))
    
    # We flattten from shape `[batch_size, sequence_length, cell3.output_size]` 
    # to `[batch_size*sequence_length, embedding_size]` because broadcasting doesn't work properly for `tf.matmul`
    flattened = tf.reshape(outputs, (-1, cell3.output_size))
    # Do multiplication and then reshape into `[batch_size, sequence_length, embedding_size]`
    matmul = tf.reshape(tf.matmul(flattened, w), shape=(-1, sequence_length-1, embedding_size))
    # Add the bias term to get the tensor of class scores
    scores = tf.add(matmul, b, name='Scores')
    # Run the scores through the softmax function to generate probabilities
    softmax = tf.nn.softmax(scores, name='Softmax')

# Shift over the inputs to create the labels. At timestep 1, input is `x[:,1,:]` and target output is `x[:,2,:]`
# Note that the shape of labels is `[batch_size, sequence_length]`. 
# There is no dimension for the embedding because the labels were not one-hot encoded.
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=scores, labels=x[:, 1:]))
train_step = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)

Now we need to actually feed data into the model and get it to train. This took a long time, so we have jumpstarted the process by providing a pre-trained model. The following code loads that model and trains it further for a couple of epochs (passes through the full training set). 

In [19]:
sess = tf.InteractiveSession()
saver = tf.train.Saver()  # we will use this to load the variables from the ckpt files

training_size = alice_train.shape[0]
batch_size = 32

try:
    # Try to load the saved model
    saver.restore(sess, save_path)
    print("Restored Model!")
except Exception:
    # The saved model did not exist, so randomly initialize the model parameters to start training from scratch
    print("Initializing Model!")
    sess.run(tf.global_variables_initializer())

num_epochs = 10  # number of training epochs. An epoch is a full pass through the whole training set
num_batches = training_size // batch_size
for epoch in range(num_epochs):
    perm = np.random.permutation(training_size)  # Every epoch, get a new set of batches
    avg_loss = 0
    for i in range(0, training_size, batch_size):
        idx = perm[i:i + batch_size]  # Select indices for batch
        x_batch = alice_train[idx]
        _, batch_loss = sess.run([train_step, loss], feed_dict={x: x_batch})
        avg_loss += batch_loss
    print("epoch %6d, loss=%6f" % (epoch + 1, avg_loss/num_batches))

# Save model here if we wanted to
# print("Saving model to %s" % save_path)
# saver.save(sess, save_path)

INFO:tensorflow:Restoring parameters from saved/rnn.ckpt
Restored Model!
epoch      1, loss=1.928284
epoch      2, loss=1.883919
epoch      3, loss=1.707575
epoch      4, loss=1.638360
epoch      5, loss=1.570829
epoch      6, loss=1.477021
epoch      7, loss=1.400826
epoch      8, loss=1.404719
epoch      9, loss=1.298011
epoch     10, loss=1.283630


## Generating Text
Now that we have trained the model, we need to generate text. Unfortunately, we can't directly use the same tensors as we did when we trained the model because it tries to fit the original data rather than create new data. We must recreate the architecture again, but with slight modifications, to be able to actually generate new data.

We will tell the architecture how many sentences we want to generate with `num_to_generate`. This will become the size of our batch dimension. Then, we select a random word from the vocabulary and start each sentence with it. We also load the weights from the softmax layer here too so that we can use them when we unroll the RNN. Note that because the session that we used from training still is active, when we call `tf.get_variable` it will use the same variables made in the training code. Note that `reuse=True` in all `tf.variable_scope` blocks. This ensures that we don't recreate any variables but instead always reuse the original ones from training.

In [11]:
tf.set_random_seed(seed)

# Number of sentences to generate
num_to_generate = tf.placeholder(tf.int32, shape=(), name='NumToGenerate')

# Select a random word to start each sentence
random_start = tf.random_uniform(shape=(num_to_generate,), maxval=embedding_size, dtype=tf.int32)

# Reuse ALL variables
with tf.variable_scope('Softmax', reuse=True):
    w = tf.get_variable(name='Weight')
    b = tf.get_variable(name='Bias')

def do_softmax(tensor):
    """Helper function to compute softmax."""
    scores = tf.matmul(tensor, w) + b
    softmax = tf.nn.softmax(scores, name='Softmax')
    return softmax

Lets perform the unrolling! Notice that this time the input to each word is the previous output of the RNN rather than some data that we provide it. This is the reason why we had to recreate the whole graph. We also have a new tensor called `generated` that just selects the word with the hightest probability from the `outputs` tensor.

In [12]:
# Again, reuse ALL variables in this scope!
with tf.variable_scope('Unrolled', reuse=True) as scope:
    cell1 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)
    cell2 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)
    cell3 = tf.contrib.rnn.BasicLSTMCell(num_units=num_units)

    # Feed into `initial_state` with `feed_dict` if you want to use the state of a prior sequence instead of zero
    initial_state = state = cell1.zero_state(batch_size=num_to_generate, dtype=tf.float32)

    # One-hot encode first word, and treat it as 1st output
    prev_word = tf.one_hot(random_start, depth=embedding_size)

    # Generate the sentence
    outputs = [prev_word]  # python list of tensors so we can keep track of all the outputs
    for i in range(sequence_length-1):  # We already "made" the first word, so generate `sequence_length-1` more
        output, state = cell1(prev_word, state, scope='Cell1')  # Step the RNN through the sequence
        output, state = cell2(output, state, scope='Cell2')  # 2nd layer
        output, state = cell3(output, state, scope='Cell3')  # 3rd layer
        output_word = do_softmax(output)  # 
        outputs.append(output_word)
        prev_word = output_word

    # Useful if you want longer outputs, you can fetch this tensor and then feed it back into `initial_state`
    final_state = state
    outputs = tf.stack(outputs, axis=1, name='Outputs')  # shape `[num_to_generate, sequence_length, embedding_size]`

generated = tf.argmax(outputs, axis=-1, name='Generated')  # convert from one-hot encoding to index of classes

Up to this point we still havent told the architecture to compute anything. Lets do that now!

In [20]:
results = sess.run(generated, feed_dict={num_to_generate: 10})
sentences = [[alice_load['vocab'][embedding] for embedding in sentence] for sentence in results]
for sentence in sentences:
    print(sentence)


TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("NumToGenerate:0", shape=(), dtype=int32) is not an element of this graph.

As you can see, the sentences are not that great, although some are direct sentences from the original novel. There are more advanced architectures that will give better results, such as seq2seq and more advanced types of RNN cells, although these are beyond the scope of this series.