# Recurrent Neural Networks

In [2]:
import tensorflow as tf
import numpy as np

## Load Training Data

Whereas convolutional neural networks take advantage of spatial locality and are well-suited for images, recurrent neural networks take advantage of temporal locality and are well-suited for text. Most state-of-the-art applications of deep learning for NLP use a form of recurrent neural network.  
  
Here, we will create an RNN that is able to mimic the writing/speaking style from a passage of text. We will be using a character-level RNN to generate fake text - the network learns what characters are likely to follow after the sequence of characters that came before it. How do we train a network like this? Below is a sample of the training data we will be using: a transcript from various speeches by Donald Trump.

In [3]:
with open('../data/trump.txt', 'r') as file:
    transcript = file.read()
transcript = transcript.replace('\n', ' ')
print(transcript[:700], '...')

People have asked me why I am running for President. I have built an amazing business that I love and I get to work side-by-side with my children every day. We come to work together and turn visions into reality. We think big, and then we make it happen. I love what I do, and I am grateful beyond words to the nation that has allowed me to do it. So when people ask me why I am running, I quickly answer: I am running to give back to this country which has been so good to me. When I see the crumbling roads and bridges, or the dilapidated airports, or the factories moving overseas to Mexico, or to other countries, I know these problems can all be fixed, but not by Hillary Clinton – only by me. T ...


We will train our RNN to predict, given the sequence of characters that came before it, what should come next. So given "People have asked me wh", our model should then predict "y". Should we train by treating the entire piece of text as one long string as above? This is probably a bad idea - our model will overfit to the exact word usage in the training data, perhaps learning sequences of sentences at a time.   
  
Instead we will break up the text into small chunks, so that our network learns "general" word usage - to predict characters that make sense given only the *immediately* preceding text. For this reason, we may find that RNN-generated text lacks any coherent direction and appears to just ramble on about a certain topic (sound familiar?).   
  
So how should we break up our text? We want to break it up such that the chunks represent the temporal locality that the RNN should take into account, i.e. how far back does it need to look to predict the next character. It might be intuitive to use sentences as our segments, but this presents computational difficulties because not all sentences are the same length (more on this later!) - here, we choose to create a window of fixed length, and slide it through our text one character at a time (much like a convolutional filter).

In [4]:
num_steps = 50  # size of "unrolling" - more on this later!
for i in range(10):
    print('{}    {}'.format(transcript[i: i + num_steps], transcript[i + num_steps]))

People have asked me why I am running for Presiden    t
eople have asked me why I am running for President    .
ople have asked me why I am running for President.     
ple have asked me why I am running for President.     I
le have asked me why I am running for President. I     
e have asked me why I am running for President. I     h
 have asked me why I am running for President. I h    a
have asked me why I am running for President. I ha    v
ave asked me why I am running for President. I hav    e
ve asked me why I am running for President. I have     


As with all other types of neural networks, these are algorithms that work strictly with numbers, so we have to encode our text in vector form. Here, we will use a simple one-hot encoding, but will implement it so that the encoding we choose can flexibly be changed. First, let's take a look at the dimension of our one-hot encoded vectors:

In [5]:
unique_chars = sorted(list(set(transcript)))
num_classes = len(unique_chars)

print('Unique characters:', unique_chars)
print('Number of unique characters:', num_classes)

Unique characters: [' ', '!', '"', '$', '%', '&', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'é', '–', '—', '‘', '’', '“', '”', '…']
Number of unique characters: 91


## Model Overview

TODO: explain

<img src="../images/RNN.jpg">

### Embedding Layer

In [6]:
batch_size = 25

# encode every character as an integer
x = tf.placeholder(tf.int32, [batch_size, num_steps])
y = tf.placeholder(tf.int32, [batch_size, num_steps])

In [7]:
one_hot_embeddings = tf.eye(num_classes)
x_embed = tf.nn.embedding_lookup(one_hot_embeddings, x)
y_embed = tf.nn.embedding_lookup(one_hot_embeddings, y)

### RNN Cells

<img src="../images/BasicRNNLabeled.png">

In [8]:
x_inputs = tf.unstack(x_embed, axis=1)
y_outputs = tf.unstack(y_embed, axis=1)

print('Length of inputs list:', len(x_inputs))
x_inputs[:3]

Length of inputs list: 50


[<tf.Tensor 'unstack:0' shape=(25, 91) dtype=float32>,
 <tf.Tensor 'unstack:1' shape=(25, 91) dtype=float32>,
 <tf.Tensor 'unstack:2' shape=(25, 91) dtype=float32>]

In [9]:
state_size = 128

with tf.variable_scope('rnn_cell_1'):
    W_xh = tf.get_variable('W_xh', [num_classes, state_size])
    W_hh = tf.get_variable('W_hh', [state_size, state_size])
    b_h = tf.get_variable('b_h', [state_size])
    
with tf.variable_scope('rnn_cell_2'):
    W_xh = tf.get_variable('W_xh', [state_size, state_size])
    W_hh = tf.get_variable('W_hh', [state_size, state_size])
    b_h = tf.get_variable('b_h', [state_size])

In [10]:
def rnn_cell(rnn_input, prev_state, scope):
    with tf.variable_scope(scope, reuse=True):
        W_xh = tf.get_variable('W_xh', [rnn_input.get_shape().as_list()[1], state_size])
        W_hh = tf.get_variable('W_hh', [state_size, state_size])
        b_h = tf.get_variable('b_h', [state_size])
        return tf.tanh(tf.matmul(prev_state, W_hh) + tf.nn.xw_plus_b(rnn_input, W_xh, b_h)) 

In [11]:
init_state = tf.zeros([batch_size, state_size])
state1 = init_state
hidden_states_1 = []

for rnn_input in x_inputs:
    state1 = rnn_cell(rnn_input, state1, 'rnn_cell_1')
    hidden_states_1.append(state1)

In [12]:
state2 = init_state
hidden_states_2 = []

for rnn_input in hidden_states_1:
    state2 = rnn_cell(rnn_input, state2, 'rnn_cell_2')
    hidden_states_2.append(state2)

### Softmax Layer

In [13]:
with tf.variable_scope('softmax'):
    W_hy = tf.get_variable('W_hy', [state_size, num_classes])
    b_y = tf.get_variable('b_y', [num_classes])
    
logits = [tf.nn.xw_plus_b(state, W_hy, b_y) for state in hidden_states_2]
preds = [tf.nn.softmax(logit) for logit in logits]

### Optimization

In [14]:
losses = [tf.nn.softmax_cross_entropy_with_logits_v2(labels=label, logits=logit)
          for label, logit in zip(y_outputs, logits)]

total_loss = tf.reduce_mean(losses)
train_step = tf.train.AdamOptimizer(learning_rate=0.001).minimize(total_loss)

To calculate the total loss, we simply average the loss across each of the time steps. Then we define a `train_step` as usual, specifying a learning rate hyperparameter - do not underestimate the difficulty of hyperparameter optimization! It is an area of ongoing research with very few theoretical results, and is usually the most time-consuming part of any machine learning project.

`zip` is a built-in Python function that creates a list of tuples from two lists:

In [15]:
a = [1, 2, 3]
b = ['a', 'b', 'c']
list(zip(a, b))

[(1, 'a'), (2, 'b'), (3, 'c')]

## Training

## Load Pre-trained Model