In [1]:
import numpy as np #matrix math 
import tensorflow as tf #machine learningt
import helpers #for formatting data into batches and generating random sequence data

tf.reset_default_graph() #Clears the default graph stack and resets the global default graph.
sess = tf.InteractiveSession() #initializes a tensorflow session

In [2]:
tf.__version__

'1.0.1'

First critical thing to decide: vocabulary size.

Dynamic RNN models can be adapted to different batch sizes  and sequence lengths without retraining 
(e.g. by serializing model parameters and Graph definitions via tf.train.Saver), 
but changing vocabulary size requires retraining the model.

In [3]:
PAD = 0
EOS = 1

vocab_size = 10
input_embedding_size = 20 #character length

encoder_hidden_units = 20 #num neurons
decoder_hidden_units = encoder_hidden_units * 2 #in original paper, they used same number of neurons for both encoder
#and decoder, but we use twice as many so decoded output is different, the target value is the original input 
#in this example

Nice way to understand complicated function is to study its signature - inputs and outputs. With pure functions, only inputs-output relation matters.
encoder_inputs int32 tensor is shaped [encoder_max_time, batch_size]
decoder_targets int32 tensor is shaped [decoder_max_time, batch_size]

In [4]:
#input placehodlers
encoder_inputs = tf.placeholder(shape=(None, None), dtype=tf.int32, name='encoder_inputs')
#contains the lengths for each of the sequence in the batch, we will pad so all the same
#if you don't want to pad, check out dynamic memory networks to input variable length sequences
encoder_inputs_length = tf.placeholder(shape=(None,), dtype=tf.int32, name='encoder_inputs_length')
decoder_targets = tf.placeholder(shape=(None, None), dtype=tf.int32, name='decoder_targets')

Here we implement decoder with tf.nn.raw_rnn and will construct decoder_inputs step by step in the loop.

# Embeddings

encoder_inputs and decoder_inputs are int32 tensors of shape [max_time, batch_size], while encoder and decoder RNNs expect dense vector representation of words, [max_time, batch_size, input_embedding_size].

We convert one to another by using word embeddings. Specifics of working with embeddings are nicely described in official tutorial on embeddings.
First we initialize embedding matrix. Initializations are random. We rely on our end-to-end training to learn vector representations for words jointly with encoder and decoder.

In [5]:
#randomly initialized embedding matrrix that can fit input sequence
#used to convert sequences to vectors (embeddings) for both encoder and decoder of the right size
#reshaping is a thing, in TF you gotta make sure you tensors are the right shape (num dimensions)
embeddings = tf.Variable(tf.random_uniform([vocab_size, input_embedding_size], -1.0, 1.0), dtype=tf.float32)

#this thing could get huge in a real world application
encoder_inputs_embedded = tf.nn.embedding_lookup(embeddings, encoder_inputs)

We use tf.nn.embedding_lookup to index embedding matrix: given word 4, we represent it as 4th column of embedding matrix. This operation is lightweight, compared with alternative approach of one-hot encoding word 4 as [0,0,0,1,0,0,0,0,0,0] (vocab size 10) and then multiplying it by embedding matrix.

Additionally, we don't need to compute gradients for any columns except 4th.

In real NLP application embedding matrix can get very large, with 100k or even 1m columns.

# Encoder

We are replacing unidirectional tf.nn.dynamic_rnn with tf.nn.bidirectional_dynamic_rnn as the encoder.

In [6]:
#from tensorflow.nn.rnn_cell import LSTMCell, LSTMStateTuple
from tensorflow.contrib.rnn import LSTMCell, LSTMStateTuple

In [35]:
encoder_cell = LSTMCell(encoder_hidden_units)

get outputs and states
bidirectional RNN function takes a separate cell argument for 
both the forward and backward RNN, and returns separate 
outputs and states for both the forward and backward RNN

When using a standard RNN to make predictions we are only taking the “past” into account. 
For certain tasks this makes sense (e.g. predicting the next word), but for some tasks 
it would be useful to take both the past and the future into account. Think of a tagging task, 
like part-of-speech tagging, where we want to assign a tag to each word in a sentence. 
Here we already know the full sequence of words, and for each word we want to take not only the 
words to the left (past) but also the words to the right (future) into account when making a prediction. 
Bidirectional RNNs do exactly that. A bidirectional RNN is a combination of two RNNs – one runs forward from 
“left to right” and one runs backward from “right to left”. These are commonly used for tagging tasks, or 
when we want to embed a sequence into a fixed-length vector (beyond the scope of this post).


In [36]:
((encoder_fw_outputs,
  encoder_bw_outputs),
 (encoder_fw_final_state,
  encoder_bw_final_state)) = (
    tf.nn.bidirectional_dynamic_rnn(cell_fw=encoder_cell,
                                    cell_bw=encoder_cell,
                                    inputs=encoder_inputs_embedded,
                                    sequence_length=encoder_inputs_length,
                                    dtype=tf.float64, time_major=True))

TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [float32, float64] that don't all match.

# Decoder

In [7]:
decoder_cell = LSTMCell(decoder_hidden_units)


Time and batch dimensions are dynamic, i.e. they can change in runtime, from batch to batch When decoding, feeding previously generated tokens as inputs adds robustness to model's errors. However feeding ground truth speeds up training. Apperantly best practice is to mix both randomly when training.

In [8]:
#we could print this, won't need
encoder_max_time, batch_size = tf.unstack(tf.shape(encoder_inputs))

Next we need to decide how far to run decoder. There are several options for stopping criteria:

Stop after specified number of unrolling steps

Stop after model produced

The choice will likely be time-dependant. In legacy translate tutorial we can see that decoder unrolls for len(encoder_input)+10 to allow for possibly longer translated sequence. Here we are doing a toy copy task, so how about we unroll decoder for len(encoder_input)+2, to allow model some room to make mistakes over 2 additional steps:

In [9]:
decoder_lengths = encoder_inputs_length + 3
# +2 additional steps, +1 leading <EOS> token for decoder inputs

# Output projection

Decoder will contain manually specified by us transition step:

output(t) -> output projection(t) -> prediction(t) (argmax) -> input embedding(t+1) -> input(t+1)

In tutorial 1, we used tf.contrib.layers.linear layer to initialize weights and biases and apply operation for us. This is convenient, however now we need to specify parameters W and b of the output layer in global scope, and apply them at every step of the decoder.

In [11]:
#manually specifying since we are going to implement attention details for the decoder in a sec
#weights
W = tf.Variable(tf.random_uniform([decoder_hidden_units, vocab_size], -1, 1), dtype=tf.float32)
#bias
b = tf.Variable(tf.zeros([vocab_size]), dtype=tf.float32)

# Decoder via tf.nn.raw_rnn
tf.nn.dynamic_rnn allows for easy RNN construction, but is limited.
For example, a nice way to increase robustness of the model is to feed as decoder inputs tokens that it previously generated, instead of shifted true sequence

In [12]:
#create padded inputs for the decoder from the word embeddings

#were telling the program to test a condition, and trigger an error if the condition is false.
assert EOS == 1 and PAD == 0

eos_time_slice = tf.ones([batch_size], dtype=tf.int32, name='EOS')
pad_time_slice = tf.zeros([batch_size], dtype=tf.int32, name='PAD')

#retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy
eos_step_embedded = tf.nn.embedding_lookup(embeddings, eos_time_slice)
pad_step_embedded = tf.nn.embedding_lookup(embeddings, pad_time_slice)

In [13]:
#manually specifying loop function through time - to get initial cell state and input to RNN
#normally we'd just use dynamic_rnn, but lets get detailed here with raw_rnn

#we define and return these values, no operations occur here
def loop_fn_initial():
    initial_elements_finished = (0 >= decoder_lengths)  # all False at the initial step
    #end of sentence
    initial_input = eos_step_embedded
    #last time steps cell state
    initial_cell_state = encoder_final_state
    #none
    initial_cell_output = None
    #none
    initial_loop_state = None  # we don't need to pass any additional information
    return (initial_elements_finished,
            initial_input,
            initial_cell_state,
            initial_cell_output,
            initial_loop_state)