# TensorFlow for text

To introduce working with text in TensorFlow, we focus on the core modelling components, and create a minimal, contrived text data set that will let us get straight to action.

Let's get started, presenting our example data and discussing some key properties of text data sets as we go.



## Text sequences

Our simulated data consists of two classes of very short "sentences", one comprised of odd digits and the other of even digits (with numbers written in English). We generate sentences built of words representing even and odd numbers. 

Our goal is to learn to classify each sentence as either odd or even in a supervised text classification task. Of course, we do not really need any machine learning for this simple task — we use this contrived example for illustrative purposes.

In [1]:
import numpy as np
digit_to_word_map = {1:"One",2:"Two", 3:"Three", 4:"Four", 5:"Five",
                     6:"Six",7:"Seven",8:"Eight",9:"Nine"}
digit_to_word_map[0]="PAD_TOKEN"

times_steps = 6
even_sentences = []
odd_sentences = []
seqlens = []
for i in range(10000):
    rand_seq_len = np.random.choice(range(3,times_steps+1))
    seqlens.append(rand_seq_len)
    rand_odd_ints = np.random.choice(range(1,10,2),
                                     rand_seq_len)
    rand_even_ints = np.random.choice(range(2,10,2),
                                      rand_seq_len)

    #Padding
    if rand_seq_len<times_steps:
        rand_odd_ints = np.append(rand_odd_ints,
                                  [0]*(times_steps-rand_seq_len))
        rand_even_ints = np.append(rand_even_ints,
                                   [0]*(times_steps-rand_seq_len))

    even_sentences+=[" ".join([digit_to_word_map[r] for
                               r in rand_even_ints])]
    odd_sentences+=[" ".join([digit_to_word_map[r] for
                              r in rand_odd_ints])] 

data = even_sentences+odd_sentences
#seqlens = [1,2,3]
seqlens *=2

In [2]:
seqlens

[4,
 3,
 5,
 6,
 6,
 4,
 4,
 3,
 4,
 5,
 5,
 5,
 5,
 5,
 5,
 3,
 4,
 6,
 6,
 6,
 3,
 6,
 5,
 6,
 6,
 3,
 5,
 6,
 6,
 6,
 4,
 4,
 5,
 6,
 6,
 5,
 5,
 5,
 4,
 4,
 5,
 4,
 5,
 4,
 4,
 6,
 3,
 5,
 5,
 3,
 4,
 5,
 4,
 3,
 6,
 5,
 3,
 6,
 3,
 4,
 4,
 3,
 5,
 6,
 3,
 5,
 6,
 5,
 6,
 3,
 6,
 5,
 6,
 6,
 6,
 3,
 4,
 6,
 3,
 6,
 6,
 6,
 4,
 6,
 6,
 3,
 6,
 3,
 5,
 5,
 4,
 3,
 4,
 4,
 4,
 6,
 5,
 4,
 6,
 3,
 4,
 3,
 4,
 4,
 5,
 6,
 3,
 6,
 6,
 6,
 5,
 4,
 6,
 3,
 3,
 5,
 3,
 5,
 4,
 6,
 5,
 3,
 6,
 6,
 3,
 6,
 4,
 5,
 6,
 5,
 5,
 4,
 5,
 4,
 4,
 4,
 6,
 5,
 5,
 5,
 4,
 5,
 4,
 4,
 6,
 5,
 5,
 5,
 3,
 5,
 5,
 5,
 3,
 5,
 4,
 4,
 3,
 5,
 6,
 6,
 4,
 5,
 4,
 4,
 6,
 4,
 6,
 6,
 5,
 6,
 5,
 6,
 4,
 5,
 3,
 5,
 3,
 3,
 4,
 5,
 5,
 6,
 4,
 6,
 4,
 4,
 3,
 4,
 4,
 6,
 5,
 6,
 5,
 5,
 6,
 6,
 5,
 6,
 6,
 5,
 5,
 6,
 6,
 6,
 3,
 3,
 3,
 5,
 3,
 5,
 3,
 5,
 5,
 6,
 5,
 4,
 5,
 6,
 6,
 4,
 3,
 3,
 6,
 3,
 3,
 3,
 4,
 4,
 4,
 4,
 3,
 5,
 5,
 6,
 6,
 4,
 5,
 5,
 6,
 4,
 6,
 3,
 6,
 4,
 4,
 3,
 6,
 4,
 3,
 6,


We pad sentences shorter than 6 with zeros (or "PAD" symbols) to make all sentences equally-sized (artificially), so they can all be put in one tensor (per batch). This pre-processing step is known as "zero-padding".

In [3]:
even_sentences[0:6]

['Four Two Two Eight PAD_TOKEN PAD_TOKEN',
 'Eight Four Four PAD_TOKEN PAD_TOKEN PAD_TOKEN',
 'Two Four Six Two Eight PAD_TOKEN',
 'Two Two Six Six Six Two',
 'Four Two Six Four Eight Eight',
 'Two Six Two Six PAD_TOKEN PAD_TOKEN']

In [4]:
odd_sentences[0:6]

['Seven Nine Nine Five PAD_TOKEN PAD_TOKEN',
 'One Seven One PAD_TOKEN PAD_TOKEN PAD_TOKEN',
 'Seven Three Nine Three Five PAD_TOKEN',
 'Three One Five Five Nine One',
 'Three Seven Seven One Seven Five',
 'Three Nine Five Three PAD_TOKEN PAD_TOKEN']

In [5]:
#original sentence lengths
seqlens[0:6]

[4, 3, 5, 6, 6, 4]

In real applications, where data is not simulated, we would get collection of documents (e.g., one-sentence tweets) and then map each word to an integer ID. 

Let's create this map — a dictionary with words as keys, and indices as values. We also create the inverse map.

In [6]:
#Map from words to indices
word2index_map ={}
index=0
for sent in data:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
#Inverse map    
index2word_map = dict([(index,word) for word,index in
                       word2index_map.items()]) 
index2word_map = {index: word for word, index in word2index_map.items()}

vocabulary_size = len(index2word_map)

In [7]:
vocabulary_size

10

In [8]:
word2index_map

{'eight': 2,
 'five': 7,
 'four': 0,
 'nine': 6,
 'one': 8,
 'pad_token': 3,
 'seven': 5,
 'six': 4,
 'three': 9,
 'two': 1}

In [9]:
index2word_map

{0: 'four',
 1: 'two',
 2: 'eight',
 3: 'pad_token',
 4: 'six',
 5: 'seven',
 6: 'nine',
 7: 'five',
 8: 'one',
 9: 'three'}

To wrap-up our data generation for a contrived supervised learning task, we create an array of labels in the one-hot format, train and test sets, a function to generate batches of instances.

First, we create some labels split our data to train and test sets.

In [10]:
labels = [1]*10000 + [0]*10000
for i in range(len(labels)):
    label = labels[i]
    one_hot_encoding = [0]*2
    one_hot_encoding[label] = 1
    labels[i] = one_hot_encoding


data_indices = list(range(len(data)))
np.random.shuffle(data_indices)
data = np.array(data)[data_indices]

labels = np.array(labels)[data_indices]
seqlens = np.array(seqlens)[data_indices]
train_x = data[:10000]
train_y = labels[:10000]
train_seqlens = seqlens[:10000]

test_x = data[10000:]
test_y = labels[10000:]
test_seqlens = seqlens[10000:]

Finally, we create a function that generates batches of sentences, and their respective labels and lengths.  

In [11]:
def get_sentence_batch(batch_size,data_x,
                       data_y,data_seqlens):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [[word2index_map[word] for word in data_x[i].lower().split()]
         for i in batch]
    y = [data_y[i] for i in batch]
    seqlens = [data_seqlens[i] for i in batch]
    return x,y,seqlens


In [12]:
x,y, lens = get_sentence_batch(batch_size = 8,data_x = train_x, data_y = train_y,data_seqlens = train_seqlens)

In [13]:
print(x)

[[2, 0, 2, 2, 2, 3], [0, 4, 4, 2, 3, 3], [1, 4, 4, 3, 3, 3], [5, 7, 8, 5, 6, 3], [6, 9, 6, 9, 3, 3], [5, 8, 9, 3, 3, 3], [8, 8, 5, 9, 5, 3], [2, 2, 2, 3, 3, 3]]


In [14]:
[[index2word_map[w] for w in sentence] for sentence in x]

[['eight', 'four', 'eight', 'eight', 'eight', 'pad_token'],
 ['four', 'six', 'six', 'eight', 'pad_token', 'pad_token'],
 ['two', 'six', 'six', 'pad_token', 'pad_token', 'pad_token'],
 ['seven', 'five', 'one', 'seven', 'nine', 'pad_token'],
 ['nine', 'three', 'nine', 'three', 'pad_token', 'pad_token'],
 ['seven', 'one', 'three', 'pad_token', 'pad_token', 'pad_token'],
 ['one', 'one', 'seven', 'three', 'seven', 'pad_token'],
 ['eight', 'eight', 'eight', 'pad_token', 'pad_token', 'pad_token']]

In [15]:
lens

[5, 4, 3, 5, 4, 3, 5, 3]

In [16]:
y

[array([0, 1]),
 array([0, 1]),
 array([0, 1]),
 array([1, 0]),
 array([1, 0]),
 array([1, 0]),
 array([1, 0]),
 array([0, 1])]

To work with this data in TensorFlow, we create placeholders. The \_inputs placeholder contains batch_size sentences, each a sequence of length times_steps. 

In [17]:
import tensorflow as tf
num_classes = 2; batch_size=128;
tf.reset_default_graph()
_inputs = tf.placeholder(tf.int32, shape=[batch_size,times_steps])
_labels = tf.placeholder(tf.float32, shape=[batch_size, num_classes])
#seqlens for dynamic calculation
_seqlens = tf.placeholder(tf.int32, shape=[batch_size])

## Supervised word embeddings

We next convert our integer sequences into word vectors. We will train word vectors in a a supervised framework, tuning the embedded word vectors to solve the downstream classification task. Here we use vectors of size 64, initialized randomly.

In [18]:
embedding_dimension = 64
with tf.name_scope("embeddings"):
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size,
                           embedding_dimension],
                          -1.0, 1.0),name='embedding')
    embed = tf.nn.embedding_lookup(embeddings, _inputs)

It is helpful to think of word embeddings as basic hash tables or look-up tables, mapping words to their dense vector values. Hence the name of the function tf.nn.embedding_lookup() function, which efficiently retrieves the vectors for each word in a given sequence of word indices.

## LSTM and using sequence length

We create an LSTM cell with tf.contrib.rnn.BasicLSTMCell(), and feed it to tf.nn.dynamic_rnn(). We also give dynamic_rnn() the length of each sequence in a batch of examples, using the \_seqlens placeholder we created above. 

dynamic_rnn() uses the sequence length to stop all RNN steps beyond the last real sequence element, and returns the final LSTM outputs we need for this task.

In [19]:
hidden_layer_size = 32
with tf.variable_scope("lstm"):
 
    lstm_cell = tf.contrib.rnn.LSTMCell(hidden_layer_size)
    _, states = tf.nn.dynamic_rnn(lstm_cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)
last_state = states[1]

 In this task, we want only the last LSTM output state that is not zero -- so, for example, if the length of our original sequence is 5 and we zero-pad it to a sequence of length 15, we extract only the 5-th output state vector from LSTM.
 
 Finally, we use this output vector in a linear layer, as in softmax regression.

In [20]:
W = tf.Variable(tf.truncated_normal([hidden_layer_size,num_classes],mean=0,stddev=.01))
b = tf.Variable(tf.truncated_normal([num_classes],mean=0,stddev=.01))

final_pred = tf.matmul(last_state,W) + b

softmax = tf.nn.softmax_cross_entropy_with_logits(logits  = final_pred,labels = _labels)                         
cross_entropy = tf.reduce_mean(softmax)


## Training

We create ops to update gradients and compute accuracy,

In [21]:
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(_labels,1),
                              tf.argmax(final_pred,1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction,
                                   tf.float32)))*100

and are now ready to train. Note that we give feed_dict batches of input sequences, labels, and sequence lengths.

In [22]:

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for step in range(1000):
        x_batch, y_batch,seqlen_batch = get_sentence_batch(batch_size,
                                                           train_x,train_y,
                                                           train_seqlens)
        sess.run(train_step,feed_dict={_inputs:x_batch, _labels:y_batch,
                                       _seqlens:seqlen_batch})
    
        if step % 100 == 0:
            acc = sess.run(accuracy,feed_dict={_inputs:x_batch,
                                               _labels:y_batch,
                                               _seqlens:seqlen_batch})
            print("Accuracy at %d: %.5f" % (step, acc)) 
      
    for test_batch in range(5):
        x_test, y_test,seqlen_test = get_sentence_batch(batch_size,
                                                        test_x,test_y,
                                                        test_seqlens)
        batch_pred,batch_acc = sess.run([tf.argmax(final_pred,1),
                                         accuracy],
                                        feed_dict={_inputs:x_test,
                                                   _labels:y_test,
                                                   _seqlens:seqlen_test})
        print("Test batch accuracy %d: %.5f" % (test_batch, batch_acc)) 


Accuracy at 0: 18.75000
Accuracy at 100: 100.00000
Accuracy at 200: 100.00000
Accuracy at 300: 100.00000
Accuracy at 400: 100.00000
Accuracy at 500: 100.00000
Accuracy at 600: 100.00000
Accuracy at 700: 100.00000
Accuracy at 800: 100.00000
Accuracy at 900: 100.00000
Test batch accuracy 0: 100.00000
Test batch accuracy 1: 100.00000
Test batch accuracy 2: 100.00000
Test batch accuracy 3: 100.00000
Test batch accuracy 4: 100.00000


## Quick hands-on
* Try making the embedding dimension much lower (go all the way down to 1) and re-running the training. How does this affect the accuracy progression over the 1000 iterations?
* What about the hidden layer size?
* Try making the data sequences much longer. Has training speed changed? What about accuracy?
* Make the neccessary changes, such that we do not use an additional linear layer, and feed LSTM outputs directly to softmax (hint: dimensions...)

