# Introduction to Natural Language Processing (NLP) and Recurrent Neural Networks in TensorFlow

### Word Embeddings

Word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Let's play around with a set of pre-trained word vectors, to get used to their properties. There exist many sets of pretrained word embeddings; here, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

In [1]:
# Download word vectors
from urllib.request import urlretrieve
import os
if not os.path.isfile('mini.h5'):
    print("Downloading Conceptnet Numberbatch word embeddings...")
    conceptnet_url = 'http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5'
    urlretrieve(conceptnet_url, 'mini.h5')

To read an `h5` file, we'll need to use the `h5py` package. Below, we use the package to open the `mini.h5` file we just downloaded. We extract from the file a list of utf-8-encoded words, as well as their $300$-dimensional vectors.

In [2]:
import numpy as np
import h5py
with h5py.File('mini.h5', 'r') as f:
    all_words = [word.decode('utf-8') for word in f['mat']['axis1'][:]]
    all_embeddings = f['mat']['block0_values'][:]

  from ._conv import register_converters as _register_converters


Now, `all_words` is a list of $V$ strings (what we call our *vocabulary*), and `all_embeddings` is a $V \times 300$ matrix. The strings are of the form `/c/language_code/word`â€”for example, `/c/en/cat` and `/c/es/gato`.

We are interested only in the English words. We use Python list comprehensions to pull out the indices of the English words, then extract just the English words (stripping the six-character `/c/en/` prefix) and their embeddings.

In [3]:
english_words = [word[6:] for word in all_words if word.startswith('/c/en/')]
english_word_indices = [i for i, word in enumerate(all_words) if word.startswith('/c/en/')]
english_embedddings = all_embeddings[english_word_indices]

The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. 
Here, we will be interested in semantics, so we *normalize* our vectors, dividing each by its length. 
The result is that all of our word vectors are length 1, and as such, lie on a unit circle. 
The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

<img src="Figures/cosine_similarity.png" alt="cosine" style="width: 500px;"/>
<center>Figure adapted from *[Mastering Machine Learning with Spark 2.x](https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml)*</center>

In [4]:
norms = np.linalg.norm(english_embedddings, axis=1)
normalized_embeddings = english_embedddings.astype('float32') / norms.astype('float32').reshape([-1, 1])

We want to look up words easily, so we create a dictionary that maps us from a word to its index in the word embeddings matrix.

In [5]:
index = {word: i for i, word in enumerate(english_words)}

Now we are ready to measure the similarity between pairs of words. We use numpy to take dot products.

In [6]:
def similarity_score(w1, w2):
    score = np.dot(normalized_embeddings[index[w1], :], normalized_embeddings[index[w2], :])
    return score

def print_similarity(w1,w2):
    try:
        print('{0}\t{1}\t'.format(w1,w2), \
          similarity_score('{}'.format(w1), '{}'.format(w2)))
    except:
        print('One of the words is not in the dictionary.')
    return None


In [7]:

# A word is as similar with itself as possible:
print('cat\tcat\t', similarity_score('cat', 'cat'))
# Closely related words still get high scores:
print('cat\tfeline\t', similarity_score('cat', 'feline'))
print('cat\tdog\t', similarity_score('cat', 'dog'))
# Unrelated words, not so much
print('cat\tmoo\t', similarity_score('cat', 'moo'))
print('cat\tfreeze\t', similarity_score('cat', 'freeze'))

cat	cat	 1.0
cat	feline	 0.8199548
cat	dog	 0.590724
cat	moo	 0.0039538248
cat	freeze	 -0.030225184


In [8]:

# Antonyms are still considered related, sometimes more so than synonyms
print('antonyms\topposites\t', similarity_score('antonym', 'opposite'))
print('antonyms\tsynonyms\t', similarity_score('antonym', 'synonym'))


antonyms	opposites	 0.3941065
antonyms	synonyms	 0.46883982


In [9]:
# whatever we want:
print_similarity('iguana','chameleon')
print_similarity('iguana','dog')
print_similarity('iguana','reptile')
print_similarity('encyclopedia','wikipedia')
print_similarity('encyclopedia','book')
print_similarity('fake','news')
print_similarity('fake','real')

iguana	chameleon	 0.34494567
iguana	dog	 0.20372051
iguana	reptile	 0.46178538
encyclopedia	wikipedia	 0.506083
encyclopedia	book	 0.28528008
fake	news	 0.043742567
fake	real	 0.40731964


We can also find, for instance, the most similar words to a given word.

In [10]:
def closest_to_vector(v, n):
    all_scores = np.dot(normalized_embeddings, v)
    best_words = map(lambda i: english_words[i], reversed(np.argsort(all_scores)))
    return [next(best_words) for _ in range(n)]

def most_similar(w, n):
    """
    Find the `n` most similar words to `w`.
    """
    return closest_to_vector(normalized_embeddings[index[w], :], n)

In [11]:
print(most_similar('cat', 10))
print(most_similar('dog', 10))
print(most_similar('duke', 10))
print(most_similar('wikipedia', 10))
print(most_similar('deep', 10))

['cat', 'humane_society', 'kitten', 'feline', 'colocolo', 'cats', 'kitty', 'maine_coon', 'housecat', 'sharp_teeth']
['dog', 'dogs', 'wire_haired_dachshund', 'doggy_paddle', 'lhasa_apso', 'good_friend', 'puppy_dog', 'bichon_frise', 'woof_woof', 'golden_retrievers']
['duke', 'dukes', 'duchess', 'duchesses', 'ducal', 'dukedom', 'duchy', 'voivode', 'princes', 'prince']
['wikipedia', 'wikipedias', 'wikimedia', 'wikisource', 'wiki', 'wikiquote', 'mediawiki', 'wikipedians', 'wikipedian', 'wikimedia_commons']
['deep', 'deeps', 'deepest', 'deeper', 'profound', 'unfathomed', 'depths', 'profoundest', 'depth', 'deepness']


We can also use `closest_to_vector` to find words "nearby" vectors that we create ourselves. This allows us to solve analogies. For example, in order to solve the analogy "man : brother :: woman : ?", we can compute a new vector `brother - man + woman`: the meaning of brother, minus the meaning of man, plus the meaning of woman. We can then ask which words are closest, in the embedding space, to that new vector.

In [12]:
def solve_analogy(a1, b1, a2):
    b2 = normalized_embeddings[index[b1], :] - normalized_embeddings[index[a1], :] + normalized_embeddings[index[a2], :]
    return closest_to_vector(b2, 5)
def print_analogy(a1, b1,a2):
    closest_words=solve_analogy(a1,b1,a2)
    print("{0}:{1} as {2}:?".format(a1,b1,a2))
    print("Best guesses are: {}".format(closest_words))
    return None

In [13]:
print_analogy("man", "brother", "woman")
print_analogy("man", "husband", "woman")
print_analogy("spain", "madrid", "france")

man:brother as woman:?
Best guesses are: ['sister', 'brother', 'sisters', 'kid_sister', 'younger_brother']
man:husband as woman:?
Best guesses are: ['wife', 'husband', 'husbands', 'spouse', 'wifes']
spain:madrid as france:?
Best guesses are: ['paris', 'france', 'le_havre', 'in_france', 'montmartre']


In [14]:
print_analogy("dog", "golden_retriever", "cat")

dog:golden_retriever as cat:?
Best guesses are: ['cat', 'maine_coon', 'kitten', 'tabby', 'kitty']


In [15]:
print_analogy("dog", "bark", "cat")

dog:bark as cat:?
Best guesses are: ['bark', 'cat', 'sharp_teeth', 'barks', 'meow']


In [16]:
print_analogy("bark", "meow", "dog")

bark:meow as dog:?
Best guesses are: ['meow', 'meows', 'cat', 'meowing', 'kitty']


These three results are quite good, but in general, the results of these analogies can be disappointing. Try experimenting with other analogies, and see if you can think of ways to get around the problems you notice (i.e., modifications to the solve_analogy algorithm).

### Using word embeddings in deep models
Word embeddings are fun to play around with, but their primary use is that they allow us to think of words as existing in a continuous, Euclidean space; we can then use an existing arsenal of techniques for machine learning with continuous numerical data (like logistic regression or neural networks) to process text.

Let's take a look at an especially simple version of this. We'll perform *sentiment analysis* on a set of movie reviews: in particular, we will attempt to classify a movie review as positive or negative based on its text.

We will use a [Simple Word Embedding Model](http://people.ee.duke.edu/~lcarin/acl2018_swem.pdf) (SWEM, Shen et al. 2018) to do so. We will represent a review as the *mean* of the embeddings of the words in the review. Then we'll train a three-layer MLP (a neural network) to classify the review as positive or negative.

###### A word of caution: these movie reviews are unfiltered real reviews online and contain inappropriate language.  For the purposes of this class, you don't have to read them, and can consider them just a matrix of numbers.

Download the `movie-simple.txt` file from the repository into this directory. Each line of that file contains 

1. the numeral 0 (for negative) or the numeral 1 (for positive), followed by
2. a tab (the whitespace character), and then
3. the review itself.

In [17]:
import string
remove_punct=str.maketrans('','',string.punctuation)

# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.
def convert_line_to_example(line):
    # Pull out the first character: that's our label (0 or 1)
    y = int(line[0])
    # Split the line into words using Python's split() function
    words = line[2:].translate(remove_punct).lower().split()
    # Look up the embeddings of each word, ignoring words not
    # in our pretrained vocabulary.
    embeddings = [normalized_embeddings[index[w]] for w in words
                  if w in index]
    # Take the mean of the embeddings
    x = np.mean(np.vstack(embeddings), axis=0)
    return {'x': x, 'y': y, 'w':embeddings}

# Apply the function to each line in the file.
enc = 'utf-8' # This is necessary from within the singularity shell
with open("Data/movie-simple.txt", "r", encoding=enc) as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]

In [18]:
len(dataset)

1411

Now that we have a dataset, let's shuffle it and do a train/test split. We use a quarter of the dataset for testing, 3/4 for training (but also ensure that we have a whole number of batches in our training set, to make the code nicer later).

In [19]:
import random
random.shuffle(dataset)

batch_size = 100
total_batches = len(dataset) // batch_size
train_batches = 3 * total_batches // 4
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]

Time to build our MLP in Tensorflow. We'll use placeholders for `X` and `y` as usual.

In [20]:
import tensorflow as tf
tf.reset_default_graph()

# Placeholders for input
X = tf.placeholder(tf.float32, [None, 300])
y = tf.placeholder(tf.float32, [None, 1])

# Three-layer MLP
h1 = tf.layers.dense(X, 100, tf.nn.relu)
h2 = tf.layers.dense(h1, 20, tf.nn.relu)
logits = tf.layers.dense(h2, 1)
probabilities = tf.sigmoid(logits)

# Loss and metrics
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(tf.sigmoid(logits)), y), tf.float32))

# Training
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

# Initialization of variables
initialize_all = tf.global_variables_initializer()

We can now begin a session and train our model. We'll train for 250 epochs. When we're finished, we'll evaluate our accuracy on all the test data.

In [21]:
sess = tf.InteractiveSession()
sess.run(initialize_all)
for epoch in range(250):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = [sample['x'] for sample in data]
        labels  = [sample['y'] for sample in data]
        labels = np.array(labels).reshape([-1, 1])
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
    if epoch % 10 == 0:
        print("Epoch", epoch, "Loss", l, "Acc", acc)
    random.shuffle(train)

# Evaluate on test set
test_reviews = [sample['x'] for sample in test]
test_labels  = [sample['y'] for sample in test]
test_labels = np.array(test_labels).reshape([-1, 1])
acc = sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})
print("Final accuracy:", acc)

Epoch 0 Loss 0.69135916 Acc 0.61
Epoch 10 Loss 0.66213167 Acc 0.65
Epoch 20 Loss 0.65845925 Acc 0.68
Epoch 30 Loss 0.66219825 Acc 0.55
Epoch 40 Loss 0.6425139 Acc 0.64
Epoch 50 Loss 0.59427387 Acc 0.8
Epoch 60 Loss 0.5231353 Acc 0.81
Epoch 70 Loss 0.4517753 Acc 0.89
Epoch 80 Loss 0.4067031 Acc 0.86
Epoch 90 Loss 0.3422463 Acc 0.89
Epoch 100 Loss 0.26190525 Acc 0.98
Epoch 110 Loss 0.27958417 Acc 0.9
Epoch 120 Loss 0.2106394 Acc 0.92
Epoch 130 Loss 0.2044314 Acc 0.92
Epoch 140 Loss 0.17735112 Acc 0.96
Epoch 150 Loss 0.26515996 Acc 0.85
Epoch 160 Loss 0.1422936 Acc 0.97
Epoch 170 Loss 0.10893148 Acc 0.97
Epoch 180 Loss 0.08726247 Acc 0.99
Epoch 190 Loss 0.12732412 Acc 0.96
Epoch 200 Loss 0.0938323 Acc 0.99
Epoch 210 Loss 0.11325945 Acc 0.97
Epoch 220 Loss 0.14819159 Acc 0.95
Epoch 230 Loss 0.08506051 Acc 0.95
Epoch 240 Loss 0.12863559 Acc 0.95
Final accuracy: 0.9440389


We can now examine what our model has learned, seeing how it responds to word vectors for different words:

In [22]:
# Check some words
words_to_test = ['exciting', 'hated', 'boring', 'loved']
for word in words_to_test:
    print(word, sess.run(probabilities, feed_dict={X: normalized_embeddings[index[word]].reshape(1, 300)}))

exciting [[0.9999826]]
hated [[3.0430762e-08]]
boring [[5.766049e-07]]
loved [[0.99999976]]


Turns out the internet hates "Brokeback Mountain" and loves "Harry Potter"

In [23]:
# Check some words
words_to_test = ['brokeback','mountain','potter']
for word in words_to_test:
    print(word, sess.run(probabilities, feed_dict={X: normalized_embeddings[index[word]].reshape(1, 300)}))

brokeback [[0.97749025]]
mountain [[0.00071368]]
potter [[0.9919137]]


Try some words of your own!

In [24]:
sess.close()

This model works great for such a simple dataset, but does a little less well on something more complex. `movie-pang02.txt`, for instance, has 2000 longer, more complex movie reviews. It's in the same format as our simple dataset. On those longer reviews, this model achieves only 60-80% accuracy. (Increasing the number of epochs to, say, 1000, does help.)

In [25]:
# Apply the function to each line in the file.
with open("Data/movie-pang02.txt", "r",encoding=enc) as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]
import random
random.shuffle(dataset)
batch_size = 100
total_batches = len(dataset) // batch_size
train_batches = 3 * total_batches // 4
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]
sess = tf.InteractiveSession()
sess.run(initialize_all)
for epoch in range(250):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = [sample['x'] for sample in data]
        labels  = [sample['y'] for sample in data]
        labels = np.array(labels).reshape([-1, 1])
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
    if epoch % 10 == 0:
        print("Epoch", epoch, "Loss", l, "Acc", acc)
    random.shuffle(train)

# Evaluate on test set
test_reviews = [sample['x'] for sample in test]
test_labels  = [sample['y'] for sample in test]
test_labels = np.array(test_labels).reshape([-1, 1])
acc = sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})
print("Final accuracy:", acc)
sess.close()

Epoch 0 Loss 0.6930969 Acc 0.52
Epoch 10 Loss 0.6924441 Acc 0.5
Epoch 20 Loss 0.6919834 Acc 0.59
Epoch 30 Loss 0.692334 Acc 0.52
Epoch 40 Loss 0.69111776 Acc 0.56
Epoch 50 Loss 0.6947514 Acc 0.4
Epoch 60 Loss 0.690073 Acc 0.72
Epoch 70 Loss 0.69025403 Acc 0.63
Epoch 80 Loss 0.6894762 Acc 0.62
Epoch 90 Loss 0.68978554 Acc 0.58
Epoch 100 Loss 0.69094795 Acc 0.5
Epoch 110 Loss 0.6878017 Acc 0.69
Epoch 120 Loss 0.6897257 Acc 0.61
Epoch 130 Loss 0.68736345 Acc 0.7
Epoch 140 Loss 0.6898315 Acc 0.45
Epoch 150 Loss 0.68583083 Acc 0.75
Epoch 160 Loss 0.6862535 Acc 0.65
Epoch 170 Loss 0.68725604 Acc 0.52
Epoch 180 Loss 0.6836673 Acc 0.56
Epoch 190 Loss 0.6855555 Acc 0.52
Epoch 200 Loss 0.6824671 Acc 0.56
Epoch 210 Loss 0.67649573 Acc 0.72
Epoch 220 Loss 0.6792133 Acc 0.65
Epoch 230 Loss 0.6842475 Acc 0.48
Epoch 240 Loss 0.6696932 Acc 0.66
Final accuracy: 0.662


### Recurrent Neural Networks (RNNs)

In the context of deep learning, natural language is commonly modeled with Recurrent Neural Networks (RNNs).
RNNs pass the output of a neuron back to the input of the next time step of the same neuron.
These directed cycles in the RNN architecture gives them the ability to model temporal dynamics, making them particularly suited for modeling sequences (e.g. text).
We can visualize an RNN layer as follows:

<img src="Figures/basic_RNN.PNG" alt="basic_RNN" style="width: 80px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

We can unroll an RNN through time, making the sequence aspect of them more obvious:

<img src="Figures/unrolled_RNN.PNG" alt="basic_RNN" style="width: 400px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

#### RNNs in TensorFlow
How would we implement an RNN in TensorFlow? Given the different forms of RNNs, there are quite a few ways, but we'll stick to a simple one. 

In [26]:
# As always, import TensorFlow first
import tensorflow as tf

Let's assume we have our inputs in word embedding form already, say of dimensionality 100. We'll use a minibatch size of 16.

In [27]:
mb = 16
x_dim = 100

x1 = tf.placeholder(tf.float32, [mb, x_dim])

Define weight matrices for projecting the input, the previous state, and the output. Rather arbitrarily, let's pick a hidden layer size of 64.

In [28]:
h_dim = 64

# For projecting the input
U = tf.Variable(tf.truncated_normal([x_dim, h_dim], stddev=0.1))

# For projecting the previous state
W = tf.Variable(tf.truncated_normal([h_dim, h_dim], stddev=0.1))

# For projecting the output
V = tf.Variable(tf.truncated_normal([h_dim, x_dim], stddev=0.1))

Next, a function for one time step of the RNN.

In [29]:
def RNN_step(x, h):
    h_next = tf.nn.tanh(tf.matmul(x, U) + tf.matmul(h, W))
    output = tf.matmul(h_next, V)
    return output, h_next

In [30]:
# Initialize hidden state to 0
h0 = tf.zeros([mb, h_dim])

# Forward pass of one RNN step for time step t=1
y1, h1 = RNN_step(x1, h0)

print("Output y1 dimensions: {0}".format(y1.shape))
print("Hidden state h1 dimensions: {0}".format(h1.shape))

Output y1 dimensions: (16, 100)
Hidden state h1 dimensions: (16, 64)


We can repeat using the `RNN_step` function to continue unrolling the RNN as far as we need to. For each step, we feed in the next input (a new placeholder) and get a new output.

In [31]:
x2 = tf.placeholder(tf.float32, [mb, x_dim])

# Forward pass of one RNN step for time step t=2
y2, h2 = RNN_step(x2, h1)

print("Output y2 dimensions: {0}".format(y2.shape))
print("Hidden state h2 dimensions: {0}".format(h2.shape))

Output y2 dimensions: (16, 100)
Hidden state h2 dimensions: (16, 64)


Of course, in practice, you'd want to do this unrolling with a `for` loop, and the RNN functionality is more cleanly wrapped up in a class. 
We're not going to implement the class version here though, as TensorFlow already has these implemented: https://www.tensorflow.org/api_guides/python/contrib.rnn#Base_interface_for_all_RNN_Cells.

In [32]:
tf.reset_default_graph()
# Number of steps to unroll
num_steps = 10

# List of inputs and hidden states
xs = []
hs = []

# Build RNN
rnn = tf.contrib.rnn.BasicRNNCell(h_dim)

# Initialize hidden state to zero
h_t = tf.zeros([mb, h_dim])

for t in range(num_steps):
    x_t = tf.placeholder(tf.float32, [mb, x_dim])
    h_t, _ = rnn(x_t, h_t)
    
    xs.append(x_t)
    hs.append(h_t)
    
print("x dimensions:")
print([x_t.shape for x_t in xs])
print("\nh dimensions:")
print([h_t.shape for h_t in hs])

x dimensions:
[TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)])]

h dimensions:
[TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)])]


Instead of using a static length, we can use a variable length output.  This can be done with tensorflow's built-in dynamic RNN.

Let's start by applying a RNN to MNIST

In [33]:
tf.reset_default_graph()

n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

In [34]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('./tmp')
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting ./tmp/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting ./tmp/train-labels-idx1-ubyte.gz
Extracting ./tmp/t10k-images-idx3-ubyte.gz
Extracting ./tmp/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [35]:
n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

0 Train accuracy: 0.9266667 Test accuracy: 0.9111
1 Train accuracy: 0.94 Test accuracy: 0.9338
2 Train accuracy: 0.9866667 Test accuracy: 0.9524
3 Train accuracy: 0.96 Test accuracy: 0.9598
4 Train accuracy: 0.96666664 Test accuracy: 0.9668
5 Train accuracy: 0.9266667 Test accuracy: 0.9443
6 Train accuracy: 0.97333336 Test accuracy: 0.9646
7 Train accuracy: 0.97333336 Test accuracy: 0.9641
8 Train accuracy: 0.99333334 Test accuracy: 0.9737
9 Train accuracy: 0.98 Test accuracy: 0.966


Applying an RNN to the text reviews, starting with the easier data.

In [36]:
tf.reset_default_graph()
# sizes
n_steps = None
n_inputs = 300
n_neurons = 50
# Build RNN
X= tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y= tf.placeholder(tf.float32, [None, 1])
basic_cell = tf.contrib.rnn.BasicRNNCell(n_neurons,activation=tf.nn.tanh)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
last_cell_output=outputs[:,-1,:]
y_=tf.layers.dense(last_cell_output,1)

In [37]:
# Loss and metrics
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_, labels=y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(tf.sigmoid(y_)), y), tf.float32))

# Training
train_step = tf.train.AdamOptimizer(0.001).minimize(loss)

##### Let's train this on the word embeddings for sentiment analysis

In [38]:
with open("Data/movie-simple.txt", "r",encoding=enc) as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]
import random
random.shuffle(dataset)
batch_size = 1
total_batches = len(dataset) // batch_size
train_batches = 3 * total_batches // 4
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]


In [39]:
initialize_all = tf.global_variables_initializer()
sess = tf.InteractiveSession()
sess.run(initialize_all)
l_ma=.74
acc_ma=.5
for epoch in range(2):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = np.array([sample['w'] for sample in data]).reshape([1,-1,300])
        labels  = np.array([sample['y'] for sample in data]).reshape([1,1])
        labels = np.array(labels).reshape([-1, 1])
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
        l_ma=.99*l_ma+(.01)*l
        acc_ma=.99*acc_ma+(.01)*acc
        if (batch+1) % 100 == 0:
            print("batch", batch, "Loss", l_ma, "Acc", acc_ma)
    if epoch % 1 == 0:
        print("Epoch", epoch, "Loss", l_ma, "Acc", acc_ma)
    random.shuffle(train)


batch 99 Loss 0.6807326930569666 Acc 0.5737172786808623
batch 199 Loss 0.622806626555265 Acc 0.6618326846100836
batch 299 Loss 0.5093100747250915 Acc 0.7340922617888246
batch 399 Loss 0.43880586051760734 Acc 0.8155645786314655
batch 499 Loss 0.37961545128341956 Acc 0.8569490547297536
batch 599 Loss 0.4343319061892812 Acc 0.7955144696782365
batch 699 Loss 0.36547634622865016 Acc 0.8326777645832846
batch 799 Loss 0.3135014364280898 Acc 0.8613334747846224
batch 899 Loss 0.33197937922695 Acc 0.8728015387520769
batch 999 Loss 0.2827433999012625 Acc 0.8898436005865152
Epoch 0 Loss 0.3120099963069911 Acc 0.8709885429134029
batch 99 Loss 0.2657768022122512 Acc 0.8946576402311365
batch 199 Loss 0.2366160425735716 Acc 0.9037072929649635
batch 299 Loss 0.21485114587533588 Acc 0.9077573885726519
batch 399 Loss 0.17474799786576492 Acc 0.9348693468622572
batch 499 Loss 0.1618795363058343 Acc 0.9535601876409244
batch 599 Loss 0.16044574291706593 Acc 0.9438806067709392
batch 699 Loss 0.179886020102208

In [40]:
np.array(reviews).shape

(1, 7, 300)

In [41]:
# Evaluate on test set
test_acc=0
n=0
for sample in test:
    test_reviews = np.array([sample['w'] ]).reshape([1,-1,300])
    test_labels  = np.array([sample['y']]).reshape([1,1])
    test_labels = np.array(test_labels).reshape([-1, 1])
    test_acc += sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})
    n+=1
acc=test_acc/n 
print("Final accuracy:", acc)


Final accuracy: 0.9206798866855525


In [42]:
sess.close()

#### Long Short-Term Memory (LSTM)
One popular type of RNNs are Long Short-Term Memory (LSTM) networks, which we went into detail during the class.

Trying out LSTMs is fairly easy in code.

In [43]:
tf.reset_default_graph()
# sizes
n_steps = None
n_inputs = 300
n_neurons = 300
# Build RNN
X= tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y= tf.placeholder(tf.float32, [None, 1])
basic_cell = tf.contrib.rnn.LSTMCell(n_neurons,activation=tf.nn.tanh)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
last_cell_output=outputs[:,-1,:]
y_=tf.layers.dense(last_cell_output,1)

In [44]:
# Loss and metrics
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_, labels=y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(tf.sigmoid(y_)), y), tf.float32))

# Training
train_step = tf.train.AdamOptimizer(0.0005).minimize(loss)

In [45]:
initialize_all = tf.global_variables_initializer()
# Apply the function to each line in the file.
sess = tf.InteractiveSession()
sess.run(initialize_all)
l_ma=.74
acc_ma=.5
for epoch in range(5):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = np.array([sample['w'] for sample in data]).reshape([1,-1,300])
        labels  = np.array([sample['y'] for sample in data]).reshape([1,1])
        labels = np.array(labels).reshape([-1, 1])
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
        l_ma=.99*l_ma+(.01)*l
        acc_ma=.99*acc_ma+(.01)*acc
        if (batch+1) % 100 == 0:
            print("batch", batch, "Loss", l_ma, "Acc", acc_ma)
    if epoch % 1 == 0:
        print("Epoch", epoch, "Loss", l_ma, "Acc", acc_ma)
    random.shuffle(train)

batch 99 Loss 0.6953412945933292 Acc 0.5593095263642633
batch 199 Loss 0.6473887512329114 Acc 0.6738199371685197
batch 299 Loss 0.564437529583084 Acc 0.7915440309288584
batch 399 Loss 0.5458760783299365 Acc 0.8075724021252919
batch 499 Loss 0.4906432281056557 Acc 0.8252360221980967
batch 599 Loss 0.4713405573776868 Acc 0.8270145546202745
batch 699 Loss 0.39924134959420376 Acc 0.8576339785394074
batch 799 Loss 0.40895472638877217 Acc 0.8463385178050217
batch 899 Loss 0.4557517276340091 Acc 0.877230618810834
batch 999 Loss 0.34532441152826093 Acc 0.9011750182308512
Epoch 0 Loss 0.3375718806407419 Acc 0.8914188728019924
batch 99 Loss 0.2754086046505621 Acc 0.9132404911303504
batch 199 Loss 0.27054470837469446 Acc 0.9097989992333892
batch 299 Loss 0.2936102623103599 Acc 0.911023526005473
batch 399 Loss 0.22185479086925006 Acc 0.9285512738763193
batch 499 Loss 0.17598988915449443 Acc 0.9408731495579647
batch 599 Loss 0.19581428715270224 Acc 0.9504568039849209
batch 699 Loss 0.18926725063262

In [46]:
# Evaluate on test set
test_acc=0
n=0
for sample in test:
    test_reviews = np.array([sample['w'] ]).reshape([1,-1,300])
    test_labels  = np.array([sample['y']]).reshape([1,1])
    test_labels = np.array(test_labels).reshape([-1, 1])
    test_acc += sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})
    n+=1
acc=test_acc/n 
print("Final accuracy:", acc)


Final accuracy: 0.9631728045325779


In [47]:
sess.close()

Swapping out the more complex dataset.

In [48]:
with open("Data/movie-pang02.txt", "r",encoding=enc) as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]
import random
random.shuffle(dataset)
batch_size = 1
total_batches = len(dataset) // batch_size
train_batches = 3 * total_batches // 4
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]


In [None]:
initialize_all = tf.global_variables_initializer()
# Apply the function to each line in the file.
sess = tf.InteractiveSession()
sess.run(initialize_all)
l_ma=.74
acc_ma=.5
for epoch in range(5):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = np.array([sample['w'] for sample in data]).reshape([1,-1,300])
        labels  = np.array([sample['y'] for sample in data]).reshape([1,1])
        labels = np.array(labels).reshape([-1, 1])
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
        l_ma=.99*l_ma+(.01)*l
        acc_ma=.99*acc_ma+(.01)*acc
        if (batch+1) % 100 == 0:
            print("batch", batch, "Loss", l_ma, "Acc", acc_ma)
    if epoch % 1 == 0:
        print("Epoch", epoch, "Loss", l_ma, "Acc", acc_ma)
    random.shuffle(train)

batch 99 Loss 0.7127559205358465 Acc 0.5117723937439334
batch 199 Loss 0.7041729767792424 Acc 0.476417691398714
batch 299 Loss 0.7136889316928847 Acc 0.5247999034197425
batch 399 Loss 0.700590607597621 Acc 0.5166465018942937
batch 499 Loss 0.6973219907535622 Acc 0.4891128552295483
batch 599 Loss 0.6879451475568742 Acc 0.5301195300376083
batch 699 Loss 0.6852633044628296 Acc 0.5582710913652682
batch 799 Loss 0.6824738077939042 Acc 0.5483043370396196
batch 899 Loss 0.6814669891477889 Acc 0.5565460319777348
batch 999 Loss 0.6855136773754444 Acc 0.5864027234411453
batch 1099 Loss 0.6800206481604915 Acc 0.6105919246898751
batch 1199 Loss 0.676026827809165 Acc 0.6062120632777231
batch 1299 Loss 0.6929756899802499 Acc 0.5315496961651291
batch 1399 Loss 0.6921897788119804 Acc 0.5085756584298227
batch 1499 Loss 0.6866363371058599 Acc 0.5418367911951362
Epoch 0 Loss 0.6866363371058599 Acc 0.5418367911951362
batch 99 Loss 0.6641352750016342 Acc 0.6181060256888042


### Other materials:
Like Reinforcement Learning, Natural Language Processing can also easily be several full courses on its own, both with or without neural networks.
Over at UNC, Prof Mohit Bansal has [taught](http://www.cs.unc.edu/~mbansal/teaching/nlp-course-fall17.html) [several](http://www.cs.unc.edu/~mbansal/teaching/nlp-seminar-spring18.html).

- [Introduction to LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Popular blog post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)