# Lecture 8: Recurrent Neural Networks
In this lecture, we will discuss tasks that require some memory. The architectures we've looked at in the course so far have no sense of what they've seen before. Each input is completely seperate from all other inputs. However, there are many types of __sequence__ problems in which the order that things are seen makes a big difference. We will look into this problem and develop the possible solutions.

![Snorlax](http://i.imgur.com/cOGzJxk.png)

Given an image, the type of neural networks we've seen will do its best to figure out whats being shown. In this case, we see water near snorlax's face. However, we lack the context to know why the water is there. This image alone isnt enough to give us an accurate understanding of the situation. For that, we need to see what came before or after, and more importantly, we need to remember that information.

![Snorlax2](http://i.imgur.com/PnWiSCf.png)

In situations where context is important, we use __Recurrent Neural Networks (RNNs)__ instead of typical neural networks. An RNN is very similar to a regular neuron with the exception that it retains some memory about what it's seen previously.

In Snorlax's case, it's clear that the context of previous images makes a big difference in our interpretation of the events.

Typically, types of tasks that benefit from RNNs are those with significant information based on sequences. A couple of examples of domains that typically benefit from memory include:
1. Video Processing
2. Speech Processing
3. Sentiment Analysis
4. Time Series Data

## Modifying the Neuron to Give it Memory
Let's work through how we can modify a traditional neuron to give it memory

![neuron](http://i.imgur.com/tFMQMOn.png)

Above, we see a basic diagram of how a simple neuron works. In the image we have
* input vector $x$
* weight matrix $W$
* activation function $\phi$
* output $y$

We would like to change this diagram. We'd like the neuron to produce its output feature as usual, but also update it's knowledge. For example, the neuron might need to remember what type of environment it's in. If it's on the beach, it might expect to see people swimming instead of bathing.

Instead of simply taking an image and returning an activity, the neuron must maintain its own internal memory.

![rnn](http://i.imgur.com/ifQrKRR.png)

We can add memory simply by saving the output of the previous image! The next image we see is weighted by a new weight matrix $U$ and added to the features computed for the new input. Thus, the only change we make to our core structure of the neuron is the addition of $U$ and the storage of the previous features.

![dory](https://animationfascination.files.wordpress.com/2015/11/dory.jpg)

Although this is a big improvement there's still a problem! We've setup RNNs to only have memory of the previous image they saw. If the context changes, maybe the beach background drops out for a frame, then all memory is lost! Another way of looking at this networks is that it has short term memory but no long term memory.

Of course, there are many examples where having long term memory is quite useful. Life with only short term memory would be quite unpleasant and chaotic. Let's next discuss how to create a neuron that has both short and long term memory

We first need to ask, what are the components of long term memory that make it useful?
1. __The ability to forget__. If a scene ends its ok to forget the details like the time of day. However, if a character dies, its important to remember that they died. Thus, we want to be able to selectively remove some memories and keep others.
2. __The ability to add long term memories__. When a new image is seen, its important to be able to add certain details to long term memory. But we need to be careful to only add important memories. We therefore need a method to determine what's worth remembering and what isn't
3. __The ability to recall important memories__. Memory is only useful when it can be used effectively. Certain situations call for certain memories while other memories aren't relevant. We need a way to move long-term memories into working memory (short term memory).

The combinations of these characteristics is called a long short-term memory (LSTM) network.

Lets try to mathematically write out each of the points above. We'll start by determining what long term memories should be kept.

Here, $\sigma$ is the sigmoid activation (which forces inputs between 0 and 1)

$W$ is a weight matrix for the current input

$U$ is a weight matrix for memories

$ltm$ is long term memories

$wm$ are short term memories

$remember_t = \sigma(W_r x_t + U_r wm_{t-1}$)

This creates a filter based on the current short term memory and new input that determines which long term memories we shoudl keep. As usual, both weight matrices are learnable through gradient descent.

Next, we need to determine which possible new memories to form, this is similarly defined as

$learn_t = \phi(W_l x_t + U_l wm_{t-1})$

However, these candidate memories may not all be useful, we also need a filter to figure out which are actually worth remembering.

$save_t = \sigma(W_s x_t + U_s wm_{t-1})$

And thats it! Now we have what we need to update our memory. First lets update our long term memory.

$ltm_t = remember_t \cdot ltm_t + save_t \cdot learn_t$

Next lets update our short term memory. This is similar to moving memory from an external hard drive to a working laptop

$focus_t = \sigma(W_f x_t + U_f wm_{t-1})$

Finally, we update our working (short term) memory

$wm_t = focus_t \cdot \phi(ltm_t)$

Each of these steps has a technical term.
* Long-term memory $ltm_t$ is usually called the __cell state__
* Working or short term memory is usually called the __hidden state__
* The remember vector $remember_t$ is usually called the __forget gate__
* the save vector $save_t$ is usually called the __input gate__
* the focus vector $focus_t$ is usually called the __output gate__

![lstm_steps](http://i.imgur.com/vsqgLYn.png)

![lstm_snorlax](http://i.imgur.com/EGZIUuc.pngg)

# LSTMs in Action: Sentiment Analysis
Let's try out an LSTM network and see how it goes! For this example, we're going to be doing sentiment analysis on IMDB movie reviews. The goal is to take a written review and determine if it was positive or negative. Obviously, there is a lot of sequential information in written text as the context of words is based heavily on previous words.

In [1]:
import re
import itertools
from collections import Counter
import numpy as np
from sklearn.model_selection import train_test_split


import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon import nn, rnn

context = mx.gpu(0)

First, we are going to load the movie review dataset. We will be taking advantage of Stanford's Large Movie Review Dataset that is available here: http://ai.stanford.edu/~amaas/data/sentiment/. This dataset includes 25,000 movies reviews from the IMBD database with 12,500 labeled as 'Positive' reviews and the other 12,500 labeled as 'Negative' reviews.

In [2]:
def read_files(foldername):
    import os
    sentiments = []
    filenames = os.listdir(foldername)
    for file in filenames:
        with open(foldername+"/"+file,"r", encoding="utf8") as pos_file:
            data=pos_file.read().replace('\n', '')
            sentiments.append(data)
    return sentiments
    
    
#Ensure that the path below leads to the location of the positive reviews 
foldername = "/data/aclImdb/train/pos/"
postive_sentiment = read_files(foldername)

#Ensure that the path below leads to the location of the negative reviews
foldername = "/data/aclImdb/train/neg/"
negative_sentiment = read_files(foldername)

#This labels the 'Positive' reviews as 1' and the 'Negative' reviews as 0
positive_labels = [1 for _ in postive_sentiment]
negative_labels = [0 for _ in negative_sentiment]

In [3]:
print(postive_sentiment[0])

Coming from the same director who'd done "Candyman" and "Immortal Beloved", I'm not surprised it's a good film. Ironically, "Papierhaus" is a movie I'd never heard of until now, yet it must be one of the best movies of the late 80s - partly because that is hands down the worst movie period in recent decades. (Not talking about Iranian or Swedish "cinema" here...) The acting is not brilliant, but merely solid - unlike what some people here claim (they must have dreamt this "wondrous acting", much like Anna). The story is an interesting fantasy that doesn't end in a clever way that ties all the loose ends together neatly. These unanswered questions are probably left there on purpose, leaving it up to the individual's interpretation, and there's nothing wrong with that with a theme such as this. "Pepperhaus" is a somewhat unusual mix of kids' film and horror, with effective use of sounds and music. I like the fact that the central character is not your typical movie-cliché ultra-shy-but-s

In [4]:
print(negative_sentiment[0])

The "Wrinkle in Time" book series is my favorite series from childhood. I have read and re-read them more times than I can count over the last 35+ years. The characters, with all their virtues and flaws, are near and dear to my heart. This adaptation contained very little of the wonderful, magical, spiritual story that I love so much. To say I was disappointed with this film would be a great understatement.<br /><br />If you have never read the book(s) I imagine you will enjoy the movie. The acting is passable, the special effects are well done for a made for TV movie, and the story is interesting. However, if you love the books, avoid this movie at all costs.<br /><br />I found this statement at the Wikipedia page of the novel: "In an interview with Newsweek, L'Engle said of the film, 'I expected it to be bad, and it is.'"<br /><br />I, like another reviewer here, feel the need to read the book again to dispel this movie from my mind.



Next we want to clean up the text of the movie reviews so that we are only processing words. The actual words in the reviews are going to be the most predictive - not sentence breaks or commas, for example.

In [5]:
#some string preprocessing
def clean_str(string):  
    
    #This removes any special characters from the review
    remove_special_chars = re.compile("[^A-Za-z0-9 ]+")
    
    #This removes any line breaks and replaces them with spaces
    string = string.lower().replace("<br />", " ")
    
    return re.sub(remove_special_chars, "", string.lower())

Next, we are going to process all of the words in the reviews, count the number of occurences of each word, and then index the words in descending order with respect to how many times this occur. This is a necessary input to help us encode the words in the reviews so that they can be understood by a machine.

In [6]:
#This creates a dictionary of the words and their counts in entire 
#movie review dataset {word:count}

word_counter = Counter()
def create_count(sentiments):
    for line in sentiments:
        for word in (clean_str(line)).split():
            if word not in word_counter.keys():               
                word_counter[word] = 1
            else:
                word_counter[word] += 1

#This assigns a unique a number for each word (sorted by descending order 
#based on the frequency of occurrence)and returns a word_dict

def create_word_index():
    idx = 1
    word_dict = {}
    for word in word_counter.most_common():
        word_dict[word[0]] = idx
        idx+=1
    return word_dict
    
#Here we combine all of the reviews into one dataset and create a word
#dictionary using this entire dataset

all_sentiments = postive_sentiment + negative_sentiment
all_labels = positive_labels + negative_labels
create_count(all_sentiments)
word_dict = create_word_index()

#This creates a reverse index from a number to the word 
idx2word = {v: k for k, v in word_dict.items()}

In [7]:
word_counter

Counter({'coming': 1018,
         'from': 20437,
         'the': 334731,
         'same': 4028,
         'director': 3838,
         'whod': 44,
         'done': 2991,
         'candyman': 6,
         'and': 162241,
         'immortal': 58,
         'beloved': 174,
         'im': 4809,
         'not': 30287,
         'surprised': 799,
         'its': 25105,
         'a': 161971,
         'good': 14710,
         'film': 38254,
         'ironically': 121,
         'papierhaus': 1,
         'is': 107037,
         'movie': 42669,
         'id': 1394,
         'never': 6428,
         'heard': 1103,
         'of': 145375,
         'until': 1755,
         'now': 4473,
         'yet': 2700,
         'it': 78092,
         'must': 3054,
         'be': 26723,
         'one': 25725,
         'best': 6316,
         'movies': 7921,
         'late': 1138,
         '80s': 687,
         'partly': 125,
         'because': 8991,
         'that': 69565,
         'hands': 623,
         'down': 3533,
       

In [8]:
word_dict

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'in': 7,
 'it': 8,
 'i': 9,
 'this': 10,
 'that': 11,
 'was': 12,
 'as': 13,
 'for': 14,
 'with': 15,
 'movie': 16,
 'but': 17,
 'film': 18,
 'on': 19,
 'not': 20,
 'you': 21,
 'are': 22,
 'his': 23,
 'have': 24,
 'be': 25,
 'he': 26,
 'one': 27,
 'its': 28,
 'at': 29,
 'all': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'from': 34,
 'who': 35,
 'so': 36,
 'like': 37,
 'her': 38,
 'just': 39,
 'or': 40,
 'about': 41,
 'has': 42,
 'if': 43,
 'out': 44,
 'some': 45,
 'there': 46,
 'what': 47,
 'good': 48,
 'more': 49,
 'when': 50,
 'very': 51,
 'even': 52,
 'she': 53,
 'my': 54,
 'up': 55,
 'no': 56,
 'would': 57,
 'time': 58,
 'which': 59,
 'only': 60,
 'really': 61,
 'story': 62,
 'their': 63,
 'were': 64,
 'had': 65,
 'see': 66,
 'can': 67,
 'me': 68,
 'than': 69,
 'we': 70,
 'much': 71,
 'well': 72,
 'been': 73,
 'get': 74,
 'will': 75,
 'also': 76,
 'into': 77,
 'other': 78,
 'bad': 79,
 'people': 80,
 'do': 81,
 'because': 82

Next we create a set of helper functions that (1) encode words into a sequence of numbers, (2) decode a sequence of numbers back into words, and (3) truncate and pad the input data to ensure they are of equal length and thereby enable easier processing.

In [9]:
#This helper function creates a encoded sentences by assigning the unique 
#id from word_dict to the words in the input text (i.e., movie reviews)
def encoded_sentences(input_file,word_dict):
    output_string = []
    for line in input_file:
        output_line = []
        for word in (clean_str(line)).split():
            if word in word_dict:
                output_line.append(word_dict[word])
        output_string.append(output_line)
    return output_string

#This helper function decodes encoded sentences
def decode_sentences(input_file,word_dict):
    output_string = []
    for line in input_file:
        output_line = ''
        for idx in line:
            output_line += idx2word[idx] + ' '
        output_string.append(output_line)
    return output_string

#This helper function pads the sequences to maxlen.
#If the sentence is greater than maxlen, it truncates the sentence.
#If the sentence is less than 500, it pads with value 0.
def pad_sequences(sentences,maxlen=500,value=0):
    """
    Pads all sentences to the same length. The length is defined by maxlen.
    Returns padded sentences.
    """
    padded_sentences = []
    for sen in sentences:
        new_sentence = []
        if(len(sen) > maxlen):
            new_sentence = sen[:maxlen]
            padded_sentences.append(new_sentence)
        else:
            num_padding = maxlen - len(sen)
            new_sentence = np.append(sen,[value] * num_padding)
            padded_sentences.append(new_sentence)
    return padded_sentences

Next we are going to encode all of the movie reviews using the word dictionary created. In addition, we are going to cap the size of the tracked vocabulary size - meaning any word that is outside of the tracked range will be encoded with the last position. This is performance versus accuracy consideration - a larger tracked vocabulary will lead to more accurary but will have performance considerations because it requires a longer training process.

In [10]:
#Encodes the positive and negative reviews into sequences of number
positive_encoded = encoded_sentences(postive_sentiment,word_dict)
negative_encoded = encoded_sentences(negative_sentiment,word_dict)

all_encoded = positive_encoded + negative_encoded

vocab_size = 5000 #Here we set the total num of words to be tracked

#Any word outside of the tracked range will be encoded with last position.
t_data = [np.array([i if i<(vocab_size-1) else (vocab_size-1) for i in s]) for s in all_encoded]

In [11]:
all_encoded[0]

[579,
 34,
 1,
 165,
 171,
 7670,
 221,
 25270,
 2,
 6309,
 2701,
 141,
 20,
 744,
 28,
 3,
 48,
 18,
 3573,
 53650,
 6,
 3,
 16,
 438,
 108,
 540,
 4,
 352,
 147,
 239,
 8,
 218,
 25,
 27,
 4,
 1,
 112,
 91,
 4,
 1,
 526,
 861,
 3476,
 82,
 11,
 6,
 934,
 184,
 1,
 240,
 16,
 809,
 7,
 1093,
 2742,
 20,
 651,
 41,
 7056,
 40,
 3818,
 431,
 127,
 1,
 110,
 6,
 20,
 517,
 17,
 1463,
 1120,
 998,
 47,
 45,
 80,
 127,
 2220,
 33,
 218,
 24,
 21855,
 10,
 9547,
 110,
 71,
 37,
 2187,
 1,
 62,
 6,
 32,
 212,
 982,
 11,
 143,
 125,
 7,
 3,
 1066,
 93,
 11,
 4184,
 30,
 1,
 1886,
 608,
 288,
 5919,
 128,
 6954,
 1166,
 22,
 234,
 306,
 46,
 19,
 1251,
 1176,
 8,
 55,
 5,
 1,
 3239,
 2874,
 2,
 215,
 155,
 351,
 15,
 11,
 15,
 3,
 749,
 135,
 13,
 10,
 53651,
 6,
 3,
 622,
 1673,
 1479,
 4,
 323,
 18,
 2,
 195,
 15,
 1098,
 349,
 4,
 914,
 2,
 222,
 9,
 37,
 1,
 185,
 11,
 1,
 1305,
 106,
 6,
 20,
 122,
 765,
 53652,
 53653,
 53654,
 244,
 17,
 3,
 1903,
 1224,
 549,
 51,
 2361,
 9,
 235,
 115

In [12]:
t_data[0]

array([ 579,   34,    1,  165,  171, 4999,  221, 4999,    2, 4999, 2701,
        141,   20,  744,   28,    3,   48,   18, 3573, 4999,    6,    3,
         16,  438,  108,  540,    4,  352,  147,  239,    8,  218,   25,
         27,    4,    1,  112,   91,    4,    1,  526,  861, 3476,   82,
         11,    6,  934,  184,    1,  240,   16,  809,    7, 1093, 2742,
         20,  651,   41, 4999,   40, 3818,  431,  127,    1,  110,    6,
         20,  517,   17, 1463, 1120,  998,   47,   45,   80,  127, 2220,
         33,  218,   24, 4999,   10, 4999,  110,   71,   37, 2187,    1,
         62,    6,   32,  212,  982,   11,  143,  125,    7,    3, 1066,
         93,   11, 4184,   30,    1, 1886,  608,  288, 4999,  128, 4999,
       1166,   22,  234,  306,   46,   19, 1251, 1176,    8,   55,    5,
          1, 3239, 2874,    2,  215,  155,  351,   15,   11,   15,    3,
        749,  135,   13,   10, 4999,    6,    3,  622, 1673, 1479,    4,
        323,   18,    2,  195,   15, 1098,  349,   

In [13]:
print(postive_sentiment[0])

Coming from the same director who'd done "Candyman" and "Immortal Beloved", I'm not surprised it's a good film. Ironically, "Papierhaus" is a movie I'd never heard of until now, yet it must be one of the best movies of the late 80s - partly because that is hands down the worst movie period in recent decades. (Not talking about Iranian or Swedish "cinema" here...) The acting is not brilliant, but merely solid - unlike what some people here claim (they must have dreamt this "wondrous acting", much like Anna). The story is an interesting fantasy that doesn't end in a clever way that ties all the loose ends together neatly. These unanswered questions are probably left there on purpose, leaving it up to the individual's interpretation, and there's nothing wrong with that with a theme such as this. "Pepperhaus" is a somewhat unusual mix of kids' film and horror, with effective use of sounds and music. I like the fact that the central character is not your typical movie-cliché ultra-shy-but-s

We will be using a word embedding matrix to represent the words that we observe in the movie reviews. Represeting the meaning of the words with these vectors is a large exercise unto itself. Instead, we will be leveraging Stanford's Global Vector for Word Representation (GloVe) embedding. We specifically used glove.42B.300d.zip available at this link: https://nlp.stanford.edu/projects/glove/.

![embedding](http://wiki.fast.ai/images/6/6d/Embedding_projection.png)

In [14]:
# Loads Stanford's Global Vector for Word Representation (GloVe) embedding

num_embed = 300 #This is the richness of the word attributes captured

def load_glove_index(loc):
    f = open(loc, encoding="utf8")
    embeddings_index = {}
    
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype = 'float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

def create_emb():
    embedding_matrix = np.zeros((vocab_size, num_embed))
    for word, i in word_dict.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    embedding_matrix = nd.array(embedding_matrix)
    return embedding_matrix

embeddings_index = load_glove_index('/data/aclImdb/glove.42B.300d.txt')
embedding_matrix = create_emb()

Next we prepare the movie reviews to be fed into the deep learning model by (1) Reserving 30% of the dataset as a test dataset, (2) padding and truncating the data to the length of 500 words, and (3) converting the movie reviews into MXNet's NDArray format.

In [15]:
#This separates 30% of the entire dataset into test dataset.
X_train, X_test, y_train, y_test_set = train_test_split(t_data, all_labels, test_size=0.3, random_state=42)

In [16]:
#Here are some of the statistics of sentences before padding
min_len = min(map(len, t_data))
max_len = max(map(len,t_data))
avg_len = sum(map(len,t_data)) / len(t_data)
print("the minimum length is:",min_len)
print("the maximum length is:",max_len)
print("the average length is:",avg_len)

the minimum length is: 10
the maximum length is: 2459
the average length is: 230.51952


In [17]:
seq_len = 500 #This set the max word length of each movie review

#Below we pad the reviews and convert them to MXNet's NDArray format
trn = nd.array(pad_sequences(X_train, maxlen=seq_len, value=0)).as_in_context(context)
test = nd.array(pad_sequences(X_test, maxlen=seq_len, value=0)).as_in_context(context)
y_trn = nd.array(y_train).as_in_context(context)
y_test = nd.array(y_test_set).as_in_context(context)

Now we're ready to define the neural network for this model using Gluon. We will be using an LSTM model with 64 hidden units, and we will be taking advantage of the embedding layer created above.

In [18]:
num_classes = 2
num_hidden = 64
learning_rate = .001
epochs = 10
batch_size = 12

model = mx.gluon.nn.Sequential()

with model.name_scope():    
    model.embed = mx.gluon.nn.Embedding(vocab_size, num_embed)
    model.add(mx.gluon.rnn.LSTM(num_hidden, layout = 'NTC'))
    model.add(mx.gluon.nn.Dense(num_classes))

Before we execute the training loop, we need to define a function that will calculate the accurary metrics for the model.

In [19]:
def evaluate_accuracy(x,y,batch_size):
    
    acc = mx.metric.Accuracy()
    
    for i in range(x.shape[0] // batch_size):
        data = x[i*batch_size:(i*batch_size + batch_size),]
        target = y[i*batch_size:(i*batch_size + batch_size),]
    
        output = model(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=target)
    
    return acc.get()[1]

Finally, we are ready to execute the training loop. Prior to kicking off the training loop, we need to initialize the model parameters and the optimer function in addition to setting up the pre-training embedding layer.

In [20]:
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

model.embed.weight.set_data(embedding_matrix.as_in_context(context))

trainer = gluon.Trainer(model.collect_params(), 'sgd',
                        {'learning_rate': learning_rate})

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()    

for epoch in range(epochs):
            
    for b in range(trn.shape[0] // batch_size):
        data = trn[b*batch_size:(b*batch_size + batch_size),]
        target = y_trn[b*batch_size:(b*batch_size + batch_size),]
        
        data = data.as_in_context(context)
        target = target.as_in_context(context)
        
        with autograd.record():
            output = model(data)
            L = softmax_cross_entropy(output, target)
        L.backward()
        trainer.step(data.shape[0])
            
    test_accuracy = evaluate_accuracy(trn, y_trn, batch_size)
    train_accuracy = evaluate_accuracy(test, y_test, batch_size)
    print("Epoch %s. Train_acc %s, Test_acc %s" %
          (epoch, train_accuracy, test_accuracy))
    
# save our model so we dont lose progress!
model.save_params("/data/aclImdb/model.params")    

KeyboardInterrupt: 

In [22]:
# if restarting, we can just load
model.load_params("/data/aclImdb/model.params", ctx=context)

In [23]:
test_accuracy = evaluate_accuracy(trn, y_trn, batch_size)
print(test_accuracy)

0.8285322359396433


In [48]:
# pretty decent accuracy! Let's try it out
def infer_sentiment(sentence):
    # wrap the sentence in a a list
    sentence = [sentence]
    # encode the words
    text_encoded = nd.array(encoded_sentences(sentence,word_dict))
    # Reduce vocabularly
    text_encoded = nd.clip(text_encoded, a_min=0, a_max=4999).asnumpy()
    # pad extra words
    padded = nd.array(pad_sequences(text_encoded, maxlen=seq_len, value=0)).as_in_context(context)
    # make guess
    guess = nd.argmax(model(padded), axis=1)
    if guess[0] == 0:
        print("Sorry you didn't like it")
    else:
        print("Glad you liked it!")

In [50]:
infer_sentiment("This movie is Great. I really loved the part where the plot happened. Beautiful fantastic!")

Glad you liked it!
