# Homework 3: Recurrent Neural Networks (100 points)

### Overview

We now move from image recognition to natural language processing. For this assignment, we will work with a common sentiment analysis task using the IMDB dataset. This set consists of review-label pairs, where the task is to predict whether the text of the given movie review is positive or negative, a binary classification.

### RNN Architecture

You will be comparing four different recurrent neural network architectures: a standard RNN, a Gated Recurrent Unit (GRU), a standard Long Short-Term Memory (LSTM), and a bidirectional LSTM. 

Note that a GRU/LSTM cell _is_ an RNN cell, but we will refer to an RNN in the code and questions below as the most basic RNN.

### Your Task

At the bottom of this notebook file, there are three short answer questions testing your understanding of this neural network architecture. 

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import pickle

In [2]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [3]:
root_dir = 'assets_week3'
reviewVocabVectors = pickle.load(open(root_dir + '/reviewVocabVectors', 'rb'))
trainIterator = pickle.load(open(root_dir + '/trainIterator', 'rb'))
testIterator = pickle.load(open(root_dir + '/testIterator', 'rb'))

In [4]:
embeddingSize = 100
hiddenSize = 10
dropoutRate = 0.5
numEpochs = 5
vocabSize = 20002
pad = 1
unk = 0

class MyRNN(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.name = model
        self.LSTM = (model == 'LSTM' or model == 'BiLSTM')
        self.bidir = (model == 'BiLSTM')
        
        self.embed = nn.Embedding(vocabSize, embeddingSize, padding_idx = pad)
        
        if model == 'RNN': 
            self.rnn = nn.RNN(embeddingSize, hiddenSize)
        elif model == 'GRU': 
            self.rnn = nn.GRU(embeddingSize, hiddenSize)
        else: 
            self.rnn = nn.LSTM(embeddingSize, hiddenSize, bidirectional=self.bidir)

        self.dense = nn.Linear(hiddenSize * (2 if self.bidir else 1), 1)
        self.dropout = nn.Dropout(dropoutRate)
        
    def forward(self, text, textLengths):
        embedded = self.dropout(self.embed(text))
        
        packedEmbedded = nn.utils.rnn.pack_padded_sequence(embedded, textLengths)
        if self.LSTM: 
            packedOutput, (hidden, cell) = self.rnn(packedEmbedded)
        else: 
            packedOutput, hidden = self.rnn(packedEmbedded)

        output, outputLengths = nn.utils.rnn.pad_packed_sequence(packedOutput)
        if self.bidir: 
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        else: 
            hidden = hidden[0]

        return self.dense(self.dropout(hidden))

In [9]:
basicRNN = MyRNN(model='RNN')
GRU = MyRNN(model='GRU') # Construct a GRU model, as above
LSTM = MyRNN(model='LSTM') # Construct a LSTM model, as above
biLSTM = MyRNN(model='BiLSTM') # Construct a BiLSTM model, as above
models = [basicRNN, GRU, LSTM, biLSTM]

In [10]:
for model in models:
    if model is None:
        continue
    model.embed.weight.data.copy_(reviewVocabVectors)
    model.embed.weight.data[unk] = torch.zeros(embeddingSize)
    model.embed.weight.data[pad] = torch.zeros(embeddingSize)

In [11]:
criterion = nn.BCEWithLogitsLoss()

def batchAccuracy(preds, targets):
    roundedPreds = (preds >= 0)
    return (roundedPreds == targets).sum().item() / len(preds)

In [12]:
# Training

for model in models: 
    if model is not None:
        model.train()

for model in models:
    if model is None:
        continue
        
    torch.manual_seed(0)
    optimizer = torch.optim.Adam(model.parameters())
    for epoch in range(numEpochs):
        epochLoss = 0
        for batch in trainIterator:
            optimizer.zero_grad()
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            loss.backward()
            optimizer.step()
            epochLoss += loss.item()
        print(f'Model: {model.name}, Epoch: {epoch + 1}, Train Loss: {epochLoss / len(trainIterator)}')
    print()

Model: RNN, Epoch: 1, Train Loss: 0.7023055340018114
Model: RNN, Epoch: 2, Train Loss: 0.6921511940334154
Model: RNN, Epoch: 3, Train Loss: 0.6874914897982117
Model: RNN, Epoch: 4, Train Loss: 0.665387328933267
Model: RNN, Epoch: 5, Train Loss: 0.6248484775233452

Model: GRU, Epoch: 1, Train Loss: 0.6980792777922452
Model: GRU, Epoch: 2, Train Loss: 0.6819975841075868
Model: GRU, Epoch: 3, Train Loss: 0.6078705844062063
Model: GRU, Epoch: 4, Train Loss: 0.4941392755112075
Model: GRU, Epoch: 5, Train Loss: 0.40071470955448685

Model: LSTM, Epoch: 1, Train Loss: 0.693562761749453
Model: LSTM, Epoch: 2, Train Loss: 0.6738174126276275
Model: LSTM, Epoch: 3, Train Loss: 0.5866248460529405
Model: LSTM, Epoch: 4, Train Loss: 0.4701130262878545
Model: LSTM, Epoch: 5, Train Loss: 0.3987897595633631

Model: BiLSTM, Epoch: 1, Train Loss: 0.6933902058455036
Model: BiLSTM, Epoch: 2, Train Loss: 0.6829730011618046
Model: BiLSTM, Epoch: 3, Train Loss: 0.5933466122278472
Model: BiLSTM, Epoch: 4, Train

In [13]:
# Evaluation

for model in models: 
    if model is not None:
        model.eval()

with torch.no_grad():
    
    for model in models:
        
        if model is None:
            continue

        accuracy = 0.0
        for batch in testIterator:
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            acc = batchAccuracy(predictions, batch[1])
            accuracy += acc
        print('Model: {}, Validation Accuracy: {}%'.format(model.name, accuracy / len(testIterator) * 100))

Model: RNN, Validation Accuracy: 73.24488491048594%
Model: GRU, Validation Accuracy: 84.69469309462916%
Model: LSTM, Validation Accuracy: 84.59079283887469%
Model: BiLSTM, Validation Accuracy: 84.1687979539642%


## Homework Questions

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: Coding (50 points)

First, run the code given above to assess accuracy of the default RNN model. 

Next, you will need to construct three other model types (GRU, LSTM, BiLSTM) for comparison purposes. Follow the comments in box 5 to initialize the three other model types then rerun the code with all models enabled.

Finally, compare the accuracies of all four models (the accuracy of the default RNN should not change from the initial run). Explain your results. In doing so, overview the advantages of the best performing architecture.

The three other model types (GRU, LSTM, and BiLSTM) outperformed the vanilla RNN.  Of those three, the GRU performed best, with validation accuracy of 84.69%, LSTM with 84.59%, and BiLSTM with 84.17%. This increase in validation accuracy between the three added models and the vanilla model is to be expected as the added models are more complex and allow us to model long-term dependencies. The best performing architechture was the GRU model, which has fewer parameters than LSTM and is faster to train.  The efficiencies are due to GRUs having only two gates (update and reset) vs the LSTM's three gates.   

### Question 2: LSTM Gates (30 points)

LSTMs improve upon the naive RNN architecture by adding a series of gates instead of a simple matrix-vector computation. Name the gates and explain each of their functions.

There are three different gates in an LSTM cell: forget gate, input gate, and output gate. The forget gate decides which information from the previous cell state should be thrown away or kept.  The input gate is used to quantify the importance of the new information, determining which parts of it to keep.  The output gate controls what part of the cell contents are passed to the next hidden state. 

### Question 3: Applications (20 points)

LSTMs are used across many different fields, from music generation to sentiment classification to text generation. Where could they be useful in your life, whether at home, for your family, or in the workplace? Give a specific problem or application for an LSTM model that was not covered in the course slides (**though it can be related to the applications covered in the slides**). Then, with your application in mind, specifically identify the input to your model, the output from your model, and an applicable loss function. 

(As an optional extension, try implementing your LSTM on your own using the code framework given in this homework!)

I work at an elementary school and each day someone on the leadership team sends an email to all employees titled "The Daily Rundown" that details the day at a high level, mentioning anything exceptional that went on that day.  Most days, nothing exceptional goes on and the email (both the writing and the reading) are superfluous. I'd like to highlight how much of a waste of time the email is by training an LSTM to write it for me. For this project I would be training my model on the text from past emails (input) and outputting predicted text in the form of Daily Rundown emails.  More specifically though, I would be predicting the next character in a sequence given the all the characters computed up to that moment. I would use a categorical cross-entropy loss function for this problem to increase the probability of the correct class, and to decrease the probabilities of the rival classes.