#COMP5900 Assignment 2 (Supplementary Materials)
Use this code to answer the questions in Assignment 2 Part 1.

# Sentiment Analysis of IMDB Dataset

In the following, we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative). This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).


We will use LSTM. 

An easy way of running this Jupiter notebook is to mount your gogle drive so you can use it for loading data and storting the results.



In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


Change the working directory to where the relevant files are.

In [None]:
cd 'drive/MyDrive/COMP5900-F22Assignments/Assignment2'

/content/drive/MyDrive/COMP5900-F22Assignments/Assignment2


In [None]:
!ls


 Assignment_2_Report.gdoc   best-model.pt  'IMDB Dataset.csv'   vectors_cache


In [None]:
import numpy as np 
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from nltk.corpus import stopwords 
from collections import Counter
import string
import re
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
import torchtext
from torchtext.vocab import GloVe

In [None]:
is_cuda = torch.cuda.is_available()
if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

GPU is available


Load the provided csv file into a panda dataframe. The dataset has a total of 50,000 samples. Often times it helps to work only on a smaller sebset of data and once debuging is done you can run the code on the fukk dataset. For example, df.sample(10000) create a subset of 10,000 samples out of the 50,000 samples.

In [None]:
base_csv = 'IMDB Dataset.csv'
df = pd.read_csv(base_csv)
# df=df.sample(10000)
df.shape

(50000, 2)

Here is a few samples of the dataset.

In [None]:
df.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Target labels are in the form of text (positive or negative. Lets convert that column into a binary variable with values 0 and 1. 

In [None]:
df['sentiment'].replace({'positive':1, 'negative':0},inplace=True)
# df.head()

Let's split data into training and validation sets. We first set aside 15% for test. From the remaining we set side 82% for training and the rest for validation. This will roughly split the data 70/15/15.

In [None]:
X,y = df['review'].values,df['sentiment'].values
x_train1, x_test1, y_train, y_test = train_test_split(X, y, train_size=0.85, random_state=1)

x_train1, x_val1, y_train, y_val = train_test_split(x_train1, y_train, train_size=0.823, random_state=1)


We can see how many examples are in each split.

In [None]:
print(f'Number of training examples: {len(x_train1)}')
print(f'Number of validation examples: {len(x_val1)}')
print(f'Number of test examples: {len(x_test1)}')

Number of training examples: 34977
Number of validation examples: 7523
Number of test examples: 7500


Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](http://people.scs.carleton.ca/~majidkomeili/Teaching/COMP5900-F22/Images/One-hot.png)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto the GPU. We only keep the top 25,000 most common words. The following builds the vocabulary, only keeping the most common 25,000 tokens.

Instead of having our word embeddings initialized randomly, they are initialized with these pre-trained pre-trained word embeddings. We will use [Glove](https://nlp.stanford.edu/projects/glove/).

The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. "terrible", "awful", "dreadful" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.

Here `6B` indicates these vectors were trained on 6 billion tokens and `100` indicates these vectors are 100-dimensional.

**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection. You may use `vectors_cache` to cache it once downloaded. 

In [None]:
glove = GloVe(name='6B', dim=100, max_vectors=25000, cache = "./vectors_cache") 

We can see the index of a word using `stoi` (**s**tring **to**  **i**nt) method.

In [None]:
print(glove.stoi['movie'])

1005


and here is the actual embedding of the word

In [None]:
glove.vectors[1005]

tensor([ 0.3825,  0.1482,  0.6060, -0.5153,  0.4399,  0.0611, -0.6272, -0.0254,
         0.1643, -0.2210,  0.1442, -0.3721, -0.2168, -0.0890,  0.0979,  0.6561,
         0.6446,  0.4770,  0.8385,  1.6486,  0.8892, -0.1181, -0.0125, -0.5208,
         0.7785,  0.4872, -0.0150, -0.1413, -0.3475, -0.2959,  0.1028,  0.5719,
        -0.0456,  0.0264,  0.5382,  0.3226,  0.4079, -0.0436, -0.1460, -0.4835,
         0.3204,  0.5509, -0.7626,  0.4327,  0.6175, -0.3650, -0.6060, -0.7962,
         0.3929, -0.2367, -0.3472, -0.6120,  0.5475,  0.9481,  0.2094, -2.7771,
        -0.6022,  0.8495,  1.2549,  0.0179, -0.0419,  2.1147, -0.0266, -0.2810,
         0.6812, -0.1417,  0.9925,  0.4988, -0.6754,  0.6417,  0.4230, -0.2791,
         0.0634,  0.6891, -0.3618,  0.0537, -0.1681,  0.1942, -0.4707, -0.1480,
        -0.5899, -0.2797,  0.1679,  0.1057, -1.7601,  0.0088, -0.8333, -0.5836,
        -0.3708, -0.5659,  0.2070,  0.0713,  0.0556, -0.2976, -0.0727, -0.2560,
         0.4269,  0.0589,  0.0911,  0.47

We can see the vocabulary directly using `itos` (**i**nt **to**  **s**tring) method.

In [None]:
print(glove.itos[1005])

movie


## Prepare Data


In [None]:
def preprocess_string(s):
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)
    # Replace all runs of whitespaces with no space
    s = re.sub(r"\s+", '', s)
    # replace digits with no space
    s = re.sub(r"\d", '', s)

    return s

We keep only the words that are in the vocabulary. 

In [None]:
x_train, x_val, x_test = [],[],[]
for sent in x_train1:
        x_train.append([glove.stoi[preprocess_string(word)] for word in sent.lower().split()
                                  if preprocess_string(word) in glove.stoi])

for sent in x_val1:        
        x_val.append([glove.stoi[preprocess_string(word)] for word in sent.lower().split()
                                  if preprocess_string(word) in glove.stoi])  
        
for sent in x_test1:        
        x_test.append([glove.stoi[preprocess_string(word)] for word in sent.lower().split()
                                  if preprocess_string(word) in glove.stoi])      

Now we pad each sample to have a predefined length. For this we clip the long samples and zero-pad shorter onces. Therefore, at the end all samples will have a length of `seq_len`. In our dataset most reviews have less than 500 word. Seo, we set `seq_len` to 500. 

In [None]:
def padding_(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

In [None]:
x_train_pad = padding_(x_train,500)
x_val_pad = padding_(x_val,500)
x_test_pad = padding_(x_test,500)

In [None]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(x_train_pad), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(x_val_pad), torch.from_numpy(y_val))
test_data = TensorDataset(torch.from_numpy(x_test_pad), torch.from_numpy(y_test))

batch_size = 100

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

## Build the Model

The next stage is building the model that we'll eventually train and evaluate. Our three layers are an _embedding_ layer, our RNN/LSTM, and a _linear_ layer. 

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the LSTM, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. We will initialize the embedding layer with GloVe embeddings.

![](http://people.scs.carleton.ca/~majidkomeili/Teaching/COMP5900-F22/Images/Embedding.png)


The MyRNN layer is our LSTM which takes in our dense vector $x_{t}$ and the previous hidden state $h_{t-1}$, and the previous cell memory $c_{t-1}$ which it uses to calculate the next hidden state, $h_t$.

$$(h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)$$

Thus, the model using an LSTM looks something like (with the embedding layers omitted):

![](http://people.scs.carleton.ca/~majidkomeili/Teaching/COMP5900-F22/Images/LSTM.png)

The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros. 

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.


### Bidirectional RNN

As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). 

In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor.

We make our sentiment prediction using a concatenation of the last hidden state from the forward LSTM (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward LSTM (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$   


### Multi-layer RNN

In multi-layer LSTM (also called *deep LSTM*) we add additional LSTMs on top of the initial standard LSTM, where each LSTM added is another *layer*. The hidden state output by the first (bottom) LSTM at time-step $t$ will be the input to the LSTM above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.


### Implementation Details

To use an RNN instead of the LSTM, you can use `nn.RNN` instead of `nn.LSTM`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. 

As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM. 

The final hidden state, `hidden`, has a shape of _[num layers $\times$ num directions, batch size, hid dim]_. These are ordered: _[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1]_. As we want the final (top) layer forward and backward hidden states, we get `hidden[2,:,:]` and `hidden[3,:,:]`, and concatenate them together before passing them to the linear layer. 

The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. 

The embedding dimension is the size of the dense word vectors i.e. 100.

The hidden dimension is the size of the hidden states i.e. 256.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [None]:
class MyRNN(nn.Module):
    def __init__(self):
        super(MyRNN,self).__init__()
 
        self.output_dim = 1        
        self.hidden_dim = 200
        embedding_dim = 100
        bidirectional = True
        num_layers = 2
      
        # We load the word embeddings into the `embedding` layer of our model.
        self.embedding = nn.Embedding.from_pretrained(glove.vectors,freeze=True) 
        
        self.rnn = nn.RNN(input_size=embedding_dim,
                            hidden_size=self.hidden_dim,
                            num_layers=num_layers,
                            bidirectional=bidirectional,
                            batch_first=True)
        
        
        self.fc = nn.Linear(self.hidden_dim * 2, self.output_dim)
        
    def forward(self,x):
        embeds = self.embedding(x)  
        
        output, hidden = self.rnn(embeds)
        #output = [sent len, batch size, hid dim * 2]        
        #hidden = [num layers * 2, batch size, hid dim]
        #cell = [num layers * 2, batch size, hid dim]
        
        #concat the final forward (hidden[2,:,:]) and backward (hidden[3,:,:]) hidden layers
        hidden = torch.cat((hidden[2,:,:], hidden[3,:,:]), dim = 1)        
        logit = self.fc(hidden)
                
        return logit

We now create an instance of our RNN class.
We'll print out the total number of parameters in our model. We also print the details of the number of parameters in each layer. For example, ()'lstm.weight_ih_l0', 80000) indicates that there are 100$\times$200 parameters that connect the embeddings (dim=100) to the hidden layer (dim=200), and there are 3X additional parameters coressponding to the three gates in LSTM. Hence a total of 100$\times$200$\times$4=80,000 parameters. Likewise ('lstm.weight_hh_l0', 160000) indicates that there are 200$\times$200 parameters that connect the previous hidden state (dim=200) to the hidden layer (dim=200), and there are 3X additional parameters coressponding to the three gates in LSTM. Hence a total of 200$\times$200$\times$4=160,000 parameters.

In [None]:
model = MyRNN()

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
print(MyRNN())
[(n, p.numel()) for n, p in MyRNN().named_parameters()]

The model has 362,001 trainable parameters
MyRNN(
  (embedding): Embedding(25000, 100)
  (rnn): RNN(100, 200, num_layers=2, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=400, out_features=1, bias=True)
)


[('embedding.weight', 2500000),
 ('rnn.weight_ih_l0', 20000),
 ('rnn.weight_hh_l0', 40000),
 ('rnn.bias_ih_l0', 200),
 ('rnn.bias_hh_l0', 200),
 ('rnn.weight_ih_l0_reverse', 20000),
 ('rnn.weight_hh_l0_reverse', 40000),
 ('rnn.bias_ih_l0_reverse', 200),
 ('rnn.bias_hh_l0_reverse', 200),
 ('rnn.weight_ih_l1', 80000),
 ('rnn.weight_hh_l1', 40000),
 ('rnn.bias_ih_l1', 200),
 ('rnn.bias_hh_l1', 200),
 ('rnn.weight_ih_l1_reverse', 80000),
 ('rnn.weight_hh_l1_reverse', 40000),
 ('rnn.bias_ih_l1_reverse', 200),
 ('rnn.bias_hh_l1_reverse', 200),
 ('fc.weight', 400),
 ('fc.bias', 1)]

## Train the Model

Now we'll set up the training and then train the model. First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. We will use `Adam` instead of `SGD` that we used in Assignment 1. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates.


In [None]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

Next, we'll define our loss function. In PyTorch this is commonly called a criterion. 

The loss function here is _binary cross entropy with logits_. Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_. We then use this bound scalar to calculate the loss using binary cross entropy. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

Using `.to`, we can place the model and the criterion on the GPU (if we have one). 

In [None]:
criterion = nn.BCEWithLogitsLoss()

model.to(device)
criterion = criterion.to(device)
print(model)

MyRNN(
  (embedding): Embedding(25000, 100)
  (rnn): RNN(100, 200, num_layers=2, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=400, out_features=1, bias=True)
)


Our criterion function calculates the loss, however we have to write our function to calculate the accuracy. 

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [None]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()  
    acc = correct.sum() / len(correct)
    return acc

The `train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode", which turns on _dropout_ and _batch normalization_. Though we aren't using them in this model.

For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed.

We then feed the batch of sentences, `inputs`, into the model. Note, you do not need to do `model.forward(inputs)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _[batch size, 1]_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size _[batch size]_.

The loss and accuracy are then calculated using our predictions and the labels, `labels`, with the loss being averaged over all examples in the batch.

In [None]:
def train(model, data_loader, optimizer, criterion):
  epoch_loss = 0
  epoch_acc = 0
  model.train()

  for inputs, labels in train_loader:
      optimizer.zero_grad()
      inputs, labels = inputs.to(device), labels.to(device, dtype=torch.float)   
      predictions = model(inputs).squeeze(1)      
      loss = criterion(predictions, labels)      
      loss.backward()
      optimizer.step()
      acc = binary_accuracy(predictions, labels)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
      
  return epoch_loss / len(data_loader), epoch_acc / len(data_loader)

`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.

`model.eval()` puts the model in "evaluation mode", this turns off _dropout_ and _batch normalization_. Though, we are not using them in this model.

No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating.


In [None]:
def evaluate(model, data_loader, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device, dtype=torch.float)
            predictions = model(inputs).squeeze(1)
            loss = criterion(predictions, labels)
            acc = binary_accuracy(predictions, labels)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(data_loader), epoch_acc / len(data_loader)

And also create a nice function to tell us how long our epochs are taking.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model...
At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_loader, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.5f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.5f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 20s
	Train Loss: 0.65896 | Train Acc: 60.80%
	 Val. Loss: 0.68068 |  Val. Acc: 56.42%
Epoch: 02 | Epoch Time: 0m 19s
	Train Loss: 0.67157 | Train Acc: 57.78%
	 Val. Loss: 0.65067 |  Val. Acc: 61.63%
Epoch: 03 | Epoch Time: 0m 19s
	Train Loss: 0.66211 | Train Acc: 59.38%
	 Val. Loss: 0.66532 |  Val. Acc: 57.89%
Epoch: 04 | Epoch Time: 0m 19s
	Train Loss: 0.65372 | Train Acc: 60.16%
	 Val. Loss: 0.65718 |  Val. Acc: 58.75%
Epoch: 05 | Epoch Time: 0m 20s
	Train Loss: 0.63929 | Train Acc: 62.99%
	 Val. Loss: 0.67392 |  Val. Acc: 58.99%


Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [None]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_loader, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.651 | Test Acc: 60.99%


## User Input

We can now use our model to predict the sentiment of any sentence we give it.

Our `predict_sentiment` function does a few things:
- sets the model to evaluation mode
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- squashes the output prediction to a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an Python number with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [None]:
def predict_sentiment(model, sentence):
    model.eval()
    #x_train.append([glove.stoi[preprocess_string(word)] for word in sent.lower().split()
                                  #if preprocess_string(word) in glove.stoi])
    input = [glove.stoi[preprocess_string(word)] for word in sentence.lower().split()
                                  if preprocess_string(word) in glove.stoi]
    tensor = torch.LongTensor(input).to(device)
    tensor = tensor.unsqueeze(0)
    logit = model(tensor)
    prediction = torch.sigmoid(logit)
    return prediction.item()

An example negative review...

In [None]:
predict_sentiment(model, "This film is terrible")

0.2471698373556137

An example positive review...

In [None]:
predict_sentiment(model, "This film is great")

0.6695504188537598