# IMDB movie review sentiment classification with RNNs

In this notebook, we'll train a recurrent neural network (RNN) for sentiment classification using **PyTorch**.

First, the needed imports. 

In [None]:
%matplotlib inline

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchtext import datasets
import torchtext.transforms as T
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

print('Using PyTorch version:', torch.__version__)
if torch.cuda.is_available():
    print('Using GPU, device name:', torch.cuda.get_device_name(0))
    device = torch.device('cuda')
else:
    print('No GPU found, using CPU instead.') 
    device = torch.device('cpu')

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive (1) and half are negative (0).

The dataset has already been preprocessed, and each word has been replaced by an integer index.
The reviews are thus represented as varying-length sequences of integers.
(Word indices begin at "3", as "1" is used to mark the start of a review and "2" represents all out-of-vocabulary words. "0" will be used later to pad shorter reviews to a fixed size.)

In [None]:
train_dataset, test_dataset = datasets.IMDB('./data', split=('train', 'test'))
#train_dataset, test_dataset = datasets.SST2('./data', split=('train', 'dev'))
#train_dataset, test_dataset = datasets.AG_NEWS('./data', split=('train', 'test'))

In [None]:
counts={}
i=0
for label, text in train_dataset:
    if label not in counts:
        counts[label] = 1
    else:
        counts[label] += 1

for key, value in counts.items():
    print(key, value)

In [None]:
# number of most-frequent words to use
nb_words = 10000

tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_dataset), 
                                  specials=["<unk>"], max_tokens=nb_words)
vocab.set_default_index(vocab["<unk>"])

In [None]:
maxlen = 80

patterns_list = [
    (r'"', '')
]

text_transform = T.Sequential(
    T.RegexTokenizer(patterns_list),
    T.VocabTransform(vocab),
    T.Truncate(maxlen),
    T.ToTensor(),
    T.PadTransform(maxlen, 0),
)

def apply_transform(x):
    return text_transform(x[1]), torch.tensor(x[0]-1, dtype=torch.float)


train_dataset_tr = train_dataset.map(apply_transform)
test_dataset_tr = test_dataset.map(apply_transform)

In [None]:
batch_size = 32

train_loader = DataLoader(dataset=train_dataset_tr, batch_size=batch_size, shuffle=True,
                          drop_last=True)
test_loader = DataLoader(dataset=test_dataset_tr, batch_size=batch_size, shuffle=False,
                         drop_last=True)

In [None]:
# FIXME use this instead? https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence

## RNN model

Let's create an RNN model that contains an LSTM layer. The first layer in the network is an *Embedding* layer that converts integer indices to dense vectors of length `embedding_dims`. The output layer contains a single neuron and *sigmoid* non-linearity to match the binary groundtruth (`y_train`). 

All the [neural network building blocks defined in PyTorch can be found in the torch.nn documentation](https://pytorch.org/docs/stable/nn.html).

The output of the last layer should be normalized with softmax, but this is actually included implicitly in the loss function in PyTorch (see below).

In [None]:
# model parameters:
embedding_dims = 50
lstm_units = 32

class SimpleRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(nb_words, embedding_dims)
        self.dropout = nn.Dropout(0.2)
        self.lstm = nn.LSTM(embedding_dims, lstm_units, batch_first=True)
        self.linear = nn.Linear(lstm_units, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.emb(x)
        x = self.dropout(x)
        x, (hn, cn) = self.lstm(x)
        x = self.linear(x[:, -1, :])
        return self.sigmoid(x.view(-1))

model = SimpleRNN().to(device)
print(model)

## Learning

Now let's train the RNN model. Note that LSTMs are rather slow to train.

In [None]:
def correct(output, target):
    sentiment_pred = output.round().int()          # set to 0 for <0.5 and 1 for >0.5
    correct_ones = sentiment_pred == target.int()  # 1 for correct, 0 for incorrect
    return correct_ones.sum().item()               # count number of correct ones


In [None]:
def train(data_loader, model, criterion, optimizer):
    model.train()

    num_batches = 0
    num_items = 0

    total_loss = 0
    total_correct = 0
    for data, target in tqdm(data_loader):
        # Copy data and targets to GPU
        data = data.to(device)
        target = target.to(device)
        
        # Do a forward pass
        output = model(data)
      
        # Calculate the loss
        loss = criterion(output, target)
        total_loss += loss
        num_batches += 1
        
        #print(output)
        #print(target)

        # Count number of correct digits
        total_correct += correct(output, target)
        num_items += len(target)
        
        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    train_loss = total_loss/num_batches
    accuracy = total_correct/num_items
    print(f"Average loss: {train_loss:7f}, accuracy: {accuracy:.2%}")


In [None]:
criterion = nn.BCELoss()
optimizer = torch.optim.RMSprop(model.parameters())

In [None]:
%%time

epochs = 10
for epoch in range(epochs):
    print(f"Training epoch: {epoch+1}")
    train(train_loader, model, criterion, optimizer)
    #test(test_loader, model, criterion)

### Inference

Here we have the same `test` function as before.

In [None]:
def test(test_loader, model, criterion):
    model.eval()

    num_batches = 0
    num_items = 0

    test_loss = 0
    total_correct = 0

    with torch.no_grad():
        for data, target in test_loader:
            # Copy data and targets to GPU
            data = data.to(device)
            target = target.to(device)

            # Do a forward pass
            output = model(data)
        
            # Calculate the loss
            loss = criterion(output, target)
            test_loss += loss.item()
            num_batches += 1
        
            # Count number of correct digits
            total_correct += correct(output, target)
            num_items += len(target)

    test_loss = test_loss/num_batches
    accuracy = total_correct/num_items

    print(f"Testset accuracy: {100*accuracy:>0.1f}%, average loss: {test_loss:>7f}")

In [None]:
test(test_loader, model, criterion)

In [None]:
myreviewtext = 'this movie was the worst i have ever seen and the actors were horrible'
#myreviewtext = 'this movie was awesome and then best action I have ever seen'

input = text_transform(myreviewtext).view(1, -1).to(device)
print(input)
output = model(input)
print(output.item())

## Task 1: Two LSTM layers

Create a model with two LSTM layers. Optionally, you can also use bidirectional layers (set `bidirectional=False` in LSTM. See the [LSTM documentation in PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM).

You can consult the [PyTorch documentation](https://pytorch.org/docs/stable/index.html), in particular all the [neural network building blocks can be found in the `torch.nn` documentation](https://pytorch.org/docs/stable/nn.html).

The code below is missing the model definition. You can copy any suitable layers from the example above.

In [None]:
class TwoLayeredRNN(nn.Module):
    def __init__(self):
        super().__init__()
        # TASK 1: ADD LAYERS HERE

    def forward(self, x):
        return x


Execute cell to see the example answer.

**Note:** in Google Colab you have to click and copy the answer manually.

In [None]:
# %load solutions/pytorch-mnist-rnn-example-answer.py
embedding_dims = 50
lstm_units = 32

class TwoLayeredRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(nb_words, embedding_dims)
        self.dropout = nn.Dropout(0.2)
        self.lstm = nn.LSTM(embedding_dims, lstm_units, num_layers=2,
                            batch_first=True)
        self.linear = nn.Linear(lstm_units, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.emb(x)
        x = self.dropout(x)
        x, (hn, cn) = self.lstm(x)
        x = self.linear(x[:, -1, :])
        return self.sigmoid(x.view(-1))


In [None]:
ex1_model = TwoLayeredRNN()
print(ex1_model)

assert len(list(ex1_model.parameters())) > 0, "ERROR: You need to write the missing model definition above!"


ex1_model = ex1_model.to(device)

In [None]:
ex1_criterion = nn.BCELoss()
ex1_optimizer = torch.optim.RMSprop(ex1_model.parameters())

In [None]:
%%time

epochs = 5
for epoch in range(epochs):
    print(f"Epoch: {epoch+1} ...")
    train(train_loader, ex1_model, ex1_criterion, ex1_optimizer)

In [None]:
test(test_loader, ex1_model, ex1_criterion)

## Task 2: Model tuning

Modify the model further.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters.

To combat overfitting, you can try for example to add dropout. For [LSTMs](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM), dropout between the LSTM layers can be set with the `dropout` parameter:

    self.lstm = nn.LSTM(embedding_dims, lstm_units, num_layers=2,
                        batch_first=True, dropout=0.2)


If you wish to change the batch size, you need to re-define the data loaders.

---
*Run this notebook in Google Colaboratory using [this link](https://colab.research.google.com/github/csc-training/intro-to-dl/blob/master/day1/optional/pytorch-mnist-mlp.ipynb).*