# IMDB movie review sentiment classification with RNNs

In this notebook, we'll train a recurrent neural network (RNN) for sentiment classification using **PyTorch**.

First, the needed imports. 

In [1]:
%matplotlib inline

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers import models, trainers, pre_tokenizers, normalizers

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import os

print('Using PyTorch version:', torch.__version__)
if torch.cuda.is_available():
    print('Using GPU, device name:', torch.cuda.get_device_name(0))
    device = torch.device('cuda')
else:
    print('No GPU found, using CPU instead.') 
    device = torch.device('cpu')

Using PyTorch version: 2.4.1+cu118
Using GPU, device name: Tesla P100-PCIE-16GB


## IMDB data set

Next we'll load the IMDB data set using the [Datasets library from Hugging Face](https://huggingface.co/docs/datasets/index).

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive and half are negative.

In [2]:
slurm_project = os.getenv('SLURM_JOB_ACCOUNT')
data_dir = os.path.join('/scratch', slurm_project, 'data') if slurm_project else './data'
print('data_dir =', data_dir)

train_dataset = load_dataset("imdb", split="train", trust_remote_code=False, cache_dir=data_dir)
test_dataset = load_dataset("imdb", split="test", trust_remote_code=False, cache_dir=data_dir)

data_dir = ./data


The data items can be accessed by index, and each item is a dictionary with a 'text' and 'label' field.

In [3]:
train_dataset[2]

{'text': "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />",
 'label': 0}

Let's do a quick count of the labels in the dataset...

In [4]:
def count_labels(dataset):
    counts={}
    i=0
    for item in dataset:
        label = item['label']
        if label not in counts:
            counts[label] = 1
        else:
            counts[label] += 1
    for key, value in counts.items():
        print(f"label: {key}, count: {value}")

print('train')
count_labels(train_dataset)

print('test')
count_labels(test_dataset)

train
label: 0, count: 12500
label: 1, count: 12500
test
label: 0, count: 12500
label: 1, count: 12500


We see that we have two labels: `0` and `1`, each with 12500 items per dataset split. Label `0` indicates a negative review, and label `1` a positive one.

## Pre-trained BERT model

In [64]:
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset

In [214]:
modelname = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
#modelname = 'AdamCodd/tinybert-sentiment-amazon'
bertmodel = pipeline(model=modelname, truncation=True, max_length=512, device=0)



<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7f9f34e53f10>

In [215]:
bertmodel(["This was a great movie!", "This was the worst movie I have ever seen."])

[{'label': 'POSITIVE', 'score': 0.999869704246521},
 {'label': 'NEGATIVE', 'score': 0.9997679591178894}]

In [216]:
batch_size = 128

test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False,
                         drop_last=True)

In [217]:
def correct(output, target):
    sentiment_pred = torch.Tensor([0 if x['label']=='NEGATIVE' else 1 for x in output]).int()
    #print(sentiment_pred)
    #print(target.int())
    correct_ones = sentiment_pred == target.int()  # 1 for correct, 0 for incorrect
    #print(correct_ones)
    return correct_ones.sum().item()               # count number of correct ones


In [218]:
def test(test_loader, model):
    
    num_items = 0
    total_correct = 0

    with torch.no_grad():
        for item in test_loader:
            # Copy data and targets to GPU
            #data = KeyDataset(item, "text")
            data = item['text']
            target = item['label'] #.to(device)

            #print(data)
            # Do a forward pass
            output = model(data)
            #print(output, target)
            #for a in enumerate(output):
            #    print(a)
            #print(model.postprocess(output))
            # Count number of correct digits
            total_correct += correct(output, target)
            num_items += len(target)
            #if num_items>10:
            #    break
            
    accuracy = total_correct/num_items

    print(f"Testset accuracy: {100*accuracy:>0.1f}%")

In [219]:
%%time 

test(test_loader, bertmodel)

Testset accuracy: 89.1%
CPU times: user 4min 26s, sys: 414 ms, total: 4min 26s
Wall time: 4min 26s


### Pre-processing

Before we start training, we need to process the data into a more suitable format. The text now consists of text strings of variable length, but a neural network typically needs to have fixed-length vectors containing numbers.

To achieve this we will use the `WordLevel` tokenizer from Hugging Face's [Tokenizers library](https://huggingface.co/docs/tokenizers/index). We will tell it to create a vocabulary of the 10,000 most frequent words, and use the special word `[UNK]` for any other words. These 10,001 words will all be mapped to a specific integer index.

In [27]:
# number of most-frequent words to use
nb_words = 10000

tokenizer = Tokenizer(models.WordLevel(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.normalizer = normalizers.Sequence([normalizers.NFD(),
                                             normalizers.Lowercase(),
                                             normalizers.StripAccents()])

trainer = trainers.WordLevelTrainer(vocab_size=nb_words, min_frequency=1, special_tokens=['[UNK]'])
tokenizer.train_from_iterator(train_dataset['text'], trainer)

Let's try our tokenizer out with an example sentence. We deliberately also add a nonsense word to see if it correctly maps that to `[UNK]` (which has index 0).

In [28]:
tokenizer.encode("hello, this is a test sentence foobazz").ids

[4793, 2, 14, 9, 5, 2246, 4252, 0]

Finally we create a function that ensures all our vector have the same length of 80 by truncating too long sentences and padding too short ones with 0's.

In [29]:
vec_length = 80

def text_transform(text):
    x = tokenizer.encode(text)
    x.truncate(vec_length)
    x.pad(vec_length)
    return x.ids


We can try the text transformation on a test sentence:

In [30]:
print(text_transform("hello, this is a test sentence"))

[4793, 2, 14, 9, 5, 2246, 4252, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Next, let's apply this by transforming our datasets to represent our texts as 80-length vectors and labels as floating point values.

In [31]:
# Apply the list of transforms to the text
# We also switch around so we have the text first and labels second
def apply_transform(x):
    return {
        'input_ids': text_transform(x['text']),
        'label_id': float(x['label'])
        }

train_dataset_tr = train_dataset.map(apply_transform, remove_columns=['text', 'label']).with_format('torch')
test_dataset_tr = test_dataset.map(apply_transform, remove_columns=['text', 'label']).with_format('torch')

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Let's take a look at one example:

In [33]:
test_dataset_tr[2]

{'input_ids': tensor([ 105,    5,  511,  877,   27,   24,    5,  191, 2458,   17, 2727,  224,
          872,   15,  108,    1,  128,  332,    5,  132,  145,    4, 3111,    1,
          527,    6,    1,  377, 1203,    0,  122,    3,  555,    6,    1,  128,
          102,    8,   29,  108,  307,    4,  332,    7,   38, 1321,   12,    7,
          384,   66,   67,    3,    1,  148,  128,    9,   15,    6,    5,   65,
         1153,  576,   15,  170,    8,   29,  585,    1,  527, 3197,    4,  114,
         1336,   35,  603,   47,    1,  479,    3,    1]),
 'label_id': tensor(0.)}

Next we'll create the data loaders with a given batch size.

In [None]:
batch_size = 128

train_loader = DataLoader(dataset=train_dataset_tr, batch_size=batch_size, shuffle=True,
                          drop_last=True)
test_loader = DataLoader(dataset=test_dataset_tr, batch_size=batch_size, shuffle=False,
                         drop_last=True)

## RNN model

Let's create an RNN model that contains an LSTM layer. The first layer in the network is an *embedding* layer that converts integer indices to dense vectors of length `embedding_dims`. The output layer contains a single neuron and *sigmoid* non-linearity to match the binary groundtruth (0=negative, 1=positive review). 

All the [neural network building blocks defined in PyTorch can be found in the torch.nn documentation](https://pytorch.org/docs/stable/nn.html).

The output of [LSTM in PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) is a 3D tensor of the shape batch_size x sequence_length x lstm_units, that is we get the output after each item in the sequence. Here we only want to have the output after the last item (after the whole sentence has been processed). This means we have to do things a bit more manually and cannot use the simple `nn.Sequential` as in previous exercises.

In [None]:
# model parameters:
embedding_dims = 50
lstm_units = 32

class SimpleRNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(nb_words, embedding_dims)
        self.dropout = nn.Dropout(0.2)
        self.lstm = nn.LSTM(embedding_dims, lstm_units, batch_first=True)
        self.linear = nn.Linear(lstm_units, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.emb(x)
        x = self.dropout(x)
        x, (hn, cn) = self.lstm(x)    # LSTM also returns the values of the internal h_n and c_n parameters
        x = self.linear(x[:, -1, :])  # we pick only the last output after having processed the whole sequence
        return self.sigmoid(x.view(-1))

model = SimpleRNN().to(device)
print(model)

## Learning

Now let's train the RNN model. Note that LSTMs are rather slow to train.

In [None]:
def correct(output, target):
    sentiment_pred = output.round().int()          # set to 0 for <0.5 and 1 for >0.5
    correct_ones = sentiment_pred == target.int()  # 1 for correct, 0 for incorrect
    return correct_ones.sum().item()               # count number of correct ones


In [None]:
def train(data_loader, model, criterion, optimizer):
    model.train()

    num_batches = 0
    num_items = 0

    total_loss = 0
    total_correct = 0
    for item in tqdm(data_loader):
        # Copy data and targets to GPU
        data = item['input_ids'].to(device)
        target = item['label_id'].to(device)
        
        # Do a forward pass
        output = model(data)
      
        # Calculate the loss
        loss = criterion(output, target)
        total_loss += loss
        num_batches += 1
        
        # Count number of correct digits
        total_correct += correct(output, target)
        num_items += len(target)
        
        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    train_loss = total_loss/num_batches
    accuracy = total_correct/num_items
    print(f"Average loss: {train_loss:7f}, accuracy: {accuracy:.2%}")


We use the [binary cross-entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) and [RMSprop optimizer](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html#torch.optim.RMSprop).

In [None]:
criterion = nn.BCELoss()
optimizer = torch.optim.RMSprop(model.parameters())

In [None]:
%%time

epochs = 5
for epoch in range(epochs):
    print(f"Training epoch: {epoch+1}")
    train(train_loader, model, criterion, optimizer)

### Inference

Here we have the same `test` function as before.

In [14]:
def test(test_loader, model, criterion):
    model.eval()

    num_batches = 0
    num_items = 0

    test_loss = 0
    total_correct = 0

    with torch.no_grad():
        for item in test_loader:
            # Copy data and targets to GPU
            data = item['input_ids'].to(device)
            target = item['label_id'].to(device)

            # Do a forward pass
            output = model(data)
        
            # Calculate the loss
            loss = criterion(output, target)
            test_loss += loss.item()
            num_batches += 1
        
            # Count number of correct digits
            total_correct += correct(output, target)
            num_items += len(target)

    test_loss = test_loss/num_batches
    accuracy = total_correct/num_items

    print(f"Testset accuracy: {100*accuracy:>0.1f}%, average loss: {test_loss:>7f}")

In [None]:
test(test_loader, model, criterion)

We can also use the learned model to predict sentiments for new reviews:

In [None]:
#myreviewtext = 'this movie was the worst i have ever seen and the actors were horrible'
myreviewtext = 'this movie was awesome and then best action I have ever seen'

input = torch.tensor(text_transform(myreviewtext)).view(1, -1).to(device)
print(input)
p = model(input).item()
sentiment = "POSITIVE" if p > 0.5 else "NEGATIVE"
print(f'Predicted sentiment: {sentiment} ({p:.4f})')

## Task 1: Two LSTM layers

Create a model with two LSTM layers (hint: there is a `num_layers` option!). Optionally, you can also use bidirectional layers (set `bidirectional=True` in LSTM). See the [LSTM documentation in PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM).

You can consult the [PyTorch documentation](https://pytorch.org/docs/stable/index.html), in particular all the [neural network building blocks can be found in the `torch.nn` documentation](https://pytorch.org/docs/stable/nn.html).

The code below is missing the model definition. You can copy any suitable layers from the example above.

In [None]:
class TwoLayeredRNN(nn.Module):
    def __init__(self):
        super().__init__()
        # TASK 1: ADD LAYERS HERE

    def forward(self, x):
        return x


Execute cell to see the example answer.

**Note:** in Google Colab you have to click and copy the answer manually.

In [None]:
%load solutions/pytorch-imdb-rnn-example-answer.py

In [None]:
ex1_model = TwoLayeredRNN()
print(ex1_model)

assert len(list(ex1_model.parameters())) > 0, "ERROR: You need to write the missing model definition above!"


ex1_model = ex1_model.to(device)

In [None]:
ex1_criterion = nn.BCELoss()
ex1_optimizer = torch.optim.RMSprop(ex1_model.parameters())

In [None]:
%%time

epochs = 5
for epoch in range(epochs):
    print(f"Epoch: {epoch+1} ...")
    train(train_loader, ex1_model, ex1_criterion, ex1_optimizer)

In [None]:
test(test_loader, ex1_model, ex1_criterion)

## Task 2: Model tuning

Modify the model further.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters.

To combat overfitting, you can try for example to add dropout. For [LSTMs](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM), dropout between the LSTM layers can be set with the `dropout` parameter:

    self.lstm = nn.LSTM(embedding_dims, lstm_units, num_layers=2,
                        batch_first=True, dropout=0.2)


If you wish to change the batch size, you need to re-define the data loaders.