# Predicting Sentiment Using a Transformer

This notebook provides you with a complete code example that predicts the sentiment of movie reviews using a transformer encoder network.

## Using the IMDB Dataset

Start by downloading the Large Movie Review Dataset (often referred to as the IMDB dataset, as it’s available at https://huggingface.co/datasets/imdb). It contains 50,000 movie reviews, labeled as positive or negative. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

Download the IMDB dataset ...

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

... splitting the training and validation datasets ...

In [2]:
split = dataset["train"].train_test_split(test_size=0.2, 
                                          stratify_by_column="label", seed=42)
train_dataset, val_dataset = split["train"], split["test"]

... and print some example reviews.

In [3]:
import numpy as np
import pandas as pd

samples = train_dataset.select(np.random.randint(0, len(train_dataset), 3))
texts, labels = samples["text"], samples["label"]

df = pd.DataFrame({"Text": texts, "Label": labels})
styled_df = df.style.set_properties(**{"text-align": "left"}).set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
)
with pd.option_context("display.max_colwidth", None):
    display(styled_df)

Unnamed: 0,Text,Label
0,"The Sunshine Boys is one of my favorite feel good movies. I first saw it when it as the Christmas attraction at Radio City Music Hall when it first came out and loved it ever since. I ended up seeing it 6 times in the theaters, and if it was playing today I'd go out to see it again. Now a lot of the reviews here mentioned the wonderful performances of the leads. Matthau was brilliant, but had the misfortune of being nominated against Jack Nicholson's Oscar winning performance of Randall P. MacMurphy in ""One Flew Over the Cuckoo's nest. Burns did win, though Richard Benjiman deserved at least to be nominated as well. Even the smallest roles were played to perfection, like Fritz Feld auditioning for the potato chips commercial. Which brings me to my reason for reviewing this film, the direction of the greatly underrated Herbert Ross. Ross who previously brought a two person play, ""The Owl And The Pussycat"" to the screen and made a full movie out of it, does it again. He opens the plays out without making them look like a photographic stage play. He fleashens out the story and the characters. Here we're 20 minutes into the film before we get to the scene that opens the play, where Ben Clark comes to see his uncle and tell him about the comedy special. Though there are dialogue from the play during the first twenty minutes, the sequence itself is totally new. A few years ago I did see at the broadway revival of the play with Jack Klugman and Tony Randall, which was wonderful. But I think that Ross and screenwriter, playwright Simon improved on it. It's just a wonderful film.",1
1,"This is truly an awful movie and a waste of 2 hours of your life. It is simultaneously bland and offensive, with nudity and lots and lots of violence. However, the nudity is not that exciting, and the violence is repetitive and boring. Also, the plot is flimsy at best, the characters are unrealistic and undeveloped, and the acting is some of the worst I have ever seen. I have heard that this movie is supposed to be funny, but it's not. I did not laugh once while watching it, nor did I even crack a smile. The makers of this film tried to combine a comedy movie with an action movie, and they failed on both counts. Some poorly made movies are funny because they are so bad, but this is not one of them.",0
2,"I really enjoyed ""Doctor Mordrid"". This is a low-budget film, which may be off-putting to some, but I have no problem with it. I admire it even more for that, considering it's WAY more entertaining than the drivel that Hollywood churns out every year. Too bad this didn't get a theatrical release; I don't know about anyone else, but I would have went to see it in theatres. `Doctor Mordrid' is a very entertaining science fiction film that just about anyone can enjoy, especially if they're into sci-fi like I am. I don't see why this is a R-rated film; only one f-word is said, and there are no gruesome death scenes, nor is there any blood at all. The timeless rivalry between sorcerers Anton and Kabal (Anton wanted the use his powers to save the human race, while Kabal wanted to enslave them), gave the story a sense of enchantment, while the mythical plotline added charm to the story itself. Basically, this a film that's just plain fun to watch. There is one unintentionally funny thing in this movie, though: seeing Jeffrey Combs keeping a straight face while wearing that silly blue cape and suit. That makes me laugh every time I see it. But I digress... Anyway, the acting is great; the main protagonists (Anton, and his lady friend, Samantha), are very likable; Anton is sympathetic, and hospitable, and Samantha is friendly. Plus, the settings were wonderful. The floating island in the other dimension was very cool setting; we're only given a glimpse of it twice, though; it would have been great to see more scenes take place here. The main setting was also very neat; Anton's apartment is very roomy, and he has some cool devices, especially the monitoring system he uses to keep track of the world's occurrences. He even has a pet raven that he keeps in his apartment named Edgar. Overall, this a great film; it was fun to watch, and the main actors put a lot of feeling into their roles. If you can find anywhere that rents `Doctor Mordrid', you should rent it (or, in my case, buy it. It was definitely money well-spent)! My Rating: 8 stars out of ten.",1


### Preprocessing the Reviews

Implement a function to tokenize a sentence ...

In [4]:
import contractions, re, spacy, unicodedata

tokenizers = {"eng": spacy.blank("en"), "spa": spacy.blank("es")}

regular_expression = r"^[a-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ.,!?¡¿/:()]+$"
pattern = re.compile(unicodedata.normalize("NFC", regular_expression))

def tokenize(text, lang="eng"):
    """Tokenize text."""
    swaps = {"’": "'", "‘": "'", "“": '"', "”": '"', "´": "'", "´´": '"'}
    for old, new in swaps.items():
        text = text.replace(old, new)
    text = contractions.fix(text) if lang == "eng" else text
    tokens = tokenizers[lang](text)
    return [token.text for token in tokens if pattern.match(token.text)]

### Building a Vocabulary

Implement a class to represent a vocabulary ...

In [5]:
class Vocab:
    """Vocabulary as callable dictionary."""
    
    def __init__(self, vocab_dict, unk_token="<unk>"):
        """Initialize vocabulary"""
        self.vocab_dict, self.unk_token = vocab_dict, unk_token
        self.default_index = vocab_dict.get(unk_token, -1)
        self.index_to_token = {idx: token for token, idx in vocab_dict.items()}
        
    def __call__(self, token_or_tokens):
        """Return the index(es) for given token or list of tokens."""
        if not isinstance(token_or_tokens, list):
            return self.vocab_dict.get(token_or_tokens, self.default_index)
        else:
            return [self.vocab_dict.get(token, self.default_index) 
                    for token in token_or_tokens]
    
    def set_default_index(self, index):
        """Set default index for unknown tokens."""
        self.default_index = index

    def lookup_token(self, index_or_indices):
        """Retrieve token corresponding to given index or list of indices."""
        if not isinstance(index_or_indices, list):
            return self.index_to_token.get(int(index_or_indices), self.unk_token)
        else:
            return [self.index_to_token.get(int(index), self.unk_token) 
                    for index in index_or_indices]

    def get_itos(self):
        """Return a list of tokens ordered by their index."""
        itos = [None] * len(self.index_to_token)
        for index, token in self.index_to_token.items():
            itos[index] = token
        return itos
        
    def __iter__(self):
        """Iterate over the tokens in the vocabulary."""
        return iter(self.vocab_dict)

    def __len__(self):
        """Return the number of tokens in the vocabulary."""
        return len(self.vocab_dict)
    
    def __contains__(self, token):
        """Check if a token is in the vocabulary."""
        return token in self.vocab_dict

... implement a function to build vocabulary from an iterator ...

In [6]:
from collections import Counter

def build_vocab_from_iterator(iterator, specials=None, min_freq=1):
    """Build vocabulary from an iterator over tokenized sentences."""
    token_freq = Counter(token for tokens in iterator for token in tokens)
    vocab, index = {}, 0
    if specials: 
        for token in specials: 
            vocab[token] = index
            index += 1
    for token, freq in token_freq.items():
        if freq >= min_freq:
            vocab[token] = index
            index += 1
    return vocab

... create a vocabulary ...

In [7]:
def imdb_iterator(dataset):
    """Iterate over the IMBD dataset."""
    for sample in dataset:
        yield tokenize(sample["text"])

vocab_dict = build_vocab_from_iterator(imdb_iterator(train_dataset), 
                                  specials=["<unk>"], min_freq=10)
vocab = Vocab(vocab_dict, unk_token="<unk>")
vocab.set_default_index(vocab(vocab.unk_token))

... and preprocess the training, validation, and testing datasets.

In [8]:
def preprocessing(sample):
    """Preprocess a movie review."""
    sentence = sample["text"]
    tokens = tokenize(unicodedata.normalize("NFC",sentence))
    sequence_of_indices = vocab(tokens)
    sample.update({"sequence": sequence_of_indices}) 
    return sample

train_dataset = train_dataset.map(preprocessing)
val_dataset = val_dataset.map(preprocessing)
test_dataset = dataset["test"].map(preprocessing)

## Defining the Data Loaders

In [9]:
from torch.utils.data import DataLoader
from torch_geometric.data import Data

def collate(batch):
    """Prepare a batch of data for the model to process."""
    sequences, labels, batch_indices = [], [], []
    for batch_index, sample in enumerate(batch):
        sequence = torch.tensor(sample["sequence"])
        sequences.append(sequence)

        batch_indices.append(torch.ones_like(sequence, dtype=torch.long) 
                             * batch_index)
        
        label = torch.tensor(sample["label"])
        labels.append(label)
    return Data(sequences=torch.cat(sequences), 
                batch_indices=torch.cat(batch_indices),
                y=torch.Tensor(labels).float())

train_dataloader = \
    DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = \
    DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate)
test_dataloader = \
    DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate)

## Building a Transformer Encoder Layer

Prepare a class to implement a multi-head attention layer ...

In [10]:
import deeplay as dl
import torch

class MultiHeadAttentionLayer(dl.DeeplayModule):
    """"Multi-head attention layer with masking."""
    
    def __init__(self, num_features, num_heads):
        """Initialize multi-head attention."""
        super().__init__()
        
        self.num_features = num_features
        self.num_heads = num_heads
        self.head_dim = num_features // num_heads  # Must be integer.
        
        self.Wq = dl.Layer(torch.nn.Linear, num_features, num_features)
        self.Wk = dl.Layer(torch.nn.Linear,num_features, num_features)
        self.Wv = dl.Layer(torch.nn.Linear, num_features, num_features)
        self.Wout = dl.Layer(torch.nn.Linear, num_features, num_features)
    
    def forward(self, in_sequences, batch_indices):
        
        seq_len, embed_dim = in_sequences.shape
        Q = self.Wq(in_sequences)
        K = self.Wk(in_sequences)
        V = self.Wv(in_sequences)
        
        Q = Q.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        K = K.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        V = V.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        
        attn_scores = (torch.matmul(Q, K.transpose(-2, -1)) 
                       / (self.head_dim ** 0.5))

        attn_mask = torch.eq(batch_indices.unsqueeze(1), batch_indices.unsqueeze(0))

        attn_mask = attn_mask.unsqueeze(0)
        attn_scores = attn_scores.masked_fill(attn_mask == False, float('-inf'))
        
        attn_weights = torch.nn.functional.softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)
        
        attn_output = attn_output.permute(1, 0, 2).contiguous()
        attn_output = attn_output.view(seq_len, self.num_features)
        output = self.Wout(attn_output)
        
        return output

... and a class to implement a transformer encoder layer ...

In [11]:
from torch_geometric.nn.norm import LayerNorm

class TransformerEncoderLayer(dl.DeeplayModule):
    """Transformer encoder layer."""
    
    def __init__(self, num_features, num_heads, feedforward_dim, dropout=0.0):
        """Initialize transformer encoder layer."""
        super().__init__()

        self.self_attn = MultiHeadAttentionLayer(num_features, num_heads)
        self.attn_dropout = dl.Layer(nn.Dropout, dropout) 
        self.attn_skip = dl.Add() 
        self.attn_norm = dl.Layer(LayerNorm, num_features, eps=1e-6)

        self.feedforward = dl.Sequential(
            dl.Layer(nn.Linear, num_features, feedforward_dim),
            dl.Layer(nn.ReLU),
            dl.Layer(nn.Linear, feedforward_dim, num_features),
        )
        self.feedforward_dropout = dl.Layer(nn.Dropout, dropout) 
        self.feedforward_skip = dl.Add() 
        self.feedforward_norm = dl.Layer(LayerNorm, num_features, eps=1e-6)

    def forward(self, in_sequences, batch_indices):
        """Refine sequence via attention and feedforward layers."""
        attns = self.self_attn(in_sequences, batch_indices)
        attns = self.attn_dropout(attns)
        attns = self.attn_skip(in_sequences, attns)
        attns = self.attn_norm(attns, batch_indices)

        out_sequences = self.feedforward(attns)
        out_sequences = self.feedforward_dropout(out_sequences)
        out_sequences = self.feedforward_skip(attns, out_sequences)
        out_sequences = self.feedforward_norm(out_sequences, batch_indices)
        
        return out_sequences

## Building a Transformer Encoder Model

Build a class to implement a transformer encoder model ...

In [12]:
import torch.nn as nn

class TransformerEncoderModel(dl.DeeplayModule):
    """Transformer encoder model."""
    
    def __init__(self, vocab_size, num_features, num_heads, feedforward_dim,
                 num_layers, out_dim, dropout=0.0):
        """Initialize transformer encoder model."""
        super().__init__()

        self.num_features = num_features

        self.embedding = dl.Layer(nn.Embedding, vocab_size, num_features)
        
        self.pos_encoder = dl.IndexedPositionalEmbedding(num_features)
        self.pos_encoder.dropout.configure(p=dropout)

        self.transformer_block = dl.LayerList()
        for _ in range(num_layers):
            self.transformer_block.append(
                TransformerEncoderLayer(
                    num_features, num_heads, feedforward_dim, dropout=dropout
                )
            )
            
        self.out_block = dl.Sequential(
            dl.Layer(nn.Dropout, dropout),
            dl.Layer(nn.Linear, num_features, num_features // 2), 
            dl.Layer(nn.ReLU),
            dl.Layer(nn.Linear, num_features // 2, out_dim), 
            dl.Layer(nn.Sigmoid)
        )

    def forward(self, dict):
        """Predict sentiment of movie reviews."""
        in_sequences, batch_indices = dict["sequences"], dict["batch_indices"]

        embeddings = self.embedding(in_sequences) * self.num_features ** 0.5
        pos_embeddings = self.pos_encoder(embeddings, batch_indices)
        
        out_sequences = pos_embeddings
        for transformer_layer in self.transformer_block:
            out_sequences = transformer_layer(out_sequences, batch_indices)

        batch_size = torch.max(batch_indices) + 1
        aggregates = torch.zeros(batch_size, self.num_features, 
                                device=out_sequences.device) 
        for batch_index in torch.unique(batch_indices):
            mask =  batch_indices == batch_index
            aggregates[batch_index] = out_sequences[mask].mean(dim=0)

        pred_sentiment = self.out_block(aggregates).squeeze()

        return pred_sentiment

... instantiate the transformer encoder model ...

In [13]:
model = TransformerEncoderModel(
    vocab_size=len(vocab), num_features=300, num_heads=12, feedforward_dim=512,
    num_layers=4, out_dim=1, dropout=0.1,
).create()

print(model)

TransformerEncoderModel(
  (embedding): Embedding(19566, 300)
  (pos_encoder): IndexedPositionalEmbedding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_block): LayerList(
    (0-3): 4 x TransformerEncoderLayer(
      (self_attn): MultiHeadAttentionLayer(
        (Wq): Linear(in_features=300, out_features=300, bias=True)
        (Wk): Linear(in_features=300, out_features=300, bias=True)
        (Wv): Linear(in_features=300, out_features=300, bias=True)
        (Wout): Linear(in_features=300, out_features=300, bias=True)
      )
      (attn_dropout): Dropout(p=0.1, inplace=False)
      (attn_skip): Add()
      (attn_norm): LayerNorm(300, affine=True, mode=graph)
      (feedforward): Sequential(
        (0): Linear(in_features=300, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=300, bias=True)
      )
      (feedforward_dropout): Dropout(p=0.1, inplace=False)
      (feedforward_skip): Add()
      (feedforward_norm): La

## Loading Pretrained Embeddings

Download the GloVe embeddings ...

In [14]:
import os
from torchvision.datasets.utils import download_url, extract_archive

glove_folder = os.path.join(".", ".glove_cache")
zip_filepath = os.path.join(glove_folder, "glove.42B.300d.zip")
if not os.path.exists(glove_folder):
    os.makedirs(glove_folder, exist_ok=True)
    url = "https://nlp.stanford.edu/data/glove.42B.300d.zip"
    download_url(url, glove_folder)
    extract_archive(zip_filepath, glove_folder)
    os.remove(zip_filepath)

... implement a function to load the GloVe embeddings ...

In [15]:
def load_glove_embeddings(glove_file):
    """Load GloVe embeddings."""
    glove_embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.split()
            word = values[0]
            glove_embeddings[word] = np.round(
                np.asarray(values[1:], dtype='float32'), decimals=6,
            )
    return glove_embeddings

... implement a function to get GloVe embeddings for a vocabulary ...

In [16]:
def get_glove_embeddings(vocab, glove_embeddings, embed_dim):
    """Get GloVe embeddings for a vocabulary."""
    embeddings = torch.zeros((len(vocab), embed_dim), dtype=torch.float32)
    for i, token in enumerate(vocab):
        embedding = glove_embeddings.get(token)
        if embedding is None:
            embedding = glove_embeddings.get(token.lower())
        if embedding is not None:
            embeddings[i] = torch.tensor(embedding, dtype=torch.float32)
    return embeddings

... add pretrained embeddings ...

In [17]:
glove_file = os.path.join(glove_folder, "glove.42B.300d.txt")
glove_embed, embed_dim = load_glove_embeddings(glove_file), 300

model.embedding.weight.data = \
    get_glove_embeddings(vocab.get_itos(), glove_embed, embed_dim)
model.embedding.weight.requires_grad = False

## Training the Model

Compile the model ...

In [18]:
classifier = dl.BinaryClassifier(model=model, 
                                 optimizer=dl.AdamW(lr=1e-4)).create()

... and train it.

In [19]:
from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    monitor="valBinaryAccuracy", dirpath="models",
    filename="ATT-model{epoch:02d}-val_accuracy{valBinaryAccuracy:.2f}",
    auto_insert_metric_name=False, mode="max",
)
trainer = dl.Trainer(max_epochs=5, callbacks=[checkpoint_callback])
trainer.fit(classifier, train_dataloader, val_dataloader)


/Users/841602/Documents/GitHub/Environments/deeplay_env/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
2024-10-21 22:38:35.946954: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Output()

NameError: name 'exit' is not defined

## Evaluating the Trained Model

Load the best model ...

In [61]:
import glob, os

saved_models = glob.glob("./models/ATT-model*")
best_model = max(saved_models, key=os.path.getctime)
best_classifier = dl.BinaryClassifier.load_from_checkpoint(
    best_model, model=model,
).create()

... test the trained model ...

In [None]:
test_results = trainer.test(best_classifier, test_dataloader)

... and display the model’s prediction on some reviews.

In [None]:
import random

best_classifier.model.eval()

texts, labels, predictions = [], [], []
for idx in random.sample(range(len(test_dataset)), 3):
    sample = test_dataset[idx]
    
    input_sequence = torch.Tensor(vocab(tokenize(sample["text"]))).long()
    test_input = {
        "sequences": input_sequence,
        "batch_indices": torch.zeros_like(input_sequence, dtype=torch.long)
    }

    probability = best_classifier.model(test_input)
    prediction = probability > 0.5
    
    texts.append(sample["text"])
    labels.append(sample["label"])
    predictions.append(prediction.item() * 1)

df = pd.DataFrame({"text": texts, "label": labels, "prediction": predictions})
styled_df = df.style.set_properties(**{"text-align": "left"}).set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
)
with pd.option_context("display.max_colwidth", None):
    display(styled_df)