# Predicting Sentiment Using a Transformer

This notebook provides you with a complete code example that predicts the sentiment of movie reviews using a transformer encoder network.

## Using the IMDB Dataset

Start by downloading the Large Movie Review Dataset (often referred to as the IMDB dataset, as it’s available at https://huggingface.co/datasets/imdb). It contains 50,000 movie reviews, labeled as positive or negative. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

Download the IMDB dataset ...

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

... splitting the training and validation datasets ...

In [2]:
split = dataset["train"].train_test_split(test_size=0.2, 
                                          stratify_by_column="label", seed=42)
train_dataset, val_dataset = split["train"], split["test"]

... and print some example reviews.

In [3]:
import numpy as np
import pandas as pd

samples = train_dataset.select(np.random.randint(0, len(train_dataset), 3))
texts, labels = samples["text"], samples["label"]

df = pd.DataFrame({"Text": texts, "Label": labels})
styled_df = df.style.set_properties(**{"text-align": "left"}).set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
)
with pd.option_context("display.max_colwidth", None):
    display(styled_df)

Unnamed: 0,Text,Label
0,"If you have seen Friends, the writing will feel very familiar. Especially the last 3 or 4 seasons of Friends often share the same comedy setups. The show is about a group of people whose connection is that they shared the same class when they were still rather young (about 10 years old I think). Now, they're in their mid-twenties, and they meet again on a class reunion. This is where the series starts. A typical episode deals with multiple story lines at once. They're usually not connected in any way. Each story line is cut up into multiple sections, which are then shown in a mixed order. The sketches is where my problems lie with this series. As in the later seasons of Friends, it's often a rather silly setting with hard to believe situations. One of the main characters does something really stupid that's hard to believe. The situation is then heavily exaggerated, as if it wasn't silly enough. If you're into this kind of in-your-face humor, then maybe you'll like this series. For me it is a great turn-off. The reason I started watching Friends is because of the first few seasons. There are interesting and especially credible story lines, with some romance in it that makes you root for the characters. The Class has none of this. The characters are simply too forced and stereotypes are pushed too far. It's therefore not possible to relate to them and like them. At least with friends, it took several seasons before it ran out of steam and the character traits were all milked out. But in The Class, it seems it has run out of steam before it even started.",0
1,"I'm still trying to decide if this is indeed, the worst film I have ever seen - A very disturbing problem with this film is that real scientists are interviewed, but their footage is edited to make it look as though they support the ideas of the many BSers who populate this film. The BS to signal ratio of the interviews is about ten thousand to one - at the end, the interviewees seem to be saying, ""We want you to _think_ !!"", but they themselves are too lazy to do simple research about things they assert as fact. If you feel that you are open-minded, and wish to expand your consciousness, please be open-minded enough to read some actual books about quantum theory: ""Einstein's Universe"", Nigel Calder (a slim volume, not a challenge), ""The Cosmic Code"", by Heinz Pagels. If you can't bring yourself to read a book, please don't complain to reviewers about being ""open-minded"". To recap, this film is just unbelievably bad. You know what's a really good film which questions the nature of reality? ""Thirteenth Floor"", directed by Roland Emmerich, with Craig Bierko, Gretchen Mol, Vincent D'onofrio. Smart, sexy, thought-provoking.",0
2,BASEketball is awesome! It's hilarious and so damned funny that you will wet your pants laughing. I have seen it so many times I have stopped counting. But everytime it gets funnier. Trust me on this one...BASEketball is a surefire hit and I loved it and will continue to love it. I hope one day there will be a special edition DVD brought out!!! Ten Thumbs Up!!!,1


### Preprocessing the Reviews

Implement a function to tokenize a sentence ...

In [4]:
import contractions, re, spacy, unicodedata

tokenizers = {"eng": spacy.blank("en"), "spa": spacy.blank("es")}

regular_expression = r"^[a-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ.,!?¡¿/:()]+$"
pattern = re.compile(unicodedata.normalize("NFC", regular_expression))

def tokenize(text, lang="eng"):
    """Tokenize text."""
    swaps = {"’": "'", "‘": "'", "“": '"', "”": '"', "´": "'", "´´": '"'}
    for old, new in swaps.items():
        text = text.replace(old, new)
    text = contractions.fix(text) if lang == "eng" else text
    tokens = tokenizers[lang](text)
    return [token.text for token in tokens if pattern.match(token.text)]

### Building a Vocabulary

Implement a class to represent a vocabulary ...

In [5]:
class Vocab:
    """Vocabulary as callable dictionary."""
    
    def __init__(self, vocab_dict, unk_token="<unk>"):
        """Initialize vocabulary"""
        self.vocab_dict, self.unk_token = vocab_dict, unk_token
        self.default_index = vocab_dict.get(unk_token, -1)
        self.index_to_token = {idx: token for token, idx in vocab_dict.items()}
        
    def __call__(self, token_or_tokens):
        """Return the index(es) for given token or list of tokens."""
        if not isinstance(token_or_tokens, list):
            return self.vocab_dict.get(token_or_tokens, self.default_index)
        else:
            return [self.vocab_dict.get(token, self.default_index) 
                    for token in token_or_tokens]
    
    def set_default_index(self, index):
        """Set default index for unknown tokens."""
        self.default_index = index

    def lookup_token(self, index_or_indices):
        """Retrieve token corresponding to given index or list of indices."""
        if not isinstance(index_or_indices, list):
            return self.index_to_token.get(int(index_or_indices), self.unk_token)
        else:
            return [self.index_to_token.get(int(index), self.unk_token) 
                    for index in index_or_indices]

    def get_itos(self):
        """Return a list of tokens ordered by their index."""
        itos = [None] * len(self.index_to_token)
        for index, token in self.index_to_token.items():
            itos[index] = token
        return itos
        
    def __iter__(self):
        """Iterate over the tokens in the vocabulary."""
        return iter(self.vocab_dict)

    def __len__(self):
        """Return the number of tokens in the vocabulary."""
        return len(self.vocab_dict)
    
    def __contains__(self, token):
        """Check if a token is in the vocabulary."""
        return token in self.vocab_dict

... implement a function to build vocabulary from an iterator ...

In [6]:
from collections import Counter

def build_vocab_from_iterator(iterator, specials=None, min_freq=1):
    """Build vocabulary from an iterator over tokenized sentences."""
    token_freq = Counter(token for tokens in iterator for token in tokens)
    vocab, index = {}, 0
    if specials: 
        for token in specials: 
            vocab[token] = index
            index += 1
    for token, freq in token_freq.items():
        if freq >= min_freq:
            vocab[token] = index
            index += 1
    return vocab

... create a vocabulary ...

In [7]:
def imdb_iterator(dataset):
    """Iterate over the IMBD dataset."""
    for sample in dataset:
        yield tokenize(sample["text"])

vocab_dict = build_vocab_from_iterator(imdb_iterator(train_dataset), 
                                  specials=["<unk>"], min_freq=10)
vocab = Vocab(vocab_dict, unk_token="<unk>")
vocab.set_default_index(vocab(vocab.unk_token))

... and preprocess the training, validation, and testing datasets.

In [8]:
def preprocessing(sample):
    """Preprocess a movie review."""
    sentence = sample["text"]
    tokens = tokenize(unicodedata.normalize("NFC", sentence))
    sequence_of_indices = vocab(tokens)
    sample.update({"sequence": sequence_of_indices}) 
    return sample

train_dataset = train_dataset.map(preprocessing)
val_dataset = val_dataset.map(preprocessing)
test_dataset = dataset["test"].map(preprocessing)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

## Defining the Data Loaders

In [9]:
from torch.utils.data import DataLoader
from torch_geometric.data import Data

def collate(batch_of_sequences):
    """Prepare a batch of sequences for the model to process."""
    sequences, labels, batch_indices = [], [], []
    for batch_index, sample in enumerate(batch_of_sequences):
        sequence = torch.tensor(sample["sequence"])
        sequences.append(sequence)
        batch_indices.append(torch.ones_like(sequence, dtype=torch.long) 
                             * batch_index)
        label = torch.tensor(sample["label"])
        labels.append(label)
    return Data(sequences=torch.cat(sequences), 
                batch_indices=torch.cat(batch_indices),
                y=torch.Tensor(labels).float())

train_dataloader = \
    DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = \
    DataLoader(val_dataset, batch_size=8, shuffle=False, collate_fn=collate)
test_dataloader = \
    DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate)

## Building a Transformer Encoder Layer

Prepare a class to implement a multi-head attention layer ...

In [10]:
import deeplay as dl
import torch

class MultiHeadAttentionLayer(dl.DeeplayModule):
    """"Multi-head attention layer with masking."""
    
    def __init__(self, num_features, num_heads):
        """Initialize multi-head attention."""
        super().__init__()
        self.num_features, self.num_heads = num_features, num_heads
        self.head_dim = num_features // num_heads  # Must be integer.
        
        self.Wq = dl.Layer(torch.nn.Linear, num_features, num_features)
        self.Wk = dl.Layer(torch.nn.Linear, num_features, num_features)
        self.Wv = dl.Layer(torch.nn.Linear, num_features, num_features)
        self.Wout = dl.Layer(torch.nn.Linear, num_features, num_features)
    
    def forward(self, in_sequence, batch_indices):
        """Apply the multi-head attention mechanism to the input sequence."""
        seq_len, embed_dim = in_sequence.shape
        Q = self.Wq(in_sequence)
        Q = Q.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        K = self.Wk(in_sequence)
        K = K.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        V = self.Wv(in_sequence)
        V = V.view(seq_len, self.num_heads, self.head_dim).permute(1, 0, 2)
        
        attn_scores = (torch.matmul(Q, K.transpose(-2, -1)) 
                       / (self.head_dim ** 0.5))
        attn_mask = torch.eq(batch_indices.unsqueeze(1), 
                             batch_indices.unsqueeze(0))
        attn_mask = attn_mask.unsqueeze(0)
        attn_scores = attn_scores.masked_fill(attn_mask == False, 
                                              float("-inf"))
        attn_weights = torch.nn.functional.softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)
        attn_output = attn_output.permute(1, 0, 2).contiguous()
        attn_output = attn_output.view(seq_len, self.num_features)
        return self.Wout(attn_output)

... and a class to implement a transformer encoder layer ...

In [11]:
from torch_geometric.nn.norm import LayerNorm

class TransformerEncoderLayer(dl.DeeplayModule):
    """Transformer encoder layer."""
    
    def __init__(self, num_features, num_heads, feedforward_dim, dropout=0.0):
        """Initialize transformer encoder layer."""
        super().__init__()

        self.self_attn = MultiHeadAttentionLayer(num_features, num_heads)
        self.attn_dropout = dl.Layer(torch.nn.Dropout, dropout) 
        self.attn_skip = dl.Add() 
        self.attn_norm = dl.Layer(LayerNorm, num_features, eps=1e-6)

        self.feedforward = dl.Sequential(
            dl.Layer(torch.nn.Linear, num_features, feedforward_dim),
            dl.Layer(torch.nn.ReLU),
            dl.Layer(torch.nn.Linear, feedforward_dim, num_features),
        )
        self.feedforward_dropout = dl.Layer(torch.nn.Dropout, dropout) 
        self.feedforward_skip = dl.Add() 
        self.feedforward_norm = dl.Layer(LayerNorm, num_features, eps=1e-6)

    def forward(self, in_sequence, batch_indices):
        """Refine sequence via attention and feedforward layers."""
        attns = self.self_attn(in_sequence, batch_indices)
        attns = self.attn_dropout(attns)
        attns = self.attn_skip(in_sequence, attns)
        attns = self.attn_norm(attns, batch_indices)

        out_sequence = self.feedforward(attns)
        out_sequence = self.feedforward_dropout(out_sequence)
        out_sequence = self.feedforward_skip(attns, out_sequence)
        out_sequence = self.feedforward_norm(out_sequence, batch_indices)
        
        return out_sequence

## Building a Transformer Encoder Model

Build a class to implement a transformer encoder model ...

In [12]:
class TransformerEncoderModel(dl.DeeplayModule):
    """Transformer encoder model."""
    
    def __init__(self, vocab_size, num_features, num_heads, feedforward_dim,
                 num_layers, out_dim, dropout=0.0):
        """Initialize transformer encoder model."""
        super().__init__()
        self.num_features = num_features

        self.embedding = dl.Layer(torch.nn.Embedding, vocab_size, num_features)
        
        self.pos_encoder = dl.IndexedPositionalEmbedding(num_features)
        self.pos_encoder.dropout.configure(p=dropout)

        self.transformer_block = dl.LayerList()
        for _ in range(num_layers):
            self.transformer_block.append(TransformerEncoderLayer(
                num_features, num_heads, feedforward_dim, dropout=dropout,
            ))
            
        self.out_block = dl.Sequential(
            dl.Layer(torch.nn.Dropout, dropout),
            dl.Layer(torch.nn.Linear, num_features, num_features // 2), 
            dl.Layer(torch.nn.ReLU),
            dl.Layer(torch.nn.Linear, num_features // 2, out_dim), 
            dl.Layer(torch.nn.Sigmoid),
        )

    def forward(self, dict):
        """Predict sentiment of movie reviews."""
        in_sequence, batch_indices = dict["sequence"], dict["batch_indices"]

        embeddings = self.embedding(in_sequence) * self.num_features ** 0.5
        pos_embeddings = self.pos_encoder(embeddings, batch_indices)
        
        out_sequence = pos_embeddings
        for transformer_layer in self.transformer_block:
            out_sequence = transformer_layer(out_sequence, batch_indices)

        batch_size = torch.max(batch_indices) + 1
        aggregates = torch.zeros(batch_size, self.num_features, 
                                 device=out_sequence.device) 
        for batch_index in torch.unique(batch_indices):
            mask =  batch_indices == batch_index
            aggregates[batch_index] = out_sequence[mask].mean(dim=0)

        pred_sentiment = self.out_block(aggregates).squeeze()
        return pred_sentiment

... instantiate the transformer encoder model ...

In [13]:
model = TransformerEncoderModel(
    vocab_size=len(vocab), num_features=300, num_heads=12, feedforward_dim=512,
    num_layers=4, out_dim=1, dropout=0.1,
).create()

print(model)

TransformerEncoderModel(
  (embedding): Embedding(19566, 300)
  (pos_encoder): IndexedPositionalEmbedding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_block): LayerList(
    (0-3): 4 x TransformerEncoderLayer(
      (self_attn): MultiHeadAttentionLayer(
        (Wq): Linear(in_features=300, out_features=300, bias=True)
        (Wk): Linear(in_features=300, out_features=300, bias=True)
        (Wv): Linear(in_features=300, out_features=300, bias=True)
        (Wout): Linear(in_features=300, out_features=300, bias=True)
      )
      (attn_dropout): Dropout(p=0.1, inplace=False)
      (attn_skip): Add()
      (attn_norm): LayerNorm(300, affine=True, mode=graph)
      (feedforward): Sequential(
        (0): Linear(in_features=300, out_features=512, bias=True)
        (1): ReLU()
        (2): Linear(in_features=512, out_features=300, bias=True)
      )
      (feedforward_dropout): Dropout(p=0.1, inplace=False)
      (feedforward_skip): Add()
      (feedforward_norm): La

## Loading Pretrained Embeddings

Download the GloVe embeddings ...

In [14]:
import os
from torchvision.datasets.utils import download_url, extract_archive

glove_folder = os.path.join(".", ".glove_cache")
zip_filepath = os.path.join(glove_folder, "glove.42B.300d.zip")
if not os.path.exists(glove_folder):
    os.makedirs(glove_folder, exist_ok=True)
    url = "https://nlp.stanford.edu/data/glove.42B.300d.zip"
    download_url(url, glove_folder)
    extract_archive(zip_filepath, glove_folder)
    os.remove(zip_filepath)

... implement a function to load the GloVe embeddings ...

In [15]:
def load_glove_embeddings(glove_file):
    """Load GloVe embeddings."""
    glove_embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.split()
            word = values[0]
            glove_embeddings[word] = np.round(
                np.asarray(values[1:], dtype='float32'), decimals=6,
            )
    return glove_embeddings

... implement a function to get GloVe embeddings for a vocabulary ...

In [16]:
def get_glove_embeddings(vocab, glove_embeddings, embed_dim):
    """Get GloVe embeddings for a vocabulary."""
    embeddings = torch.zeros((len(vocab), embed_dim), dtype=torch.float32)
    for i, token in enumerate(vocab):
        embedding = glove_embeddings.get(token)
        if embedding is None:
            embedding = glove_embeddings.get(token.lower())
        if embedding is not None:
            embeddings[i] = torch.tensor(embedding, dtype=torch.float32)
    return embeddings

... add pretrained embeddings ...

In [17]:
glove_file = os.path.join(glove_folder, "glove.42B.300d.txt")
glove_embed, embed_dim = load_glove_embeddings(glove_file), 300

model.embedding.weight.data = \
    get_glove_embeddings(vocab.get_itos(), glove_embed, embed_dim)
model.embedding.weight.requires_grad = False

## Training the Model

Compile the model ...

In [18]:
classifier = dl.BinaryClassifier(model=model, 
                                 optimizer=dl.AdamW(lr=1e-4)).create()

... and train it.

In [19]:
trainer = dl.Trainer(epochs=5, accelerator="cpu")  ###
trainer.fit(classifier, train_dataloader, val_dataloader)

/Users/giovannivolpe/Documents/GitHub/DeepLearningCrashCourse/py_env_book/lib/python3.10/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/giovannivolpe/Documents/GitHub/DeepLearningCrashCourse/py_env_book/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default


Output()

## Evaluating the Trained Model

Test the trained model ... ...

In [20]:
test_results = trainer.test(classifier, test_dataloader)

Output()

/Users/giovannivolpe/Documents/GitHub/DeepLearningCrashCourse/py_env_book/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=10` in the `DataLoader` to improve performance.


... and display the model’s prediction on some reviews.

In [21]:
import random

classifier.model.eval()

texts, labels, predictions = [], [], []
for idx in random.sample(range(len(test_dataset)), 3):
    sample = test_dataset[idx]
    input_sequence = torch.Tensor(vocab(tokenize(sample["text"]))).long()
    test_input = {
        "sequence": input_sequence,
        "batch_indices": torch.zeros_like(input_sequence, dtype=torch.long)
    }
    probability = classifier.model(test_input)
    prediction = probability > 0.5
    
    texts.append(sample["text"])
    labels.append(sample["label"])
    predictions.append(prediction.item() * 1)

df = pd.DataFrame({"text": texts, "label": labels, "prediction": predictions})
styled_df = df.style.set_properties(**{"text-align": "left"}).set_table_styles(
    [{"selector": "th", "props": [("text-align", "center")]}]
)
with pd.option_context("display.max_colwidth", None):
    display(styled_df)

Unnamed: 0,text,label,prediction
0,"This is the kind of movie that I grew up on. It is great family fun, that the kids love and the parents enjoy as well. I wish more films like this were made. It's a great story about a little boy who raises a Bull with his mother and sister, and shows it all the way up to the National Grand Championship, where he wins! Then he's scared that someone's going to barbecue his bull, so he kidnaps it and heads home with it. I was really excited when I first saw this film in the theater and was surprised to see George Strait, Julia Roberts and Bruce Willis in this little film. The music was great with all kinds of huge country names like Willie Nelson and the Dixie Chicks. Anyone who doesn't enjoy this movie, doesn't have any children or never was a kid them self.",1,1
1,A good entertainment but nothing more : in this western we are between the classics and the spaghetti ones. This provides us a good a conventional story but it's always a pleasure to see Robert Mitchum with his legendary flegma although he isn't as fit as in the forties or the fifties. And don't forget David Carradine is the son of John Carradine,1,1
2,"""Stealing Time"" actually dates back to 2001 when it was mysteriously titled ""Rennie's Landing"". Which explains how director Marc Fusco was able to afford this cast of now established television/movie actors in what is obviously an extremely low budget production. About ten minutes into the film you understand why this thing never got a theatrical release after it made the film festival rounds several years ago. Its recent distribution by Franchise Pictures probably reflects a perception that the rising popularity of certain cast members can be milked to recover some of the modest production costs. Although not a great addition to anyone's resume, young actors have done worse things when they were desperately seeking acting work of any kind. Peter Facinelli, Ethan Embry, Scott Foley and Charlotte Ayanna play college friends who do an early ""Big Chill"" reunion and compare war stories about the failure of reality to measure up to their dreams. Unfortunately nothing else happens, absolutely nothing. Yes Alec (Facinelli) dreams about a liquor store holdup and a bank robbery, which are then ""cheaply and lamely"" staged to completely inappropriate music. It is the least suspenseful bank job since W.C. Fields was the guard in ""The Bank Dick"". If anyone can point to any moment in ""Stealing Time"" where something ""actually"" happens I would like to know about it, because as far as I can tell, not a thing happens in the whole film. Perhaps Fusco, through incessant visual reflections, is trying to say something profound about taking control of one's life before it is too late. Like ""St. Elmo's Fire"" the movie is littered with every profound thought ever uttered by a young adult who has left the ivory tower to experience the real world for the first time. I felt Fusco was going for a kind of Howard Hawks Young Professionals in Action ""Only Angels Have Wings"" motif. Then again, I'm sure I was reading much too much into the film. After all, things actually happen Howard Hawks films. Then again, what do I know? I'm only a child.",0,0
