# Word embedding and RNN for sentiment analysis

The goal of the following notebook is to predict whether a written
critic about a movie is positive or negative. For that we will try
three models. A simple linear model on the word embeddings and
recurrent neural network.

In [1]:
from typing import Iterable, List
import appdirs                  # Used to cache pretrained embeddings
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchtext
from torch.nn.utils.rnn import pad_sequence
from torch.optim import Adam, Optimizer
from torch.utils.data import DataLoader
from torchtext import datasets
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import random

In [2]:
!pip install 'portalocker>=2.0.0'

Collecting portalocker>=2.0.0
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


## The IMDB dataset

In [2]:
torch_cache = appdirs.user_cache_dir('pytorch')
train_iter, test_iter = datasets.IMDB(root=torch_cache, split=('train', 'test'))

TRAIN_SET = list(train_iter)
TEST_SET = list(test_iter)
random.shuffle(TRAIN_SET)
random.shuffle(TEST_SET)

In [3]:
TRAIN_SET[5]

(2,
 "This is such a great movie to watch with young children. I'm always looking for an excuse to watch it over & over. Gena was good, Cheech was fun,the Russian was good, Maria was adorable & of course Paulie was the best!")

## Global variables

First let's define a few variables. `EMBEDDING_DIM` is the dimension
of the vector space used to embed all the words of the vocabulary.
`SEQ_LENGTH` is the maximum length of a sequence, `BATCH_SIZE` is
the size of the batches used in stochastic optimization algorithms
and `NUM_EPOCHS` the number of times we are going thought the entire
training set during the training phase.

In [4]:
SEQ_LENGTH = 64
BATCH_SIZE = 512
NUM_EPOCHS = 10

We first need a tokenizer that take a text a returns a list of
tokens. There are many tokenizers available from other libraries.
Here we use the one that comes with Pytorch.

In [5]:
tokenizer = get_tokenizer("basic_english")

## Building the vocabulary

Then we need to define the set of words that will be understood by
the model: this is the vocabulary. We build it from the training
set.

In [6]:
# Costruzione del vocabolario
def token_generation(data_iter: Iterable) -> List[str]:
    for data_sample in data_iter:
        yield tokenizer(data_sample[1])


special_tokens = ["<unk>", "<pad>"]
vocab = build_vocab_from_iterator(
    token_generation(TRAIN_SET),
    min_freq=10,
    specials=special_tokens,
    special_first=True)
UNK_IDX, PAD_IDX = vocab.lookup_indices(special_tokens)
VOCAB_SIZE = len(vocab)
vocab.set_default_index(UNK_IDX)


To limit the number of tokens in the vocabulary, we specified
`min_freq=10`: a token should be seen at least 10 times to be part
of the vocabulary. Consequently some words in the training set (and
in the test set) are not present in the vocabulary. We then need to
set a default index.

In [None]:
# vocab['pouet']                  # Error
                          # le parole sconosciute avranno indice 0 nel vocabolario
#vocab['vdfbdfbdfbdf']
#vocab['I']
#vocab['am']
#vocab['groot']

# Collate function

The collate function maps raw samples coming from the dataset to
padded tensors of numericalized tokens ready to be fed to the model.

In [7]:
def text_to_tensor_fn(batch: List):
    def text_to_tensor(text):
        tokens = tokenizer(text)[:SEQ_LENGTH]
        return torch.LongTensor(vocab(tokens))

    src_batch = [text_to_tensor(text) for _, text in batch]
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = torch.Tensor([label - 1 for label, _ in batch])
    return src_batch, tgt_batch


In [8]:
text_to_tensor_fn([
    (1, "i am Groot")
])

# tensor([[ 13], [246], [ 0]]): Questo rappresenta il tensore risultante dalla tokenizzazione e dal padding delle sequenze. Ogni numero intero all'interno del tensore corrisponde a un indice nel vocabolario.
# tensor([0.]): Questo rappresenta l'etichetta associata all'elemento del batch


(tensor([[ 13],
         [246],
         [  0]]),
 tensor([0.]))

## Training a linear classifier with an embedding

We first test a simple linear classifier on the word embeddings.

We need to implement an accuracy function to be used in the `Trainer`
class (see below).

In [9]:
def accuracy(predictions, labels):
    return torch.sum((torch.sigmoid(predictions) > 0.5).float() == (labels > .5)).item() / len(predictions)

Train and test functions

In [10]:
def train_epoch(model: nn.Module, optimizer: Optimizer):
    model.train()
    loss_fn = nn.BCEWithLogitsLoss()
    train_dataloader = DataLoader(TRAIN_SET, batch_size=BATCH_SIZE, collate_fn=text_to_tensor_fn)

    matches = 0
    for sequences, labels in train_dataloader:

        optimizer.zero_grad()
        predictions = model(sequences)
        loss = loss_fn(predictions, labels)
        loss.backward()
        optimizer.step()

        acc = accuracy(predictions, labels)
        matches += len(predictions) * acc

    return matches / len(TRAIN_SET)

In [11]:
def evaluate(model: nn.Module):
    model.eval()
    val_dataloader = DataLoader(TEST_SET, batch_size=BATCH_SIZE, collate_fn=text_to_tensor_fn)

    matches = 0
    for sequences, labels in val_dataloader:

        predictions = model(sequences)
        acc = accuracy(predictions, labels)
        matches += len(predictions) * acc

    return matches / len(TEST_SET)

In [12]:
def train(model, optimizer):
    for epoch in range(1, NUM_EPOCHS + 1):
        train_acc = train_epoch(model, optimizer)
        val_acc = evaluate(model)
        print(
            f"Epoch: {epoch}, "
            f"Train acc: {train_acc:.3f}, "
            f"Val acc: {val_acc:.3f} "
        )

In [13]:
def predict_sentiment(model, sentence):
    "Predict sentiment of given sentence according to model"

    tensor, _ = text_to_tensor_fn([(0, sentence)])
    prediction = model(tensor)
    pred = torch.sigmoid(prediction)
    return pred.item()

In [14]:
# # Load a GloVe pretrained embedding instead
# GloVe --> Global Vectors for Word Representation
# Download GloVe word embedding
glove = torchtext.vocab.GloVe(name="6B", dim="100", cache=torch_cache)
vocab_vectors = glove.get_vecs_by_tokens(vocab.get_itos())

/root/.cache/pytorch/glove.6B.zip: 862MB [02:41, 5.35MB/s]                           
100%|█████████▉| 399999/400000 [00:19<00:00, 20241.68it/s]


In [15]:
class GloVeEmbeddingNet(nn.Module):
    def __init__(self, seq_length, vocab_vectors, freeze=True):
        super().__init__()
        self.seq_length = seq_length
        self.embedding_dim = vocab_vectors.size(1)
        self.embedding = nn.Embedding.from_pretrained(vocab_vectors, freeze=freeze)
        self.l1 = nn.Linear(self.seq_length * self.embedding_dim, 1)

    def forward(self, x):
        embedded = self.embedding(x)
        flatten = embedded.view(-1, self.seq_length * self.embedding_dim)
        return self.l1(flatten).squeeze()

glove_embedding_net1 = GloVeEmbeddingNet(SEQ_LENGTH, vocab_vectors, freeze=True)
optimizer = Adam(glove_embedding_net1.parameters())
train(glove_embedding_net1, optimizer)

Epoch: 1, Train acc: 0.498, Val acc: 0.502 
Epoch: 2, Train acc: 0.570, Val acc: 0.506 
Epoch: 3, Train acc: 0.597, Val acc: 0.504 
Epoch: 4, Train acc: 0.612, Val acc: 0.504 
Epoch: 5, Train acc: 0.625, Val acc: 0.506 
Epoch: 6, Train acc: 0.634, Val acc: 0.505 
Epoch: 7, Train acc: 0.640, Val acc: 0.507 
Epoch: 8, Train acc: 0.645, Val acc: 0.505 
Epoch: 9, Train acc: 0.646, Val acc: 0.503 
Epoch: 10, Train acc: 0.650, Val acc: 0.504 


## Recurrent neural network with frozen pretrained embedding

In [16]:
class RNN(nn.Module):
    def __init__(self, hidden_size, vocab_vectors, freeze=True):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(vocab_vectors, freeze=freeze)
        self.embedding_size = self.embedding.embedding_dim
        self.input_size = self.embedding_size
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size=self.input_size, hidden_size=self.hidden_size)
        self.linear = nn.Linear(hidden_size, 1)

    def forward(self, x, h0=None):
        if h0 is None:
            batch_size = x.size(1)
            h0 = torch.zeros(self.gru.num_layers, batch_size, self.hidden_size)

        embedded = self.embedding(x)
        output, hidden = self.gru(embedded, h0)
        return self.linear(hidden).squeeze()


rnn = RNN(hidden_size=100, vocab_vectors=vocab_vectors)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, rnn.parameters()), lr=0.001)
train(rnn, optimizer)

Epoch: 1, Train acc: 0.571, Val acc: 0.666 
Epoch: 2, Train acc: 0.706, Val acc: 0.706 
Epoch: 3, Train acc: 0.729, Val acc: 0.731 
Epoch: 4, Train acc: 0.745, Val acc: 0.744 
Epoch: 5, Train acc: 0.760, Val acc: 0.756 
Epoch: 6, Train acc: 0.772, Val acc: 0.763 
Epoch: 7, Train acc: 0.778, Val acc: 0.770 
Epoch: 8, Train acc: 0.785, Val acc: 0.772 
Epoch: 9, Train acc: 0.792, Val acc: 0.773 
Epoch: 10, Train acc: 0.796, Val acc: 0.774 


## Test function


In [20]:
sentence_bad = "Whispers of Destiny' disappoints on multiple levels. The plot feels convoluted and overstretched, leaving audiences confused rather than engaged. Despite a talented cast, the characters lack depth, making it hard to invest in their journey. The CGI, particularly in crucial scenes, falls short, distracting from any potential emotional impact. The film's potential is overshadowed by poor execution, resulting in a forgettable and underwhelming cinematic experience."
sentence_good = "Intriguing from start to finish, 'Whispers of Destiny' captivates with its brilliant storytelling and exceptional performances. The cinematography beautifully enhances the magical world, while the soundtrack complements the emotional depth. Each twist keeps you on the edge, and the climax is both satisfying and surprising. A true cinematic gem that lingers in your thoughts, leaving a lasting impression."
prediction_rnn1 = predict_sentiment(rnn, sentence_good)
prediction_rnn2 = predict_sentiment(rnn, sentence_bad)
prediction_glove1 = predict_sentiment(glove_embedding_net1, sentence_good)
prediction_glove2 = predict_sentiment(glove_embedding_net1, sentence_bad)
print(
            f"RNN1 Model Prediction: {prediction_rnn1} \n"
            f"RNN2 Model Prediction: {prediction_rnn2} \n"
            f"Glove1 Model Prediction: {prediction_glove1} \n"
            f"Glove2 Model Prediction: {prediction_glove2} "
      )


RNN1 Model Prediction: 0.9861965775489807 
RNN2 Model Prediction: 0.1494850218296051 
Glove1 Model Prediction: 0.6458622813224792 
Glove2 Model Prediction: 0.3467015326023102 
