Building a text sentiment analysis model using Pytorch and torchtext on the IMDb dataset of movie reviews. For this notebook we will be using the RNN architecture to predict the sentiment inside a text.

RNN Architecture:
1. An RNN takes an input and performs an operation on the input along with a hidden state and then this output acts as an input to the next hidden state of the model along with an input.
2. Here, the RNN model will take N inputs(Number of words in a sequence) and fuse them with a hidden state to produce an output. The output of the current hidden state will act as input for the next hidden state along with the next input in the sequence.

![](Screenshot%202023-04-22%20at%201.13.40%20PM.png)

Preparing Data:
A central part of torchtext is Field. Field specifies how do you want your data to be processed. For our sentiment analysis task, the IMDB dataset consists of both the raw string - i.e. the movie review and also the sentiment for that statement - positive or negative. 

Understanding the technicalities:
1. The parameters of FIELD define how you want your data to be processed.
2. For our dataset-
    a) TEXT defines how our raw string will be processed.
    b) LABEL defines how to process the sentiment.

Tokenization - it is the task of splittng data into discrete 'tokens' so that we can work with textual data.
TEXT field has an argument 'spacy', which indicates that we are using the spacy tokenizer.
LABEL is used for handling the labels

In [138]:
pip install -U spacy

Note: you may need to restart the kernel to use updated packages.


In [139]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [140]:
import torch
from torchtext import data
from torchtext.vocab import Vocab

#Set a random seed
SEED=42 

torch.manual_seed(SEED)

#Handling text data and labels
TEXT = data.Field(tokenize = 'spacy',tokenizer_language = 'en_core_web_sm')
LABEL = data.LabelField(dtype = torch.float)

Importing the Dataset - 
The IMDb dataset is already pre-loaded in pytorch so we can directly import it.

In [141]:
from torchtext.datasets import IMDB

Splitting the data into train and test datasets

In [142]:
train_data, test_data = IMDB.splits(TEXT, LABEL)

Now we find the number of data samples in both training and test datasets;
The original dataset was structured to have 25000 training and 25000 test datasets so we should have 25000 samples in each split.

In [143]:
print("Number of training dataset : ", len(train_data))
print("Number of examples in test dataset : ", len(test_data))

Number of training dataset :  25000
Number of examples in test dataset :  25000


Let us view an example from our dataset

In [144]:
print(vars(train_data.examples[17]))

{'text': ['A', 'typical', 'romp', 'through', 'Cheech', 'and', 'Chong', "'s", 'reality', 'which', 'includes', 'drugs', ',', 'singing', ',', 'more', 'drugs', ',', 'cars', 'and', 'driving', ',', 'even', 'more', 'drugs', ',', 'Pee', 'Wee', ',', 'aliens', ',', 'gasoline', ',', 'laundry', ',', 'stand', 'up', 'comedy', ',', 'surprisingly', 'more', 'drugs', 'and', 'SPACE', 'COKE', '!', '!', '.', 'It', 'is', 'not', 'as', 'coherent', 'or', 'plausible', 'as', 'Up', 'in', 'Smoke', 'but', 'it', 'still', 'is', 'incredibly', 'funny', ',', 'without', 'becoming', 'as', 'strange', 'as', 'Nice', 'Dreams', '.', 'There', 'are', 'some', 'classic', 'scenes', ',', 'which', 'include', 'the', 'opening', 'scene', 'where', 'they', 'get', 'some', 'gas', 'for', 'their', 'car', 'and', 'the', 'drive', 'to', 'work', '.', 'Also', 'funny', 'is', 'Cheech', "'s", 'song', '(', 'Mexican', '-', 'Americans', ')', 'and', 'Chong', "'s", 'follow', 'up', 'song', '.', 'Another', 'notable', 'scene', 'is', 'the', 'welfare', 'office'

We want to create a validation data set to gain an insight on how our model will perform on unseen data. With these insights, we can optimize our hyperparameters to fine tune and better fit our model. Use a validation dataset helps our model perform better on unseen data.

Split using the random fucntion.

In [145]:
import random

train_data, valid_data = train_data.split(random_state=random.seed(SEED))

In [146]:
print("Number of training dataset : ", len(train_data))
print("Number of examples in Validation dataset : ", len(valid_data))
print("Number of examples in test dataset : ", len(test_data))

Number of training dataset :  17500
Number of examples in Validation dataset :  7500
Number of examples in test dataset :  25000


Vocabulary -

For our text dataset, we have a composition of strings, we want to convert the string data into numbers and vectors for our model to work with.
Vocabulary for our dataset is a collection of unique words in our dataset and then we encode these words using One Hot Encoding so that they are represented in a vector of 1*N where N is the number of unique words. Every unique word has a specific index assigned to it. The one hot vector will have 0's in the entire vector except at the index of the word where it takes a value of 1.

In [147]:
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

In [148]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 101094
Unique tokens in LABEL vocabulary: 2


We see that there are more than 100,000 unique words in our dataset. Creating a huge vector will be computationally very expensive and not feasible.
So, we will limit our vocab size to only 25,000 tokens and we will have a vector of 25000 by 1.
For the labels wwe have a categorical values which is why it has only two entries.

In [149]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [150]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


We observe two extra tokens from our max_size.
These tokens are 
1.  'UNK' token for words not present in the vocab
2.  'pad' token

When we feed sentences to our model, we want all our sentences to be of the same length. This is however not true for all sentences so we add padding to the smaller ones.

We want to make our dataset iterable for which we use the BucketIterator. We want to iterate over them in a training loop.

We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We want the tensors returned by the iterator on the GPU which is handled using torch.device.

In [151]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [152]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

BUILDING THE MODEL

For this analysis task, we want to use a RNN Model.
The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). 
This embedding layer is simply a single fully connected layer.
As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. 

So we want to capture the probabilities of one word occuring in context of the other words so we can capture semantic and syntactic meaning.

![](Screenshot%202023-04-22%20at%204.36.26%20PM.png)

1. The forward method is called when we feed our examples into our model.
2. Each batch, TEXT is a tensor of size [sentencelength, batchsize](we have same sentence length for all our batches)
3. This input batch will be passed to an embedding layer where it will be converted into a dense vector of co-dependent words.
4. The output of the embedding layer acts as input to the RNN model.
5. The RNN returns 2 tensors, output of size [sentence length, batch size, hidden dim] and hidden of size [1, batch size, hidden dim]. output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state. We verify this using the assert statement. Note the squeeze method, which is used to remove a dimension of size 1.
6. Finally, we feed the last hidden state, hidden, through the linear layer, fc, to produce a prediction.

In [153]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super (RNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)

        self.rnn = nn.RNN(embedding_dim, hidden_dim)

        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        return self.fc(hidden.squeeze(0))


1. We now create an instance of RNN class
2. The input dimension is the size of vocab [25002 by 1]
3. The embedding layer takes an input and then creates a dense vector.
4. The output of the model is usually the number of classes.

In [154]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

TRAIN THE MODEL

We'll use stochastic gradient descent (SGD) to update the parameters of the module. The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [155]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Now we will define a loss function which will measure our loss.

In [156]:
criterion = nn.BCEWithLogitsLoss()

In [157]:
model = model.to(device)
criterion = criterion.to(device)

Now we want to write a function that defines the accuracy of our model.

In [158]:
def binary_accuracy(preds, y):

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

The train function iterates over the data one batch at a time.

1. For each batch we zero the gradients. Each parameter in a model has a grad attribute which stores the gradient calculate by the criterion.
2. After this we feed the batch to our model for which the model calculates gradients and loss for every batch, with the loss being averaged over all examples in the batch.
3. We calculate the gradient of each parameter with loss.backward(), and then update the parameters using the gradients and optimizer algorithm with optimizer.step().
4. The loss and accuracy is accumulated across the epoch, the .item() method is used to extract a scalar from a tensor which only contains a single value.
5. Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator.

In [159]:
def train(model, iterator, optimizer, criterion):

    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        #We need to zero the gradients before every iteration or they will be carried over to every iteration
        optimizer.zero_grad()

        #We make a new prediction based on our model architecture
        predictions = model(batch.text).squeeze(1)

        #We define a loss functions which tells us misclasssified examples
        loss = criterion(predictions, batch.label)

        accuracy = binary_accuracy(predictions, batch.label)

        #This is the back prop algorithm, taking gradient wrt each parameter
        loss.backward()

        #Taking a step in the optimal direction
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += accuracy.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [160]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

This will tell us the time taken between each epoch

In [161]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [164]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 5m 29s
	Train Loss: 0.693 | Train Acc: 49.90%
	 Val. Loss: 0.695 |  Val. Acc: 49.43%
Epoch: 02 | Epoch Time: 5m 37s
	Train Loss: 0.693 | Train Acc: 49.55%
	 Val. Loss: 0.695 |  Val. Acc: 48.68%
Epoch: 03 | Epoch Time: 5m 25s
	Train Loss: 0.693 | Train Acc: 49.82%
	 Val. Loss: 0.695 |  Val. Acc: 49.16%
Epoch: 04 | Epoch Time: 5m 29s
	Train Loss: 0.693 | Train Acc: 49.70%
	 Val. Loss: 0.695 |  Val. Acc: 49.38%
Epoch: 05 | Epoch Time: 5m 29s
	Train Loss: 0.693 | Train Acc: 49.70%
	 Val. Loss: 0.695 |  Val. Acc: 49.50%
Epoch: 06 | Epoch Time: 66m 54s
	Train Loss: 0.693 | Train Acc: 50.13%
	 Val. Loss: 0.695 |  Val. Acc: 48.61%
Epoch: 07 | Epoch Time: 5m 26s
	Train Loss: 0.693 | Train Acc: 49.38%
	 Val. Loss: 0.695 |  Val. Acc: 48.61%
Epoch: 08 | Epoch Time: 5m 25s
	Train Loss: 0.693 | Train Acc: 49.60%
	 Val. Loss: 0.695 |  Val. Acc: 49.44%
Epoch: 09 | Epoch Time: 6m 0s
	Train Loss: 0.693 | Train Acc: 49.36%
	 Val. Loss: 0.695 |  Val. Acc: 48.37%
Epoch: 10 | Epoch T

In [165]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.703 | Test Acc: 43.85%


We can see that RNNs don't do particularly well on this task because of the sentence has many token.
1. Vanishing and Exploding Gradients - Finding derivative of a term in the beginining wrt the loss fucntion involves many terms in the chain rule. If one of these terms are near to zero the gradient is zero and learining is very slow
2. Computational complexity - since it carries over memory from lot of states, it has high computational complexity.
3. Lack Of Parallelism - RNNs are inherently sequential, which makes it difficult to parallelize the computation. This can limit the speed and scalability of the network.
4. Difficulty In Capturing Long-Term Dependencies - Although RNNs are designed to capture information about past inputs, they can struggle to capture long-term dependencies in the input sequence. This is because the gradients can become very small as they propagate through time, which can cause the network to forget important information.