# IMDB sentiment analysis with RNNs

## Kaggle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

## Alessandro Castelli Kaggle Code: https://www.kaggle.com/code/alessandromajumba/sentiment-analysis-on-imdb-dataset-castelli

## Alessandro Castelli WANDB: https://wandb.ai/ales-2000-09/IMDB%20Dataset%20of%2050K%20Movie%20Reviews?workspace=user-ales-2000-09

In [79]:
import pandas as pd
import numpy as np
import re

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from spellchecker import SpellChecker
from tqdm import tqdm
# allows to have a progress bar in pandas, useful for long processing operations
tqdm.pandas()
from collections import Counter
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
# !pip install --upgrade pandas numpy nltk scikit-learn pyspellchecker tqdm torch

Read the dataset and observe the first 5 rows.

In [80]:
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Lucky us, the dataset is well-balanced.

In [81]:
data.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

Transform the labels to 0 and 1.

In [82]:
def transform_label(label):
    return 1 if label == 'positive' else 0


data['label'] = data['sentiment'].progress_apply(transform_label)

100%|██████████| 50000/50000 [00:00<00:00, 466871.03it/s]


## Preprocessing

- In classic NLP, the text is often preprocessed to remove tokens that might confuse the classifier
- Below you can find some examples of possible preprocessing techniques
- Feel free to modify them to improve the results of your classifier

In [83]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
stopwords = set(stopwords.words('english'))

def rm_link(text):
    return re.sub(r'http\S+', '', text)


# handle case like "shut up okay?Im only 10 years old"
# become "shut up okay Im only 10 years old"
def rm_punct2(text):
    return re.sub(r'[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]', ' ', text)
    # return re.sub(r'[\"\#\$\%\&\'\(\)\*\+\/\:\;\<\=\>\@\[\\\]\^\_\`\{\|\}\~]', ' ', text)


def rm_html(text):
    # remove html tags
    text = re.sub(r'<.*?>', '', text)
    # remove <br /> tags
    return re.sub(r'<br />', '', text)


def space_bt_punct(text):
    pattern = r'([.,!?-])'
    s = re.sub(pattern, r' \1 ', text)  # add whitespaces between punctuation
    s = re.sub(r'\s{2,}', ' ', s)  # remove double whitespaces
    return s


def rm_number(text):
    return re.sub(r'\d+', '', text)


def rm_whitespaces(text):
    return re.sub(r'\s+', ' ', text)


def rm_nonascii(text):
    return re.sub(r'[^\x00-\x7f]', r'', text)


def rm_emoji(text):
    emojis = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE
    )
    return emojis.sub(r'', text)


def spell_correction(text):
    # if too slow: return text
    return text
    # https://pypi.org/project/pyspellchecker/
    spell = SpellChecker()
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            candidate = spell.correction(word)
            if candidate is not None:
                corrected_text.append(candidate)
            else:
                corrected_text.append(word)
        else:
            corrected_text.append(word)
    return ' '.join(corrected_text)


def clean_pipeline(text):
    text = text.lower()
    no_link = rm_link(text)
    no_html = rm_html(no_link)
    space_punct = space_bt_punct(no_html)
    no_punct = rm_punct2(space_punct)
    no_number = rm_number(no_punct)
    no_whitespaces = rm_whitespaces(no_number)
    no_nonasci = rm_nonascii(no_whitespaces)
    no_emoji = rm_emoji(no_nonasci)
    spell_corrected = spell_correction(no_emoji)
    return spell_corrected

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Let's clean the reviews first:

In [84]:
data['review'] = data['review'].progress_apply(clean_pipeline)

100%|██████████| 50000/50000 [00:21<00:00, 2342.64it/s]


We now tokenize and remove stopwords (i.e. the, a, an, etc.) and lemmatize the words (i.e. running -> run, better -> good, etc.).

In [85]:
# preprocessing
def tokenize(text):
    return word_tokenize(text)


def rm_stopwords(text):
    return [i for i in text if i not in stopwords]


def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in text]
    # make sure lemmas does not contains stopwords
    return rm_stopwords(lemmas)


def preprocess_pipeline(text):
    tokens = tokenize(text)
    no_stopwords = rm_stopwords(tokens)
    lemmas = lemmatize(no_stopwords)
    return ' '.join(lemmas)

In [86]:
data['review'] = data['review'].progress_apply(preprocess_pipeline)

100%|██████████| 50000/50000 [01:55<00:00, 432.25it/s]


Let's check the result.

In [87]:
data.head()

Unnamed: 0,review,sentiment,label
0,one reviewer mentioned watching oz episode hoo...,positive,1
1,wonderful little production filming technique ...,positive,1
2,thought wonderful way spend time hot summer we...,positive,1
3,basically family little boy jake think zombie ...,negative,0
4,petter mattei love time money visually stunnin...,positive,1


## Embedding

- ANNs cannot process text input
- Input tokens must be mapped to integers using a vocabulary
- In this example, we build a vocabulary manually, but you can also replace this code with an [embedding layer](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)

In [88]:
# get all processed reviews
reviews = data.review.values
# merge into single variable, separated by whitespaces
words = ' '.join(reviews)
# obtain list of words
words = words.split()
# build vocabulary
counter = Counter(words)
# only keep top 2000 words
vocab = sorted(counter, key=counter.get, reverse=True)[:2000]
int2word = dict(enumerate(vocab, 2))
int2word[0] = '<PAD>'
int2word[1] = '<UNK>'
word2int = {word: id for id, word in int2word.items()}

In [89]:
reviews_enc = [[word2int[word] if word in word2int else word2int['<UNK>'] for word in review.split()] for review in tqdm(reviews, desc='encoding')]

encoding: 100%|██████████| 50000/50000 [00:01<00:00, 26242.55it/s]


Because we have to build batch, we have to pad the reviews to the same length. We will pad the reviews with <PAD> token.
**Because we use RNNs, we need to left pad and not right pad the sequence.**

In [90]:
# left padding sequences
def pad_features(reviews, pad_id, seq_length=128):
    # features = np.zeros((len(reviews), seq_length), dtype=int)
    features = np.full((len(reviews), seq_length), pad_id, dtype=int)

    for i, row in enumerate(reviews):
        start_index = max(0, seq_length - len(row))
        # if seq_length < len(row) then review will be trimmed
        features[i, start_index:] = np.array(row)[:min(seq_length, len(row))]

    return features


seq_length = 128
features = pad_features(reviews_enc, pad_id=word2int['<PAD>'], seq_length=seq_length)

## Split the data

In [91]:
labels = data.label.to_numpy()

# train test split
train_size = .75  # we will use 75% of whole data as train set
val_size = .5  # and we will use 50% of test set as validation set

# stratify will make sure that train and test set have same distribution of labels
train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=1 - train_size, stratify=labels)

# split test set into validation and test set
val_x, test_x, val_y, test_y = train_test_split(test_x, test_y, test_size=val_size, stratify=test_y)

In [92]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("Label")

import wandb
wandb.login(key=secret_value_0)
wandb.init(project='IMDB Dataset of 50K Movie Reviews', save_code=True)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Define the datasets and dataloaders.

In [93]:
# define batch size
batch_size = 128

# create tensor datasets
train_dataset = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_dataset = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_dataset = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# create dataloaders
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_dataset, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=batch_size)

Define the model.

In [94]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size, num_layers=2):
        # Initialize the RNN model with specified parameters
        super(RNN, self).__init__()
        # Embedding layer to convert input indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # RNN layer with specified parameters
        self.rnn = nn.RNN(embed_size, hidden_size, num_layers=num_layers, batch_first=True)
        # Fully connected layer 
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Apply the embedding layer to convert input indices to dense vectors
        x = self.embedding(x)
        # Pass the embedded input through the RNN layer
        rnn_out, h_n = self.rnn(x)
        # Assuming h_n is a tuple of hidden states from all layers
        # Concatenate the hidden states from all layers (assuming the last layer [-1])
        h_n = h_n[-1].squeeze(0)
        # Pass the concatenated hidden states through the fully connected layer
        output = self.fc(h_n)
        return output


Instantiate the model.

In [95]:
# Define the layer dimensions
vocab_size = len(word2int)  # The number of unique words in your vocabulary
embed_size = 256  # Dimension of the word embeddings
hidden_size = 128  # Number of features in the hidden state
output_size = 1  # Dimension of the output, e.g., for a binary classification problem

# Configuration for WandB
config = wandb.config
config.vocab_size = vocab_size  # Number of words in the vocabulary
config.embed_size = embed_size  # Dimension of the word embeddings
config.hidden_size = hidden_size  # Dimension of the hidden state
config.output_size = output_size  # Dimension of the output


In [96]:
# Create an instance of the model
model = RNN(vocab_size, embed_size, hidden_size, output_size, num_layers=3)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

if device == 'cuda':
     model = torch.nn.DataParallel(model)
        
model.to(device)
# Print the structure of the model
print(model)

DataParallel(
  (module): RNN(
    (embedding): Embedding(2002, 256)
    (rnn): RNN(256, 128, num_layers=3, batch_first=True)
    (fc): Linear(in_features=128, out_features=1, bias=True)
  )
)


Define the loss function and optimizer.

In [97]:
import torch.optim as optim

# Define the loss function
criterion = nn.BCEWithLogitsLoss()

# Define the optimizer
lr = 0.000005
wandb.log({"LR": lr})

optimizer = optim.Adam(model.parameters(), lr=lr)
#optimizer = optim.RMSprop(model.parameters(), lr=lr, alpha=0.9)

Define the training loop.

In [98]:
# Number of epochs
num_epochs = 256

for epoch in range(num_epochs):
    # Set the model to training mode
    model.train()
    
    # Variable for total loss in each epoch
    total_loss = 0.0
    
    # Iterate through the training data
    for inputs, labels in train_loader:
        # Zero the gradients
        optimizer.zero_grad()
        
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(inputs)
        
        # Reshape the labels
        labels = labels.view(-1, 1)  # Change dimensions to [batch_size, 1]
            
        # Compute the loss
        loss = criterion(outputs, labels.float())  # Convert labels to float
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        # Update the total loss
        total_loss += loss.item()
    
    # Calculate the average loss per epoch
    Training_Loss = total_loss / len(train_loader)
    
    # Print the average loss per epoch during training
    print(f'Epoch [{epoch+1}/{num_epochs}], Training Loss: {Training_Loss:.4f}')
    
    # Set the model to evaluation mode
    model.eval()
    
    # Variables for total loss and number of correct predictions
    total_loss = 0.0
    correct_predictions = 0

    # Iterate through the validation data
    with torch.no_grad():  # Disable gradient computation during evaluation
        for inputs, labels in valid_loader:
            
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            # Forward pass
            outputs = model(inputs)
            
            # Reshape the labels
            labels = labels.view(-1, 1)  # Change dimensions to [batch_size, 1]
                
            # Compute the loss
            loss = criterion(outputs, labels.float())  # Convert labels to float
            
            # Update the total loss
            total_loss += loss.item()
            
            # Calculate the number of correct predictions
            threshold = 0.5
            predicted_labels = (torch.sigmoid(outputs) > threshold).float()
            correct_predictions += (predicted_labels == labels.float()).sum().item()

    
    # Calculate the average loss per epoch during evaluation
    average_loss = total_loss / len(valid_loader)
    
    # Calculate accuracy
    accuracy = correct_predictions / len(valid_loader.dataset)
    
    # Print evaluation metrics
    print(f'Epoch [{epoch+1}/{num_epochs}], Validation Loss: {average_loss:.4f}, Validation Accuracy: {accuracy:.4f}')
    
    # Log metrics using WandB
    wandb.log({"Epoch": epoch+1,"Training Loss": Training_Loss, "Validation Loss": average_loss, "Validation Accuracy": accuracy})

# Print a completion message
print('Training and Validation completed!')

Epoch [1/256], Training Loss: 0.6951
Epoch [1/256], Validation Loss: 0.6954, Validation Accuracy: 0.5022
Epoch [2/256], Training Loss: 0.6926
Epoch [2/256], Validation Loss: 0.6936, Validation Accuracy: 0.4965
Epoch [3/256], Training Loss: 0.6908
Epoch [3/256], Validation Loss: 0.6921, Validation Accuracy: 0.5154
Epoch [4/256], Training Loss: 0.6890
Epoch [4/256], Validation Loss: 0.6907, Validation Accuracy: 0.5288
Epoch [5/256], Training Loss: 0.6872
Epoch [5/256], Validation Loss: 0.6893, Validation Accuracy: 0.5344
Epoch [6/256], Training Loss: 0.6851
Epoch [6/256], Validation Loss: 0.6875, Validation Accuracy: 0.5432
Epoch [7/256], Training Loss: 0.6826
Epoch [7/256], Validation Loss: 0.6853, Validation Accuracy: 0.5491
Epoch [8/256], Training Loss: 0.6793
Epoch [8/256], Validation Loss: 0.6823, Validation Accuracy: 0.5586
Epoch [9/256], Training Loss: 0.6749
Epoch [9/256], Validation Loss: 0.6783, Validation Accuracy: 0.5669
Epoch [10/256], Training Loss: 0.6683
Epoch [10/256], V

Evaluate the model on the test set.

In [99]:
# Set the model to evaluation mode
model.eval()

# Variables for total loss and number of correct predictions
total_loss = 0.0
correct_predictions = 0

# Iterate through the test data
with torch.no_grad():  # Disable gradient computation during evaluation
    for inputs, labels in test_loader:
        inputs = inputs.to(device)
        labels = labels.to(device)
        # Forward pass
        outputs = model(inputs)
        
        # Reshape the model's output
        outputs = outputs.view(-1)  # Change dimensions from [batch_size, 1] to [batch_size]
        
        # Compute the loss
        loss = criterion(outputs, labels.float())  # Convert labels to float
        
        # Update the total loss
        total_loss += loss.item()
        
        # Calculate the number of correct predictions
        threshold = 0.5
        predicted_labels = (torch.sigmoid(outputs) > threshold).float()
        correct_predictions += (predicted_labels == labels.float()).sum().item()
        
# Calculate the average loss for the test set
average_loss = total_loss / len(test_loader)

# Calculate accuracy on the test set
accuracy = correct_predictions / len(test_loader.dataset)

print(f'Test Loss: {average_loss:.4f}, Test Accuracy: {accuracy:.4f}')

# Log metrics using WandB
wandb.log({"Test Loss": average_loss, "Test Accuracy": accuracy})


Test Loss: 0.3586, Test Accuracy: 0.8507


In [100]:
wandb.finish()



VBox(children=(Label(value='0.250 MB of 0.469 MB uploaded\r'), FloatProgress(value=0.5333447210715861, max=1.0…

0,1
Epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
LR,▁
Test Accuracy,▁
Test Loss,▁
Training Loss,██▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁
Validation Accuracy,▁▂▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇███████████████████████
Validation Loss,██▆▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Epoch,256.0
LR,1e-05
Test Accuracy,0.85072
Test Loss,0.35856
Training Loss,0.26377
Validation Accuracy,0.83632
Validation Loss,0.39418
