<a href="https://colab.research.google.com/github/ISE-CS4445-AI/CS4445-AI-Practice/blob/main/Week-5_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 5 Exercise Notebook:
## Practical RNN Implementation on IMDB Sentiment Analysis with PyTorch

In this notebook, we will:
- Load the IMDB dataset (movie reviews labeled as positive/negative).
- Preprocess the text data (tokenization, vocabulary building, numericalization, and padding).
- Build a simple RNN (using an LSTM cell) for sentiment classification.
- Train the model on a small subset (for demonstration) and evaluate predictions.
- Use the trained model to make a fun prediction on a custom review.

> **Note:** Training on the full IMDB dataset will take time. For demonstration, we are going to use a subset of randomly sampled 1000 datapoints.

---
## Initial Setup & Importing Libraries

In [None]:
!pip uninstall -y torchdata torchtext
!pip install torchdata==0.6.1 torchtext==0.15.2

In [None]:
!pip install portalocker==2.7.0

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
import random
import portalocker

# Set random seeds for reproducibility
torch.manual_seed(42)
random.seed(42)

## Loading and exploring the IMDB dataset

But first, some background about text preprocessing.

## Introduction to Text Preprocessing

In this section, we prepare the raw text data for our RNN model. The IMDB dataset contains movie reviews along with their labels ("pos" or "neg"). Before feeding the data into our model, we need to perform several key steps:

- **Tokenization:**  
  Splitting the text into individual words or tokens. We use a basic English tokenizer from torchtext.

- **Vocabulary Building:**  
  Creating a mapping from words (tokens) to unique integer indices.  
  This allows us to convert each review into a sequence of numbers.  
  We also add special tokens like `<unk>` (for unknown words) and `<pad>` (for padding shorter sequences).

- **Pipeline Creation:**  
  Writing functions to transform raw text into numerical token IDs and mapping labels to integers.  
  These pipelines standardize the data input before it reaches the model.


In [None]:
# Download the IMDB dataset using torchtext.
# Each example is a tuple: (label, text)
train_iter = list(IMDB(split='train'))
test_iter = list(IMDB(split='test'))

# For quick demonstration, we will use a small subset of the training data.
# You can adjust the subset size if needed.
subset_size = 1000
train_data = random.sample(train_iter, subset_size)

print(f"Total training examples (subset): {len(train_data)}")
print("Example:", train_data[0])

## Tokenization and Building the Vocabulary

We begin by defining a tokenizer using torchtext’s `get_tokenizer` with the `basic_english` setting. This simple tokenizer splits the text on whitespace and punctuation, converting everything to lowercase.

Next, we build the vocabulary. Rather than manually updating a Counter, we use an iterator approach:

- **Yield Tokens:**  
  A helper function (`yield_tokens`) goes through each review and yields the tokenized words.  
  This is efficient for large datasets.

- **Special Tokens:**  
  We include `<unk>` for words that are not in our vocabulary and `<pad>` to pad sequences to the same length in a batch.

- **Building the Vocab:**  
  We use `build_vocab_from_iterator` to create the vocabulary from our tokens.  
  We also set a default index using `vocab.set_default_index` so that any word not found in the vocab is mapped to `<unk>`.

After the vocabulary is built, we define two pipelines:

- **Text Pipeline:**  
  Converts each review (a string) into a list of integers representing token IDs.

- **Label Pipeline:**  
  Maps the label "pos" to `1` and "neg" to `0`.

These transformations allow our model to work with numerical data rather than raw text.


In [None]:
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Define a tokenizer using torchtext's basic_english tokenizer.
tokenizer = get_tokenizer('basic_english')

# Build an iterator that yields tokens from the training data.
def yield_tokens(data_iter):
    for label, text in data_iter:
        yield tokenizer(text)

# Assume train_data is already defined (a list of (label, text) tuples).
# Build the vocabulary from the training data iterator.
specials = ['<unk>', '<pad>']
vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=specials)

# Set the default index for unknown tokens.
vocab.set_default_index(vocab['<unk>'])

print(f"Vocabulary size: {len(vocab)}")

# Define pipelines to convert text to token IDs and labels to integers.
def text_pipeline(text):
    return [vocab[token] for token in tokenizer(text)]

def label_pipeline(label):
    # Map "pos" to 1 and "neg" to 0.
    return 1 if label == "pos" else 0


## Handling Variable-Length Sequences: Padding

Movie reviews naturally have variable lengths. To process these reviews in batches, we need to ensure that all sequences in a batch have the same length.

- **Padding:**  
  We use `pad_sequence` from PyTorch to pad shorter sequences with a special `<pad>` token.  
  This makes it possible to form a uniform tensor of shape `[batch_size, max_seq_length]`.

- **Collate Function:**  
  Our custom `collate_batch` function:
  - Takes a batch of (text, label) pairs.
  - Pads the text sequences to the length of the longest review in the batch.
  - Returns a padded tensor of token IDs and a tensor of labels.

This step is crucial for feeding batches into our RNN, which expects a consistent input size across the batch.

## Create a custom dataset and dataloader

In [None]:
class IMDBDataset(Dataset):
    def __init__(self, data):
        # Data is a list of tuples (label, text)
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        label, text = self.data[idx]
        return torch.tensor(text_pipeline(text)), torch.tensor(label_pipeline(label))

# Custom collate function to pad sequences within a batch.
def collate_batch(batch):
    # Each item in batch is (text_tensor, label_tensor)
    text_list, label_list = zip(*batch)
    # Pad sequences to the maximum length in the batch.
    text_list = pad_sequence(text_list, batch_first=True, padding_value=vocab['<pad>'])
    labels = torch.tensor(label_list)
    return text_list, labels

# Create DataLoader objects.
batch_size = 32
train_dataset = IMDBDataset(train_data)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)


## Define the RNN Model (LSTM-based Classifier)
### RNNClassifier Architecture for Sentiment Analysis

In this section, we define the `RNNClassifier` class. This custom model is designed to classify movie reviews (from the IMDB dataset) as positive or negative. The architecture uses an LSTM (Long Short-Term Memory) layer, which is a popular type of Recurrent Neural Network (RNN) that helps capture sequential dependencies and address issues like the vanishing gradient problem.

### Key Components and Their Roles

1. **Embedding Layer (`nn.Embedding`):**
   - **Purpose:**  
     Converts integer-encoded tokens (words) into dense, continuous vector representations.  
   - **Details:**  
     - The `vocab_size` parameter determines how many unique words the model can embed.  
     - The `embed_dim` defines the size of each word vector.  
     - The `padding_idx` is set to the index of the `<pad>` token, ensuring that padded elements do not contribute to the learning process.
   - **Outcome:**  
     The input text tensor (shape: `[batch_size, seq_length]`) becomes an embedded tensor of shape `[batch_size, seq_length, embed_dim]`.

2. **LSTM Layer (`nn.LSTM`):**
   - **Purpose:**  
     Processes the embedded sequence data step-by-step, capturing the context and order within the review.
   - **Details:**  
     - Receives the embedded sequence as input.  
     - The `hidden_dim` parameter controls the size of the hidden state vector.  
     - The `num_layers` parameter specifies the depth (number of stacked LSTM layers).  
     - `batch_first=True` indicates that the first dimension of the input represents the batch size.
   - **Output:**  
     - Returns two values:
       1. The LSTM outputs at all time steps (not used in our model).
       2. A tuple `(hidden, cell)` where:
          - `hidden` contains the hidden state from each layer (shape: `[num_layers, batch_size, hidden_dim]`).
          - We take `hidden[-1]` (i.e., the hidden state from the last LSTM layer) as a summary of the entire sequence.

3. **Fully Connected Layer (`nn.Linear`):**
   - **Purpose:**  
     Maps the final hidden state from the LSTM to a single output value (a logit) that will be used for binary classification.
   - **Details:**  
     - It transforms a vector of size `hidden_dim` to `output_dim` (which is 1 for our binary sentiment classification).
   - **Outcome:**  
     Produces a logit for each example in the batch.

4. **Sigmoid Activation (`nn.Sigmoid`):**
   - **Purpose:**  
     Converts the logit into a probability between 0 and 1.
   - **Outcome:**  
     After applying the sigmoid, values closer to 1 indicate a positive sentiment and values closer to 0 indicate a negative sentiment.
   - **Note:**  
     The `.squeeze()` function is applied to remove any extra dimensions, ensuring the output shape is appropriate for subsequent loss computation.

### Walkthrough of the `forward` Method

- **Input:**  
  The `forward` method accepts `text`, a tensor of token IDs with shape `[batch_size, seq_length]`.

- **Step 1: Embedding**  
  ```python
  embedded = self.embedding(text)  

Each token is converted into a dense vector. The resulting tensor has shape `[batch_size, seq_length, embed_dim]`.


- **Step 2: LSTM Processing**  
  ```python  
  output, (hidden, cell) = self.lstm(embedded)  

The LSTM processes the embedded sequence. We are mainly interested in the final hidden state (`hidden`) which contains the learned representation for the entire sequence.

- **Step 3: Extracting the Final Hidden State**
```python
hidden_last = hidden[-1]

We select the hidden state from the last LSTM layer. This tensor has shape `[batch_size, hidden_dim]` and serves as the summary of the input review.

- **Step 4: Fully Connected Mapping**
```python
logits = self.fc(hidden_last)

The hidden summary is passed through the linear layer to generate a logit for each example.

- **Step 5: Sigmoid Activation**
```python
return self.sigmoid(logits).squeeze()

The logit is converted to a probability score between 0 and 1. The squeeze function ensures that the output has the correct dimensions.

In [None]:
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, num_layers=1):
        super(RNNClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=vocab['<pad>'])
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()  # For binary classification

    def forward(self, text):
        # text shape: [batch_size, seq_length]
        embedded = self.embedding(text)  # Shape: [batch_size, seq_length, embed_dim]
        # LSTM returns (output, (hidden, cell)). We use the hidden state from the last time step.
        output, (hidden, cell) = self.lstm(embedded)
        # hidden has shape: [num_layers, batch_size, hidden_dim]. Use the last layer.
        hidden_last = hidden[-1]  # Shape: [batch_size, hidden_dim]
        logits = self.fc(hidden_last)  # Shape: [batch_size, output_dim]
        # Apply sigmoid for binary classification
        return self.sigmoid(logits).squeeze()

# Hyperparameters
vocab_size = len(vocab)
embed_dim = 100
hidden_dim = 128
output_dim = 1  # Binary classification

model = RNNClassifier(vocab_size, embed_dim, hidden_dim, output_dim)
print(model)

## Training the RNN model

In [None]:
# Set training hyperparameters
num_epochs = 3  # For demonstration, use a small number of epochs.
learning_rate = 0.001

criterion = nn.BCELoss()  # Binary cross-entropy loss for binary classification.
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Move model to appropriate device (CPU for this demo).
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion.to(device)

# Training loop
model.train()
for epoch in range(num_epochs):
    epoch_loss = 0
    for texts, labels in train_loader:
        texts, labels = texts.to(device), labels.float().to(device)
        optimizer.zero_grad()         # Reset gradients.
        predictions = model(texts)      # Forward pass.
        loss = criterion(predictions, labels)  # Compute prediction error.
        loss.backward()               # Backpropagate error.
        optimizer.step()              # Update weights.
        epoch_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}")

## Evaluating the Model

In [None]:
# Function to predict sentiment on a single review text.
def predict_sentiment(model, text):
    model.eval()
    with torch.no_grad():
        # Convert text to tensor and add batch dimension.
        text_tensor = torch.tensor(text_pipeline(text)).unsqueeze(0).to(device)
        prediction = model(text_tensor)
        # Return sentiment label based on threshold 0.5.
        sentiment = "Positive" if prediction.item() >= 0.5 else "Negative"
        return sentiment, prediction.item()

# Try predicting sentiment on a few sample reviews.
sample_reviews = [
    "I absolutely loved this movie. It was amazing and full of surprises!",
    "The film was boring and too long. I wouldn't recommend it.",
    "Not the best movie I've seen, but it had its moments.",
    "This movie was so bad it made me laugh at how terrible it was!"
]

for review in sample_reviews:
    sentiment, score = predict_sentiment(model, review)
    print(f"Review: {review}\nPredicted Sentiment: {sentiment} (Score: {score:.4f})\n")


# Wrap-Up

In this notebook, we:
- Loaded and preprocessed the IMDB dataset.
- Built a vocabulary and converted text to numerical data.
- Defined a custom RNN (LSTM-based) for sentiment classification.
- Trained the model using a simple training loop.
- Made predictions on sample reviews for fun sentiment analysis.