# Challenge 5 - RNNs

Welcome to challenge #5!

In this challenge, you will implement a simple RNN classifier using an LSTM cell for sentiment analysis on the YelpReviewPolarity dataset (a binary sentiment classification dataset).

Your model should include:
- An **Embedding** layer (with a specified `padding_idx`)
- An **LSTM** layer (with `batch_first=True`)
- A **Fully Connected (fc)** layer to map the final hidden state to the output
- A **Sigmoid** activation to produce a probability between 0 and 1

Your tasks are:

1. **Data Preprocessing & DataLoader Setup** (2 points):  
   Import the YelpReviewPolarity dataset, tokenize the text, build a vocabulary, and create a DataLoader to supply batches to your model.

2. **RNNClassifier** (2 points):  
   Implement a class that builds and returns the RNN classifier model using the parameters provided.

3. **Training and evaluating the model** (2 points):  
   Implement a function that takes your model and trains it over a number of epochs and then tests it with sample yelp reviews.

4. **Q&A Section** (3 points total, 1 point each):  
   Answer three questions (in markdown) about your implementation and key concepts.

When you are finished, the provided pytest tests at the end of this notebook will automatically evaluate your code.


## Important Note on Environment:
This challenge will <span style="color: red">not work properly on Codespaces</span> because of the lack of GPU and pytorch support (at least I haven't figured out the setup yet). So for this challenge, you will open your repository's notebook in Google Collab and continue. You can do so by clicking the "Open in Collab" badge below and opening this notebook from your repository there.

<a href="https://colab.research.google.com/" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

After completing the code part, please make sure to run the TESTING section of this notebook for the test cases. Again, the pytest on Github will fail. I shall manually enter your grades without the help of the autograder for this challenge referring to the test section at the end.

## Imports and setup

In [1]:
# Downloading the yelp review dataset
!wget "https://drive.usercontent.google.com/download?id=0Bz8a_Dbh9QhbNUpYQ2N3SGlFaDg&export=download&authuser=0&confirm=t&uuid=08839d6e-0170-44f8-a1c1-2f829c484617&at=AIrpjvOJpeXNKY4yGqP9mw6bXpQS:1739966900676" -O yelp_review_polarity_csv.tar

# Extracting the dataset
!tar -xvf yelp_review_polarity_csv.tar

# Downloading python package dependencies
!pip install torchdata==0.6.1 torchtext portalocker==2.7.0

--2025-02-19 14:15:34--  https://drive.usercontent.google.com/download?id=0Bz8a_Dbh9QhbNUpYQ2N3SGlFaDg&export=download&authuser=0&confirm=t&uuid=08839d6e-0170-44f8-a1c1-2f829c484617&at=AIrpjvOJpeXNKY4yGqP9mw6bXpQS:1739966900676
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.200.132, 2404:6800:4003:c00::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.200.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 166373322 (159M) [application/octet-stream]
Saving to: ‘yelp_review_polarity_csv.tar’


2025-02-19 14:15:37 (148 MB/s) - ‘yelp_review_polarity_csv.tar’ saved [166373322/166373322]

yelp_review_polarity_csv/
yelp_review_polarity_csv/readme.txt
yelp_review_polarity_csv/test.csv
yelp_review_polarity_csv/train.csv
Collecting torchdata==0.6.1
  Downloading torchdata-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting torchtext
  Downloading torchtext-0.18.0

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torchtext.datasets import YelpReviewPolarity
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import csv

# Set random seed for reproducibility.
torch.manual_seed(42)

<torch._C.Generator at 0x7f48babd1ef0>

## Task 1: Data Preprocessing (2 points)

Note: the YelpReviewPolarity dataset label includes
- 1 : Negative polarity.
- 2 : Positive polarity.

Refer the [pytorch docs](https://pytorch.org/text/0.8.1/datasets.html#yelpreviewpolarity) for the dataset for more info.

The below code cell downloads the dataset and loads it for you.

In [8]:
def load_local_yelp_list(train_csv_path, test_csv_path, has_header=True, sample_size=50000):
    """
    Reads the local train.csv and test.csv for Yelp Review Polarity
    and returns two lists: train_list, test_list,
    where each element is (label, text).
    label is int (1 or 2), text is the review string.
    """
    train_list = []
    with open(train_csv_path, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        if has_header:
            next(reader, None)  # skip the header row
        for row in reader:
            if len(train_list) >= sample_size:
                break
            label_str, text = row
            label = int(label_str)
            train_list.append((label, text))

    test_list = []
    with open(test_csv_path, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        if has_header:
            next(reader, None)
        for row in reader:
            label_str, text = row
            label = int(label_str)
            test_list.append((label, text))

    return train_list, test_list

train_list, test_list = load_local_yelp_list('./yelp_review_polarity_csv/train.csv', './yelp_review_polarity_csv/test.csv', has_header=False)
print(f"Number of training examples: {len(train_list)}")
print(f"Number of testing examples: {len(test_list)}")

Number of training examples: 50000
Number of testing examples: 38000


In [24]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for label, text in data_iter:
        yield tokenizer(text)

print(train_list[0])

vocab = build_vocab_from_iterator(
    yield_tokens(train_list),
    specials=['<unk>', '<pad>'],
    special_first=True
)
vocab.set_default_index(vocab['<unk>'])

def text_pipeline(text):
    return vocab(tokenizer(text))

def label_pipeline(label):
    return 1 if label == 2 else 0


(1, "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.")


In [25]:
# -------------------------
# Create a custom Dataset and DataLoader
# -------------------------
class YelpDataset(Dataset):
    def __init__(self, data_iter):
        # TODO: Store the data and process it if necessary.
        self.data = [(text_pipeline(text), label_pipeline(label)) for label, text in data_iter]

    def __len__(self):
        # TODO: Return the total number of examples.
        return len(self.data)

    def __getitem__(self, idx):
        # TODO: Return the processed (text_tensor, label) tuple for index idx.
        return self.data[idx]

# TODO: Create a collate function that pads sequences in a batch.
def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(_label)
        text_list.append(torch.tensor(_text, dtype=torch.int64))

    text_list = pad_sequence(text_list, batch_first=True, padding_value=vocab['<pad>'])

    label_list = torch.tensor(label_list, dtype=torch.int64)

    return text_list, label_list

# TODO: Create a DataLoader for the training data.
batch_size = 32
train_dataset = YelpDataset(train_list)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)

## Task 2: Build the RNN Classifier (2 points)

In [26]:
# TODO: Implement the RNNClassifier class with no implementation provided.
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int, output_dim: int, num_layers: int, padding_idx: int):
        super(RNNClassifier, self).__init__()
        # TODO: Create an Embedding layer.
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=padding_idx)
        # TODO: Create an LSTM layer (batch_first=True).
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)
        # TODO: Create a fully connected layer mapping hidden_dim to output_dim.
        self.fc = nn.Linear(hidden_dim, output_dim)
        # TODO: Create a Sigmoid activation.
        self.sigmoid = nn.Sigmoid()

    def forward(self, text):
        # text: [batch_size, seq_length]
        # TODO: Convert token IDs into embeddings.
        embedded = self.embedding(text)
        # TODO: Process the embeddings through the LSTM.
        output, (hidden, cell) = self.lstm(embedded)  # Your code here.
        # TODO: Extract the final hidden state from the last LSTM layer.
        hidden_last = hidden[-1]  # Your code here.
        # TODO: Map the hidden state to output using the fully connected layer.
        logits = self.fc(hidden_last)  # Your code here.
        # TODO: Apply the Sigmoid activation and return the squeezed output.
        return self.sigmoid(logits).squeeze()  # Your code here.

## Task 3: Training and evaluating your model (2 points)

In [27]:
def train_and_evaluate():
    # TODO: Define the device (e.g., "cuda")
    device = None

    # Use the vocabulary built in the Data Preprocessing section.
    vocab_size = len(vocab)
    embed_dim = 100        # You can choose a value (e.g., 100)
    hidden_dim = 128       # You can choose a value (e.g., 128)
    output_dim = 1         # For binary classification, output dimension is 1
    num_layers = 1         # Number of LSTM layers, e.g., 1
    padding_idx = vocab['<pad>']

    # TODO: Build your RNN classifier using your RNNClassifier class.
    model = RNNClassifier(vocab_size, embed_dim, hidden_dim, output_dim, num_layers, padding_idx)
    model = model.to(device)

    # TODO: Define your loss function
    criterion = nn.BCELoss()

    # TODO: Define your optimizer with a chosen learning rate.
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Set the number of epochs for training.
    num_epochs = 10  # You can adjust the number of epochs.

    for epoch in range(num_epochs):
        # TODO: Set the model to training mode.
        model.train()

        epoch_loss = 0.0  # Initialize epoch loss.

        # Iterate over batches in your training DataLoader.
        for batch_idx, (text_batch, label_batch) in enumerate(train_loader):
            # TODO: Move text_batch and label_batch to the defined device.
            text_batch, label_batch = text.batch.to(device), label.batch.to(device)

            # TODO: Zero out the gradients.
            optimizer.zero_grad()

            # TODO: Perform a forward pass: obtain predictions from the model.
            predictions = model(text_batch)

            # TODO: Compute the loss using your criterion.
            loss = criterion(predictions, label_batch.float())

            # TODO: Perform the backward pass to compute gradients.
            loss.backward()
            optimizer.step()

            # TODO: Accumulate the loss for reporting.
            epoch_loss += loss.item()

            # TODO: Update the model parameters.


            # TODO: Accumulate the loss for reporting.

        # Compute the average loss for the epoch and print it.
        avg_loss = epoch_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

    # -------------------------
    # Evaluation on Sample Yelp Reviews
    # -------------------------

    # Define some sample Yelp reviews for evaluation.
    sample_reviews = [
        "The food was amazing and the service was excellent!",
        "I did not enjoy my visit at all. The experience was terrible.",
        "The ambiance was pleasant but the food was just okay.",
        "Absolutely loved the place! Will come back again."
    ]

    # TODO: Set the model to evaluation mode.

    predictions = []  # Store the predictions for the sample reviews.
    print("Evaluation on Sample Yelp Reviews:")
    for review in sample_reviews:
        with torch.no_grad():
            # TODO: Convert the review text to a tensor using your text_pipeline function.
            review_tensor = torch.tensor(text_pipeline(review), dtype=torch.int64)
            # TODO: Add a batch dimension to review_tensor (e.g., using unsqueeze).
            review_tensor =  review_tensor.unsqueeze(0).to(device)


            # TODO: Perform a forward pass to get the prediction.
            prediction = model(review_tensor)

            # TODO: Interpret the prediction
            sentiment = "Positive" if prediction >= 0.5 else "Negative"

            # Print the review and its predicted sentiment along with the prediction score.
            print(f"Review: {review}\nPredicted Sentiment: {sentiment} (Score: {prediction.item():.4f})\n")

            # TODO: Append the prediction score (as a float) to the predictions list.
            predictions.append(prediction.item())

    # Return a dictionary with keys "avg_loss" and "predictions".
    return {"avg_loss": avg_loss, "predictions": predictions}

## Task 4: Q&A Section (3 points)

Please answer the questions below in brief. Each carries 1 point. This section is also open-book i.e., you can refer documentation to inform your response.


**Question 1:**  
What is the purpose of specifying a `padding_idx` in the Embedding layer, and how does it affect the model's training and output?

<font color='red'>*Your Answer:* </font>

> *Ensures that the embedding for the padded token doesnt change or update during training. This prevents the padding from effecting the learning of the model.*

**Question 2:**  
Why do we use the final hidden state from the LSTM for classification? How does this hidden state encapsulate the overall information of the input sequence in the context of sentiment analysis?

<font color='red'>*Your Answer:* </font>

> *This state represents the overall sentiment by aggregating relevant features from all previous time steps.

*

**Question 3:**  
What is the role of the Sigmoid activation in this binary classification model? How might this change if you were working on a multi-class classification task? (think of other activation functions)

<font color='red'>*Your Answer:* </font>

> *The Sigmoid activation maps the output to a probability between 0 and 1 for binary classification. For multi-class classification, Softmax would replace Sigmoid, as it converts outputs into a probability distribution over multiple classes*

---
# Autograder section.

After you finish your code implementation, please run this part of the notebook to see your score for the coding section (6 points).