# Assignment 3
Text Classification using RNN and Hugging Face Dataset

This notebook implements a Recurrent Neural Network (RNN) for text classification using PyTorch and a dataset from Hugging Face.

## 1. Installing and Importing Libraries
We start by installing and importing the necessary libraries.

In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import AutoTokenizer


## 2. Loading the Dataset
We use the `imdb` dataset from Hugging Face, which consists of movie reviews labeled as positive or negative.

In [None]:
dataset = load_dataset("imdb")
print("-"*100)
print("Dataset Strcture")
print("-"*100)
print(dataset)
print("-"*100)

### Dataset Structure

The dataset used in this notebook is the `imdb` dataset from Hugging Face. It consists of movie reviews labeled as positive or negative. The dataset is divided into three subsets:

- **Train**: Contains 25,000 labeled movie reviews.
- **Test**: Contains 25,000 labeled movie reviews.
- **Unsupervised**: Contains 50,000 unlabeled movie reviews.

Each subset has the following features:
- `text`: The movie review text.
- `label`: The sentiment label (0 for negative, 1 for positive).

## 3. Preprocessing the Text Data
We use a tokenizer to convert text into sequences and prepare input tensors.

In [None]:
from transformers import AutoTokenizer

# Use a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer loaded.")

# Tokenization function
def tokenize_data(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

# Apply tokenization
dataset = dataset.map(tokenize_data, batched=True, remove_columns=["text"])
print("Dataset tokenized.")

# Convert dataset into PyTorch format with only necessary columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
print("Dataset format set to torch.")

# Check an example
print("Example sample:", dataset["train"][0])

## 4. Creating a PyTorch Dataset and DataLoader
We define a custom PyTorch dataset class and create data loaders.

Creating a custom PyTorch dataset class and data loaders is essential for efficiently handling and processing the data during training and evaluation. The custom dataset class allows us to define how the data is accessed and transformed, ensuring that each sample is correctly formatted for the model. Data loaders facilitate batching, shuffling, and parallel data loading, which are crucial for optimizing the training process and improving model performance.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class IMDBDataset(Dataset):
    def __init__(self, data):
        self.data = data
        print(f"Initialized IMDBDataset with {len(data)} samples.")

    def __len__(self):
        length = len(self.data)
        return length

    def __getitem__(self, idx):
        item = self.data[idx]
        
        input_ids = torch.tensor(item.get("input_ids", [0] * 256), dtype=torch.long)
        attention_mask = torch.tensor(item.get("attention_mask", [0] * 256), dtype=torch.long)
        label = torch.tensor(item.get("label", 0), dtype=torch.float)

        result = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "label": label
        }
        
        return result

# Creating data loaders
batch_size = 16

print("Creating train_loader...")
train_loader = DataLoader(IMDBDataset(dataset["train"]), batch_size=batch_size, shuffle=True)
print("train_loader created.")

print("Creating test_loader...")
test_loader = DataLoader(IMDBDataset(dataset["test"]), batch_size=batch_size, shuffle=False)
print("test_loader created.")

## 5. Defining the RNN Model
We define a simple RNN model with an embedding layer, an RNN layer, and a fully connected output layer.

In [None]:
class RNNModel(nn.Module):
    """
    A simple Recurrent Neural Network (RNN) model for binary text classification.

    Args:
        vocab_size (int): The size of the vocabulary.
        embed_dim (int): The dimensionality of the embedding layer.
        hidden_dim (int): The number of features in the hidden state of the RNN.
        output_dim (int): The number of output features (1 for binary classification).

    Attributes:
        embedding (nn.Embedding): The embedding layer that converts input tokens to dense vectors.
        rnn (nn.RNN): The RNN layer that processes the embedded input sequences.
        fc (nn.Linear): The fully connected layer that maps the RNN output to the desired output dimension.
        sigmoid (nn.Sigmoid): The sigmoid activation function applied to the output of the fully connected layer.
    """
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        """
        Defines the forward pass of the model.

        Args:
            x (torch.Tensor): The input tensor containing token IDs.

        Returns:
            torch.Tensor: The output tensor containing the predicted probabilities.
        """
        x = self.embedding(x)  # Convert input tokens to dense vectors
        x, _ = self.rnn(x)     # Process the embedded input sequences with the RNN
        x = self.fc(x[:, -1, :])  # Use the output of the last time step
        return self.sigmoid(x)  # Apply sigmoid activation to get probabilities

# Model Initialization
vocab_size = tokenizer.vocab_size  # Size of the vocabulary from the tokenizer
embed_dim = 128  # Dimensionality of the embedding layer
hidden_dim = 64  # Number of features in the hidden state of the RNN
output_dim = 1  # Number of output features (1 for binary classification)

model = RNNModel(vocab_size, embed_dim, hidden_dim, output_dim)
print("-"*100)
print("Model Architecture")
print("-"*100)
print(model)
print("-"*100)

## 6. Training the Model
We train the model using Binary Cross Entropy loss and the Adam optimizer.

In [9]:
from tqdm import tqdm

# Loss and Optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training Loop
epochs = 5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    model.train()
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}", leave=False)
    
    for batch in progress_bar:
        input_ids = batch["input_ids"].to(device)
        labels = batch["label"].float().to(device)
        
        optimizer.zero_grad()
        outputs = model(input_ids).squeeze()
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        progress_bar.set_postfix(loss=total_loss/len(train_loader))
    
    print(f"Epoch {epoch+1}/{epochs} - Loss: {total_loss/len(train_loader):.4f}")

  input_ids = torch.tensor(item.get("input_ids", [0] * 256), dtype=torch.long)
  attention_mask = torch.tensor(item.get("attention_mask", [0] * 256), dtype=torch.long)
  label = torch.tensor(item.get("label", 0), dtype=torch.float)
                                                                          

Epoch 1/5 - Loss: 0.6964


                                                                          

Epoch 2/5 - Loss: 0.6895


                                                                          

Epoch 3/5 - Loss: 0.6729


                                                                          

Epoch 4/5 - Loss: 0.6443


                                                                          

Epoch 5/5 - Loss: 0.6275




## 7. Evaluating the Model
We evaluate the model on the test dataset.

In [10]:
# Evaluation Function
def evaluate_model(model, data_loader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            labels = batch["label"].to(device)
            outputs = model(input_ids).squeeze()
            predictions = (outputs > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    
    accuracy = correct / total
    print(f"Test Accuracy: {accuracy * 100:.2f}%")

# Evaluate the model
evaluate_model(model, test_loader)

  input_ids = torch.tensor(item.get("input_ids", [0] * 256), dtype=torch.long)
  attention_mask = torch.tensor(item.get("attention_mask", [0] * 256), dtype=torch.long)
  label = torch.tensor(item.get("label", 0), dtype=torch.float)


Test Accuracy: 53.06%


## 8. Making Predictions
We define a function to predict the sentiment of a given text input.

In [11]:
# Prediction Function
def predict_sentiment(text):
    model.eval()
    encoded_input = tokenizer(text, padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    input_ids = encoded_input["input_ids"].to(device)
    
    with torch.no_grad():
        prediction = model(input_ids).item()
    
    return "Positive" if prediction > 0.5 else "Negative"

# Example Prediction
print(predict_sentiment("I love this movie!"))

Positive
