# LSTM-based Sentiment Analysis on IMDB Dataset

This notebook implements a sentiment analysis model using LSTM (Long Short-Term Memory) neural networks to classify IMDB movie reviews as either positive or negative.

Key components:
- PyTorch for deep learning implementation
- Hugging Face's datasets library for loading the IMDB dataset
- DistilBERT tokenizer for text preprocessing
- LSTM architecture for sequence processing

The model will learn to classify movie reviews as positive (1) or negative (0) based on the text content.

## 1. Import Required Libraries

We need several Python libraries:
- `torch`: Main PyTorch library for deep learning
- `transformers`: For using pre-trained tokenizers
- `datasets`: For loading the IMDB dataset
- `matplotlib`: For visualizing training progress
- `numpy`: For numerical operations

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers
!pip install datasets
!pip install scikit-learn
!pip install ipywidgets
!pip install tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
from datasets import load_dataset
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:0

## 2. Custom Dataset Class

The `SentimentDataset` class handles data preprocessing:

1. Accepts raw text data and tokenizer
2. Converts text to token IDs using the tokenizer
3. Handles padding and truncation to ensure fixed length
4. Creates attention masks for valid tokens
5. Converts labels to tensor format

Key parameters:
- `max_length`: Maximum sequence length (default: 128)
- `padding`: Set to 'max_length' to ensure uniform sizes
- `truncation`: True to handle reviews longer than max_length

In [2]:
class SentimentDataset(Dataset):
    def __init__(self, dataset_split, tokenizer, max_length=128):
        self.data = dataset_split
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data[idx]['text']
        label = self.data[idx]['label']

        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.float)
        }


## 3. LSTM Model Architecture

The `LSTMClassifier` implements a neural network with:

1. **Embedding Layer**: Converts token IDs to dense vectors
   - Input: Vocabulary size
   - Output: 300-dimensional embeddings

2. **LSTM Layer**: Processes the sequence
   - Input: 300-dimensional vectors
   - Hidden size: 256 dimensions
   - 2 stacked LSTM layers
   - Includes dropout for regularization

3. **Output Layer**: Final classification
   - Linear layer converting to single score
   - Sigmoid activation (implicit through loss function)

The model incorporates attention masks to handle variable-length sequences properly.

In [3]:
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=300, hidden_dim=256, n_layers=2, dropout=0.3):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                           batch_first=True, dropout=dropout if n_layers > 1 else 0)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, input_ids, attention_mask):
        embedded = self.embedding(input_ids)

        # Apply attention mask
        embedded = embedded * attention_mask.unsqueeze(-1)

        # LSTM forward pass
        lstm_out, _ = self.lstm(embedded)

        # Get final hidden state
        final_hidden_state = lstm_out[:, -1, :]

        # Apply dropout and classification layer
        output = self.dropout(final_hidden_state)
        output = self.fc(output)

        return output

## 4. Training and Evaluation Functions

The training loop includes:

1. **Per Epoch**:
   - Training phase with gradient updates
   - Validation phase without gradients
   - Loss computation and optimization
   - Progress tracking and metrics calculation

2. **Key Features**:
   - Gradient clipping (max norm: 1.0)
   - Early saving of best model
   - Training and validation loss tracking
   - Accuracy monitoring

3. **Hyperparameters**:
   - Learning rate: 2e-5
   - Batch size: 32
   - Number of epochs: 5
   - Loss function: BCEWithLogitsLoss

In [4]:
def train_model(model, train_loader, valid_loader, criterion, optimizer, device, num_epochs=5):
    best_valid_loss = float('inf')
    train_losses, valid_losses = [], []

    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs.squeeze(), labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        train_loss = train_loss / len(train_loader)
        train_losses.append(train_loss)

        # Validation
        model.eval()
        valid_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for batch in valid_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs.squeeze(), labels)
                valid_loss += loss.item()

                predictions = (outputs.squeeze() > 0.5).float()
                correct += (predictions == labels).sum().item()
                total += labels.size(0)

        valid_loss = valid_loss / len(valid_loader)
        valid_losses.append(valid_loss)
        accuracy = correct / total

        print(f'Epoch {epoch+1}/{num_epochs}:')
        print(f'Training Loss: {train_loss:.4f}')
        print(f'Validation Loss: {valid_loss:.4f}')
        print(f'Validation Accuracy: {accuracy:.4f}')

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'best_model.pt')

    return train_losses, valid_losses

## 5. Main Training Pipeline

The training process follows these steps:

1. **Setup**:
   - GPU/CPU device selection
   - Dataset loading and preprocessing
   - Model initialization

2. **Training**:
   - Batched processing of reviews
   - Forward and backward passes
   - Model parameter updates

3. **Monitoring**:
   - Loss tracking for both training and validation
   - Learning curves visualization

4. **Results**:
   - Best model saved during training
   - Final performance metrics
   - Training progress plots

In [None]:
def main():
    # Detailed CUDA diagnostics
    print("PyTorch version:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("CUDA version:", torch.version.cuda)

    if not torch.cuda.is_available():
        print("\nWARNING: CUDA not available. Checking system:")
        print("1. Check if NVIDIA GPU is present:")
        import subprocess
        try:
            nvidia_smi = subprocess.check_output(["nvidia-smi"]).decode('utf-8')
            print(nvidia_smi)
        except:
            print("nvidia-smi command failed - GPU may not be present or drivers not installed")

    # Try to force CUDA device
    try:
        device = torch.device("cuda:0")
        torch.cuda.set_device(device)
        print(f"\nSuccessfully set device to: {device}")
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    except Exception as e:
        print(f"\nFailed to set CUDA device: {e}")
        print("Falling back to CPU")
        device = torch.device("cpu")

    # Load dataset
    dataset = load_dataset('imdb')
    tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

    # Create datasets
    train_dataset = SentimentDataset(dataset['train'], tokenizer)
    valid_dataset = SentimentDataset(dataset['test'], tokenizer)

    # Create dataloaders
    batch_size = 32
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=batch_size)

    # Initialize model
    model = LSTMClassifier(
        vocab_size=tokenizer.vocab_size,
        embedding_dim=300,
        hidden_dim=256,
        n_layers=2
    ).to(device)

    # Training parameters
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

    # Train model
    train_losses, valid_losses = train_model(
        model, train_loader, valid_loader,
        criterion, optimizer, device, num_epochs=5
    )

    # Plot training curves
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label='Training Loss')
    plt.plot(valid_losses, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()
    plt.show()

if __name__ == '__main__':
    main()


PyTorch version: 2.5.1+cu121
CUDA available: True
CUDA version: 12.1

Successfully set device to: cuda:0
Using GPU: Tesla T4
GPU Memory: 15.84 GB


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]