
# Natural Language Processing with PyTorch:  
# Building and Optimizing Sentiment Analysis Models Using Amazon Reviews



## Abstract

This tutorial demonstrates how to build, train, and compare multiple sentiment analysis models using PyTorch and the Amazon Reviews dataset (amazon_polarity from Hugging Face). The guide covers data loading and preprocessing, model architecture design (BoW+MLP, LSTM, CNN, LSTM with Attention), hyperparameter experimentation, and detailed evaluation using graphs and tables. Critical comparisons with similar tutorials are provided to highlight novel insights and improvements. Following this tutorial, you will understand end-to-end NLP model development with real-world data.
"""


## Learning Objectives

- Understand the structure and contents of the Amazon Reviews dataset.
- Preprocess text data and build a vocabulary.
- Implement multiple neural network models in PyTorch for sentiment analysis.
- Compare models and analyze hyperparameter effects on performance.
- Visualize results using graphs and tables and interpret model outputs (e.g., attention visualization).
- Critically evaluate the approach against existing tutorials.

## Table of Contents

1. [Introduction](#introduction)
2. [Dataset Overview](#dataset-overview)
3. [Data Preprocessing and Vocabulary Building](#data-preprocessing)
4. [Model Architectures](#model-architectures)
   - 4.1 [Bag-of-Words + MLP](#bow-model)
   - 4.2 [LSTM Model](#lstm-model)
   - 4.3 [CNN Model](#cnn-model)
   - 4.4 [LSTM with Attention](#attn-model)
5. [Training Process](#training-process)
6. [Hyperparameter Experimentation](#hyperparameter-experimentation)
7. [Results and Evaluation](#results)
   - 7.1 [Visualization of Learning Curves, ROC, etc.](#results-visualization)
   - 7.2 [Model Comparison](#model-comparison)
8. [Conclusion](#conclusion)
9. [References](#references)

## 1. Introduction

In this tutorial, we develop a sentiment analysis pipeline using Amazon product reviews. We explore various deep learning models—from a simple bag-of-words approach to more complex architectures such as LSTM, CNN, and LSTM with Attention. We will preprocess the raw text data, build numerical representations, and train the models using PyTorch. In addition, we’ll examine how hyperparameter configurations affect model performance. Detailed visualizations and critical comparisons with existing tutorials are included.


## 2. Dataset Overview

We use the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset from Hugging Face. It contains:
- **Train Split:** 3,600,000 examples
- **Test Split:** 400,000 examples
- **Features:**
  - `label`: Sentiment (0 for negative, 1 for positive)
  - `title`: Review title
  - `content`: Full review text

For faster processing, we select a smaller subset:
- 100,000 examples for training
- 10,000 examples for testing

In [None]:
import re
from collections import Counter
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import pandas as pd
import time
import os
from lime.lime_text import LimeTextExplainer
# Load the dataset from Hugging Face
from datasets import load_dataset

In [None]:
dataset = load_dataset('amazon_polarity')
small_train = dataset['train'].select(range(100000))
small_test = dataset['test'].select(range(10000))
print("Sample training record:")
print(small_train[0])

## 3. Data Preprocessing and Vocabulary Building

We clean the text (convert to lowercase, remove punctuation and extra whitespace) and build a vocabulary using the training subset.



In [None]:
def clean_text(text):
    """Clean and normalize text data."""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def build_vocab(dataset_split, min_freq=5, max_words=50000):
    """
    Build vocabulary from dataset (iterating by index).
    
    Args:
        dataset_split: Dataset to build vocabulary from.
        min_freq: Minimum frequency for a word to be included.
        max_words: Maximum vocabulary size.
        
    Returns:
        vocab: Dictionary mapping words to indices.
    """
    print(f"Building vocabulary (min_freq={min_freq}, max_words={max_words})...")
    counter = Counter()
    for i in tqdm(range(len(dataset_split)), desc="Building vocab"):
        example = dataset_split[i]
        tokens = clean_text(example['content']).split()
        counter.update(tokens)
    words = [word for word, freq in counter.most_common(max_words) if freq >= min_freq]
    vocab = {word: idx+2 for idx, word in enumerate(words)}
    vocab["<PAD>"] = 0
    vocab["<UNK>"] = 1
    print(f"Vocabulary built: {len(vocab)} words")
    return vocab

# Build vocabulary from the training subset
vocab = build_vocab(small_train, min_freq=5)
vocab_size = len(vocab)
print("Vocabulary Size:", vocab_size)

def precompute_sequences(dataset_split, vocab, max_len=100):
    """
    Precompute sequences from dataset.
    
    Args:
        dataset_split: Dataset to process.
        vocab: Vocabulary dictionary.
        max_len: Maximum sequence length.
        
    Returns:
        sequences: List of token index sequences.
        labels: List of labels.
    """
    sequences = []
    labels = []
    for i in tqdm(range(len(dataset_split)), desc="Precomputing sequences"):
        example = dataset_split[i]
        tokens = clean_text(example['content']).split()
        seq = [vocab.get(token, vocab["<UNK>"]) for token in tokens]
        if len(seq) < max_len:
            seq = seq + [vocab["<PAD>"]] * (max_len - len(seq))
        else:
            seq = seq[:max_len]
        sequences.append(seq)
        labels.append(example['label'])
    return sequences, labels

## 4. Model Architectures

We implement four models:
- **Bag-of-Words + MLP**
- **LSTM Model**
- **CNN Model**
- **LSTM with Attention**

### 4.1 Bag-of-Words + MLP

In [None]:
class BoWModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, dropout=0.2):
        super(BoWModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
        
    def forward(self, x):
        embedded = self.embedding(x)  # [batch, max_len, embed_dim]
        avg_embedded = embedded.mean(dim=1)  # Mean pooling
        hidden = F.relu(self.fc1(avg_embedded))
        hidden = self.dropout(hidden)
        logits = self.fc2(hidden)
        return logits

### 4.2 LSTM Model

In [None]:
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes, bidirectional=True, dropout=0.2):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                            batch_first=True, bidirectional=bidirectional,
                            dropout=dropout if num_layers > 1 else 0)
        lstm_out_dim = hidden_dim * (2 if bidirectional else 1)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_out_dim, num_classes)
        
    def forward(self, x):
        embedded = self.embedding(x)
        lengths = (x != 0).sum(dim=1).cpu()
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths, batch_first=True, enforce_sorted=False)
        packed_output, (h_n, _) = self.lstm(packed_embedded)
        if self.lstm.bidirectional:
            h = torch.cat((h_n[-2], h_n[-1]), dim=1)
        else:
            h = h_n[-1]
        h = self.dropout(h)
        logits = self.fc(h)
        return logits

### 4.3 CNN Model

In [None]:
class CNNModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, kernel_sizes=[3,4,5], num_filters=100, dropout=0.2):
        super(CNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embed_dim, out_channels=num_filters, kernel_size=k)
            for k in kernel_sizes
        ])
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(num_filters * len(kernel_sizes), num_classes)
        
    def forward(self, x):
        embedded = self.embedding(x)
        embedded = embedded.permute(0, 2, 1)
        conv_outs = [F.relu(conv(embedded)) for conv in self.convs]
        pooled = [F.max_pool1d(out, kernel_size=out.size(2)).squeeze(2) for out in conv_outs]
        concat = torch.cat(pooled, dim=1)
        concat = self.dropout(concat)
        logits = self.fc(concat)
        return logits

### 4.4 LSTM with Attention

In [None]:
class AttnLSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes, bidirectional=True, dropout=0.2):
        super(AttnLSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                           batch_first=True, bidirectional=bidirectional,
                           dropout=dropout if num_layers > 1 else 0)
        self.lstm_out_dim = hidden_dim * (2 if bidirectional else 1)
        self.attention = nn.Linear(self.lstm_out_dim, 1)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(self.lstm_out_dim, num_classes)
        
    def forward(self, x):
        embedded = self.embedding(x)
        mask = (x != 0).float()
        lstm_out, _ = self.lstm(embedded)
        attn_weights = self.attention(lstm_out).squeeze(-1)
        attn_weights = attn_weights.masked_fill(mask == 0, -1e10)
        attn_weights = F.softmax(attn_weights, dim=1).unsqueeze(2)
        context = torch.sum(attn_weights * lstm_out, dim=1)
        context = self.dropout(context)
        logits = self.fc(context)
        return logits, attn_weights

## 5. Creating a Dataset and DataLoader

We precompute sequences for training and testing, then create Dataset objects and DataLoaders.

In [None]:
# Precompute sequences for training and testing
max_seq_len = 50  # Shorter sequences for faster training
train_sequences, train_labels = precompute_sequences(small_train, vocab, max_len=max_seq_len)
test_sequences, test_labels = precompute_sequences(small_test, vocab, max_len=max_seq_len)

# Define a dataset class
class AmazonReviewsDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return torch.tensor(self.sequences[idx], dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)

train_dataset = AmazonReviewsDataset(train_sequences, train_labels)
test_dataset = AmazonReviewsDataset(test_sequences, test_labels)

batch_size = 128
num_workers = 0  # Set to 0 to avoid multiprocessing issues
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=True)



## 6. Training and Evaluation Functions

In [None]:
def plot_learning_curves(train_losses, val_losses, model_name):
    plt.figure(figsize=(10, 6))
    epochs = range(1, len(train_losses) + 1)
    plt.plot(epochs, train_losses, 'b-', label='Training Loss')
    plt.plot(epochs, val_losses, 'r-', label='Validation Loss')
    plt.title(f'Learning Curves - {model_name}')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)
    plt.savefig(f'{model_name}_learning_curves.png', dpi=300)
    plt.show()

def plot_confusion_matrix(true_labels, predictions, model_name, classes=['Negative', 'Positive']):
    cm = confusion_matrix(true_labels, predictions)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig(f'{model_name}_confusion_matrix.png', dpi=300)
    plt.show()
    tn, fp, fn, tp = cm.ravel()
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    print(f"Metrics for {model_name}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

def plot_roc_curve(true_labels, prediction_probs, model_name):
    fpr, tpr, _ = roc_curve(true_labels, prediction_probs)
    roc_auc = auc(fpr, tpr)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.3f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {model_name}')
    plt.legend(loc="lower right")
    plt.grid(True)
    plt.savefig(f'{model_name}_roc_curve.png', dpi=300)
    plt.show()

def plot_precision_recall_curve(true_labels, prediction_probs, model_name):
    precision, recall, _ = precision_recall_curve(true_labels, prediction_probs)
    avg_precision = np.mean(precision)
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, color='blue', lw=2, label=f'Precision-Recall curve (AP = {avg_precision:.3f})')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title(f'Precision-Recall Curve - {model_name}')
    plt.legend(loc="lower left")
    plt.grid(True)
    plt.savefig(f'{model_name}_pr_curve.png', dpi=300)
    plt.show()

def train_model(model, train_loader, val_loader, epochs=5, lr=0.001, device='cpu', use_attention=False):
    device = torch.device(device)
    print(f"Using device: {device}")
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    best_val_acc = 0.0
    train_losses = []
    val_losses = []
    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        train_preds = []
        train_labels = []
        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Train]"):
            features, labels = batch[0].to(device), batch[1].to(device)
            optimizer.zero_grad()
            if use_attention:
                outputs, _ = model(features)
            else:
                outputs = model(features)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * features.size(0)
            _, preds = torch.max(outputs, 1)
            train_preds.extend(preds.cpu().numpy())
            train_labels.extend(labels.cpu().numpy())
        avg_train_loss = train_loss / len(train_loader.dataset)
        train_losses.append(avg_train_loss)
        train_acc = accuracy_score(train_labels, train_preds)
        model.eval()
        val_loss = 0.0
        val_preds = []
        val_labels = []
        val_probs = []
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{epochs} [Val]"):
                features, labels = batch[0].to(device), batch[1].to(device)
                if use_attention:
                    outputs, _ = model(features)
                else:
                    outputs = model(features)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * features.size(0)
                probs = torch.softmax(outputs, dim=1)
                _, preds = torch.max(outputs, 1)
                val_preds.extend(preds.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())
                val_probs.extend(probs[:, 1].cpu().numpy())
        avg_val_loss = val_loss / len(val_loader.dataset)
        val_losses.append(avg_val_loss)
        val_acc = accuracy_score(val_labels, val_preds)
        print(f"Epoch {epoch+1}/{epochs}:")
        print(f"  Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.4f}")
        print(f"  Val Loss: {avg_val_loss:.4f}, Val Acc: {val_acc:.4f}")
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            model_name_str = model.__class__.__name__
            torch.save(model.state_dict(), f"best_{model_name_str.lower()}_model.pt")
            print(f"  New best model saved with val acc: {val_acc:.4f}")
    model_name_str = model.__class__.__name__
    plot_learning_curves(train_losses, val_losses, model_name_str)
    model.load_state_dict(torch.load(f"best_{model_name_str.lower()}_model.pt"))
    return train_losses, val_losses

def evaluate_model(model, test_loader, device='cpu', use_attention=False):
    device = torch.device(device)
    model = model.to(device)
    model.eval()
    all_preds = []
    all_labels = []
    all_probs = []
    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            features, labels = batch[0].to(device), batch[1].to(device)
            if use_attention:
                outputs, _ = model(features)
            else:
                outputs = model(features)
            probs = torch.softmax(outputs, dim=1)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            all_probs.extend(probs[:, 1].cpu().numpy())
    accuracy_val = accuracy_score(all_labels, all_preds)
    report = classification_report(all_labels, all_preds, target_names=["Negative", "Positive"], output_dict=True)
    model_name_str = model.__class__.__name__
    plot_confusion_matrix(all_labels, all_preds, f"{model_name_str} (Test)")
    plot_roc_curve(all_labels, all_probs, f"{model_name_str} (Test)")
    plot_precision_recall_curve(all_labels, all_probs, f"{model_name_str} (Test)")
    print(f"\nTest Results for {model_name_str}:")
    print(f"Accuracy: {accuracy_val:.4f}")
    print(f"Precision (Positive): {report['Positive']['precision']:.4f}")
    print(f"Recall (Positive): {report['Positive']['recall']:.4f}")
    print(f"F1-Score (Positive): {report['Positive']['f1-score']:.4f}")
    results = {
        'accuracy': accuracy_val,
        'precision': report['Positive']['precision'],
        'recall': report['Positive']['recall'],
        'f1': report['Positive']['f1-score'],
        'predictions': all_preds,
        'true_labels': all_labels,
        'probabilities': all_probs
    }
    return results

## 7. Hyperparameter Experimentation

In [None]:
def train_model_quick(model, train_loader, val_loader, num_epochs=2, learning_rate=1e-3, device='cpu', use_attention=False):
    """A faster version for hyperparameter experiments."""
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):
        model.train()
        train_loss = 0
        for batch_idx, (texts, labels) in enumerate(train_loader):
            texts, labels = texts.to(device), labels.to(device)
            optimizer.zero_grad()
            if use_attention:
                outputs, _ = model(texts)
            else:
                outputs = model(texts)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            if batch_idx >= 20:
                break
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_idx, (texts, labels) in enumerate(val_loader):
                texts, labels = texts.to(device), labels.to(device)
                if use_attention:
                    outputs, _ = model(texts)
                else:
                    outputs = model(texts)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                if batch_idx >= 10:
                    break
    return model

def evaluate_model_quick(model, test_loader, device='cpu', use_attention=False):
    """Evaluate model accuracy on a subset for fast experiments."""
    model.to(device)
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch_idx, (texts, labels) in enumerate(test_loader):
            texts, labels = texts.to(device), labels.to(device)
            if use_attention:
                outputs, _ = model(texts)
            else:
                outputs = model(texts)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            if batch_idx >= 20:
                break
    accuracy_val = accuracy_score(all_labels, all_preds)
    print(f"Test Accuracy (on subset): {accuracy_val:.4f}")
    return accuracy_val

def run_hyperparameter_experiments(train_dataset, test_dataset, vocab_size, device='cpu'):
    print("\n==== Running Hyperparameter Experiments ====\n")
    results_data = []
    max_epochs = 2
    test_loader_exp = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=0)
    print("\n--- Experiment 1: Effect of Embedding Dimension ---")
    embedding_dims = [16, 32, 64, 128]
    for embed_dim in embedding_dims:
        print(f"\nTesting embedding dimension: {embed_dim}")
        model = BoWModel(vocab_size, embed_dim, hidden_dim=64, num_classes=2, dropout=0.2)
        batch_size = 128
        train_loader_exp = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
        start_time = time.time()
        train_model_quick(model, train_loader_exp, test_loader_exp, num_epochs=max_epochs, learning_rate=0.001, device=device)
        training_time = time.time() - start_time
        val_accuracy = evaluate_model_quick(model, test_loader_exp, device)
        results_data.append({
            'model_type': 'BoW',
            'embedding_dim': embed_dim,
            'hidden_dim': 64,
            'learning_rate': 0.001,
            'batch_size': batch_size,
            'dropout': 0.2,
            'val_accuracy': val_accuracy,
            'training_time': training_time
        })
    print("\n--- Experiment 2: Effect of Learning Rate ---")
    learning_rates = [0.0001, 0.001, 0.01, 0.1]
    for lr in learning_rates:
        print(f"\nTesting learning rate: {lr}")
        model = CNNModel(vocab_size, embed_dim=64, num_classes=2, dropout=0.2)
        batch_size = 128
        train_loader_exp = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
        start_time = time.time()
        train_model_quick(model, train_loader_exp, test_loader_exp, num_epochs=max_epochs, learning_rate=lr, device=device)
        training_time = time.time() - start_time
        val_accuracy = evaluate_model_quick(model, test_loader_exp, device)
        results_data.append({
            'model_type': 'CNN',
            'embedding_dim': 64,
            'hidden_dim': None,
            'learning_rate': lr,
            'batch_size': batch_size,
            'dropout': 0.2,
            'val_accuracy': val_accuracy,
            'training_time': training_time
        })
    print("\n--- Experiment 3: Effect of Dropout Rate ---")
    dropout_rates = [0.0, 0.1, 0.2, 0.3, 0.5]
    for dropout in dropout_rates:
        print(f"\nTesting dropout rate: {dropout}")
        model = LSTMModel(vocab_size, embed_dim=64, hidden_dim=64, num_layers=1, num_classes=2, dropout=dropout)
        batch_size = 128
        train_loader_exp = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
        start_time = time.time()
        train_model_quick(model, train_loader_exp, test_loader_exp, num_epochs=max_epochs, learning_rate=0.001, device=device)
        training_time = time.time() - start_time
        val_accuracy = evaluate_model_quick(model, test_loader_exp, device)
        results_data.append({
            'model_type': 'LSTM',
            'embedding_dim': 64,
            'hidden_dim': 64,
            'learning_rate': 0.001,
            'batch_size': batch_size,
            'dropout': dropout,
            'val_accuracy': val_accuracy,
            'training_time': training_time
        })
    print("\n--- Experiment 4: Effect of Hidden Layer Size ---")
    hidden_dims = [32, 64, 128, 256]
    for hidden_dim in hidden_dims:
        print(f"\nTesting hidden dimension: {hidden_dim}")
        model = AttnLSTMModel(vocab_size, embed_dim=64, hidden_dim=hidden_dim, num_layers=1, num_classes=2, dropout=0.2)
        batch_size = 128
        train_loader_exp = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
        start_time = time.time()
        train_model_quick(model, train_loader_exp, test_loader_exp, num_epochs=max_epochs, learning_rate=0.001, device=device, use_attention=True)
        training_time = time.time() - start_time
        val_accuracy = evaluate_model_quick(model, test_loader_exp, device, use_attention=True)
        results_data.append({
            'model_type': 'LSTM+Attention',
            'embedding_dim': 64,
            'hidden_dim': hidden_dim,
            'learning_rate': 0.001,
            'batch_size': batch_size,
            'dropout': 0.2,
            'val_accuracy': val_accuracy,
            'training_time': training_time
        })
    print("\n--- Experiment 5: Comparing Model Architectures ---")
    models = {
        'BoW': BoWModel(vocab_size, embed_dim=64, hidden_dim=64, num_classes=2, dropout=0.2),
        'LSTM': LSTMModel(vocab_size, embed_dim=64, hidden_dim=64, num_layers=1, num_classes=2, dropout=0.2),
        'CNN': CNNModel(vocab_size, embed_dim=64, num_classes=2, dropout=0.2),
        'LSTM+Attention': AttnLSTMModel(vocab_size, embed_dim=64, hidden_dim=64, num_layers=1, num_classes=2, dropout=0.2)
    }
    for model_name, model in models.items():
        print(f"\nTraining {model_name} model with best hyperparameters")
        batch_size = 128
        train_loader_exp = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
        start_time = time.time()
        use_attn = model_name == 'LSTM+Attention'
        train_model_quick(model, train_loader_exp, test_loader_exp, num_epochs=max_epochs, learning_rate=0.001, device=device, use_attention=use_attn)
        training_time = time.time() - start_time
        val_accuracy = evaluate_model_quick(model, test_loader_exp, device, use_attention=use_attn)
        results_data.append({
            'model_type': model_name,
            'embedding_dim': 64,
            'hidden_dim': 64 if model_name != 'CNN' else None,
            'learning_rate': 0.001,
            'batch_size': batch_size,
            'dropout': 0.2,
            'val_accuracy': val_accuracy,
            'training_time': training_time
        })
    results_df = pd.DataFrame(results_data)
    return results_df

def visualize_hyperparameter_results(results_df):
    plt.figure(figsize=(20, 24))
    plt.subplot(4, 2, 1)
    bow_results = results_df[results_df['model_type'] == 'BoW']
    plt.plot(bow_results['embedding_dim'], bow_results['val_accuracy'], 'bo-', linewidth=2, markersize=10)
    for x, y in zip(bow_results['embedding_dim'], bow_results['val_accuracy']):
        plt.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0,10), ha='center', fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    plt.xlabel('Embedding Dimension', fontsize=12)
    plt.ylabel('Validation Accuracy', fontsize=12)
    plt.title('Effect of Embedding Dimension on Accuracy (BoW Model)', fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(bow_results['embedding_dim'])
    
    plt.subplot(4, 2, 2)
    bars = plt.bar(bow_results['embedding_dim'], bow_results['training_time'], width=0.5, color='skyblue', edgecolor='navy', alpha=0.7)
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1, f'{height:.1f}s', ha='center', va='bottom', fontsize=9)
    plt.xlabel('Embedding Dimension', fontsize=12)
    plt.ylabel('Training Time (seconds)', fontsize=12)
    plt.title('Effect of Embedding Dimension on Training Time (BoW Model)', fontsize=14)
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.xticks(bow_results['embedding_dim'])
    
    plt.subplot(4, 2, 3)
    cnn_results = results_df[results_df['model_type'] == 'CNN']
    plt.semilogx(cnn_results['learning_rate'], cnn_results['val_accuracy'], 'ro-', linewidth=2, markersize=10)
    for x, y in zip(cnn_results['learning_rate'], cnn_results['val_accuracy']):
        plt.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0,10), ha='center', fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    plt.xlabel('Learning Rate (log scale)', fontsize=12)
    plt.ylabel('Validation Accuracy', fontsize=12)
    plt.title('Effect of Learning Rate on Accuracy (CNN Model)', fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(cnn_results['learning_rate'], [f'{x:.4f}' for x in cnn_results['learning_rate']])
    
    plt.subplot(4, 2, 4)
    bars = plt.bar(cnn_results['learning_rate'], cnn_results['training_time'], width=0.3, color='salmon', edgecolor='darkred', alpha=0.7)
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1, f'{height:.1f}s', ha='center', va='bottom', fontsize=9)
    plt.xlabel('Learning Rate (log scale)', fontsize=12)
    plt.ylabel('Training Time (seconds)', fontsize=12)
    plt.title('Effect of Learning Rate on Training Time (CNN Model)', fontsize=14)
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.xticks(cnn_results['learning_rate'], [f'{x:.4f}' for x in cnn_results['learning_rate']])
    
    plt.subplot(4, 2, 5)
    lstm_results = results_df[results_df['model_type'] == 'LSTM']
    plt.plot(lstm_results['dropout'], lstm_results['val_accuracy'], 'go-', linewidth=2, markersize=10)
    for x, y in zip(lstm_results['dropout'], lstm_results['val_accuracy']):
        plt.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0,10), ha='center', fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    plt.xlabel('Dropout Rate', fontsize=12)
    plt.ylabel('Validation Accuracy', fontsize=12)
    plt.title('Effect of Dropout Rate on Accuracy (LSTM Model)', fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(lstm_results['dropout'])
    
    plt.subplot(4, 2, 6)
    attn_results = results_df[results_df['model_type'] == 'LSTM+Attention']
    plt.plot(attn_results['hidden_dim'], attn_results['val_accuracy'], 'mo-', linewidth=2, markersize=10)
    for x, y in zip(attn_results['hidden_dim'], attn_results['val_accuracy']):
        plt.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0,10), ha='center', fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    plt.xlabel('Hidden Dimension', fontsize=12)
    plt.ylabel('Validation Accuracy', fontsize=12)
    plt.title('Effect of Hidden Dimension on Accuracy (LSTM+Attention Model)', fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(attn_results['hidden_dim'])
    
    plt.subplot(4, 2, 7)
    model_comparison = results_df.drop_duplicates(subset=['model_type'], keep='last')
    colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
    bars = plt.bar(model_comparison['model_type'], model_comparison['val_accuracy'], color=colors, width=0.6, edgecolor='black', linewidth=1.5)
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.01, f'{height:.3f}', ha='center', va='bottom', fontsize=10, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    plt.xlabel('Model Architecture', fontsize=12)
    plt.ylabel('Validation Accuracy', fontsize=12)
    plt.title('Comparison of Model Architectures (Best Configuration)', fontsize=14)
    plt.ylim(0.5, max(model_comparison['val_accuracy']) + 0.1)
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.xticks(rotation=45, ha='right')
    
    plt.subplot(4, 2, 8)
    bars = plt.bar(model_comparison['model_type'], model_comparison['training_time'], color=colors, width=0.6, edgecolor='black', linewidth=1.5, alpha=0.7)
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1, f'{height:.1f}s', ha='center', va='bottom', fontsize=10)
    plt.xlabel('Model Architecture', fontsize=12)
    plt.ylabel('Training Time (seconds)', fontsize=12)
    plt.title('Training Time Comparison (Best Configuration)', fontsize=14)
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.xticks(rotation=45, ha='right')
    
    plt.tight_layout(pad=3.0)
    plt.savefig('hyperparameter_experiments.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    best_configs = []
    for model in results_df['model_type'].unique():
        model_results = results_df[results_df['model_type'] == model]
        best_row = model_results.loc[model_results['val_accuracy'].idxmax()]
        best_configs.append(best_row)
    best_df = pd.DataFrame(best_configs)
    print("\nBest Hyperparameter Configurations:")
    print(best_df[['model_type', 'embedding_dim', 'hidden_dim', 'learning_rate', 'dropout', 'val_accuracy', 'training_time']])
    
    if len(results_df['embedding_dim'].unique()) > 1 and len(results_df['dropout'].unique()) > 1:
        plt.figure(figsize=(15, 6))
        combinations = []
        for emb in sorted(results_df['embedding_dim'].unique()):
            for drop in sorted(results_df['dropout'].unique()):
                matching = results_df[(results_df['embedding_dim'] == emb) & (results_df['dropout'] == drop)]
                if not matching.empty:
                    combinations.append({
                        'embedding_dim': emb,
                        'dropout': drop,
                        'val_accuracy': matching['val_accuracy'].values[0]
                    })
        if combinations:
            combo_df = pd.DataFrame(combinations)
            pivot_df = combo_df.pivot(index='dropout', columns='embedding_dim', values='val_accuracy')
            plt.subplot(1, 2, 1)
            sns.heatmap(pivot_df, annot=True, fmt='.3f', cmap='viridis', cbar_kws={'label': 'Validation Accuracy'})
            plt.title('Effect of Embedding Dim + Dropout Combinations')
            plt.tight_layout()
            plt.savefig('hyperparameter_heatmap.png', dpi=300, bbox_inches='tight')
            plt.show()
    return best_df

def run_hyperparameter_study():
    subset_size = 5000
    np.random.seed(42)
    train_indices = np.random.choice(len(small_train), subset_size, replace=False)
    print("\nPrecomputing sequences for hyperparameter experiments...")
    train_sequences, train_labels = precompute_sequences([small_train[int(i)] for i in train_indices], vocab, max_len=50)
    test_indices = np.random.choice(len(small_test), subset_size//5, replace=False)
    test_sequences, test_labels = precompute_sequences([small_test[int(i)] for i in test_indices], vocab, max_len=50)
    train_dataset_hp = AmazonReviewsDataset(train_sequences, train_labels)
    test_dataset_hp = AmazonReviewsDataset(test_sequences, test_labels)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    results_df = run_hyperparameter_experiments(train_dataset_hp, test_dataset_hp, vocab_size, device)
    best_configs = visualize_hyperparameter_results(results_df)
    return best_configs


## 8. Results and Evaluation

In [None]:
def plot_model_comparison(model_results):
    model_names = list(model_results.keys())
    metric_values = [model_results[model]['accuracy'] for model in model_names]
    plt.figure(figsize=(10, 6))
    bars = plt.bar(model_names, metric_values, color=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'])
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.01, f'{height:.4f}',
                 ha='center', va='bottom')
    plt.title('Model Comparison - Accuracy')
    plt.ylabel('Accuracy')
    plt.ylim(0, 1.0)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.savefig('model_comparison_accuracy.png', dpi=300)
    plt.show()

def plot_multi_metric_comparison(model_results):
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    model_names = list(model_results.keys())
    data = []
    for model in model_names:
        for metric in metrics:
            data.append({
                'Model': model,
                'Metric': metric,
                'Value': model_results[model][metric]
            })
    df = pd.DataFrame(data)
    plt.figure(figsize=(12, 8))
    chart = sns.barplot(x='Model', y='Value', hue='Metric', data=df)
    for i, p in enumerate(chart.patches):
        chart.annotate(f'{p.get_height():.3f}', (p.get_x() + p.get_width() / 2., p.get_height() + 0.01),
                      ha='center', va='bottom', fontsize=8)
    plt.title('Model Comparison - Multiple Metrics')
    plt.ylabel('Value')
    plt.ylim(0, 1.0)
    plt.legend(title='Metric')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.savefig('model_comparison_multi_metric.png', dpi=300)
    plt.show()

def visualize_attention(model, text, vocab, tokenizer=None, max_len=100, device='cpu'):
    model.eval()
    if tokenizer:
        tokens = tokenizer(text)
    else:
        tokens = clean_text(text).split()
    seq = [vocab.get(token, vocab["<UNK>"]) for token in tokens]
    if len(seq) < max_len:
        effective_len = len(seq)
        seq = seq + [vocab["<PAD>"]] * (max_len - len(seq))
    else:
        effective_len = max_len
        seq = seq[:max_len]
    seq_tensor = torch.tensor([seq], dtype=torch.long).to(device)
    with torch.no_grad():
        _, attn_weights = model(seq_tensor)
    attn_data = attn_weights[0, :effective_len, 0].cpu().numpy()
    plt.figure(figsize=(16, 4))
    plt.subplot(1, 2, 1)
    attn_df = pd.DataFrame({'token': tokens[:effective_len], 'attention': attn_data})
    sns.barplot(x='token', y='attention', data=attn_df)
    plt.xticks(rotation=45, ha='right')
    plt.title('Attention Weights per Token')
    plt.tight_layout()
    plt.subplot(1, 2, 2)
    norm_weights = (attn_data - attn_data.min()) / (attn_data.max() - attn_data.min() + 1e-10)
    y_positions = np.arange(effective_len)
    plt.barh(y_positions, norm_weights, color='skyblue')
    plt.yticks(y_positions, tokens[:effective_len])
    plt.gca().invert_yaxis()
    plt.title('Attention Heatmap')
    plt.xlabel('Normalized Attention')
    plt.tight_layout()
    plt.savefig('attention_visualization.png', dpi=300)
    plt.show()
    print("Tokens with attention weights:")
    for token, weight in zip(tokens[:effective_len], attn_data):
        print(f"{token}: {weight:.4f}")
   

In [None]:
!pip install lime

In [None]:
def predict_proba_model(texts, model, vocab, max_seq_len, device):
    """
    Convert a list of raw text strings to prediction probabilities using the model.
    If a text becomes empty after cleaning, it assigns a default "<UNK>" token.
    """
    model.eval()
    processed = []
    for text in texts:
        tokens = clean_text(text).split()
        if len(tokens) == 0:
            tokens = ["<UNK>"]
        seq = [vocab.get(token, vocab["<UNK>"]) for token in tokens]
        if len(seq) < max_seq_len:
            seq = seq + [vocab["<PAD>"]] * (max_seq_len - len(seq))
        else:
            seq = seq[:max_seq_len]
        processed.append(seq)
    processed_tensor = torch.tensor(processed, dtype=torch.long).to(device)
    with torch.no_grad():
        outputs = model(processed_tensor)
        if isinstance(outputs, tuple):  # for models that return a tuple (e.g., with attention)
            outputs = outputs[0]
        probs = torch.softmax(outputs, dim=1)
    return probs.cpu().numpy()

def explain_instance(model, raw_text, vocab, max_seq_len, device, class_names=["Negative", "Positive"], num_features=10):
    """
    Generate and display a LIME explanation for a given review.
    """
    explainer = LimeTextExplainer(class_names=class_names)
    predict_fn = lambda texts: predict_proba_model(texts, model, vocab, max_seq_len, device)
    explanation = explainer.explain_instance(raw_text, predict_fn, num_features=num_features)
    explanation.show_in_notebook(text=raw_text)
    fig = explanation.as_pyplot_figure()
    plt.title("LIME Explanation")
    plt.show()
    return explanation

def explain_prediction(model, text, vocab, max_seq_len, device='cpu', num_features=10, class_names=["Negative", "Positive"]):
    explainer = LimeTextExplainer(class_names=class_names)
    predict_fn = lambda texts: predict_proba_model(texts, model, vocab, max_seq_len, device)
    explanation = explainer.explain_instance(text, predict_fn, num_features=num_features)
    probs = predict_fn([text])[0]
    pred_class = class_names[1] if probs[1] >= 0.5 else class_names[0]
    probability = probs[1] if pred_class == class_names[1] else probs[0]
    return explanation, pred_class, probability

def compare_all_models_lime(text_sample, trained_models, vocab, max_seq_len, device='cpu'):
    """Compare LIME explanations across all provided model architectures."""
    fig, axes = plt.subplots(len(trained_models), 2, figsize=(15, 5 * len(trained_models)))
    model_names = list(trained_models.keys())
    
    for i, model_name in enumerate(model_names):
        model = trained_models[model_name]
        exp, class_name, probability = explain_prediction(model, text_sample, vocab, max_seq_len, device=device)
        axes[i, 0].bar(['Negative', 'Positive'],
                       [1 - probability, probability] if class_name == "Positive" else [probability, 1 - probability],
                       color=['#3498db', '#e74c3c'])
        axes[i, 0].set_title(f"{model_name}: {class_name} ({probability:.2f})")
        axes[i, 0].set_ylim(0, 1.0)
        
        desired_label = 0 if class_name == "Negative" else 1
        try:
            words_weights = exp.as_list(label=desired_label)
        except KeyError:
            available_labels = list(exp.local_exp.keys())
            if available_labels:
                print(f"Warning: Using label {available_labels[0]} instead of {desired_label} for {model_name}")
                words_weights = exp.as_list(label=available_labels[0])
            else:
                words_weights = []
        
        sorted_indices = np.argsort([abs(w[1]) for w in words_weights])[::-1][:10]
        words = [words_weights[j][0] for j in sorted_indices]
        weights = [words_weights[j][1] for j in sorted_indices]
        colors = ['green' if w > 0 else 'red' for w in weights]
        y_pos = np.arange(len(words))
        axes[i, 1].barh(y_pos, weights, color=colors)
        axes[i, 1].set_yticks(y_pos)
        axes[i, 1].set_yticklabels(words)
        axes[i, 1].set_title(f"Top Words influencing {model_name} prediction")
    
    plt.tight_layout()
    plt.savefig('model_explanations_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    return fig

def create_enhanced_lime_visualization(text, exp, class_idx, model_name):
    """Create an enhanced visualization for a LIME explanation."""
    words_weights = exp.as_list(label=class_idx)
    tokens = text.lower().split()
    word_data = []
    for word in tokens:
        weight = 0
        for w, val in words_weights:
            if w == word:
                weight = val
                break
        word_data.append({
            'word': word,
            'weight': weight,
            'color': 'green' if weight > 0 else 'red' if weight < 0 else 'gray',
            'abs_weight': abs(weight)
        })
    plt.figure(figsize=(15, 8))
    plt.subplot(2, 1, 1)
    words_df = pd.DataFrame(word_data)
    top_words = words_df.sort_values('abs_weight', ascending=False).head(15)
    colors = top_words['color']
    plt.barh(range(len(top_words)), top_words['weight'], color=colors, alpha=0.8)
    plt.yticks(range(len(top_words)), top_words['word'])
    plt.xlabel('Weight / Influence')
    plt.title(f'Top Influential Words - {model_name}')
    plt.grid(axis='x', linestyle='--', alpha=0.6)
    
    plt.subplot(2, 1, 2)
    plt.axis('off')
    colormap = plt.cm.RdYlGn
    max_weight = max([abs(w['weight']) for w in word_data])
    normalized_weights = [(w['weight'] / max_weight) if max_weight > 0 else 0 for w in word_data]
    highlighted_text = ""
    for i, word in enumerate(tokens):
        weight = normalized_weights[i]
        color_val = (weight + 1) / 2
        color = colormap(color_val)
        rgb = f'rgb({int(color[0]*255)},{int(color[1]*255)},{int(color[2]*255)})'
        size = 10 + abs(weight) * 6
        highlighted_text += f'<span style="background-color:{rgb}; font-size:{size}pt">{word}</span> '
    plt.text(0.5, 0.5, highlighted_text, fontsize=12, ha='center', va='center', wrap=True, transform=plt.gca().transAxes)
    plt.tight_layout()
    plt.savefig(f'enhanced_lime_{model_name.lower().replace(" ", "_")}.png', dpi=300, bbox_inches='tight')
    plt.show()

def analyze_test_cases(test_reviews, trained_models, vocab, max_seq_len, device='cpu'):
    for i, review in enumerate(test_reviews):
        print(f"\n=== Test Case {i+1} ===")
        print(review)
        compare_all_models_lime(review, trained_models, vocab, max_seq_len, device=device)
        print("\nAnalysis: (Add your detailed discussion here about model agreements/disagreements.)\n")

def analyze_word_importance_patterns(test_corpus, models, vocab, max_seq_len, device='cpu'):
    """
    Analyze which words are consistently important across different models.
    
    For each text in the test corpus, get the LIME explanation from each model.
    If the desired label is not found in the explanation (KeyError), try the alternative label.
    Then aggregate the importance scores across models and plot the top words.
    """
    word_importance = {}
    for text in test_corpus:
        for model_name, model in models.items():
            exp, class_name, _ = explain_prediction(model, text, vocab, max_seq_len, device=device)
            class_idx = 0 if class_name == "Negative" else 1
            try:
                words_weights = exp.as_list(label=class_idx)
            except KeyError:
                alternative_label = 1 - class_idx
                try:
                    words_weights = exp.as_list(label=alternative_label)
                    print(f"Warning: Using alternative label {alternative_label} for model {model_name} on text: '{text[:50]}...'")
                except KeyError:
                    print(f"Warning: No label found for model {model_name} for text: '{text[:50]}...'")
                    words_weights = []
            for word, weight in words_weights:
                if word not in word_importance:
                    word_importance[word] = {model_name: []}
                elif model_name not in word_importance[word]:
                    word_importance[word][model_name] = []
                word_importance[word][model_name].append(weight)
    aggregated_importance = {}
    for word, model_scores in word_importance.items():
        aggregated_importance[word] = {}
        for model, weights in model_scores.items():
            if weights:
                aggregated_importance[word][model] = sum(weights) / len(weights)
    importance_scores = []
    for word, model_scores in aggregated_importance.items():
        # Only include words that appeared for every model.
        if len(model_scores) == len(models):
            avg_score = sum(model_scores.values()) / len(model_scores)
            std_score = np.std(list(model_scores.values()))
            importance_scores.append((word, avg_score, std_score))
    importance_scores.sort(key=lambda x: abs(x[1]), reverse=True)
    top_words = importance_scores[:20]
    plt.figure(figsize=(12, 8))
    words = [w[0] for w in top_words]
    scores = [w[1] for w in top_words]
    errors = [w[2] for w in top_words]
    colors = ['green' if s > 0 else 'red' for s in scores]
    y_pos = np.arange(len(words))
    plt.barh(y_pos, scores, xerr=errors, color=colors, alpha=0.7)
    plt.yticks(y_pos, words)
    plt.xlabel('Average Importance Score (with std dev)')
    plt.title('Top Important Words Across All Models')
    plt.grid(axis='x', linestyle='--', alpha=0.6)
    plt.tight_layout()
    plt.savefig('word_importance_patterns.png', dpi=300)
    plt.show()
    return aggregated_importance



## 9. Conclusion

In this project, we built and optimized multiple sentiment analysis models using a subset of the Amazon Reviews dataset. We covered data preprocessing, vocabulary building, and implemented four architectures: a Bag-of-Words + MLP, an LSTM, a CNN, and an LSTM with an attention mechanism. Hyperparameter experimentation allowed us to analyze the effects of embedding dimensions, learning rates, dropout rates, and hidden dimensions on model performance. Detailed evaluations with various metrics and visualizations provided insights into each model’s trade-offs.



## 10. References

- Hugging Face Datasets: [https://huggingface.co/docs/datasets](https://huggingface.co/docs/datasets)
- PyTorch Documentation: [https://pytorch.org/docs/stable/index.html](https://pytorch.org/docs/stable/index.html)
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *arXiv preprint arXiv:1810.04805*.
- Additional relevant sources as needed.



## Final Execution


In [None]:
if __name__ == "__main__":
    print("\n====== Starting Main Experiments ======")
    # Run your main experiments (this should define and train your models)
    def run_experiments(fast_mode=True):
        embed_dim = 64 if fast_mode else 128
        hidden_dim = 64 if fast_mode else 128
        num_layers = 1
        num_classes = 2
        num_epochs = 3 if fast_mode else 5
        batch_size = 128
        max_seq_len_local = 50 if fast_mode else 100
        print("Precomputing training sequences...")
        train_sequences, train_labels = precompute_sequences(small_train, vocab, max_len=max_seq_len_local)
        print("Precomputing test sequences...")
        test_sequences, test_labels = precompute_sequences(small_test, vocab, max_len=max_seq_len_local)
        train_dataset_main = AmazonReviewsDataset(train_sequences, train_labels)
        test_dataset_main = AmazonReviewsDataset(test_sequences, test_labels)
        num_workers = 0
        train_loader_main = DataLoader(train_dataset_main, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
        test_loader_main = DataLoader(test_dataset_main, batch_size=batch_size, num_workers=num_workers, pin_memory=True)
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"Using device: {device}")
        train_bow = True
        train_lstm = True
        train_cnn = True
        train_attn = True
        if fast_mode:
            subset_size = 10000
            print(f"Using a small subset of {subset_size} examples for extra fast mode")
            subset_indices = torch.randperm(len(train_dataset_main))[:subset_size]
            subset_dataset = torch.utils.data.Subset(train_dataset_main, subset_indices)
            train_loader_main = DataLoader(subset_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
        model_results = {}
        train_loss_dict = {}
        val_loss_dict = {}
        if train_bow:
            print("\n=== Training Bag-of-Words + MLP model ===")
            bow_model = BoWModel(vocab_size, embed_dim, hidden_dim, num_classes)
            bow_train_losses, bow_val_losses = train_model(bow_model, train_loader_main, test_loader_main, num_epochs, lr=1e-3, device=device)
            train_loss_dict['BoW'] = bow_train_losses
            val_loss_dict['BoW'] = bow_val_losses
            bow_results = evaluate_model(bow_model, test_loader_main, device=device)
            model_results['BoW'] = bow_results
        if train_lstm:
            print("\n=== Training LSTM model ===")
            lstm_model = LSTMModel(vocab_size, embed_dim, hidden_dim, num_layers, num_classes, bidirectional=True)
            lstm_train_losses, lstm_val_losses = train_model(lstm_model, train_loader_main, test_loader_main, num_epochs, lr=1e-3, device=device)
            train_loss_dict['LSTM'] = lstm_train_losses
            val_loss_dict['LSTM'] = lstm_val_losses
            lstm_results = evaluate_model(lstm_model, test_loader_main, device=device)
            model_results['LSTM'] = lstm_results
        if train_cnn:
            print("\n=== Training CNN model ===")
            cnn_model = CNNModel(vocab_size, embed_dim, num_classes)
            cnn_train_losses, cnn_val_losses = train_model(cnn_model, train_loader_main, test_loader_main, num_epochs, lr=1e-3, device=device)
            train_loss_dict['CNN'] = cnn_train_losses
            val_loss_dict['CNN'] = cnn_val_losses
            cnn_results = evaluate_model(cnn_model, test_loader_main, device=device)
            model_results['CNN'] = cnn_results
        if train_attn:
            print("\n=== Training LSTM with Attention model ===")
            attn_model = AttnLSTMModel(vocab_size, embed_dim, hidden_dim, num_layers, num_classes, bidirectional=True)
            attn_train_losses, attn_val_losses = train_model(attn_model, train_loader_main, test_loader_main, num_epochs, lr=1e-3, use_attention=True, device=device)
            train_loss_dict['LSTM+Attention'] = attn_train_losses
            val_loss_dict['LSTM+Attention'] = attn_val_losses
            attn_results = evaluate_model(attn_model, test_loader_main, device=device, use_attention=True)
            model_results['LSTM+Attention'] = attn_results
        if len(model_results) > 1:
            print("\n=== Model Comparison ===")
            plot_model_comparison(model_results)
            plot_multi_metric_comparison(model_results)
            plt.figure(figsize=(12, 8))
            plt.subplot(1, 2, 1)
            epochs_range = range(1, num_epochs + 1)
            for model_name, losses in train_loss_dict.items():
                plt.plot(epochs_range, losses, 'o-', label=model_name)
            plt.title('Training Loss Comparison')
            plt.xlabel('Epochs')
            plt.ylabel('Loss')
            plt.legend()
            plt.grid(True)
            plt.subplot(1, 2, 2)
            for model_name, losses in val_loss_dict.items():
                plt.plot(epochs_range, losses, 'o-', label=model_name)
            plt.title('Validation Loss Comparison')
            plt.xlabel('Epochs')
            plt.ylabel('Loss')
            plt.legend()
            plt.grid(True)
            plt.tight_layout()
            plt.savefig('model_comparison_learning_curves.png', dpi=300)
            plt.show()
        return model_results

    print("\n====== Starting Main Experiments ======")
    model_results = run_experiments(fast_mode=True)
    
    print("\n====== Starting Hyperparameter Experimentation ======")
    best_configs = run_hyperparameter_study()
    

In [None]:
print("\n====== Model Explainability with LIME ======")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the BoW model from file (adjust filename as needed)
bow_model_explain = BoWModel(vocab_size, 64, 64, num_classes=2)
if os.path.exists("best_bowmodel_model.pt"):
    bow_model_explain.load_state_dict(torch.load("best_bowmodel_model.pt", map_location=device))
else:
    print("Warning: best_bowmodel_model.pt not found. Using untrained BoW model.")
bow_model_explain = bow_model_explain.to(device)

# Load the LSTM model from file (if available)
lstm_model_explain = LSTMModel(vocab_size, 64, 64, num_layers=1, num_classes=2, bidirectional=True)
if os.path.exists("best_lstmmodel_model.pt"):
    lstm_model_explain.load_state_dict(torch.load("best_lstmmodel_model.pt", map_location=device))
else:
    print("Warning: best_lstm_model.pt not found. Using untrained LSTM model.")
lstm_model_explain = lstm_model_explain.to(device)

# Load the CNN model from file (if available)
cnn_model_explain = CNNModel(vocab_size, 64, num_classes=2, dropout=0.2)
if os.path.exists("best_cnnmodel_model.pt"):
    cnn_model_explain.load_state_dict(torch.load("best_cnnmodel_model.pt", map_location=device))
else:
    print("Warning: best_cnn_model.pt not found. Using untrained CNN model.")
cnn_model_explain = cnn_model_explain.to(device)

# Load the LSTM+Attention model from file
attn_model_explain = AttnLSTMModel(vocab_size, 64, 64, num_layers=1, num_classes=2, bidirectional=True)
if os.path.exists("best_attnlstmmodel_model.pt"):
    attn_model_explain.load_state_dict(torch.load("best_attnlstmmodel_model.pt", map_location=device))
else:
    print("Warning: best_attnlstmmodel_model.pt not found. Using untrained LSTM+Attention model.")
attn_model_explain = attn_model_explain.to(device)

# Define dictionary of all trained (or loaded) models for comparison.
trained_models = {
    "BoW Model": bow_model_explain,
    "LSTM Model": lstm_model_explain,
    "CNN Model": cnn_model_explain,
    "LSTM+Attention Model": attn_model_explain
}

# Choose a sample review from the test set for explanation.
sample_review = small_test[0]['content']
print("Sample Review for Explanation:")
print(sample_review)

# Generate LIME explanation for the BoW model.
explanation_bow = explain_instance(bow_model_explain, sample_review, vocab, max_seq_len, device)

# Generate LIME explanation for the LSTM+Attention model.
explanation_attn = explain_instance(attn_model_explain, sample_review, vocab, max_seq_len, device)

# ------------------------------
# Comparative LIME Analysis Across Models
# ------------------------------
print("\n====== Comparative LIME Analysis Across Models ======")
compare_all_models_lime(sample_review, trained_models, vocab, max_seq_len, device)

# ------------------------------
# Analyze Challenging Test Cases
# ------------------------------
print("\n====== Analyzing Challenging Test Cases ======")
test_reviews = [
    "This product is amazing! I absolutely love it and would recommend to everyone.",
    "Terrible product. Broke after two days and customer service was awful.",
    "The product has some good features but overall I'm disappointed with its performance. The battery life is excellent, but the software is buggy.",
    "I was expecting more from this product given its price point. It functions adequately.",
    "Oh great, another product that breaks after a week. Just what I needed!"
]
analyze_test_cases(test_reviews, trained_models, vocab, max_seq_len, device)

# ------------------------------
# Enhanced LIME Visualization for BoW Model
# ------------------------------
print("\n====== Enhanced LIME Visualization for BoW Model ======")
exp_bow = explain_instance(trained_models["BoW Model"], sample_review, vocab, max_seq_len, device)
# Adjust class index based on prediction (here we assume positive, so class_idx=1)
create_enhanced_lime_visualization(sample_review, exp_bow, class_idx=1, model_name="BoW Model")

# ------------------------------
# Analyze Overall Word Importance Patterns Across Models
# ------------------------------
print("\n====== Analyzing Word Importance Patterns ======")
test_corpus = [small_test[i]['content'] for i in range(10)]
word_importance_patterns = analyze_word_importance_patterns(test_corpus, trained_models, vocab, max_seq_len, device)

In [None]:

seems like the files are saved as 
best_lstmmodel_model.pt and 
best_cnnmodel.pt