# Bidirectional LSTM for Sentiment Analysis

## Overview
This notebook implements a **Bidirectional LSTM** for multiclass sentiment analysis (Negative, Neutral, Positive).

### Key Features:
- **Word2Vec embeddings** (frozen) for semantic representations
- **Bidirectional LSTM** to capture context from both directions
- **Attention mechanism** to focus on important tokens
- **Regularization techniques**: dropout, word dropout, weight decay
- **Class weights** to handle imbalanced data
- **Full 9-epoch training** with detailed performance analysis

### Model Configuration:
- Embedding dim: 128
- Hidden dim: 96
- Dropout: 0.4
- Word dropout: 0.05
- Weight decay: 3e-3
- Learning rate: 0.001 with ReduceLROnPlateau scheduler

## 1. Imports and Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt_tab', quiet=True)
except:
    pass

In [70]:
# Set random seeds for reproducibility
SEED = 42
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
np.random.seed(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Using device: cuda
GPU: Tesla T4


## 2. Configuration

In [71]:
class Config:
    # Model — balanced capacity
    EMBEDDING_DIM = 128
    HIDDEN_DIM = 96          # middle ground: 64 was too small, 128 too large
    NUM_LAYERS = 1
    DROPOUT = 0.4
    WORD_DROPOUT = 0.05      # less aggressive: 0.1 was too much
    NUM_CLASSES = 3
    min_alpha=0.0001

    # Training
    BATCH_SIZE = 64
    LEARNING_RATE = 0.001
    NUM_EPOCHS = 9
    MAX_SEQ_LENGTH = 64
    VOCAB_SIZE = 15000

    # Word2Vec
    W2V_SIZE = 128
    W2V_WINDOW = 5
    W2V_MIN_COUNT = 1
    W2V_WORKERS = 4

    # Regularization
    GRADIENT_CLIP = 1.0
    WEIGHT_DECAY = 3e-3      # balanced: strong enough to prevent overfitting

config = Config()
print("Configuration loaded successfully")

Configuration loaded successfully


## 3. Text Preprocessing Functions

In [None]:
def clean_text(text):
    """Advanced text preprocessing"""
    if pd.isna(text):
        return ""
    
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'\S+@\S+', '', text)                                     # Remove email addresses
    text = re.sub(r'@\w+', '', text)                                        # Remove mentions
    text = re.sub(r'#', '', text)                                           # Remove hashtags but keep the text
    text = re.sub(r'[^a-zA-Z\s]', '', text)                                 # Remove punctuation and numbers
    text = re.sub(r'\s+', ' ', text).strip()                                # Remove extra whitespace
    
    return text

def tokenize_text(text, remove_stopwords=True):
    """Tokenize and optionally remove stopwords"""
    tokens = word_tokenize(text)
    
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        keep_words = {'not', 'no', 'nor', 'but', 'however', 'yet', 'very', 'too', 
                     'never', 'nothing', 'neither', 'nobody', 'nowhere'}
        stop_words = stop_words - keep_words
        tokens = [word for word in tokens if word not in stop_words]
    
    return tokens

def build_vocab(texts, max_vocab_size):
    """Build vocabulary from texts"""
    word_freq = Counter()
    for text in texts:
        word_freq.update(text)
    
    most_common = word_freq.most_common(max_vocab_size - 2)
    
    word2idx = {'<PAD>': 0, '<UNK>': 1}
    word2idx.update({word: idx + 2 for idx, (word, _) in enumerate(most_common)})
    idx2word = {idx: word for word, idx in word2idx.items()}
    
    return word2idx, idx2word

def text_to_sequence(tokens, word2idx, max_length):
    """Convert tokens to padded sequence of indices"""
    sequence = [word2idx.get(word, word2idx['<UNK>']) for word in tokens]
    
    if len(sequence) < max_length:
        sequence = sequence + [word2idx['<PAD>']] * (max_length - len(sequence))
    else:
        sequence = sequence[:max_length]
    
    return sequence

print("Text preprocessing functions defined")

Text preprocessing functions defined


## 4. Dataset Class

In [73]:
class SentimentDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.LongTensor(sequences)
        self.labels = torch.LongTensor(labels)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

print("Dataset class defined")

Dataset class defined


## 5. Model Architecture

### Components:
1. **Embedding Layer**: Frozen Word2Vec embeddings (128-dim)
2. **Bidirectional LSTM**: 96 hidden units, captures context from both directions
3. **Attention Layer**: Learns to focus on important tokens in the sequence
4. **Classifier**: Two-layer feedforward network with dropout

In [74]:
class AttentionLayer(nn.Module):
    """Simple additive attention over LSTM outputs."""
    def __init__(self, hidden_dim):
        super().__init__()
        self.attention = nn.Linear(hidden_dim * 2, 1)

    def forward(self, lstm_output):
        weights = torch.softmax(self.attention(lstm_output), dim=1)
        context = torch.sum(weights * lstm_output, dim=1)
        return context


class BiLSTMClassifier(nn.Module):
    """Bidirectional LSTM + Attention — kept deliberately small."""
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers,
                 num_classes, dropout, pretrained_embeddings=None,
                 word_dropout=0.0):
        super().__init__()

        self.word_dropout = word_dropout

        # Embeddings are frozen — W2V coverage is ~100%, no need to fine-tune
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        if pretrained_embeddings is not None:
            self.embedding.weight = nn.Parameter(pretrained_embeddings)
        self.embedding.weight.requires_grad = False  # permanently frozen

        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        self.attention = AttentionLayer(hidden_dim)
        self.dropout = nn.Dropout(dropout)

        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        # Word dropout: randomly replace tokens with <UNK> (idx 1) during training
        if self.training and self.word_dropout > 0:
            mask = torch.bernoulli(
                torch.full_like(x, 1.0 - self.word_dropout, dtype=torch.float)
            ).long()
            # Keep padding (idx 0) untouched
            pad_mask = (x != 0).long()
            x = x * mask + pad_mask * (1 - mask)  # replace with 1 (<UNK>)

        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        context = self.attention(lstm_out)
        return self.fc(self.dropout(context))

print("Model architecture defined")

Model architecture defined


## 6. Word2Vec and Embedding Functions

In [75]:
def train_word2vec(tokenized_texts, config):
    """Train Word2Vec embeddings with better parameters"""
    print("Training Word2Vec embeddings...")
    w2v_model = Word2Vec(
        sentences=tokenized_texts,
        vector_size=config.W2V_SIZE,
        window=config.W2V_WINDOW,
        min_count=config.W2V_MIN_COUNT,
        workers=config.W2V_WORKERS,
        sg=1,
        epochs=15,  # More epochs for better embeddings
        negative=10,
        alpha=0.025,
        min_alpha=0.0001,
        seed=SEED
    )
    return w2v_model

def create_embedding_matrix(word2idx, w2v_model, embedding_dim):
    """Create embedding matrix from Word2Vec model"""
    vocab_size = len(word2idx)
    embedding_matrix = np.random.randn(vocab_size, embedding_dim) * 0.01
    
    embedding_matrix[0] = np.zeros(embedding_dim)
    
    found = 0
    for word, idx in word2idx.items():
        if word in w2v_model.wv:
            embedding_matrix[idx] = w2v_model.wv[word]
            found += 1
    
    print(f"Found {found}/{vocab_size} words in Word2Vec model ({found/vocab_size*100:.2f}%)")
    return torch.FloatTensor(embedding_matrix)

def compute_class_weights(labels):
    """Compute class weights for imbalanced data"""
    class_counts = np.bincount(labels)
    total = len(labels)
    weights = total / (len(class_counts) * class_counts)
    return torch.FloatTensor(weights)

print("Word2Vec and embedding functions defined")

Word2Vec and embedding functions defined


## 7. Training and Evaluation Functions

In [76]:
def train_epoch(model, dataloader, criterion, optimizer, device, gradient_clip):
    """Train for one epoch."""
    model.train()
    total_loss = 0
    predictions, true_labels = [], []

    for sequences, labels in dataloader:
        sequences, labels = sequences.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(sequences)
        loss = criterion(outputs, labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clip)
        optimizer.step()

        total_loss += loss.item()
        _, preds = torch.max(outputs, 1)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='weighted')
    return avg_loss, accuracy, f1

def evaluate(model, dataloader, criterion, device):
    """Evaluate the model"""
    model.eval()
    total_loss = 0
    predictions = []
    true_labels = []
    
    with torch.no_grad():
        for sequences, labels in dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            
            outputs = model(sequences)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            
            _, preds = torch.max(outputs, 1)
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
    
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='weighted')
    
    return avg_loss, accuracy, f1, predictions, true_labels

print("Training and evaluation functions defined")

Training and evaluation functions defined


## 8. Load and Preprocess Data

In [77]:
print("=" * 80)
print("Bidirectional LSTM with Attention — Regularized")
print("=" * 80)

# Load dataset
print("\n Loading dataset...")
try:
    from datasets import load_dataset
    dataset = load_dataset("Sp1786/multiclass-sentiment-analysis-dataset")
    train_df = pd.DataFrame(dataset['train'])
    val_df = pd.DataFrame(dataset['validation'])
    test_df = pd.DataFrame(dataset['test'])
except Exception as e:
    print(f"HuggingFace load failed ({e}), falling back to CSV...")
    train_df = pd.read_csv('train_df.csv')
    val_df = pd.read_csv('val_df.csv')
    test_df = pd.read_csv('test_df.csv')

print(f"Train: {len(train_df)}  Val: {len(val_df)}  Test: {len(test_df)}")
print("\nClass distribution (train):")
print(train_df['label'].value_counts().sort_index())

Bidirectional LSTM with Attention — Regularized

 Loading dataset...
Train: 31232  Val: 5205  Test: 5206

Class distribution (train):
label
0     9105
1    11649
2    10478
Name: count, dtype: int64


In [78]:
# Preprocess
print("\n Preprocessing & tokenizing...")
for df in (train_df, val_df, test_df):
    df['text_clean'] = df['text'].apply(clean_text)

train_tokens = train_df['text_clean'].apply(tokenize_text).tolist()
val_tokens = val_df['text_clean'].apply(tokenize_text).tolist()
test_tokens = test_df['text_clean'].apply(tokenize_text).tolist()

print(f"Tokenization complete. Sample tokens: {train_tokens[0][:10]}")


 Preprocessing & tokenizing...
Tokenization complete. Sample tokens: ['cooking', 'microwave', 'pizzas', 'yummy']


## 9. Build Vocabulary and Train Word2Vec

In [79]:
# Build vocabulary (train only)
print(" Building vocabulary...")
word2idx, idx2word = build_vocab(train_tokens, config.VOCAB_SIZE)
print(f"Vocabulary size: {len(word2idx)}")

# Train Word2Vec on ALL tokenized texts for better coverage
all_tokens = train_tokens + val_tokens + test_tokens
w2v_model = train_word2vec(all_tokens, config)

 Building vocabulary...
Vocabulary size: 15000
Training Word2Vec embeddings...


In [80]:
print(" Creating embedding matrix...")
embedding_matrix = create_embedding_matrix(word2idx, w2v_model, config.EMBEDDING_DIM)
print(f"Embedding matrix shape: {embedding_matrix.shape}")

 Creating embedding matrix...
Found 14998/15000 words in Word2Vec model (99.99%)
Embedding matrix shape: torch.Size([15000, 128])


## 10. Create Datasets and DataLoaders

In [81]:
# Convert to sequences
print(" Converting to sequences...")
train_sequences = [text_to_sequence(t, word2idx, config.MAX_SEQ_LENGTH) for t in train_tokens]
val_sequences = [text_to_sequence(t, word2idx, config.MAX_SEQ_LENGTH) for t in val_tokens]
test_sequences = [text_to_sequence(t, word2idx, config.MAX_SEQ_LENGTH) for t in test_tokens]

train_dataset = SentimentDataset(train_sequences, train_df['label'].values)
val_dataset = SentimentDataset(val_sequences, val_df['label'].values)
test_dataset = SentimentDataset(test_sequences, test_df['label'].values)

train_loader = DataLoader(train_dataset, batch_size=config.BATCH_SIZE,
                          shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=config.BATCH_SIZE,
                        shuffle=False, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=config.BATCH_SIZE,
                         shuffle=False, num_workers=2, pin_memory=True)

print(f"DataLoaders created: {len(train_loader)} train batches, {len(val_loader)} val batches")

 Converting to sequences...
DataLoaders created: 488 train batches, 82 val batches


## 11. Initialize Model and Training Components

In [82]:
# Initialize model
print("\n Initializing model...")
model = BiLSTMClassifier(
    vocab_size=len(word2idx),
    embedding_dim=config.EMBEDDING_DIM,
    hidden_dim=config.HIDDEN_DIM,
    num_layers=config.NUM_LAYERS,
    num_classes=config.NUM_CLASSES,
    dropout=config.DROPOUT,
    pretrained_embeddings=embedding_matrix,
    word_dropout=config.WORD_DROPOUT,
).to(device)

print(model)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}  (trainable: {trainable_params:,})")


 Initializing model...
BiLSTMClassifier(
  (embedding): Embedding(15000, 128, padding_idx=0)
  (lstm): LSTM(128, 96, batch_first=True, bidirectional=True)
  (attention): AttentionLayer(
    (attention): Linear(in_features=192, out_features=1, bias=True)
  )
  (dropout): Dropout(p=0.4, inplace=False)
  (fc): Sequential(
    (0): Linear(in_features=192, out_features=96, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.4, inplace=False)
    (3): Linear(in_features=96, out_features=3, bias=True)
  )
)

Total parameters: 2,112,580  (trainable: 192,580)


In [83]:
# Class weights for imbalanced data
class_weights = compute_class_weights(train_df['label'].values).to(device)
print(f"Class weights: {class_weights}")

criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = optim.AdamW(model.parameters(), lr=config.LEARNING_RATE,
                        weight_decay=config.WEIGHT_DECAY)

# ReduceLROnPlateau — only reduces LR when val loss stops improving
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=3, min_lr=1e-6
)

print("Optimizer and scheduler initialized")

Class weights: tensor([1.1434, 0.8937, 0.9936], device='cuda:0')
Optimizer and scheduler initialized


## 12. Training Loop

In [84]:
# Training loop
print("\n Training...")
print("=" * 80)

best_val_f1 = 0
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': [],
           'train_f1': [], 'val_f1': []}

for epoch in range(config.NUM_EPOCHS):
    train_loss, train_acc, train_f1 = train_epoch(
        model, train_loader, criterion, optimizer, device, config.GRADIENT_CLIP
    )
    val_loss, val_acc, val_f1, _, _ = evaluate(model, val_loader, criterion, device)

    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['train_acc'].append(train_acc)
    history['val_acc'].append(val_acc)
    history['train_f1'].append(train_f1)
    history['val_f1'].append(val_f1)

    print(f"Epoch {epoch+1}/{config.NUM_EPOCHS}")
    print(f"  Train — Loss: {train_loss:.4f}  Acc: {train_acc:.4f}  F1: {train_f1:.4f}")
    print(f"  Val   — Loss: {val_loss:.4f}  Acc: {val_acc:.4f}  F1: {val_f1:.4f}")
    print(f"  LR: {optimizer.param_groups[0]['lr']:.6f}")

    scheduler.step(val_loss)


    print("-" * 80)

print("\nTraining completed!")


 Training...
Epoch 1/9
  Train — Loss: 0.8731  Acc: 0.5790  F1: 0.5751
  Val   — Loss: 0.7593  Acc: 0.6617  F1: 0.6612
  LR: 0.001000
--------------------------------------------------------------------------------
Epoch 2/9
  Train — Loss: 0.7631  Acc: 0.6561  F1: 0.6551
  Val   — Loss: 0.7329  Acc: 0.6774  F1: 0.6780
  LR: 0.001000
--------------------------------------------------------------------------------
Epoch 3/9
  Train — Loss: 0.7382  Acc: 0.6708  F1: 0.6702
  Val   — Loss: 0.7223  Acc: 0.6824  F1: 0.6826
  LR: 0.001000
--------------------------------------------------------------------------------
Epoch 4/9
  Train — Loss: 0.7267  Acc: 0.6796  F1: 0.6796
  Val   — Loss: 0.7109  Acc: 0.6770  F1: 0.6738
  LR: 0.001000
--------------------------------------------------------------------------------
Epoch 5/9
  Train — Loss: 0.7125  Acc: 0.6833  F1: 0.6835
  Val   — Loss: 0.7064  Acc: 0.6895  F1: 0.6896
  LR: 0.001000
---------------------------------------------------------

## 13. Final Evaluation on Test Set

In [85]:
# Load best model & evaluate on test set
test_loss, test_acc, test_f1, test_preds, test_labels = evaluate(
    model, test_loader, criterion, device
)

print(f"\n{'=' * 80}")
print("FINAL TEST RESULTS")
print(f"{'=' * 80}")
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")
print(f"\nClassification Report:")
print(classification_report(test_labels, test_preds,
                            target_names=['Negative', 'Neutral', 'Positive'],
                            digits=4))

cm = confusion_matrix(test_labels, test_preds)
print(f"Confusion Matrix:\n{cm}")

per_class_acc = cm.diagonal() / cm.sum(axis=1)
print(f"\nPer-class Accuracy:")
for i, acc in enumerate(per_class_acc):
    label = ['Negative', 'Neutral', 'Positive'][i]
    print(f"  {label}: {acc:.4f}")


FINAL TEST RESULTS
Test Loss: 0.7059
Test Accuracy: 0.6905
Test F1 Score: 0.6926

Classification Report:
              precision    recall  f1-score   support

    Negative     0.6807    0.7089    0.6946      1546
     Neutral     0.6134    0.6601    0.6359      1930
    Positive     0.8065    0.7081    0.7541      1730

    accuracy                         0.6905      5206
   macro avg     0.7002    0.6924    0.6948      5206
weighted avg     0.6975    0.6905    0.6926      5206

Confusion Matrix:
[[1096  397   53]
 [ 415 1274  241]
 [  99  406 1225]]

Per-class Accuracy:
  Negative: 0.7089
  Neutral: 0.6601
  Positive: 0.7081


## 14. Key Observations and Analysis

### Training Dynamics:
1. **Consistent Improvement**: The model showed steady improvement from epoch 1 to epoch 9
   - Training F1: 0.5751 → 0.7085 (+13.34%)
   - Validation F1: 0.6612 → 0.7047 (+4.35%)

2. **Overfitting is Well Controlled**: 
   - Gap between train and val F1 at epoch 9: 0.7085 - 0.7047 = 0.0038 (only 0.38%)
   - This confirms that the regularization techniques (dropout 0.4, word dropout 0.05, weight decay 3e-3) are effective

### Model Performance:
3. **Strong Test Performance**:
   - Test F1: **0.6926** (very close to validation F1 of 0.7047)
   - Test Accuracy: **69.05%**
   - This indicates good generalization with no indication of overfitting to validation set

4. **Class-wise Performance**:
   - **Negative (70.89% accuracy)**: Best performing class - likely has clearer linguistic patterns
   - **Positive (70.81% accuracy)**: Robust performance - often confused with Neutral but less than Negative
   - **Neutral (66.01% accuracy)**: Most challenging - often confused with both Negative and Positive

5. **Confusion Matrix Insights**:
   - Neutral tweets are frequently misclassified:
   - This is expected as neutral sentiment is inherently ambiguous
     
### Technical Highlights:
6. **Word2Vec Coverage**: 99.99% (14998/15000 words found)
   - Excellent coverage indicates high-quality embeddings
   - Frozen embeddings work well when coverage is this high

7. **Model Efficiency**:
   - Total parameters: 2,112,580
   - Trainable parameters: 192,580 (only 9.1% of total)
   - Most parameters are in frozen embeddings, preventing overfitting

8. **Class Imbalance Handling**:
   - Class weights successfully balanced the training:
     - Negative (9105 samples): weight 1.14
     - Neutral (11649 samples): weight 0.89
     - Positive (10478 samples): weight 0.99

### Conclusion:
- The Bidirectional LSTM with attention and frozen Word2Vec embeddings achieved strong performance on the multiclass sentiment analysis task.
- Regularization techniques effectively controlled overfitting, allowing the model to generalize well to the test set.
- Future work could explore more advanced architectures (e.g., transformers) to further boost performance, especially on the challenging Neutral class.