# 8) Bi-LSTM with N-gram Features v2 (Train on train.src.tok, Validate on dev_set.csv)

**Research Paper**: "Enhancing Bangla Language Next Word Prediction..." (arXiv 2405.01873, 2024)

**Expected Accuracy**: 60-75% (realistic: 60-70%, optimistic: 75%)

## Difference from v1

- **Training**: Still trains on train.src.tok
- **Validation**: Computes loss and accuracy on dev_set.csv after each epoch
- **Purpose**: Monitor performance on actual task format during training

## Overview

This notebook implements a Bidirectional LSTM that combines neural and statistical approaches:
- **Bidirectional LSTM**: Reads context both forward and backward
- **N-gram Features**: Incorporates statistical n-gram probabilities
- **Hybrid Architecture**: Best of both worlds

## 8.1 Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from typing import Dict, List, Tuple
from collections import defaultdict
from tqdm import tqdm
import pickle
import wandb

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Login to wandb
print("\n" + "="*80)
print("WANDB LOGIN")
print("="*80)
print("Please login to wandb to track your experiments.")
print("Get your API key from: https://wandb.ai/authorize")
print()

wandb.login()

print("\nâœ“ wandb login successful!")
print("You can view your runs at: https://wandb.ai")

## 8.2 Load Training Data and Build Vocabulary

In [None]:
# Load training data
print("Loading training data...")
with open('train.src.tok', 'r', encoding='utf-8') as f:
    train_sentences = [line.strip() for line in f]

print(f"Loaded {len(train_sentences)} training sentences")
print(f"First 3 sentences:")
for i in range(3):
    print(f"{i+1}: {train_sentences[i]}")

# Build vocabulary
print("\nBuilding vocabulary...")
word_counts = defaultdict(int)
for sentence in tqdm(train_sentences, desc="Counting words"):
    for word in sentence.split():
        word_counts[word] += 1

# Create word2idx and idx2word
vocab = ['<PAD>', '<UNK>', '<s>', '</s>'] + sorted(word_counts.keys(), key=word_counts.get, reverse=True)
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

print(f"Vocabulary size: {len(vocab)}")
print(f"Most common words: {vocab[4:14]}")

## 8.3 Build N-gram Models

Extract n-gram statistics for use as features.

In [None]:
print("Building n-gram models...")

# Initialize count dictionaries
unigram_counts = defaultdict(int)
bigram_counts = defaultdict(int)
trigram_counts = defaultdict(int)
fourgram_counts = defaultdict(int)

# Count n-grams from full dataset (remove slice for full dataset)
for sentence in tqdm(train_sentences[:1000000], desc="Extracting n-grams"):
    words = ['<s>', '<s>', '<s>'] + sentence.split() + ['</s>']
    
    for i in range(3, len(words)):
        # Unigram
        unigram_counts[words[i]] += 1
        
        # Bigram
        bigram = (words[i-1], words[i])
        bigram_counts[bigram] += 1
        
        # Trigram
        trigram = (words[i-2], words[i-1], words[i])
        trigram_counts[trigram] += 1
        
        # 4-gram
        fourgram = (words[i-3], words[i-2], words[i-1], words[i])
        fourgram_counts[fourgram] += 1

print(f"\nN-gram statistics:")
print(f"Unique unigrams: {len(unigram_counts)}")
print(f"Unique bigrams: {len(bigram_counts)}")
print(f"Unique trigrams: {len(trigram_counts)}")
print(f"Unique 4-grams: {len(fourgram_counts)}")

# Save n-gram models
print("\nSaving n-gram models...")
with open('ngram_models_v2.pkl', 'wb') as f:
    pickle.dump({
        'unigram': unigram_counts,
        'bigram': bigram_counts,
        'trigram': trigram_counts,
        'fourgram': fourgram_counts
    }, f)
print("N-gram models saved!")

## 8.4 Load Development Set

In [None]:
# Load development set
dev_df = pd.read_csv('dev_set.csv')
print(f"Development set loaded: {len(dev_df)} examples")
print(f"\nFirst 3 examples:")
print(dev_df.head(3))

# Build vocabulary by first letter
print("\nBuilding vocabulary by first letter...")
vocab_by_first_letter = defaultdict(set)
for word in vocab:
    if word not in ['<PAD>', '<UNK>', '<s>', '</s>'] and len(word) > 0:
        vocab_by_first_letter[word[0].lower()].add(word)

print(f"Vocabulary organized by {len(vocab_by_first_letter)} first letters")

## 8.5 Bi-LSTM with N-gram Features Model

In [None]:
class BiLSTM_Ngram(nn.Module):
    """
    Bidirectional LSTM with N-gram features for next word prediction.
    """
    
    def __init__(self, vocab_size: int, embedding_dim: int = 256, 
                 hidden_dim: int = 512, num_layers: int = 2, 
                 dropout: float = 0.3, ngram_feature_dim: int = 4):
        """
        Args:
            vocab_size: Size of vocabulary
            embedding_dim: Dimension of word embeddings
            hidden_dim: Dimension of LSTM hidden state
            num_layers: Number of LSTM layers
            dropout: Dropout rate
            ngram_feature_dim: Number of n-gram features (unigram, bigram, trigram, 4-gram)
        """
        super(BiLSTM_Ngram, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.dropout = nn.Dropout(dropout)
        
        # Bidirectional LSTM
        self.bilstm = nn.LSTM(
            embedding_dim, 
            hidden_dim, 
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=True,
            batch_first=True
        )
        
        # Combine BiLSTM output (2 * hidden_dim) with n-gram features
        self.fc = nn.Linear(2 * hidden_dim + ngram_feature_dim, vocab_size)
        
    def forward(self, x: torch.Tensor, ngram_features: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.
        
        Args:
            x: Input token indices, shape (batch_size, seq_len)
            ngram_features: N-gram probability features, shape (batch_size, ngram_feature_dim)
            
        Returns:
            Output logits, shape (batch_size, vocab_size)
        """
        # Embedding
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        embedded = self.dropout(embedded)
        
        # Bidirectional LSTM
        lstm_out, (h_n, c_n) = self.bilstm(embedded)
        # lstm_out: (batch_size, seq_len, 2 * hidden_dim)
        # h_n: (2 * num_layers, batch_size, hidden_dim)
        
        # Take the last output
        last_output = lstm_out[:, -1, :]  # (batch_size, 2 * hidden_dim)
        
        # Concatenate with n-gram features
        combined = torch.cat([last_output, ngram_features], dim=1)
        # combined: (batch_size, 2 * hidden_dim + ngram_feature_dim)
        
        # Final prediction
        output = self.fc(combined)  # (batch_size, vocab_size)
        
        return output

print("BiLSTM_Ngram model defined!")

## 8.6 Dataset Class with N-gram Features

In [None]:
class NgramDataset(Dataset):
    """
    Dataset that provides sequences with n-gram features.
    """
    
    def __init__(self, sentences: List[str], word2idx: Dict, 
                 ngram_models: Dict, max_len: int = 50):
        self.sentences = sentences
        self.word2idx = word2idx
        self.ngram_models = ngram_models
        self.max_len = max_len
        
    def __len__(self):
        return len(self.sentences)
    
    def compute_ngram_features(self, context: List[str], target_word: str) -> np.ndarray:
        """
        Compute n-gram probability features.
        
        Returns:
            Array of [unigram_prob, bigram_prob, trigram_prob, fourgram_prob]
        """
        features = np.zeros(4)
        
        # Unigram probability
        unigram_count = self.ngram_models['unigram'].get(target_word, 0)
        total_unigrams = sum(self.ngram_models['unigram'].values())
        features[0] = unigram_count / total_unigrams if total_unigrams > 0 else 0
        
        if len(context) >= 1:
            # Bigram probability
            bigram = (context[-1], target_word)
            bigram_count = self.ngram_models['bigram'].get(bigram, 0)
            context_count = self.ngram_models['unigram'].get(context[-1], 0)
            features[1] = bigram_count / context_count if context_count > 0 else 0
        
        if len(context) >= 2:
            # Trigram probability
            trigram = (context[-2], context[-1], target_word)
            trigram_count = self.ngram_models['trigram'].get(trigram, 0)
            bigram_context = (context[-2], context[-1])
            bigram_context_count = self.ngram_models['bigram'].get(bigram_context, 0)
            features[2] = trigram_count / bigram_context_count if bigram_context_count > 0 else 0
        
        if len(context) >= 3:
            # 4-gram probability
            fourgram = (context[-3], context[-2], context[-1], target_word)
            fourgram_count = self.ngram_models['fourgram'].get(fourgram, 0)
            trigram_context = (context[-3], context[-2], context[-1])
            trigram_context_count = self.ngram_models['trigram'].get(trigram_context, 0)
            features[3] = fourgram_count / trigram_context_count if trigram_context_count > 0 else 0
        
        return features
    
    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        words = ['<s>'] + sentence.split()
        
        # Randomly select a position to predict
        if len(words) < 2:
            return self.__getitem__((idx + 1) % len(self.sentences))
        
        target_pos = np.random.randint(1, len(words))
        context_words = words[:target_pos]
        target_word = words[target_pos]
        
        # Convert to indices
        context_indices = [self.word2idx.get(w, self.word2idx['<UNK>']) for w in context_words]
        target_idx = self.word2idx.get(target_word, self.word2idx['<UNK>'])
        
        # Pad/truncate context
        if len(context_indices) > self.max_len:
            context_indices = context_indices[-self.max_len:]
            context_words = context_words[-self.max_len:]
        else:
            padding = [0] * (self.max_len - len(context_indices))
            context_indices = padding + context_indices
        
        # Compute n-gram features
        ngram_features = self.compute_ngram_features(context_words, target_word)
        
        return {
            'context': torch.tensor(context_indices, dtype=torch.long),
            'ngram_features': torch.tensor(ngram_features, dtype=torch.float32),
            'target': torch.tensor(target_idx, dtype=torch.long)
        }

print("NgramDataset class defined!")

## 8.7 Dev Set Evaluation Function

In [None]:
def compute_ngram_features_for_candidate(context_words: List[str], candidate: str, ngram_models: Dict) -> np.ndarray:
    """
    Compute n-gram features for a candidate word given context.
    """
    features = np.zeros(4, dtype=np.float32)  # Specify dtype for efficiency
    
    # Unigram
    unigram_count = ngram_models['unigram'].get(candidate, 0)
    total_unigrams = sum(ngram_models['unigram'].values())
    features[0] = unigram_count / total_unigrams if total_unigrams > 0 else 0
    
    if len(context_words) >= 1:
        # Bigram
        bigram = (context_words[-1], candidate)
        bigram_count = ngram_models['bigram'].get(bigram, 0)
        context_count = ngram_models['unigram'].get(context_words[-1], 0)
        features[1] = bigram_count / context_count if context_count > 0 else 0
    
    if len(context_words) >= 2:
        # Trigram
        trigram = (context_words[-2], context_words[-1], candidate)
        trigram_count = ngram_models['trigram'].get(trigram, 0)
        bigram_context = (context_words[-2], context_words[-1])
        bigram_context_count = ngram_models['bigram'].get(bigram_context, 0)
        features[2] = trigram_count / bigram_context_count if bigram_context_count > 0 else 0
    
    if len(context_words) >= 3:
        # 4-gram
        fourgram = (context_words[-3], context_words[-2], context_words[-1], candidate)
        fourgram_count = ngram_models['fourgram'].get(fourgram, 0)
        trigram_context = (context_words[-3], context_words[-2], context_words[-1])
        trigram_context_count = ngram_models['trigram'].get(trigram_context, 0)
        features[3] = fourgram_count / trigram_context_count if trigram_context_count > 0 else 0
    
    return features


def evaluate_on_dev_set(model, dev_df, word2idx, vocab_by_first_letter, ngram_models, max_len, device, criterion):
    """
    Evaluate model on dev_set.csv and compute both loss and accuracy.
    
    Returns:
        (average_loss, accuracy)
    """
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for _, row in dev_df.iterrows():
            context = row['context']
            first_letter = row['first letter']
            answer = row['answer']
            
            # Tokenize context
            words = ['<s>'] + context.lower().split()
            
            # Convert to indices
            context_indices = [word2idx.get(w, word2idx['<UNK>']) for w in words]
            
            # Pad/truncate
            if len(context_indices) > max_len:
                context_indices = context_indices[-max_len:]
                words = words[-max_len:]
            else:
                padding = [0] * (max_len - len(context_indices))
                context_indices = padding + context_indices
            
            # Get candidates
            candidates = vocab_by_first_letter.get(first_letter.lower(), set())
            if not candidates:
                continue
            
            context_tensor = torch.tensor([context_indices], dtype=torch.long).to(device)
            
            # Find best candidate and compute loss for answer
            best_score = float('-inf')
            best_word = None
            answer_loss = None
            
            for candidate in candidates:
                # Compute n-gram features (returns numpy array)
                ngram_features = compute_ngram_features_for_candidate(words, candidate, ngram_models)
                # Convert numpy array to tensor efficiently
                ngram_tensor = torch.from_numpy(ngram_features).unsqueeze(0).to(device)
                
                # Get model output
                output = model(context_tensor, ngram_tensor)
                candidate_idx = word2idx.get(candidate, word2idx['<UNK>'])
                score = output[0, candidate_idx].item()
                
                if score > best_score:
                    best_score = score
                    best_word = candidate
                
                # If this is the answer, compute loss
                if candidate == answer:
                    answer_idx = torch.tensor([candidate_idx], dtype=torch.long).to(device)
                    answer_loss = criterion(output, answer_idx).item()
            
            # Add to metrics
            if answer_loss is not None:
                total_loss += answer_loss
            
            if best_word == answer:
                correct += 1
            
            total += 1
    
    avg_loss = total_loss / total if total > 0 else 0.0
    accuracy = (correct / total * 100) if total > 0 else 0.0
    
    return avg_loss, accuracy, correct, total

print("Dev set evaluation function defined!")

## 8.8 Training Setup

In [None]:
# Hyperparameters
EMBEDDING_DIM = 256
HIDDEN_DIM = 512
NUM_LAYERS = 2
DROPOUT = 0.3
BATCH_SIZE = 64
LEARNING_RATE = 0.001
NUM_EPOCHS = 5
MAX_LEN = 50

# Use subset for faster training (adjust as needed)
TRAIN_SIZE = 500000  # Use 500K sentences

print("Hyperparameters:")
print(f"Embedding dim: {EMBEDDING_DIM}")
print(f"Hidden dim: {HIDDEN_DIM}")
print(f"Num layers: {NUM_LAYERS}")
print(f"Dropout: {DROPOUT}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Num epochs: {NUM_EPOCHS}")
print(f"Training size: {TRAIN_SIZE} sentences")

## 8.9 Create Dataset and DataLoader

In [None]:
# Load n-gram models
print("Loading n-gram models...")
with open('ngram_models_v2.pkl', 'rb') as f:
    ngram_models = pickle.load(f)
print("N-gram models loaded!")

# Create dataset
print(f"\nCreating dataset with {TRAIN_SIZE} sentences...")
train_dataset = NgramDataset(
    train_sentences[:TRAIN_SIZE],
    word2idx,
    ngram_models,
    max_len=MAX_LEN
)

# Create dataloader
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=0
)

print(f"Dataset created: {len(train_dataset)} examples")
print(f"Number of batches: {len(train_loader)}")

## 8.10 Initialize Model and Training

In [None]:
# Initialize wandb
wandb.init(
    project="predictive-keyboard-bilstm-ngram-v2",
    config={
        "learning_rate": LEARNING_RATE,
        "epochs": NUM_EPOCHS,
        "batch_size": BATCH_SIZE,
        "embedding_dim": EMBEDDING_DIM,
        "hidden_dim": HIDDEN_DIM,
        "num_layers": NUM_LAYERS,
        "dropout": DROPOUT,
        "max_len": MAX_LEN,
        "train_size": TRAIN_SIZE,
        "architecture": "BiLSTM with N-gram features v2",
        "ngram_features": "unigram, bigram, trigram, 4-gram",
        "validation": "dev_set.csv"
    }
)

# Initialize model
model = BiLSTM_Ngram(
    vocab_size=len(vocab),
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    num_layers=NUM_LAYERS,
    dropout=DROPOUT,
    ngram_feature_dim=4
).to(device)

print("Model initialized:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Watch model with wandb
wandb.watch(model, log='all', log_freq=100)

print("\nOptimizer: Adam")
print(f"Loss function: CrossEntropyLoss")

## 8.11 Training Loop with Dev Set Validation

In [None]:
print("Starting training...\n")

best_dev_accuracy = 0.0

for epoch in range(NUM_EPOCHS):
    # ====== TRAINING PHASE ======
    model.train()
    total_train_loss = 0
    correct_train = 0
    total_train = 0
    
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [TRAIN]")
    
    for batch in progress_bar:
        context = batch['context'].to(device)
        ngram_features = batch['ngram_features'].to(device)
        target = batch['target'].to(device)
        
        # Forward pass
        optimizer.zero_grad()
        output = model(context, ngram_features)
        loss = criterion(output, target)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Statistics
        total_train_loss += loss.item()
        _, predicted = torch.max(output, 1)
        correct_train += (predicted == target).sum().item()
        total_train += target.size(0)
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': total_train_loss / (progress_bar.n + 1),
            'acc': 100 * correct_train / total_train
        })
    
    avg_train_loss = total_train_loss / len(train_loader)
    train_accuracy = 100 * correct_train / total_train
    
    # ====== VALIDATION PHASE ======
    print(f"\nEpoch {epoch+1}: Evaluating on dev set...")
    dev_loss, dev_accuracy, dev_correct, dev_total = evaluate_on_dev_set(
        model, dev_df, word2idx, vocab_by_first_letter, ngram_models, MAX_LEN, device, criterion
    )
    
    # Print epoch summary
    print(f"\n{'='*80}")
    print(f"Epoch {epoch+1}/{NUM_EPOCHS} Summary:")
    print(f"{'='*80}")
    print(f"TRAIN - Loss: {avg_train_loss:.4f} | Accuracy: {train_accuracy:.2f}%")
    print(f"DEV   - Loss: {dev_loss:.4f} | Accuracy: {dev_accuracy:.2f}% ({dev_correct}/{dev_total})")
    print(f"{'='*80}\n")
    
    # Log to wandb
    wandb.log({
        "epoch": epoch + 1,
        "train_loss": avg_train_loss,
        "train_accuracy": train_accuracy,
        "dev_loss": dev_loss,
        "dev_accuracy": dev_accuracy,
        "dev_correct": dev_correct,
        "dev_total": dev_total
    })
    
    # Save checkpoint
    checkpoint_path = f'bilstm_ngram_v2_epoch{epoch+1}.pt'
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'train_loss': avg_train_loss,
        'dev_loss': dev_loss,
        'dev_accuracy': dev_accuracy,
    }, checkpoint_path)
    print(f"Checkpoint saved: {checkpoint_path}")
    
    # Save best model
    if dev_accuracy > best_dev_accuracy:
        best_dev_accuracy = dev_accuracy
        torch.save(model.state_dict(), 'bilstm_ngram_v2_best.pt')
        print(f"ðŸŽ‰ New best dev accuracy: {dev_accuracy:.2f}% - Model saved!\n")

print("Training completed!")
print(f"Best dev accuracy achieved: {best_dev_accuracy:.2f}%")

## 8.12 Save Final Model

In [None]:
# Save final model
torch.save(model.state_dict(), 'bilstm_ngram_v2_final.pt')
print("Final model saved: bilstm_ngram_v2_final.pt")

# Save vocabulary
with open('bilstm_vocab_v2.pkl', 'wb') as f:
    pickle.dump({
        'word2idx': word2idx,
        'idx2word': idx2word
    }, f)
print("Vocabulary saved: bilstm_vocab_v2.pkl")

# Save model as wandb artifact
artifact = wandb.Artifact('bilstm-ngram-model-v2', type='model')
artifact.add_file('bilstm_ngram_v2_best.pt')
artifact.add_file('bilstm_ngram_v2_final.pt')
artifact.add_file('bilstm_vocab_v2.pkl')
wandb.log_artifact(artifact)
print("Model saved as wandb artifact!")

# Update summary
wandb.summary.update({
    "best_dev_accuracy": best_dev_accuracy,
    "total_parameters": sum(p.numel() for p in model.parameters())
})

# Finish wandb run
wandb.finish()
print("wandb run finished!")

## 8.13 Test Set Predictions

In [None]:
# Load best model
model.load_state_dict(torch.load('bilstm_ngram_v2_best.pt', map_location=device))
model.eval()
print("Loaded best model")

# Load test set
test_df = pd.read_csv('test_set_no_answer.csv')
print(f"Test set loaded: {len(test_df)} examples")

# Generate predictions
test_predictions = []

for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Predicting test set"):
    context = row['context']
    first_letter = row['first letter']
    
    # Tokenize context
    words = ['<s>'] + context.lower().split()
    
    # Convert to indices
    context_indices = [word2idx.get(w, word2idx['<UNK>']) for w in words]
    
    # Pad/truncate
    if len(context_indices) > MAX_LEN:
        context_indices = context_indices[-MAX_LEN:]
        words = words[-MAX_LEN:]
    else:
        padding = [0] * (MAX_LEN - len(context_indices))
        context_indices = padding + context_indices
    
    # Get candidates
    candidates = vocab_by_first_letter.get(first_letter.lower(), set())
    if not candidates:
        test_predictions.append(first_letter)
        continue
    
    # Score each candidate
    best_score = float('-inf')
    best_word = None
    
    context_tensor = torch.tensor([context_indices], dtype=torch.long).to(device)
    
    with torch.no_grad():
        for candidate in candidates:
            # Compute n-gram features (returns numpy array)
            ngram_features = compute_ngram_features_for_candidate(words, candidate, ngram_models)
            # Convert numpy array to tensor efficiently
            ngram_tensor = torch.from_numpy(ngram_features).unsqueeze(0).to(device)
            
            # Get model output
            output = model(context_tensor, ngram_tensor)
            candidate_idx = word2idx.get(candidate, word2idx['<UNK>'])
            score = output[0, candidate_idx].item()
            
            if score > best_score:
                best_score = score
                best_word = candidate
    
    test_predictions.append(best_word if best_word else list(candidates)[0])

# Save predictions
with open('test_predictions_bilstm_ngram_v2.txt', 'w') as f:
    for pred in test_predictions:
        f.write(f"{pred}\n")

print(f"\nTest predictions saved to 'test_predictions_bilstm_ngram_v2.txt'")
print(f"Total predictions: {len(test_predictions)}")

## 8.14 Summary

**Bi-LSTM with N-gram Features v2 Performance:**
- Architecture: Bidirectional LSTM (2 layers, 512 hidden units)
- N-gram features: 4 features (unigram, bigram, trigram, 4-gram probabilities)
- Embedding dimension: 256
- Dropout: 0.3
- Expected accuracy: 60-75%

**Key Differences from v1:**
- Validation on dev_set.csv after each epoch
- Both loss and accuracy tracked on dev set
- Best model saved based on dev accuracy
- More realistic training dynamics

**Advantages:**
- Better monitoring of generalization
- Early stopping possible based on dev performance
- More faithful to actual task format