# ü§ó RoBERTa Model Training: Emotion Classification

This notebook fine-tunes a **RoBERTa (Robustly Optimized BERT Pretraining Approach)** model on emotion data for Part B of the assignment.

**Model:** RoBERTa-base - 125M parameters, an optimized variant of BERT with improved training procedure.

**Why RoBERTa?** RoBERTa builds upon BERT with key improvements: longer training, larger batches, dynamic masking, and removal of Next Sentence Prediction. It consistently outperforms BERT on most NLP benchmarks.


## üìö Import Libraries


## üì¶ Install Dependencies (Run on Colab)


In [None]:
# Check if running on Google Colab
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

# Install dependencies only on Colab
if IN_COLAB:
    print("Running on Google Colab - Installing dependencies...")
    %pip install -q transformers==4.36.0 torch==2.1.0 datasets==2.16.0 accelerate==0.25.0
    %pip install -q pandas==2.3.3 numpy==2.2.5 scikit-learn==1.7.2 matplotlib==3.10.6 seaborn==0.13.2
    print("Dependencies installed successfully!")
else:
    print("Running locally - Using local dependencies")


## üñ•Ô∏è GPU Configuration Check


In [None]:
# Check GPU availability for PyTorch
import torch

print("=" * 60)
print("üîç CHECKING GPU AVAILABILITY")
print("=" * 60)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"\n‚úÖ GPU IS AVAILABLE - Training will use GPU acceleration!")
    print(f"   GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"   Number of GPUs: {torch.cuda.device_count()}")
    print(f"   CUDA Version: {torch.version.cuda}")
    print(f"\n   üöÄ Expected speedup: 10-20x faster than CPU!")
    device = torch.device("cuda")
else:
    print(f"\n‚ö†Ô∏è  NO GPU DETECTED - Training will use CPU only")
    print(f"   Note: RoBERTa training on CPU is very slow")
    device = torch.device("cpu")

print("=" * 60)
print(f"Using device: {device}")


In [None]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Transformers imports
from transformers import (
    RobertaTokenizer, 
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)
from datasets import Dataset
import torch
from torch.utils.data import DataLoader

import random
import os

# Set random seeds for reproducibility
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


---

**üìù Important Note on GPU Usage:**

PyTorch and Transformers **automatically use your GPU** when available. The Trainer API handles device placement automatically.

- If GPU was detected above ‚úÖ, all training will run on GPU
- RoBERTa training is significantly faster on GPU (10-20x speedup)
- Monitor GPU usage: `watch -n 1 nvidia-smi`

---


## üìÇ Load Data

**Note on Preprocessing:** This notebook uses the preprocessed data from `01_preprocessing.ipynb` (same as Part A models) for fair comparison. While transformers typically perform better with raw text, using consistent preprocessing across all models ensures a valid comparative analysis.


In [None]:
# Load preprocessed data (following the same flow as GRU/LSTM notebooks)
# Note: Transformers typically work with raw text, but for fair comparison
# we use the same preprocessed data as Part A models
train_df = pd.read_pickle('./data/train_preprocessed.pkl')
val_df = pd.read_pickle('./data/validation_preprocessed.pkl')

print(f"Training data shape: {train_df.shape}")
print(f"Validation data shape: {val_df.shape}")
print(f"\nColumns: {train_df.columns.tolist()}")
print(f"\nFirst few rows:")
print(train_df.head())

# Emotion labels
emotion_labels = ['Sadness', 'Joy', 'Love', 'Anger', 'Fear', 'Surprise']
num_labels = len(emotion_labels)

print(f"\nüìä Number of classes: {num_labels}")
print(f"Labels: {emotion_labels}")
print(f"\nLabel distribution in training set:")
print(train_df['Label'].value_counts().sort_index())


## üî† RoBERTa Tokenization

RoBERTa uses Byte-Pair Encoding (BPE) tokenization without Next Sentence Prediction.


In [None]:
# Load RoBERTa tokenizer
model_name = 'roberta-base'
tokenizer = RobertaTokenizer.from_pretrained(model_name)

print(f"‚úÖ Loaded tokenizer: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size}")

# Find optimal max length using preprocessed text
train_lengths = train_df['Text'].apply(lambda x: len(tokenizer.encode(x)))
print(f"\nüìä Token length statistics:")
print(f"   Mean: {train_lengths.mean():.1f}")
print(f"   Median: {train_lengths.median():.1f}")
print(f"   95th percentile: {train_lengths.quantile(0.95):.1f}")
print(f"   Max: {train_lengths.max()}")

# Use 128 as max_length (covers ~99% of samples while being efficient)
max_length = 128
print(f"\n‚úÖ Using max_length={max_length} for tokenization")


In [None]:
# Tokenize function for datasets
def tokenize_function(examples):
    return tokenizer(
        examples['Text'],
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

# Create HuggingFace datasets from preprocessed data
# Note: Column names are 'Text' and 'Label' (capitalized) in preprocessed files
train_dataset = Dataset.from_pandas(train_df[['Text', 'Label']].rename(columns={'Label': 'label'}))
val_dataset = Dataset.from_pandas(val_df[['Text', 'Label']].rename(columns={'Label': 'label'}))

# Tokenize datasets
print("Tokenizing training data...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
print("Tokenizing validation data...")
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"\n‚úÖ Datasets prepared:")
print(f"   Training samples: {len(train_dataset)}")
print(f"   Validation samples: {len(val_dataset)}")


## üèóÔ∏è Load Pre-trained RoBERTa Model

We'll fine-tune RoBERTa for sequence classification with 6 emotion labels.


In [None]:
# Load pre-trained RoBERTa model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    problem_type="single_label_classification"
)

# Move model to GPU if available
model.to(device)

print(f"‚úÖ Loaded pre-trained RoBERTa model: {model_name}")
print(f"   Number of parameters: {model.num_parameters():,}")
print(f"   Model device: {next(model.parameters()).device}")

# Model architecture summary
print(f"\nüìã Model Architecture:")
print(f"   RoBERTa layers: {model.config.num_hidden_layers}")
print(f"   Hidden size: {model.config.hidden_size}")
print(f"   Attention heads: {model.config.num_attention_heads}")
print(f"   Vocabulary size: {model.config.vocab_size}")


## ‚öôÔ∏è Configure Training Arguments

Set up hyperparameters for fine-tuning RoBERTa.


In [None]:
# Define output directory
output_dir = './data/roberta'
os.makedirs(output_dir, exist_ok=True)

# Training arguments optimized for emotion classification
training_args = TrainingArguments(
    output_dir=output_dir,
    
    # Training hyperparameters
    num_train_epochs=5,              # 3-5 epochs typical for fine-tuning
    per_device_train_batch_size=16,  # Adjust based on GPU memory
    per_device_eval_batch_size=32,   # Can be larger for evaluation
    learning_rate=2e-5,              # Standard for RoBERTa fine-tuning
    weight_decay=0.01,               # L2 regularization
    warmup_steps=500,                # Gradual learning rate warmup
    
    # Evaluation and saving
    evaluation_strategy="epoch",     # Evaluate after each epoch
    save_strategy="epoch",           # Save checkpoint after each epoch
    save_total_limit=2,              # Keep only best 2 checkpoints
    load_best_model_at_end=True,     # Load best model after training
    metric_for_best_model="accuracy",
    greater_is_better=True,
    
    # Logging
    logging_dir=f'{output_dir}/logs',
    logging_steps=100,
    logging_strategy="steps",
    
    # Performance
    fp16=torch.cuda.is_available(),  # Mixed precision training on GPU
    dataloader_num_workers=2,
    
    # Reproducibility
    seed=SEED,
    
    # Other
    report_to="none",                # Disable wandb/tensorboard
    push_to_hub=False,
)

print("‚úÖ Training arguments configured:")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size (train): {training_args.per_device_train_batch_size}")
print(f"   Batch size (eval): {training_args.per_device_eval_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Weight decay: {training_args.weight_decay}")
print(f"   Warmup steps: {training_args.warmup_steps}")
print(f"   FP16 training: {training_args.fp16}")


In [None]:
# Compute metrics function for Trainer
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions)
    return {'accuracy': accuracy}

print("‚úÖ Metrics function defined")


## üöÄ Initialize Trainer and Start Fine-tuning


In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

print("‚úÖ Trainer initialized")
print("\nüöÄ Starting RoBERTa fine-tuning...")
print("=" * 60)


In [None]:
# Train the model
import time

start_time = time.time()
train_result = trainer.train()
training_time = time.time() - start_time

print(f"\n‚úÖ Training completed!")
print(f"   Training time: {training_time/60:.2f} minutes")
print(f"   Best checkpoint: {trainer.state.best_model_checkpoint}")


## üìä Visualize Training Progress


In [None]:
# Extract training history
log_history = trainer.state.log_history

# Separate train and eval logs
train_logs = [log for log in log_history if 'loss' in log and 'eval_loss' not in log]
eval_logs = [log for log in log_history if 'eval_loss' in log]

# Extract metrics
train_loss = [log['loss'] for log in train_logs]
train_steps = [log['step'] for log in train_logs]

eval_loss = [log['eval_loss'] for log in eval_logs]
eval_accuracy = [log['eval_accuracy'] for log in eval_logs]
eval_epochs = [log['epoch'] for log in eval_logs]

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot loss
ax1 = axes[0]
ax1_twin = ax1.twiny()
ax1.plot(train_steps, train_loss, label='Training Loss', alpha=0.7, color='blue')
ax1_twin.plot(eval_epochs, eval_loss, 'o-', label='Validation Loss', color='red', markersize=8)
ax1.set_xlabel('Training Steps')
ax1_twin.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
ax1.legend(loc='upper left')
ax1_twin.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# Plot accuracy
axes[1].plot(eval_epochs, eval_accuracy, 'o-', marker='s', markersize=8, color='green', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([min(eval_accuracy) - 0.02, 1.0])

plt.tight_layout()
plt.show()

print(f"\nüìä Final Training Metrics:")
print(f"   Best Validation Accuracy: {max(eval_accuracy):.4f}")
print(f"   Final Validation Loss: {eval_loss[-1]:.4f}")
print(f"   Training completed in: {training_time/60:.2f} minutes")


## üìà Evaluate Final Model Performance


In [None]:
# Evaluate on validation set
eval_results = trainer.evaluate()

print(f"\nüìä Final Model Performance:")
print(f"  Validation Loss: {eval_results['eval_loss']:.4f}")
print(f"  Validation Accuracy: {eval_results['eval_accuracy']:.4f}")

# Generate predictions for confusion matrix
predictions = trainer.predict(val_dataset)
y_pred = np.argmax(predictions.predictions, axis=-1)
y_true = predictions.label_ids

print(f"\n‚úÖ Predictions generated for {len(y_pred)} samples")


## üéØ Confusion Matrix


In [None]:
# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', 
            xticklabels=emotion_labels,
            yticklabels=emotion_labels)
plt.title('Confusion Matrix - RoBERTa Model', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()


## üìù Classification Report


In [None]:
# Generate classification report
report = classification_report(y_true, y_pred, target_names=emotion_labels)
print("\nüìù Classification Report:")
print("=" * 60)
print(report)


## üìä Model Statistics Summary


In [None]:
# Calculate model size
model_size_mb = sum(p.numel() * p.element_size() for p in model.parameters()) / (1024 * 1024)

print("\n" + "=" * 60)
print("üìä RoBERTa MODEL STATISTICS")
print("=" * 60)
print(f"Model Name:              {model_name}")
print(f"Total Parameters:        {model.num_parameters():,}")
print(f"Model Size:              {model_size_mb:.2f} MB")
print(f"Training Time:           {training_time/60:.2f} minutes")
print(f"Validation Accuracy:     {eval_results['eval_accuracy']:.4f} ({eval_results['eval_accuracy']*100:.2f}%)")
print(f"Validation Loss:         {eval_results['eval_loss']:.4f}")
print(f"Max Sequence Length:     {max_length}")
print("=" * 60)


## üíæ Save Final Model and Tokenizer


In [None]:
# Save the final model and tokenizer
final_model_dir = './data/roberta/final_model'
os.makedirs(final_model_dir, exist_ok=True)

# Save model
trainer.save_model(final_model_dir)
print(f"‚úÖ Model saved to: {final_model_dir}")

# Save tokenizer
tokenizer.save_pretrained(final_model_dir)
print(f"‚úÖ Tokenizer saved to: {final_model_dir}")

# Save metadata
metadata = {
    'model_name': model_name,
    'num_parameters': model.num_parameters(),
    'model_size_mb': model_size_mb,
    'max_length': max_length,
    'num_labels': num_labels,
    'emotion_labels': emotion_labels,
    'val_accuracy': eval_results['eval_accuracy'],
    'val_loss': eval_results['eval_loss'],
    'training_time_minutes': training_time/60,
    'training_args': {
        'num_epochs': training_args.num_train_epochs,
        'batch_size': training_args.per_device_train_batch_size,
        'learning_rate': training_args.learning_rate,
        'weight_decay': training_args.weight_decay,
        'warmup_steps': training_args.warmup_steps
    }
}

metadata_path = os.path.join(output_dir, 'roberta_metadata.pkl')
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)
print(f"‚úÖ Metadata saved to: {metadata_path}")

print("\n" + "=" * 60)
print("‚úÖ All files saved successfully!")
print("=" * 60)
print(f"\nFinal Model Performance:")
print(f"  Validation Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"  Validation Loss: {eval_results['eval_loss']:.4f}")
print(f"  Model Size: {model_size_mb:.2f} MB")
print(f"  Parameters: {model.num_parameters():,}")
