# ![Banner](https://github.com/LittleHouse75/flatiron-resources/raw/main/NevitsBanner.png)
---
# Experiment 2 — BART & T5 (Pretrained Seq2Seq Models)
### Purpose-Built Encoder-Decoder Summarization
---

This notebook implements **Experiment 2** for the capstone project:

**Goal:**  
Evaluate purpose-built sequence-to-sequence models for dialogue summarization:

- **BART** (`facebook/bart-base`) — Denoising autoencoder pretrained for seq2seq tasks
- **T5** (`t5-small`) — Text-to-text transformer pretrained on C4

Unlike Experiment 1's "Frankenstein" BERT→GPT-2, these models have **pretrained cross-attention**
layers, meaning the encoder and decoder already know how to communicate.

**What This Notebook Covers:**
1. Model construction and tokenizer setup
2. Fine-tuning on SAMSum using `Seq2SeqTrainer`
3. Evaluation on both validation and **test sets**
4. ROUGE metrics and qualitative analysis
5. Side-by-side comparison of BART vs T5

**Note:** This notebook parallels the structure of `02_experiment1_bert_gpt2-revised.ipynb`
for consistency across experiments.

## 1. Environment Setup

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import random
import numpy as np
import pandas as pd
import torch
from pathlib import Path
import sys
import warnings

# Mute common warnings
warnings.filterwarnings("ignore", message=".*requires_grad.*")
warnings.filterwarnings("ignore", category=FutureWarning)

# Project root for imports
PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if device.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Configuration

All hyperparameters and paths in one place for easy modification.

**Note:** Parameters are aligned with Experiment 1 for fair comparison,
with adjustments where appropriate for pretrained seq2seq models.

In [None]:
# =============================================================================
# TRAINING FLAGS
# =============================================================================
RUN_TRAINING_BART = True  # Set False to load from checkpoint
RUN_TRAINING_T5 = True    # Set False to load from checkpoint

# =============================================================================
# MODEL CONFIGURATION
# =============================================================================
BART_MODEL_NAME = "facebook/bart-base"
T5_MODEL_NAME = "t5-small"
T5_PREFIX = "summarize: "  # Required prefix for T5

# =============================================================================
# SEQUENCE LENGTHS
# =============================================================================
MAX_SOURCE_LEN = 512  # Dialogue input length
MAX_TARGET_LEN = 128  # Summary output length

# =============================================================================
# TRAINING HYPERPARAMETERS
# (Aligned with Experiment 1 for fair comparison)
# =============================================================================
BATCH_SIZE = 4
GRAD_ACCUM_STEPS = 2        # Effective batch size = 4 * 2 = 8
NUM_EPOCHS = 6              # BART/T5 converge faster than BERT→GPT2
LEARNING_RATE = 5e-5
WARMUP_STEPS = 500
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 100

# Early stopping
EARLY_STOPPING_PATIENCE = 2

# Generation settings
NUM_BEAMS = 4
NO_REPEAT_NGRAM_SIZE = 3
LENGTH_PENALTY = 2.0
MIN_LENGTH = 5

# =============================================================================
# PATHS
# =============================================================================
# BART paths
BART_OUTPUT_DIR = PROJECT_ROOT / "models" / "bart"
BART_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
BART_CHECKPOINT_DIR = BART_OUTPUT_DIR / "checkpoints"
BART_BEST_DIR = BART_OUTPUT_DIR / "best"
BART_HISTORY_PATH = BART_OUTPUT_DIR / "history.csv"
BART_TEST_RESULTS_PATH = BART_OUTPUT_DIR / "test_results.csv"

# T5 paths
T5_OUTPUT_DIR = PROJECT_ROOT / "models" / "t5"
T5_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
T5_CHECKPOINT_DIR = T5_OUTPUT_DIR / "checkpoints"
T5_BEST_DIR = T5_OUTPUT_DIR / "best"
T5_HISTORY_PATH = T5_OUTPUT_DIR / "history.csv"
T5_TEST_RESULTS_PATH = T5_OUTPUT_DIR / "test_results.csv"

print("Configuration loaded.")
print(f"  BART model: {BART_MODEL_NAME}")
print(f"  T5 model: {T5_MODEL_NAME}")
print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM_STEPS}")
print(f"  BART output: {BART_OUTPUT_DIR}")
print(f"  T5 output: {T5_OUTPUT_DIR}")

## 3. Load SAMSum Data

In [None]:
from src.data.load_data import load_samsum

train_df, val_df, test_df = load_samsum()

print(f"Dataset sizes:")
print(f"  Train:      {len(train_df):,} examples")
print(f"  Validation: {len(val_df):,} examples")
print(f"  Test:       {len(test_df):,} examples")

In [None]:
# Quick peek at the data
print("Sample dialogue:")
print("-" * 40)
print(train_df.iloc[0]["dialogue"][:300], "...")
print()
print("Sample summary:")
print("-" * 40)
print(train_df.iloc[0]["summary"])

## 4. Shared Imports and Utilities

Import common components used by both BART and T5.

In [None]:
from datasets import Dataset
from transformers import (
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    EarlyStoppingCallback,
)
import evaluate

# Load ROUGE metric (shared by both models)
rouge_metric = evaluate.load("rouge")

def compute_metrics(eval_pred, tokenizer):
    """
    Compute ROUGE scores for evaluation.
    This function is parameterized by tokenizer to work with both BART and T5.
    """
    predictions, labels = eval_pred
    
    # Replace -100 with pad token id for decoding
    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Clean up whitespace
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]
    
    # Compute ROUGE
    result = rouge_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True,
    )
    
    # Convert to percentages for readability
    result = {k: round(v * 100, 2) for k, v in result.items()}
    
    return result

print("Shared utilities loaded.")

---
# PART A: BART
---

BART (Bidirectional and Auto-Regressive Transformers) is a denoising autoencoder
pretrained by corrupting text and learning to reconstruct it. This makes it
naturally suited for seq2seq tasks like summarization.

## 5A. BART Tokenizer Setup

In [None]:
from transformers import BartTokenizer

bart_tokenizer = BartTokenizer.from_pretrained(BART_MODEL_NAME)

print(f"BART Tokenizer: {bart_tokenizer.__class__.__name__}")
print(f"  Vocab size: {len(bart_tokenizer):,}")
print(f"  Pad token: '{bart_tokenizer.pad_token}' (id: {bart_tokenizer.pad_token_id})")
print(f"  BOS token: '{bart_tokenizer.bos_token}' (id: {bart_tokenizer.bos_token_id})")
print(f"  EOS token: '{bart_tokenizer.eos_token}' (id: {bart_tokenizer.eos_token_id})")

## 6A. Build or Load BART Model

In [None]:
from transformers import BartForConditionalGeneration

if RUN_TRAINING_BART:
    print("Building fresh BART model for training...")
    
    bart_model = BartForConditionalGeneration.from_pretrained(BART_MODEL_NAME)
    
    # Configure generation defaults
    bart_model.config.max_length = MAX_TARGET_LEN
    bart_model.config.min_length = MIN_LENGTH
    bart_model.config.num_beams = NUM_BEAMS
    bart_model.config.no_repeat_ngram_size = NO_REPEAT_NGRAM_SIZE
    bart_model.config.length_penalty = LENGTH_PENALTY
    bart_model.config.early_stopping = True
    
    print(f"\nBART model built successfully!")
    
else:
    print(f"Loading BART model from checkpoint: {BART_BEST_DIR}")
    bart_model = BartForConditionalGeneration.from_pretrained(BART_BEST_DIR)
    bart_tokenizer = BartTokenizer.from_pretrained(BART_BEST_DIR)
    print("BART model loaded successfully!")

# Move to device
bart_model = bart_model.to(device)

# Count parameters
total_params = sum(p.numel() for p in bart_model.parameters())
trainable_params = sum(p.numel() for p in bart_model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## 7A. Prepare BART Datasets

In [None]:
def preprocess_bart(examples):
    """
    Tokenize dialogues and summaries for BART.
    BART doesn't need a task prefix.
    """
    # Encode dialogues
    model_inputs = bart_tokenizer(
        examples["dialogue"],
        max_length=MAX_SOURCE_LEN,
        truncation=True,
        padding="max_length",
    )
    
    # Encode summaries (as labels)
    labels = bart_tokenizer(
        text_target=examples["summary"],
        max_length=MAX_TARGET_LEN,
        truncation=True,
        padding="max_length",
    )
    
    # Replace padding token id with -100 for loss calculation
    labels_ids = np.array(labels["input_ids"])
    labels_ids[labels_ids == bart_tokenizer.pad_token_id] = -100
    
    model_inputs["labels"] = labels_ids.tolist()
    return model_inputs


# Convert DataFrames to HuggingFace Datasets
print("Converting to HuggingFace Datasets...")
bart_train_dataset = Dataset.from_pandas(train_df[["dialogue", "summary"]])
bart_val_dataset = Dataset.from_pandas(val_df[["dialogue", "summary"]])
bart_test_dataset = Dataset.from_pandas(test_df[["dialogue", "summary"]])

# Tokenize
print("Tokenizing datasets for BART...")
bart_tokenized_train = bart_train_dataset.map(
    preprocess_bart,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing train",
)

bart_tokenized_val = bart_val_dataset.map(
    preprocess_bart,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing validation",
)

bart_tokenized_test = bart_test_dataset.map(
    preprocess_bart,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing test",
)

print(f"\nBART tokenized datasets:")
print(f"  Train: {len(bart_tokenized_train):,} examples")
print(f"  Validation: {len(bart_tokenized_val):,} examples")
print(f"  Test: {len(bart_tokenized_test):,} examples")

## 8A. BART Data Collator

In [None]:
bart_data_collator = DataCollatorForSeq2Seq(
    tokenizer=bart_tokenizer,
    model=bart_model,
    label_pad_token_id=-100,
)

# Create metrics function with BART tokenizer
def bart_compute_metrics(eval_pred):
    return compute_metrics(eval_pred, bart_tokenizer)

print("BART data collator configured.")

## 9A. Train BART with Seq2SeqTrainer

In [None]:
if RUN_TRAINING_BART:
    print("Setting up BART Seq2SeqTrainer...")
    print(f"\nTraining configuration:")
    print(f"  Epochs: {NUM_EPOCHS}")
    print(f"  Batch size: {BATCH_SIZE}")
    print(f"  Gradient accumulation: {GRAD_ACCUM_STEPS}")
    print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM_STEPS}")
    print(f"  Learning rate: {LEARNING_RATE}")
    print(f"  Warmup steps: {WARMUP_STEPS}")
    print(f"  Weight decay: {WEIGHT_DECAY}")
    
    bart_training_args = Seq2SeqTrainingArguments(
        output_dir=str(BART_CHECKPOINT_DIR),
        
        # Training
        num_train_epochs=NUM_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUM_STEPS,
        
        # Optimization
        learning_rate=LEARNING_RATE,
        warmup_steps=WARMUP_STEPS,
        weight_decay=WEIGHT_DECAY,
        
        # Evaluation & Saving
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="rougeL",  # Use ROUGE-L, not val_loss!
        greater_is_better=True,
        save_total_limit=2,
        
        # Generation during evaluation
        predict_with_generate=True,
        generation_max_length=MAX_TARGET_LEN,
        generation_num_beams=NUM_BEAMS,
        
        # Logging
        logging_dir=str(BART_OUTPUT_DIR / "logs"),
        logging_strategy="steps",
        logging_steps=LOGGING_STEPS,
        
        # Performance
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=0,
        
        # Reproducibility
        seed=SEED,
    )
    
    bart_trainer = Seq2SeqTrainer(
        model=bart_model,
        args=bart_training_args,
        train_dataset=bart_tokenized_train,
        eval_dataset=bart_tokenized_val,
        tokenizer=bart_tokenizer,
        data_collator=bart_data_collator,
        compute_metrics=bart_compute_metrics,
        callbacks=[EarlyStoppingCallback(
            early_stopping_patience=EARLY_STOPPING_PATIENCE,
            early_stopping_threshold=0.0,
        )],
    )
    
    print("\nBART Trainer configured successfully!")

In [None]:
if RUN_TRAINING_BART:
    print("="*60)
    print("STARTING BART TRAINING")
    print("="*60)
    
    bart_train_result = bart_trainer.train()
    
    print("\n" + "="*60)
    print("BART TRAINING COMPLETE")
    print("="*60)
    print(f"\nTraining time: {bart_train_result.metrics['train_runtime']:.1f} seconds")
    print(f"Final training loss: {bart_train_result.metrics['train_loss']:.4f}")

## 10A. Save BART Model

In [None]:
if RUN_TRAINING_BART:
    print(f"\nSaving best BART model to: {BART_BEST_DIR}")
    bart_trainer.save_model(str(BART_BEST_DIR))
    bart_tokenizer.save_pretrained(BART_BEST_DIR)
    bart_trainer.save_state()
    print("BART model and tokenizer saved!")

## 11A. BART Training History & Loss Curves

In [None]:
if RUN_TRAINING_BART:
    # Extract training history from trainer state
    bart_log_history = bart_trainer.state.log_history
    
    # Separate evaluation logs
    bart_eval_logs = [log for log in bart_log_history if "eval_loss" in log]
    
    # Create history DataFrame
    bart_history_data = []
    for eval_log in bart_eval_logs:
        epoch = eval_log.get("epoch", 0)
        bart_history_data.append({
            "epoch": int(epoch),
            "train_loss": eval_log.get("train_loss", np.nan),
            "val_loss": eval_log.get("eval_loss", np.nan),
            "rouge1": eval_log.get("eval_rouge1", np.nan),
            "rouge2": eval_log.get("eval_rouge2", np.nan),
            "rougeL": eval_log.get("eval_rougeL", np.nan),
            "rougeLsum": eval_log.get("eval_rougeLsum", np.nan),
        })
    
    bart_history_df = pd.DataFrame(bart_history_data)
    bart_history_df.to_csv(BART_HISTORY_PATH, index=False)
    print(f"BART training history saved to: {BART_HISTORY_PATH}")
    
else:
    if BART_HISTORY_PATH.exists():
        bart_history_df = pd.read_csv(BART_HISTORY_PATH)
        print(f"Loaded BART training history from: {BART_HISTORY_PATH}")
    else:
        bart_history_df = None
        print("No BART training history found.")

In [None]:
if bart_history_df is not None:
    print("\n" + "="*60)
    print("BART TRAINING HISTORY")
    print("="*60)
    display(bart_history_df)

In [None]:
import matplotlib.pyplot as plt

if bart_history_df is not None and len(bart_history_df) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss curves
    ax = axes[0]
    if "train_loss" in bart_history_df.columns:
        ax.plot(bart_history_df["epoch"], bart_history_df["train_loss"], marker="o", label="Train Loss")
    ax.plot(bart_history_df["epoch"], bart_history_df["val_loss"], marker="o", label="Val Loss")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("Loss")
    ax.set_title("BART — Loss Curves")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # ROUGE curves
    ax = axes[1]
    ax.plot(bart_history_df["epoch"], bart_history_df["rouge1"], marker="o", label="ROUGE-1")
    ax.plot(bart_history_df["epoch"], bart_history_df["rouge2"], marker="o", label="ROUGE-2")
    ax.plot(bart_history_df["epoch"], bart_history_df["rougeL"], marker="o", label="ROUGE-L")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("ROUGE Score")
    ax.set_title("BART — ROUGE Scores (Validation)")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    fig_path = BART_OUTPUT_DIR / "training_curves.png"
    plt.savefig(fig_path, dpi=150, bbox_inches="tight")
    print(f"Saved BART training curves to: {fig_path}")
    
    plt.show()

## 12A. BART Validation Qualitative Examples

In [None]:
from src.eval.qualitative import qualitative_samples

print("\n" + "="*60)
print("BART VALIDATION SET: Qualitative Examples")
print("="*60)

qualitative_samples(
    df=val_df,
    model=bart_model,
    encoder_tokenizer=bart_tokenizer,
    decoder_tokenizer=bart_tokenizer,
    device=device,
    max_source_len=MAX_SOURCE_LEN,
    max_target_len=MAX_TARGET_LEN,
    source_prefix="",  # No prefix for BART
    n=5,
    seed=SEED,
)

## 13A. BART Test Set Evaluation

Evaluate on the **held-out test set** — the final measure of model performance.

In [None]:
print("\n" + "="*60)
print("BART TEST SET EVALUATION")
print("="*60)

if RUN_TRAINING_BART:
    print("\nRunning BART evaluation on test set...")
    bart_test_results = bart_trainer.evaluate(eval_dataset=bart_tokenized_test)
else:
    # Create a trainer just for evaluation
    print("\nCreating BART trainer for evaluation...")
    
    bart_eval_args = Seq2SeqTrainingArguments(
        output_dir=str(BART_OUTPUT_DIR / "eval_temp"),
        per_device_eval_batch_size=BATCH_SIZE,
        predict_with_generate=True,
        generation_max_length=MAX_TARGET_LEN,
        generation_num_beams=NUM_BEAMS,
        fp16=torch.cuda.is_available(),
    )
    
    bart_eval_trainer = Seq2SeqTrainer(
        model=bart_model,
        args=bart_eval_args,
        tokenizer=bart_tokenizer,
        data_collator=bart_data_collator,
        compute_metrics=bart_compute_metrics,
    )
    
    print("Running BART evaluation on test set...")
    bart_test_results = bart_eval_trainer.evaluate(eval_dataset=bart_tokenized_test)

# Display results
print("\n" + "-"*40)
print("BART TEST SET RESULTS")
print("-"*40)
print(f"  Loss:      {bart_test_results['eval_loss']:.4f}")
print(f"  ROUGE-1:   {bart_test_results['eval_rouge1']:.2f}")
print(f"  ROUGE-2:   {bart_test_results['eval_rouge2']:.2f}")
print(f"  ROUGE-L:   {bart_test_results['eval_rougeL']:.2f}")
print(f"  ROUGE-Lsum:{bart_test_results['eval_rougeLsum']:.2f}")

In [None]:
# Save BART test results
bart_test_results_df = pd.DataFrame([{
    "model": "BART",
    "test_loss": bart_test_results["eval_loss"],
    "rouge1": bart_test_results["eval_rouge1"],
    "rouge2": bart_test_results["eval_rouge2"],
    "rougeL": bart_test_results["eval_rougeL"],
    "rougeLsum": bart_test_results["eval_rougeLsum"],
}])

bart_test_results_df.to_csv(BART_TEST_RESULTS_PATH, index=False)
print(f"\nBART test results saved to: {BART_TEST_RESULTS_PATH}")
display(bart_test_results_df)

## 14A. BART Test Set Qualitative Examples

In [None]:
print("\n" + "="*60)
print("BART TEST SET: Qualitative Examples")
print("="*60)

qualitative_samples(
    df=test_df,
    model=bart_model,
    encoder_tokenizer=bart_tokenizer,
    decoder_tokenizer=bart_tokenizer,
    device=device,
    max_source_len=MAX_SOURCE_LEN,
    max_target_len=MAX_TARGET_LEN,
    source_prefix="",
    n=5,
    seed=SEED,
)

## 15A. Generate All BART Test Predictions

In [None]:
from tqdm.auto import tqdm
from src.eval.qualitative import generate_summary

print("\nGenerating BART predictions for all test examples...")
print("This may take a few minutes.\n")

bart_test_predictions = []

bart_model.eval()
for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="BART Generating"):
    pred = generate_summary(
        model=bart_model,
        encoder_tokenizer=bart_tokenizer,
        decoder_tokenizer=bart_tokenizer,
        text=row["dialogue"],
        device=device,
        max_source_len=MAX_SOURCE_LEN,
        max_target_len=MAX_TARGET_LEN,
        source_prefix="",
    )
    bart_test_predictions.append(pred)

# Create results DataFrame
bart_full_test_results = test_df.copy()
bart_full_test_results["model_prediction"] = bart_test_predictions

# Save
bart_predictions_path = BART_OUTPUT_DIR / "test_predictions.csv"
bart_full_test_results.to_csv(bart_predictions_path, index=False)
print(f"\nSaved {len(bart_test_predictions)} BART predictions to: {bart_predictions_path}")

In [None]:
# Verify ROUGE scores match
from src.eval.rouge_eval import compute_rouge_from_lists

print("Verifying BART ROUGE scores on full test set predictions...")

bart_rouge_verify = compute_rouge_from_lists(
    predictions=bart_test_predictions,
    references=test_df["summary"].tolist(),
)

print(f"\nBART ROUGE Scores (verification):")
print(f"  ROUGE-1:   {bart_rouge_verify['rouge1']*100:.2f}")
print(f"  ROUGE-2:   {bart_rouge_verify['rouge2']*100:.2f}")
print(f"  ROUGE-L:   {bart_rouge_verify['rougeL']*100:.2f}")

---
# PART B: T5
---

T5 (Text-to-Text Transfer Transformer) frames all NLP tasks as text-to-text problems.
For summarization, we prepend `"summarize: "` to the input text.

## 5B. T5 Tokenizer Setup

In [None]:
from transformers import T5Tokenizer

t5_tokenizer = T5Tokenizer.from_pretrained(T5_MODEL_NAME)

print(f"T5 Tokenizer: {t5_tokenizer.__class__.__name__}")
print(f"  Vocab size: {len(t5_tokenizer):,}")
print(f"  Pad token: '{t5_tokenizer.pad_token}' (id: {t5_tokenizer.pad_token_id})")
print(f"  EOS token: '{t5_tokenizer.eos_token}' (id: {t5_tokenizer.eos_token_id})")
print(f"  Task prefix: '{T5_PREFIX}'")

## 6B. Build or Load T5 Model

In [None]:
from transformers import T5ForConditionalGeneration

if RUN_TRAINING_T5:
    print("Building fresh T5 model for training...")
    
    t5_model = T5ForConditionalGeneration.from_pretrained(T5_MODEL_NAME)
    
    # Configure generation defaults
    t5_model.config.max_length = MAX_TARGET_LEN
    t5_model.config.min_length = MIN_LENGTH
    t5_model.config.num_beams = NUM_BEAMS
    t5_model.config.no_repeat_ngram_size = NO_REPEAT_NGRAM_SIZE
    t5_model.config.length_penalty = LENGTH_PENALTY
    t5_model.config.early_stopping = True
    
    print(f"\nT5 model built successfully!")
    
else:
    print(f"Loading T5 model from checkpoint: {T5_BEST_DIR}")
    t5_model = T5ForConditionalGeneration.from_pretrained(T5_BEST_DIR)
    t5_tokenizer = T5Tokenizer.from_pretrained(T5_BEST_DIR)
    
    # Load the prefix used during training
    prefix_path = T5_BEST_DIR / "source_prefix.txt"
    if prefix_path.exists():
        loaded_prefix = prefix_path.read_text().strip()
        if loaded_prefix != T5_PREFIX:
            print(f"  WARNING: Loaded prefix '{loaded_prefix}' differs from config '{T5_PREFIX}'")
            print(f"  Using loaded prefix for consistency.")
            T5_PREFIX = loaded_prefix
    
    print("T5 model loaded successfully!")

# Move to device
t5_model = t5_model.to(device)

# Count parameters
total_params = sum(p.numel() for p in t5_model.parameters())
trainable_params = sum(p.numel() for p in t5_model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## 7B. Prepare T5 Datasets

In [None]:
def preprocess_t5(examples):
    """
    Tokenize dialogues and summaries for T5.
    T5 requires a task prefix (e.g., "summarize: ").
    """
    # Add prefix to dialogues
    inputs = [T5_PREFIX + dialogue for dialogue in examples["dialogue"]]
    
    # Encode dialogues with prefix
    model_inputs = t5_tokenizer(
        inputs,
        max_length=MAX_SOURCE_LEN,
        truncation=True,
        padding="max_length",
    )
    
    # Encode summaries (as labels)
    labels = t5_tokenizer(
        text_target=examples["summary"],
        max_length=MAX_TARGET_LEN,
        truncation=True,
        padding="max_length",
    )
    
    # Replace padding token id with -100 for loss calculation
    labels_ids = np.array(labels["input_ids"])
    labels_ids[labels_ids == t5_tokenizer.pad_token_id] = -100
    
    model_inputs["labels"] = labels_ids.tolist()
    return model_inputs


# Convert DataFrames to HuggingFace Datasets
print("Converting to HuggingFace Datasets...")
t5_train_dataset = Dataset.from_pandas(train_df[["dialogue", "summary"]])
t5_val_dataset = Dataset.from_pandas(val_df[["dialogue", "summary"]])
t5_test_dataset = Dataset.from_pandas(test_df[["dialogue", "summary"]])

# Tokenize
print(f"Tokenizing datasets for T5 (with prefix '{T5_PREFIX}')...")
t5_tokenized_train = t5_train_dataset.map(
    preprocess_t5,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing train",
)

t5_tokenized_val = t5_val_dataset.map(
    preprocess_t5,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing validation",
)

t5_tokenized_test = t5_test_dataset.map(
    preprocess_t5,
    batched=True,
    remove_columns=["dialogue", "summary"],
    desc="Tokenizing test",
)

print(f"\nT5 tokenized datasets:")
print(f"  Train: {len(t5_tokenized_train):,} examples")
print(f"  Validation: {len(t5_tokenized_val):,} examples")
print(f"  Test: {len(t5_tokenized_test):,} examples")

## 8B. T5 Data Collator

In [None]:
t5_data_collator = DataCollatorForSeq2Seq(
    tokenizer=t5_tokenizer,
    model=t5_model,
    label_pad_token_id=-100,
)

# Create metrics function with T5 tokenizer
def t5_compute_metrics(eval_pred):
    return compute_metrics(eval_pred, t5_tokenizer)

print("T5 data collator configured.")

## 9B. Train T5 with Seq2SeqTrainer

In [None]:
if RUN_TRAINING_T5:
    print("Setting up T5 Seq2SeqTrainer...")
    print(f"\nTraining configuration:")
    print(f"  Epochs: {NUM_EPOCHS}")
    print(f"  Batch size: {BATCH_SIZE}")
    print(f"  Gradient accumulation: {GRAD_ACCUM_STEPS}")
    print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM_STEPS}")
    print(f"  Learning rate: {LEARNING_RATE}")
    print(f"  Warmup steps: {WARMUP_STEPS}")
    print(f"  Weight decay: {WEIGHT_DECAY}")
    
    t5_training_args = Seq2SeqTrainingArguments(
        output_dir=str(T5_CHECKPOINT_DIR),
        
        # Training
        num_train_epochs=NUM_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUM_STEPS,
        
        # Optimization
        learning_rate=LEARNING_RATE,
        warmup_steps=WARMUP_STEPS,
        weight_decay=WEIGHT_DECAY,
        
        # Evaluation & Saving
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="rougeL",  # Use ROUGE-L, not val_loss!
        greater_is_better=True,
        save_total_limit=2,
        
        # Generation during evaluation
        predict_with_generate=True,
        generation_max_length=MAX_TARGET_LEN,
        generation_num_beams=NUM_BEAMS,
        
        # Logging
        logging_dir=str(T5_OUTPUT_DIR / "logs"),
        logging_strategy="steps",
        logging_steps=LOGGING_STEPS,
        
        # Performance
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=0,
        
        # Reproducibility
        seed=SEED,
    )
    
    t5_trainer = Seq2SeqTrainer(
        model=t5_model,
        args=t5_training_args,
        train_dataset=t5_tokenized_train,
        eval_dataset=t5_tokenized_val,
        tokenizer=t5_tokenizer,
        data_collator=t5_data_collator,
        compute_metrics=t5_compute_metrics,
        callbacks=[EarlyStoppingCallback(
            early_stopping_patience=EARLY_STOPPING_PATIENCE,
            early_stopping_threshold=0.0,
        )],
    )
    
    print("\nT5 Trainer configured successfully!")

In [None]:
if RUN_TRAINING_T5:
    print("="*60)
    print("STARTING T5 TRAINING")
    print("="*60)
    
    t5_train_result = t5_trainer.train()
    
    print("\n" + "="*60)
    print("T5 TRAINING COMPLETE")
    print("="*60)
    print(f"\nTraining time: {t5_train_result.metrics['train_runtime']:.1f} seconds")
    print(f"Final training loss: {t5_train_result.metrics['train_loss']:.4f}")

## 10B. Save T5 Model

In [None]:
if RUN_TRAINING_T5:
    print(f"\nSaving best T5 model to: {T5_BEST_DIR}")
    t5_trainer.save_model(str(T5_BEST_DIR))
    t5_tokenizer.save_pretrained(T5_BEST_DIR)
    t5_trainer.save_state()
    
    # CRITICAL: Save the prefix used during training
    prefix_path = T5_BEST_DIR / "source_prefix.txt"
    prefix_path.write_text(T5_PREFIX)
    print(f"Saved T5 source prefix: '{T5_PREFIX}'")
    
    print("T5 model and tokenizer saved!")

## 11B. T5 Training History & Loss Curves

In [None]:
if RUN_TRAINING_T5:
    # Extract training history from trainer state
    t5_log_history = t5_trainer.state.log_history
    
    # Separate evaluation logs
    t5_eval_logs = [log for log in t5_log_history if "eval_loss" in log]
    
    # Create history DataFrame
    t5_history_data = []
    for eval_log in t5_eval_logs:
        epoch = eval_log.get("epoch", 0)
        t5_history_data.append({
            "epoch": int(epoch),
            "train_loss": eval_log.get("train_loss", np.nan),
            "val_loss": eval_log.get("eval_loss", np.nan),
            "rouge1": eval_log.get("eval_rouge1", np.nan),
            "rouge2": eval_log.get("eval_rouge2", np.nan),
            "rougeL": eval_log.get("eval_rougeL", np.nan),
            "rougeLsum": eval_log.get("eval_rougeLsum", np.nan),
        })
    
    t5_history_df = pd.DataFrame(t5_history_data)
    t5_history_df.to_csv(T5_HISTORY_PATH, index=False)
    print(f"T5 training history saved to: {T5_HISTORY_PATH}")
    
else:
    if T5_HISTORY_PATH.exists():
        t5_history_df = pd.read_csv(T5_HISTORY_PATH)
        print(f"Loaded T5 training history from: {T5_HISTORY_PATH}")
    else:
        t5_history_df = None
        print("No T5 training history found.")

In [None]:
if t5_history_df is not None:
    print("\n" + "="*60)
    print("T5 TRAINING HISTORY")
    print("="*60)
    display(t5_history_df)

In [None]:
if t5_history_df is not None and len(t5_history_df) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss curves
    ax = axes[0]
    if "train_loss" in t5_history_df.columns:
        ax.plot(t5_history_df["epoch"], t5_history_df["train_loss"], marker="o", label="Train Loss")
    ax.plot(t5_history_df["epoch"], t5_history_df["val_loss"], marker="o", label="Val Loss")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("Loss")
    ax.set_title("T5 — Loss Curves")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # ROUGE curves
    ax = axes[1]
    ax.plot(t5_history_df["epoch"], t5_history_df["rouge1"], marker="o", label="ROUGE-1")
    ax.plot(t5_history_df["epoch"], t5_history_df["rouge2"], marker="o", label="ROUGE-2")
    ax.plot(t5_history_df["epoch"], t5_history_df["rougeL"], marker="o", label="ROUGE-L")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("ROUGE Score")
    ax.set_title("T5 — ROUGE Scores (Validation)")
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    fig_path = T5_OUTPUT_DIR / "training_curves.png"
    plt.savefig(fig_path, dpi=150, bbox_inches="tight")
    print(f"Saved T5 training curves to: {fig_path}")
    
    plt.show()

## 12B. T5 Validation Qualitative Examples

In [None]:
print("\n" + "="*60)
print("T5 VALIDATION SET: Qualitative Examples")
print("="*60)

qualitative_samples(
    df=val_df,
    model=t5_model,
    encoder_tokenizer=t5_tokenizer,
    decoder_tokenizer=t5_tokenizer,
    device=device,
    max_source_len=MAX_SOURCE_LEN,
    max_target_len=MAX_TARGET_LEN,
    source_prefix=T5_PREFIX,  # T5 needs the prefix!
    n=5,
    seed=SEED,
)

## 13B. T5 Test Set Evaluation

In [None]:
print("\n" + "="*60)
print("T5 TEST SET EVALUATION")
print("="*60)

if RUN_TRAINING_T5:
    print("\nRunning T5 evaluation on test set...")
    t5_test_results = t5_trainer.evaluate(eval_dataset=t5_tokenized_test)
else:
    # Create a trainer just for evaluation
    print("\nCreating T5 trainer for evaluation...")
    
    t5_eval_args = Seq2SeqTrainingArguments(
        output_dir=str(T5_OUTPUT_DIR / "eval_temp"),
        per_device_eval_batch_size=BATCH_SIZE,
        predict_with_generate=True,
        generation_max_length=MAX_TARGET_LEN,
        generation_num_beams=NUM_BEAMS,
        fp16=torch.cuda.is_available(),
    )
    
    t5_eval_trainer = Seq2SeqTrainer(
        model=t5_model,
        args=t5_eval_args,
        tokenizer=t5_tokenizer,
        data_collator=t5_data_collator,
        compute_metrics=t5_compute_metrics,
    )
    
    print("Running T5 evaluation on test set...")
    t5_test_results = t5_eval_trainer.evaluate(eval_dataset=t5_tokenized_test)

# Display results
print("\n" + "-"*40)
print("T5 TEST SET RESULTS")
print("-"*40)
print(f"  Loss:      {t5_test_results['eval_loss']:.4f}")
print(f"  ROUGE-1:   {t5_test_results['eval_rouge1']:.2f}")
print(f"  ROUGE-2:   {t5_test_results['eval_rouge2']:.2f}")
print(f"  ROUGE-L:   {t5_test_results['eval_rougeL']:.2f}")
print(f"  ROUGE-Lsum:{t5_test_results['eval_rougeLsum']:.2f}")

In [None]:
# Save T5 test results
t5_test_results_df = pd.DataFrame([{
    "model": "T5",
    "test_loss": t5_test_results["eval_loss"],
    "rouge1": t5_test_results["eval_rouge1"],
    "rouge2": t5_test_results["eval_rouge2"],
    "rougeL": t5_test_results["eval_rougeL"],
    "rougeLsum": t5_test_results["eval_rougeLsum"],
}])

t5_test_results_df.to_csv(T5_TEST_RESULTS_PATH, index=False)
print(f"\nT5 test results saved to: {T5_TEST_RESULTS_PATH}")
display(t5_test_results_df)

## 14B. T5 Test Set Qualitative Examples

In [None]:
print("\n" + "="*60)
print("T5 TEST SET: Qualitative Examples")
print("="*60)

qualitative_samples(
    df=test_df,
    model=t5_model,
    encoder_tokenizer=t5_tokenizer,
    decoder_tokenizer=t5_tokenizer,
    device=device,
    max_source_len=MAX_SOURCE_LEN,
    max_target_len=MAX_TARGET_LEN,
    source_prefix=T5_PREFIX,
    n=5,
    seed=SEED,
)

## 15B. Generate All T5 Test Predictions

In [None]:
print("\nGenerating T5 predictions for all test examples...")
print("This may take a few minutes.\n")

t5_test_predictions = []

t5_model.eval()
for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="T5 Generating"):
    pred = generate_summary(
        model=t5_model,
        encoder_tokenizer=t5_tokenizer,
        decoder_tokenizer=t5_tokenizer,
        text=row["dialogue"],
        device=device,
        max_source_len=MAX_SOURCE_LEN,
        max_target_len=MAX_TARGET_LEN,
        source_prefix=T5_PREFIX,
    )
    t5_test_predictions.append(pred)

# Create results DataFrame
t5_full_test_results = test_df.copy()
t5_full_test_results["model_prediction"] = t5_test_predictions

# Save
t5_predictions_path = T5_OUTPUT_DIR / "test_predictions.csv"
t5_full_test_results.to_csv(t5_predictions_path, index=False)
print(f"\nSaved {len(t5_test_predictions)} T5 predictions to: {t5_predictions_path}")

In [None]:
# Verify ROUGE scores match
print("Verifying T5 ROUGE scores on full test set predictions...")

t5_rouge_verify = compute_rouge_from_lists(
    predictions=t5_test_predictions,
    references=test_df["summary"].tolist(),
)

print(f"\nT5 ROUGE Scores (verification):")
print(f"  ROUGE-1:   {t5_rouge_verify['rouge1']*100:.2f}")
print(f"  ROUGE-2:   {t5_rouge_verify['rouge2']*100:.2f}")
print(f"  ROUGE-L:   {t5_rouge_verify['rougeL']*100:.2f}")

---
# 16. Side-by-Side Comparison: BART vs T5
---

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 2 — FINAL COMPARISON: BART vs T5")
print("="*70)

# Combine test results
comparison_df = pd.concat([bart_test_results_df, t5_test_results_df], ignore_index=True)
comparison_df = comparison_df.sort_values("rougeL", ascending=False).reset_index(drop=True)

print("\nTest Set Results (sorted by ROUGE-L):")
display(comparison_df)

In [None]:
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

models = comparison_df["model"].tolist()
x = np.arange(len(models))
width = 0.25

r1 = comparison_df["rouge1"].tolist()
r2 = comparison_df["rouge2"].tolist()
rL = comparison_df["rougeL"].tolist()

bars1 = ax.bar(x - width, r1, width, label="ROUGE-1", color="#2ecc71")
bars2 = ax.bar(x, r2, width, label="ROUGE-2", color="#3498db")
bars3 = ax.bar(x + width, rL, width, label="ROUGE-L", color="#9b59b6")

ax.set_xlabel("Model")
ax.set_ylabel("ROUGE Score")
ax.set_title("Experiment 2: BART vs T5 — Test Set ROUGE Scores")
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend()
ax.grid(True, alpha=0.3, axis="y")

# Add value labels
for bars in [bars1, bars2, bars3]:
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.1f}',
                   xy=(bar.get_x() + bar.get_width() / 2, height),
                   xytext=(0, 3),
                   textcoords="offset points",
                   ha='center', va='bottom',
                   fontsize=10)

plt.tight_layout()
fig_path = PROJECT_ROOT / "experiments" / "exp2_comparison.png"
fig_path.parent.mkdir(parents=True, exist_ok=True)
plt.savefig(fig_path, dpi=150, bbox_inches="tight")
print(f"Saved comparison chart to: {fig_path}")
plt.show()

## Same Examples: BART vs T5

In [None]:
print("\n" + "="*70)
print("QUALITATIVE COMPARISON: Same Test Examples")
print("="*70)

# Sample 3 test examples
sample_indices = test_df.sample(3, random_state=SEED).index.tolist()

for idx in sample_indices:
    row = test_df.loc[idx]
    
    print(f"\n{'='*70}")
    print(f"TEST EXAMPLE (index {idx})")
    print("="*70)
    
    print(f"\n[DIALOGUE]\n{row['dialogue'][:400]}{'...' if len(row['dialogue']) > 400 else ''}")
    print(f"\n[HUMAN SUMMARY]\n{row['summary']}")
    
    # BART prediction
    bart_pred = bart_full_test_results.loc[idx, "model_prediction"]
    
    # T5 prediction
    t5_pred = t5_full_test_results.loc[idx, "model_prediction"]
    
    print(f"\n[BART]\n{bart_pred}")
    print(f"\n[T5]\n{t5_pred}")
    print("-" * 70)

## 17. Final Summary

In [None]:
print("\n" + "="*70)
print("EXPERIMENT 2 — FINAL SUMMARY")
print("="*70)

print(f"\nModels Evaluated:")
print(f"  BART: {BART_MODEL_NAME}")
print(f"  T5:   {T5_MODEL_NAME} (prefix: '{T5_PREFIX}')")

print(f"\nTraining Configuration:")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM_STEPS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Best model selection: ROUGE-L (not val_loss)")

print(f"\nTest Set Performance:")
print(f"\n  BART:")
print(f"    ROUGE-1: {bart_test_results['eval_rouge1']:.2f}")
print(f"    ROUGE-2: {bart_test_results['eval_rouge2']:.2f}")
print(f"    ROUGE-L: {bart_test_results['eval_rougeL']:.2f}")

print(f"\n  T5:")
print(f"    ROUGE-1: {t5_test_results['eval_rouge1']:.2f}")
print(f"    ROUGE-2: {t5_test_results['eval_rouge2']:.2f}")
print(f"    ROUGE-L: {t5_test_results['eval_rougeL']:.2f}")

# Determine winner
if bart_test_results['eval_rougeL'] > t5_test_results['eval_rougeL']:
    winner = "BART"
    margin = bart_test_results['eval_rougeL'] - t5_test_results['eval_rougeL']
else:
    winner = "T5"
    margin = t5_test_results['eval_rougeL'] - bart_test_results['eval_rougeL']

print(f"\n  Winner: {winner} (by {margin:.2f} ROUGE-L points)")

print(f"\nArtifacts:")
print(f"  BART model: {BART_BEST_DIR}")
print(f"  BART predictions: {bart_predictions_path}")
print(f"  T5 model: {T5_BEST_DIR}")
print(f"  T5 predictions: {t5_predictions_path}")

print("\n" + "="*70)

## 18. Key Takeaways

### Architecture

**BART:**
- Denoising autoencoder pretrained by corrupting and reconstructing text
- Naturally suited for generation tasks like summarization
- No task prefix required

**T5:**
- Text-to-text framework that frames all tasks uniformly
- Requires task prefix ("summarize: ") to specify the task
- More flexible for multi-task learning

### Performance

*[Fill in after training completes]*

- Which model achieved better ROUGE scores?
- How many epochs before convergence?
- Training time comparison?

### Comparison to Experiment 1 (BERT→GPT-2)

Both BART and T5 should significantly outperform the custom BERT→GPT-2 architecture because:
1. **Pretrained cross-attention**: The encoder and decoder were trained together
2. **Purpose-built for seq2seq**: These models were designed for generation tasks
3. **Better initialization**: No randomly initialized layers to train from scratch

### Next Steps

- Compare with **Experiment 3** (frontier LLMs via API) as upper bound
- Final comparison in **notebook 05_evaluation_and_comparison.ipynb**