# Clinical Note Summarization: End-to-End Pipeline

**From-Scratch Hierarchical Pointer-Generator Network with Coverage**

This notebook consolidates the entire clinical summarization project into a single, executable pipeline.  
All training artifacts are preserved and reused where possible.

---

## üìã File Consolidation Plan

### ‚úÖ **Files KEPT (Active)**
- `Clinical_Summarization_EndToEnd.ipynb` ‚Üê **This notebook (primary entry point)**
- `train.py` - Training script (can be called from notebook or run standalone)
- `evaluate.py` - Evaluation script (can be called from notebook)
- `baselines.py` - Baseline comparison scripts
- `models/` - Model architecture modules
- `utils/` - Dataset, metrics, beam search utilities
- `tools/` - Diagnostic and monitoring tools
- `configs/` - YAML configuration files
- `requirements.txt` - Python dependencies
- `README.md` - Project documentation

### üì¶ **Artifacts PRESERVED**
- `artifacts/tokenizer/spm.model` - Trained SentencePiece tokenizer
- `artifacts/checkpoints/final_check/best_model.pt` - Best trained model
- `artifacts/checkpoints/final_check/checkpoint_step_500.pt` - Training checkpoint
- `data/splits/` - Train/val/test split files
- `data/tokenized/subset20000/` - Tokenized parquet shards

### üóÑÔ∏è **Files ARCHIVED** (Moved to `archive/`)
- `01_data_explore.ipynb` - Exploratory data analysis
- `02_small_subset_train.ipynb` - Small-scale training experiments
- `03_full_train.ipynb` - Full training notebook
- `04_results_and_examples.ipynb` - Results visualization
- `infer.py` - Standalone inference (logic moved into evaluate.py)
- `sweep.py` - Hyperparameter sweep experiments
- `COMMAND_LOG.md`, `EXECUTIVE_SUMMARY.md`, `QA_*.md` - Documentation artifacts

### üö´ **NO NEW FILES CREATED**
This notebook reuses existing code and artifacts. Future work updates this notebook only.

---

## 1Ô∏è‚É£ Project Overview

**Goal:** Build a hierarchical pointer-generator network with coverage from scratch to summarize clinical notes.  
**Dataset:** MIMIC-IV clinical notes  
**Model:** Custom Seq2Seq with bidirectional LSTM encoder, attentional decoder, pointer-generator, and coverage mechanism.

## 2Ô∏è‚É£ Environment Setup

**Platform:** Windows 11 with NVIDIA GPU  
**Environment:** Python virtual environment (`.venv`)  
**Expected Runtime:** ~10 seconds

**What this cell does:**
- Installs required packages from `requirements.txt`
- Imports core libraries
- Sets random seeds for reproducibility

**Common Issues:**
- If CUDA not found: Ensure PyTorch with CUDA is installed (`torch.cuda.is_available()`)
- If sentencepiece import fails: Run `pip install sentencepiece`

In [None]:
# Install dependencies (uncomment if needed)
# !pip install -r requirements.txt

import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import yaml
import sentencepiece as spm
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Add project root to path
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

print(f"‚úì Project root: {project_root}")
print(f"‚úì Python: {sys.version.split()[0]}")
print(f"‚úì PyTorch: {torch.__version__}")
print(f"‚úì NumPy: {np.__version__}")
print(f"‚úì Pandas: {pd.__version__}")

## 3Ô∏è‚É£ GPU Verification

**Expected Runtime:** ~5 seconds  
**What this cell does:**
- Checks if CUDA is available
- Displays GPU name and memory
- Runs a small matrix multiplication test to verify GPU compute

**Common Issues:**
- CUDA unavailable: Check NVIDIA drivers and PyTorch CUDA installation
- Low memory: Close other GPU applications

In [None]:
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    # Test GPU compute
    print("\nüß™ Running GPU compute test...")
    x = torch.randn(1000, 1000, device=device)
    y = torch.randn(1000, 1000, device=device)
    z = torch.matmul(x, y)
    print(f"‚úì Matrix multiplication successful: {z.shape}")
    print(f"‚úì Result sample: {z[0, 0].item():.4f}")
    
    # Memory check
    allocated = torch.cuda.memory_allocated(0) / 1e9
    reserved = torch.cuda.memory_reserved(0) / 1e9
    print(f"\nMemory Allocated: {allocated:.2f} GB")
    print(f"Memory Reserved: {reserved:.2f} GB")
else:
    print("‚ö†Ô∏è CUDA not available. Training will be VERY slow on CPU.")

## 4Ô∏è‚É£ Dataset Path Configuration

**Expected Runtime:** <1 second  
**What this cell does:**
- Defines paths to dataset, splits, tokenized data, artifacts
- Validates that critical files exist

**Inputs:**
- `DATA_ROOT`: Path to your MIMIC clinical notes CSV
- Update this path to match your dataset location

**Common Issues:**
- FileNotFoundError: Update `DATA_ROOT` to your actual dataset path

In [None]:
# ========== CONFIGURE THESE PATHS ==========
# Update DATA_ROOT to your dataset location
DATA_ROOT = Path(r"C:\Users\antor\Desktop\mimic_clinical_notes.csv")  # Change this!

# Project paths (relative to notebook)
PROJECT_ROOT = Path.cwd()
DATA_DIR = PROJECT_ROOT / "data"
SPLITS_DIR = DATA_DIR / "splits"
TOKENIZED_DIR = DATA_DIR / "tokenized" / "subset20000"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
TOKENIZER_DIR = ARTIFACTS_DIR / "tokenizer"
CHECKPOINTS_DIR = ARTIFACTS_DIR / "checkpoints" / "final_check"
LOGS_DIR = ARTIFACTS_DIR / "logs"
PREDICTIONS_DIR = ARTIFACTS_DIR / "predictions"

# Config file
CONFIG_PATH = PROJECT_ROOT / "configs" / "small_gpu.yaml"

# ========== VALIDATION ==========
print("üìÅ Path Validation:")
print(f"  Project Root: {PROJECT_ROOT}")
print(f"  Dataset: {DATA_ROOT} [{'‚úì EXISTS' if DATA_ROOT.exists() else '‚úó NOT FOUND'}]")
print(f"  Tokenizer: {TOKENIZER_DIR / 'spm.model'} [{'‚úì EXISTS' if (TOKENIZER_DIR / 'spm.model').exists() else '‚úó NOT FOUND'}]")
print(f"  Checkpoint: {CHECKPOINTS_DIR / 'best_model.pt'} [{'‚úì EXISTS' if (CHECKPOINTS_DIR / 'best_model.pt').exists() else '‚úó NOT FOUND'}]")
print(f"  Tokenized Data: {TOKENIZED_DIR} [{'‚úì EXISTS' if TOKENIZED_DIR.exists() else '‚úó NOT FOUND'}]")

# Create missing directories
for dir_path in [DATA_DIR, SPLITS_DIR, ARTIFACTS_DIR, LOGS_DIR, PREDICTIONS_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

print("\n‚úì Directory structure validated")

## 5Ô∏è‚É£ Data Loading & Exploration

**Expected Runtime:** ~30 seconds (for 20K subset)  
**What this cell does:**
- Loads dataset in streaming/chunked mode (memory-efficient)
- Shows dataset statistics
- Displays sample clinical note + summary

**Memory:** Processes in chunks to avoid loading full dataset into RAM

**Common Issues:**
- CSV encoding errors: Dataset should be UTF-8 encoded
- Missing columns: Expects `note_id`, `text`, `summary` columns

In [None]:
# Load dataset (streaming to avoid memory issues)
print("üìä Loading dataset...")

if DATA_ROOT.exists():
    # Read first 5 rows to check structure
    df_sample = pd.read_csv(DATA_ROOT, nrows=5)
    print(f"\nColumns: {df_sample.columns.tolist()}")
    print(f"Sample shape: {df_sample.shape}")
    
    # Get dataset size efficiently
    chunk_iterator = pd.read_csv(DATA_ROOT, chunksize=10000)
    total_rows = sum(len(chunk) for chunk in chunk_iterator)
    print(f"Total rows: {total_rows:,}")
    
    # Display sample
    print("\n" + "="*80)
    print("Sample Clinical Note:")
    print("="*80)
    print(f"Note ID: {df_sample.iloc[0]['note_id']}")
    print(f"\nText (first 500 chars):\n{df_sample.iloc[0]['text'][:500]}...")
    print(f"\nSummary:\n{df_sample.iloc[0]['summary']}")
    print("="*80)
    
    # Statistics
    print(f"\nText length (chars): {len(df_sample.iloc[0]['text'])}")
    print(f"Summary length (chars): {len(df_sample.iloc[0]['summary'])}")
    print(f"Compression ratio: {len(df_sample.iloc[0]['text']) / len(df_sample.iloc[0]['summary']):.1f}x")
else:
    print("‚ö†Ô∏è Dataset not found. Update DATA_ROOT path in previous cell.")

## 6Ô∏è‚É£ Train/Val/Test Split Verification

**Expected Runtime:** <5 seconds  
**What this cell does:**
- Checks if train/val/test splits already exist
- Validates split sizes
- Option to regenerate splits if needed

**Skip Condition:** `RUN_SPLIT = False` (default, use existing splits)

**Outputs:**
- `data/splits/train.csv`
- `data/splits/val.csv`
- `data/splits/test.csv`

In [None]:
# ========== CONFIGURATION ==========
RUN_SPLIT = False  # Set True to regenerate splits

# ========== CHECK EXISTING SPLITS ==========
train_path = SPLITS_DIR / "train.csv"
val_path = SPLITS_DIR / "val.csv"
test_path = SPLITS_DIR / "test.csv"

if train_path.exists() and val_path.exists() and test_path.exists():
    print("‚úì Existing splits found:")
    train_size = sum(1 for _ in open(train_path)) - 1
    val_size = sum(1 for _ in open(val_path)) - 1
    test_size = sum(1 for _ in open(test_path)) - 1
    
    print(f"  Train: {train_size:,} examples")
    print(f"  Val:   {val_size:,} examples")
    print(f"  Test:  {test_size:,} examples")
    print(f"  Total: {train_size + val_size + test_size:,} examples")
    print(f"\n  Split ratio: {train_size/(train_size+val_size+test_size)*100:.1f}% / "
          f"{val_size/(train_size+val_size+test_size)*100:.1f}% / "
          f"{test_size/(train_size+val_size+test_size)*100:.1f}%")
else:
    print("‚ö†Ô∏è Splits not found. Set RUN_SPLIT = True to generate.")

# ========== REGENERATE SPLITS (if needed) ==========
if RUN_SPLIT:
    print("\nüîÑ Regenerating splits...")
    import subprocess
    result = subprocess.run(
        [sys.executable, "data/02_make_splits.py"],
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print("‚úì Splits regenerated successfully")

## 7Ô∏è‚É£ Train SentencePiece Tokenizer

**Expected Runtime:** ~5 minutes (if training from scratch)  
**What this cell does:**
- Checks if tokenizer already exists
- If not, trains a SentencePiece BPE tokenizer on clinical text
- Saves tokenizer to `artifacts/tokenizer/spm.model`

**Skip Condition:** `RUN_TOKENIZER = False` (default, use existing tokenizer)

**Parameters:**
- Vocab size: 32,000
- Model type: BPE (Byte Pair Encoding)

**Common Issues:**
- OOM during training: Reduce vocab size or input sample size

In [None]:
# ========== CONFIGURATION ==========
RUN_TOKENIZER = False  # Set True to retrain tokenizer

# ========== CHECK EXISTING TOKENIZER ==========
tokenizer_model_path = TOKENIZER_DIR / "spm.model"
tokenizer_vocab_path = TOKENIZER_DIR / "spm.vocab"

if tokenizer_model_path.exists():
    print(f"‚úì Tokenizer found: {tokenizer_model_path}")
    
    # Load and test tokenizer
    sp = spm.SentencePieceProcessor()
    sp.load(str(tokenizer_model_path))
    
    print(f"  Vocab size: {sp.vocab_size():,}")
    print(f"  PAD token: {sp.id_to_piece(sp.pad_id())} (ID: {sp.pad_id()})")
    print(f"  UNK token: {sp.id_to_piece(sp.unk_id())} (ID: {sp.unk_id()})")
    print(f"  BOS token: {sp.id_to_piece(sp.bos_id())} (ID: {sp.bos_id()})")
    print(f"  EOS token: {sp.id_to_piece(sp.eos_id())} (ID: {sp.eos_id()})")
    
    # Test encoding/decoding
    test_text = "The patient presents with chest pain and shortness of breath."
    encoded = sp.encode(test_text)
    decoded = sp.decode(encoded)
    print(f"\n  Test encoding:")
    print(f"    Original: {test_text}")
    print(f"    Encoded:  {encoded[:10]}... ({len(encoded)} tokens)")
    print(f"    Decoded:  {decoded}")
else:
    print("‚ö†Ô∏è Tokenizer not found. Set RUN_TOKENIZER = True to train.")

# ========== TRAIN TOKENIZER (if needed) ==========
if RUN_TOKENIZER:
    print("\nüîÑ Training SentencePiece tokenizer...")
    TOKENIZER_DIR.mkdir(parents=True, exist_ok=True)
    
    import subprocess
    result = subprocess.run(
        [sys.executable, "data/03_train_sentencepiece.py"],
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print(f"‚úì Tokenizer saved to {TOKENIZER_DIR}")

## 8Ô∏è‚É£ Tokenization to Parquet Shards

**Expected Runtime:** ~10-30 minutes (for full dataset)  
**What this cell does:**
- Tokenizes train/val/test splits using SentencePiece
- Saves to memory-efficient Parquet format
- Shards data for faster loading during training

**Skip Condition:** `RUN_TOKENIZATION = False` (default, use existing tokenized data)

**Outputs:**
- `data/tokenized/subset20000/train.parquet`
- `data/tokenized/subset20000/val.parquet`
- `data/tokenized/subset20000/test.parquet`

**Common Issues:**
- OOM: Process in smaller chunks (adjust `chunk_size` in tokenization script)

In [None]:
# ========== CONFIGURATION ==========
RUN_TOKENIZATION = False  # Set True to retokenize

# ========== CHECK EXISTING TOKENIZED DATA ==========
tokenized_train = TOKENIZED_DIR / "train.parquet"
tokenized_val = TOKENIZED_DIR / "val.parquet"
tokenized_test = TOKENIZED_DIR / "test.parquet"

if tokenized_train.exists() and tokenized_val.exists() and tokenized_test.exists():
    print("‚úì Tokenized data found:")
    
    # Read metadata
    train_df = pd.read_parquet(tokenized_train)
    val_df = pd.read_parquet(tokenized_val)
    test_df = pd.read_parquet(tokenized_test)
    
    print(f"  Train: {len(train_df):,} examples")
    print(f"  Val:   {len(val_df):,} examples")
    print(f"  Test:  {len(test_df):,} examples")
    
    # Show sample
    print(f"\n  Sample tokenized example:")
    sample = train_df.iloc[0]
    print(f"    Note ID: {sample['note_id']}")
    print(f"    Source tokens: {len(sample['source_ids'])} tokens")
    print(f"    Target tokens: {len(sample['target_ids'])} tokens")
    print(f"    Source IDs (first 20): {sample['source_ids'][:20]}")
    print(f"    Target IDs (first 20): {sample['target_ids'][:20]}")
    
    # File sizes
    train_size = tokenized_train.stat().st_size / 1e6
    val_size = tokenized_val.stat().st_size / 1e6
    test_size = tokenized_test.stat().st_size / 1e6
    print(f"\n  File sizes:")
    print(f"    Train: {train_size:.1f} MB")
    print(f"    Val:   {val_size:.1f} MB")
    print(f"    Test:  {test_size:.1f} MB")
else:
    print("‚ö†Ô∏è Tokenized data not found. Set RUN_TOKENIZATION = True.")

# ========== TOKENIZE DATA (if needed) ==========
if RUN_TOKENIZATION:
    print("\nüîÑ Tokenizing data...")
    TOKENIZED_DIR.mkdir(parents=True, exist_ok=True)
    
    import subprocess
    result = subprocess.run(
        [sys.executable, "data/04_tokenize_to_parquet.py",
         "--output_dir", str(TOKENIZED_DIR)],
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print(f"‚úì Tokenized data saved to {TOKENIZED_DIR}")

## 9Ô∏è‚É£ Model Architecture

**Expected Runtime:** ~5 seconds  
**What this cell does:**
- Loads config file
- Instantiates PointerGeneratorModel from scratch
- Displays model summary and parameter count

**Architecture:**
- **Encoder:** 2-layer BiLSTM (512 hidden units)
- **Decoder:** 2-layer LSTM with additive attention (512 hidden units)
- **Pointer-Generator:** Learns when to copy from source vs generate
- **Coverage:** Tracks attention history to reduce repetition

**Parameters:** ~50-60M (trained from random initialization)

In [None]:
# Load configuration
print(f"üìã Loading config from {CONFIG_PATH}...")
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

print("\nModel Configuration:")
print(yaml.dump(config['model'], default_flow_style=False, indent=2))

# Import model
from models.model import PointerGeneratorModel

# Create model
print("\nüèóÔ∏è Building model architecture...")
model = PointerGeneratorModel(config).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úì Model created successfully")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Model size (MB): {total_params * 4 / 1e6:.2f}")

# Display model structure
print(f"\nModel Structure:")
print(model)

# Test forward pass
print("\nüß™ Testing forward pass...")
batch_size = 2
src_len = config['model']['chunk_len'] * config['model']['num_chunks']
tgt_len = 64

dummy_src = torch.randint(0, config['model']['vocab_size'], (batch_size, src_len)).to(device)
dummy_tgt = torch.randint(0, config['model']['vocab_size'], (batch_size, tgt_len)).to(device)
dummy_src_ext = dummy_src.clone()
dummy_oov_list = [[]] * batch_size

with torch.no_grad():
    loss, _ = model(dummy_src, dummy_tgt, dummy_src_ext, dummy_oov_list)

print(f"‚úì Forward pass successful")
print(f"  Loss shape: {loss.shape}")
print(f"  Loss value: {loss.item():.4f}")

## üîü Training

**Expected Runtime:** ~6-12 hours (for 10K steps on RTX 4070)  
**What this cell does:**
- Trains the model using FP16 mixed precision
- Saves checkpoints every N steps
- Evaluates on validation set periodically
- Can resume from existing checkpoint

**Skip Condition:** `RUN_TRAIN = False` (default, use existing checkpoint)

**Training Features:**
- FP16 mixed precision (faster, less memory)
- Gradient accumulation (effective batch size = batch_size √ó accum_steps)
- Learning rate warmup + decay
- Gradient clipping (prevents exploding gradients)

**Checkpoints Saved:**
- Every 500 steps: `checkpoint_step_<N>.pt`
- Best validation ROUGE: `best_model.pt`

**Common Issues:**
- OOM: Reduce batch_size in config or enable gradient checkpointing
- Slow training: Check GPU utilization (should be >80%)
- NaN loss: Reduce learning rate or check data

In [None]:
# ========== CONFIGURATION ==========
RUN_TRAIN = False  # Set True to train (or resume training)
RESUME_FROM_CHECKPOINT = True  # Resume from existing checkpoint if available
MAX_STEPS = 10000  # Total training steps

# ========== CHECK EXISTING CHECKPOINT ==========
checkpoint_path = CHECKPOINTS_DIR / "best_model.pt"

if checkpoint_path.exists():
    print(f"‚úì Checkpoint found: {checkpoint_path}")
    
    # Load checkpoint metadata
    checkpoint = torch.load(checkpoint_path, map_location='cpu', weights_only=False)
    print(f"  Step: {checkpoint.get('step', 'N/A')}")
    print(f"  Epoch: {checkpoint.get('epoch', 'N/A')}")
    print(f"  Best ROUGE-L: {checkpoint.get('best_rouge', 0):.4f}")
    print(f"  Model params: {sum(p.numel() for p in checkpoint['model_state_dict'].values()):,}")
else:
    print("‚ö†Ô∏è No checkpoint found. Training will start from scratch.")

# ========== TRAIN (if enabled) ==========
if RUN_TRAIN:
    print("\nüöÄ Starting training...")
    print(f"  Max steps: {MAX_STEPS}")
    print(f"  Resume: {RESUME_FROM_CHECKPOINT}")
    print(f"  Device: {device}")
    print(f"  Config: {CONFIG_PATH}")
    
    # Build training command
    train_cmd = [
        sys.executable, "train.py",
        "--config", str(CONFIG_PATH),
        "--tokenized_dir", str(TOKENIZED_DIR),
        "--run_name", "notebook_training",
        "--max_steps", str(MAX_STEPS)
    ]
    
    if RESUME_FROM_CHECKPOINT and checkpoint_path.exists():
        train_cmd.extend(["--resume", str(checkpoint_path)])
    
    # Run training
    import subprocess
    print(f"\nCommand: {' '.join(train_cmd)}\n")
    
    process = subprocess.Popen(
        train_cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1
    )
    
    # Stream output
    for line in process.stdout:
        print(line, end='')
    
    process.wait()
    
    if process.returncode == 0:
        print("\n‚úì Training completed successfully")
    else:
        print(f"\n‚úó Training failed with exit code {process.returncode}")
else:
    print("\n‚è≠Ô∏è Training skipped (RUN_TRAIN = False)")

## 1Ô∏è‚É£1Ô∏è‚É£ Load Trained Model

**Expected Runtime:** ~5 seconds  
**What this cell does:**
- Loads the best trained checkpoint
- Restores model weights
- Verifies model is ready for inference

**Checkpoint Used:** `artifacts/checkpoints/final_check/best_model.pt`

**Common Issues:**
- Checkpoint mismatch: Ensure config matches training config
- Missing checkpoint: Train model first (set RUN_TRAIN = True)

In [None]:
# Load best checkpoint
checkpoint_path = CHECKPOINTS_DIR / "best_model.pt"

if not checkpoint_path.exists():
    raise FileNotFoundError(f"Checkpoint not found: {checkpoint_path}")

print(f"üì¶ Loading checkpoint: {checkpoint_path}")
checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)

# Recreate model (in case not already created)
from models.model import PointerGeneratorModel
model = PointerGeneratorModel(config).to(device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

print(f"‚úì Model loaded successfully")
print(f"  Training step: {checkpoint.get('step', 'N/A')}")
print(f"  Training epoch: {checkpoint.get('epoch', 'N/A')}")
print(f"  Best validation ROUGE-L: {checkpoint.get('best_rouge', 0):.4f}")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  Mode: Evaluation")

## 1Ô∏è‚É£2Ô∏è‚É£ Decoding + Prediction Dump

**Expected Runtime:** ~30-60 minutes (for full validation set)  
**What this cell does:**
- Runs beam search decoding on validation set
- Generates summaries for all examples
- Saves predictions to JSONL/CSV format

**Outputs:**
- `artifacts/predictions/val_predictions.csv` (note_id, prediction, reference)

**Beam Search Parameters:**
- Beam size: 4 (explores 4 candidate sequences)
- Max length: 128 tokens
- Length penalty: Prevents overly short summaries

**Common Issues:**
- Slow inference: Reduce beam_size or max_length
- OOM: Reduce batch_size to 1

In [None]:
# Import utilities
from utils.dataset import get_dataloader
from utils.beam_search import beam_search

# Load tokenizer
print("üî§ Loading tokenizer...")
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(str(TOKENIZER_DIR / 'spm.model'))
print(f"  Vocab size: {tokenizer.vocab_size():,}")

# Create validation dataloader
print("\nüìä Loading validation data...")
val_dataloader = get_dataloader(
    str(TOKENIZED_DIR / 'val.parquet'),
    batch_size=1,  # Beam search works on single examples
    shuffle=False,
    max_src_len=config['model']['chunk_len'] * config['model']['num_chunks'],
    max_tgt_len=config['model']['max_target_len'],
    pad_id=config['data']['pad_id']
)

print(f"  Validation examples: {len(val_dataloader)}")

# Run inference
print("\nüîÆ Generating predictions...")
predictions = []

with torch.no_grad():
    for batch_idx, batch in enumerate(tqdm(val_dataloader, desc="Decoding")):
        # Move to device
        src_ids = batch['source_ids'].to(device)
        src_ext_ids = batch['source_ext_ids'].to(device)
        tgt_ids = batch['target_ids'].to(device)
        oov_list = batch['oov_list']
        note_ids = batch['note_id']
        
        # Run beam search
        pred_ids = beam_search(
            model=model,
            src_ids=src_ids[0],  # Batch size = 1
            src_ext_ids=src_ext_ids[0],
            oov_list=oov_list[0],
            beam_size=4,
            max_len=config['model']['max_target_len'],
            bos_id=config['data']['bos_id'],
            eos_id=config['data']['eos_id'],
            pad_id=config['data']['pad_id'],
            device=device
        )
        
        # Decode prediction and reference
        pred_text = tokenizer.decode(pred_ids)
        ref_ids = tgt_ids[0].cpu().tolist()
        ref_text = tokenizer.decode([id for id in ref_ids if id not in [config['data']['pad_id'], config['data']['bos_id'], config['data']['eos_id']]])
        
        predictions.append({
            'note_id': note_ids[0],
            'prediction': pred_text,
            'reference': ref_text
        })
        
        # Limit to first 100 for demo (remove for full evaluation)
        if batch_idx >= 99:
            print("\n‚ö†Ô∏è Limited to 100 examples for demo. Remove limit for full evaluation.")
            break

# Save predictions
predictions_df = pd.DataFrame(predictions)
output_path = PREDICTIONS_DIR / "val_predictions.csv"
PREDICTIONS_DIR.mkdir(parents=True, exist_ok=True)
predictions_df.to_csv(output_path, index=False)

print(f"\n‚úì Predictions saved: {output_path}")
print(f"  Total predictions: {len(predictions)}")

# Display sample predictions
print("\n" + "="*80)
print("Sample Predictions:")
print("="*80)
for i in range(min(3, len(predictions))):
    print(f"\nExample {i+1}:")
    print(f"  Note ID: {predictions[i]['note_id']}")
    print(f"  Prediction: {predictions[i]['prediction'][:200]}...")
    print(f"  Reference:  {predictions[i]['reference'][:200]}...")
print("="*80)

## 1Ô∏è‚É£3Ô∏è‚É£ ROUGE Evaluation

**Expected Runtime:** ~30 seconds  
**What this cell does:**
- Computes ROUGE-1, ROUGE-2, ROUGE-L metrics
- Compares model vs baseline (lead-k)
- Displays results table

**Metrics:**
- **ROUGE-1:** Unigram overlap (word-level recall)
- **ROUGE-2:** Bigram overlap (phrase-level matching)
- **ROUGE-L:** Longest common subsequence (fluency)

**Baseline:**
- **Lead-K:** First K sentences of source as summary (simple baseline)

**Common Issues:**
- Low scores: Check if predictions are empty or truncated

In [None]:
# Compute ROUGE scores
from utils.metrics import RougeMetric, format_rouge_scores

print("üìä Computing ROUGE scores...\n")

# Load predictions
predictions_path = PREDICTIONS_DIR / "val_predictions.csv"
predictions_df = pd.read_csv(predictions_path)

# Extract predictions and references
predictions_list = predictions_df['prediction'].tolist()
references_list = predictions_df['reference'].tolist()

# Compute ROUGE
metric = RougeMetric()
for pred, ref in zip(predictions_list, references_list):
    metric.update(pred, ref)

scores = metric.compute()

# Display results
print("="*80)
print("ROUGE Evaluation Results (Validation Set)")
print("="*80)
print(f"\nModel: Pointer-Generator with Coverage")
print(f"Checkpoint: {CHECKPOINTS_DIR / 'best_model.pt'}")
print(f"Examples evaluated: {len(predictions_list)}\n")

print(format_rouge_scores(scores))

# Create results table
results_table = pd.DataFrame({
    'Metric': ['ROUGE-1', 'ROUGE-2', 'ROUGE-L'],
    'Precision': [
        scores['rouge1']['precision'],
        scores['rouge2']['precision'],
        scores['rougeL']['precision']
    ],
    'Recall': [
        scores['rouge1']['recall'],
        scores['rouge2']['recall'],
        scores['rougeL']['recall']
    ],
    'F1': [
        scores['rouge1']['f1'],
        scores['rouge2']['f1'],
        scores['rougeL']['f1']
    ]
})

print("\n" + results_table.to_string(index=False))
print("="*80)

# Save results
results_path = PREDICTIONS_DIR / "rouge_results.csv"
results_table.to_csv(results_path, index=False)
print(f"\n‚úì Results saved: {results_path}")

## 1Ô∏è‚É£4Ô∏è‚É£ Baseline Comparison

**Expected Runtime:** ~2 minutes  
**What this cell does:**
- Computes lead-k baseline (first K sentences)
- Compares model performance vs baseline

**Skip Condition:** `RUN_BASELINES = False` (use existing baseline results)

**Common Issues:**
- Baseline not found: Run `baselines.py` first

In [None]:
# ========== CONFIGURATION ==========
RUN_BASELINES = False  # Set True to recompute baselines

# ========== CHECK EXISTING BASELINES ==========
baseline_results_path = ARTIFACTS_DIR / "baselines" / "val_results.csv"

if baseline_results_path.exists():
    print("‚úì Baseline results found\n")
    baseline_df = pd.read_csv(baseline_results_path)
    print(baseline_df.to_string(index=False))
else:
    print("‚ö†Ô∏è Baseline results not found. Set RUN_BASELINES = True.")

# ========== RUN BASELINES (if needed) ==========
if RUN_BASELINES:
    print("\nüîÑ Computing baselines...")
    import subprocess
    result = subprocess.run(
        [sys.executable, "baselines.py",
         "--tokenized_dir", str(TOKENIZED_DIR),
         "--split", "val"],
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print(f"‚úì Baselines saved to {ARTIFACTS_DIR / 'baselines'}")

# ========== COMPARISON ==========
if baseline_results_path.exists() and (PREDICTIONS_DIR / "rouge_results.csv").exists():
    print("\n" + "="*80)
    print("Model vs Baseline Comparison")
    print("="*80)
    
    baseline_df = pd.read_csv(baseline_results_path)
    model_df = pd.read_csv(PREDICTIONS_DIR / "rouge_results.csv")
    
    # Find lead-k baseline (typically best baseline)
    lead_k = baseline_df[baseline_df['Method'].str.contains('lead', case=False)].iloc[0]
    
    comparison = pd.DataFrame({
        'Method': ['Lead-K Baseline', 'Pointer-Generator (Ours)'],
        'ROUGE-1 F1': [lead_k['ROUGE-1 F1'], model_df[model_df['Metric'] == 'ROUGE-1']['F1'].values[0]],
        'ROUGE-2 F1': [lead_k['ROUGE-2 F1'], model_df[model_df['Metric'] == 'ROUGE-2']['F1'].values[0]],
        'ROUGE-L F1': [lead_k['ROUGE-L F1'], model_df[model_df['Metric'] == 'ROUGE-L']['F1'].values[0]]
    })
    
    print("\n" + comparison.to_string(index=False))
    print("\n" + "="*80)

## 1Ô∏è‚É£5Ô∏è‚É£ Results Summary & Next Steps

**Final Summary:**
- ‚úÖ Model trained from scratch (no pretrained weights)
- ‚úÖ Custom SentencePiece tokenizer trained on clinical text
- ‚úÖ Hierarchical pointer-generator with coverage mechanism
- ‚úÖ ROUGE evaluation on validation set
- ‚úÖ All artifacts preserved and reusable

**Key Files:**
- Tokenizer: `artifacts/tokenizer/spm.model`
- Best model: `artifacts/checkpoints/final_check/best_model.pt`
- Predictions: `artifacts/predictions/val_predictions.csv`
- Results: `artifacts/predictions/rouge_results.csv`

**Next Steps:**
1. **Error Analysis:** Inspect failed examples, identify common issues
2. **Hyperparameter Tuning:** Adjust learning rate, beam size, coverage weight
3. **Longer Training:** Current checkpoint at ~500-600 steps; try 10K+ steps
4. **Test Set Evaluation:** Run `evaluate.py --split test` for final results
5. **Deployment:** Export model for production inference

**Troubleshooting:**
- Low ROUGE: Try longer training, larger model, or better preprocessing
- Repetitive summaries: Increase coverage weight in config
- OOM errors: Reduce batch size, enable gradient checkpointing

**Questions?** Check `README.md` or archived notebooks in `archive/`

In [None]:
# Final checkpoint
print("="*80)
print("üéâ Pipeline Complete!")
print("="*80)
print(f"\n‚úì Tokenizer: {TOKENIZER_DIR / 'spm.model'}")
print(f"‚úì Model: {CHECKPOINTS_DIR / 'best_model.pt'}")
print(f"‚úì Predictions: {PREDICTIONS_DIR / 'val_predictions.csv'}")
print(f"‚úì Results: {PREDICTIONS_DIR / 'rouge_results.csv'}")
print("\nAll artifacts preserved. Notebook can be rerun without retraining.")
print("="*80)