# ChronoTranscriber Evaluation: CER & WER Analysis

This notebook evaluates transcription quality across multiple LLM models by computing:
- **Character Error Rate (CER)**: Edit distance at character level
- **Word Error Rate (WER)**: Edit distance at word level

## Evaluation Method
Metrics are computed **page-by-page** using the temporary JSONL files produced by the transcriber.
This ensures accurate comparison without formatting penalties from post-processing.

## Models Evaluated
| Provider | Model | Reasoning |
|----------|-------|----------|
| OpenAI | GPT-5.1 | Medium |
| OpenAI | GPT-5 Mini | Medium |
| Google | Gemini 3.0 Pro | Medium |
| Google | Gemini 2.5 Flash | None |
| Anthropic | Claude Sonnet 4.5 | Medium |
| Anthropic | Claude Haiku 4.5 | Medium |

## Dataset Categories
1. **Address Books** - Swiss address book pages (Basel 1900)
2. **Bibliography** - European culinary bibliographies  
3. **Military Records** - Brazilian military enlistment cards

In [None]:
# Standard library imports
import json
import os
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path for imports
EVAL_DIR = Path.cwd()
PROJECT_ROOT = EVAL_DIR.parent
sys.path.insert(0, str(PROJECT_ROOT))
sys.path.insert(0, str(EVAL_DIR))

# Import evaluation metrics
from metrics import (
    compute_metrics,
    aggregate_metrics,
    TranscriptionMetrics,
    format_metrics_table,
)

# Import JSONL page-level utilities
from jsonl_eval import (
    PageTranscription,
    DocumentTranscriptions,
    parse_transcription_jsonl,
    find_jsonl_file,
    load_page_transcriptions,
    load_ground_truth_pages,
    align_pages,
)

# Data handling
import yaml

print(f"Evaluation directory: {EVAL_DIR}")
print(f"Project root: {PROJECT_ROOT}")

## 1. Configuration

In [None]:
# Load evaluation configuration
CONFIG_PATH = EVAL_DIR / "eval_config.yaml"

with open(CONFIG_PATH, 'r', encoding='utf-8') as f:
    config = yaml.safe_load(f)

# Extract paths
INPUT_PATH = EVAL_DIR / config['dataset']['input_path']
OUTPUT_PATH = EVAL_DIR / config['dataset']['output_path']
GROUND_TRUTH_PATH = EVAL_DIR / config['dataset']['ground_truth_path']
REPORTS_PATH = EVAL_DIR / config['evaluation']['reports_path']

# Create reports directory
REPORTS_PATH.mkdir(exist_ok=True)

# Extract categories and models
CATEGORIES = [cat['name'] for cat in config['dataset']['categories']]
MODELS = {m['name']: m for m in config['models']}

print("Configuration loaded successfully.")
print(f"\nCategories: {CATEGORIES}")
print(f"\nModels: {list(MODELS.keys())}")
print(f"\nPaths:")
print(f"  Input: {INPUT_PATH}")
print(f"  Output: {OUTPUT_PATH}")
print(f"  Ground Truth: {GROUND_TRUTH_PATH}")
print(f"  Reports: {REPORTS_PATH}")

## 2. Discover Available Data

In [None]:
def discover_sources(category: str) -> List[str]:
    """
    Discover source files/folders in the input directory for a category.
    
    Args:
        category: Dataset category
        
    Returns:
        List of source names
    """
    input_dir = INPUT_PATH / category
    
    if not input_dir.exists():
        return []
    
    sources = []
    
    for item in input_dir.iterdir():
        if item.is_file() and item.suffix.lower() in ['.pdf', '.jpg', '.jpeg', '.png', '.tiff']:
            sources.append(item.name)
        elif item.is_dir():
            # Check if folder contains images
            images = list(item.glob('*.jpg')) + list(item.glob('*.png'))
            if images:
                sources.append(item.name)
    
    return sorted(sources)


def discover_available_models(category: str) -> List[str]:
    """
    Discover which models have JSONL output for a given category.
    
    Args:
        category: Dataset category
        
    Returns:
        List of model names with available output
    """
    output_dir = OUTPUT_PATH / category
    
    if not output_dir.exists():
        return []
    
    models = []
    for d in output_dir.iterdir():
        if d.is_dir():
            # Check if model directory has any JSONL files
            jsonl_files = list(d.rglob('*.jsonl'))
            if jsonl_files:
                models.append(d.name)
    
    return sorted(models)


def check_ground_truth_available(category: str) -> Tuple[bool, int]:
    """
    Check if ground truth JSONL files exist for a category.
    
    Returns:
        Tuple of (has_ground_truth, count_of_files)
    """
    gt_dir = GROUND_TRUTH_PATH / category
    if not gt_dir.exists():
        return False, 0
    
    jsonl_files = list(gt_dir.glob('*.jsonl'))
    return len(jsonl_files) > 0, len(jsonl_files)


# Discover and display available data
print("=" * 60)
print("AVAILABLE DATA SUMMARY")
print("=" * 60)

data_summary = {}

for category in CATEGORIES:
    sources = discover_sources(category)
    available_models = discover_available_models(category)
    gt_available, gt_count = check_ground_truth_available(category)
    
    data_summary[category] = {
        'sources': sources,
        'models': available_models,
        'ground_truth_available': gt_available,
        'ground_truth_count': gt_count,
    }
    
    print(f"\n{category.upper()}")
    print("-" * 40)
    print(f"  Input sources: {len(sources)}")
    if sources:
        for s in sources[:5]:
            print(f"    - {s}")
        if len(sources) > 5:
            print(f"    ... and {len(sources) - 5} more")
    print(f"  Models with JSONL output: {len(available_models)}")
    for m in available_models:
        print(f"    - {m}")
    print(f"  Ground truth JSONL: {'Yes' if gt_available else 'No'} ({gt_count} files)")

## 3. Page-Level Evaluation

Metrics are computed by comparing each page of the model output against the corresponding
ground truth page. This ensures:
- No formatting penalties from whitespace differences in final TXT output
- Accurate per-page error attribution
- Better isolation of transcription quality from post-processing effects

In [None]:
@dataclass
class PageEvaluationResult:
    """Container for per-page evaluation results."""
    page_index: int
    image_name: str
    metrics: Optional[TranscriptionMetrics]
    ground_truth_found: bool
    output_found: bool
    error: Optional[str] = None


@dataclass
class SourceEvaluationResult:
    """Container for source-level evaluation results."""
    category: str
    model_name: str
    source_name: str
    page_results: List[PageEvaluationResult]
    aggregated_metrics: Optional[TranscriptionMetrics]
    ground_truth_found: bool
    output_found: bool
    error: Optional[str] = None
    
    @property
    def total_pages(self) -> int:
        return len(self.page_results)
    
    @property
    def evaluated_pages(self) -> int:
        return sum(1 for p in self.page_results if p.metrics is not None)


def evaluate_source_pages(
    category: str,
    model_name: str,
    source_name: str,
) -> SourceEvaluationResult:
    """
    Evaluate a source by comparing pages from model output to ground truth.
    
    Args:
        category: Dataset category
        model_name: Model identifier
        source_name: Source file/folder name
        
    Returns:
        SourceEvaluationResult with per-page and aggregated metrics
    """
    # Load ground truth pages
    gt_doc = load_ground_truth_pages(GROUND_TRUTH_PATH, category, source_name)
    if gt_doc is None or not gt_doc.pages:
        return SourceEvaluationResult(
            category=category,
            model_name=model_name,
            source_name=source_name,
            page_results=[],
            aggregated_metrics=None,
            ground_truth_found=False,
            output_found=False,
            error="Ground truth JSONL not found",
        )
    
    # Load model output pages
    hyp_doc = load_page_transcriptions(OUTPUT_PATH, category, model_name, source_name)
    if hyp_doc is None or not hyp_doc.pages:
        return SourceEvaluationResult(
            category=category,
            model_name=model_name,
            source_name=source_name,
            page_results=[],
            aggregated_metrics=None,
            ground_truth_found=True,
            output_found=False,
            error="Model output JSONL not found",
        )
    
    # Align pages
    aligned = align_pages(hyp_doc, gt_doc)
    
    # Compute per-page metrics
    page_results: List[PageEvaluationResult] = []
    valid_metrics: List[TranscriptionMetrics] = []
    
    for hyp_page, gt_page in aligned:
        # Determine page info
        if gt_page:
            page_index = gt_page.page_index
            image_name = gt_page.image_name or (hyp_page.image_name if hyp_page else "")
        elif hyp_page:
            page_index = hyp_page.page_index
            image_name = hyp_page.image_name
        else:
            continue
        
        # Check availability
        gt_found = gt_page is not None and gt_page.has_text()
        hyp_found = hyp_page is not None and hyp_page.has_text()
        
        if not gt_found:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=False,
                output_found=hyp_found,
                error="No ground truth for page",
            ))
            continue
        
        if not hyp_found:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=True,
                output_found=False,
                error="No model output for page",
            ))
            continue
        
        # Compute metrics
        try:
            metrics = compute_metrics(
                gt_page.transcription,
                hyp_page.transcription,
                normalize=True,
            )
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=metrics,
                ground_truth_found=True,
                output_found=True,
            ))
            valid_metrics.append(metrics)
        except Exception as e:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=True,
                output_found=True,
                error=str(e),
            ))
    
    # Aggregate metrics
    aggregated = aggregate_metrics(valid_metrics) if valid_metrics else None
    
    return SourceEvaluationResult(
        category=category,
        model_name=model_name,
        source_name=source_name,
        page_results=page_results,
        aggregated_metrics=aggregated,
        ground_truth_found=True,
        output_found=True,
    )


def evaluate_model_category(
    category: str,
    model_name: str,
) -> Tuple[List[SourceEvaluationResult], Optional[TranscriptionMetrics]]:
    """
    Evaluate all sources in a category for a given model.
    
    Args:
        category: Dataset category
        model_name: Model identifier
        
    Returns:
        Tuple of (list of per-source results, aggregated metrics)
    """
    sources = discover_sources(category)
    results = []
    all_page_metrics = []
    
    for source in sources:
        result = evaluate_source_pages(category, model_name, source)
        results.append(result)
        
        # Collect valid page metrics for aggregation
        for page_result in result.page_results:
            if page_result.metrics is not None:
                all_page_metrics.append(page_result.metrics)
    
    aggregated = aggregate_metrics(all_page_metrics) if all_page_metrics else None
    
    return results, aggregated


print("Page-level evaluation functions defined.")

In [None]:
# Run full evaluation
print("=" * 60)
print("RUNNING PAGE-LEVEL EVALUATION")
print("=" * 60)

all_results: Dict[str, Dict[str, List[SourceEvaluationResult]]] = {}
aggregated_metrics: Dict[str, Dict[str, TranscriptionMetrics]] = {}

for category in CATEGORIES:
    all_results[category] = {}
    aggregated_metrics[category] = {}
    
    available_models = discover_available_models(category)
    gt_available, gt_count = check_ground_truth_available(category)
    
    if not available_models:
        print(f"\n{category}: No model outputs found (skipping)")
        continue
    
    if not gt_available:
        print(f"\n{category}: No ground truth JSONL files (skipping)")
        print(f"  Hint: Run 'python main/prepare_ground_truth.py --extract' to create editable files")
        continue
    
    print(f"\n{category.upper()}")
    print("-" * 40)
    
    for model_name in available_models:
        results, agg_metrics = evaluate_model_category(category, model_name)
        all_results[category][model_name] = results
        
        if agg_metrics:
            aggregated_metrics[category][model_name] = agg_metrics
            total_pages = sum(r.total_pages for r in results)
            eval_pages = sum(r.evaluated_pages for r in results)
            print(f"  {model_name}:")
            print(f"    CER: {agg_metrics.cer*100:.2f}%  |  WER: {agg_metrics.wer*100:.2f}%")
            print(f"    Pages evaluated: {eval_pages}/{total_pages}")
        else:
            errors = [r.error for r in results if r.error]
            print(f"  {model_name}: No valid evaluations")
            if errors:
                print(f"    Error: {errors[0]}")

print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)

## 4. Results Summary

In [None]:
# Restructure for display: model -> category -> metrics
model_category_metrics: Dict[str, Dict[str, TranscriptionMetrics]] = {}

for category, models in aggregated_metrics.items():
    for model_name, metrics in models.items():
        if model_name not in model_category_metrics:
            model_category_metrics[model_name] = {}
        model_category_metrics[model_name][category] = metrics

# Display as formatted table
if model_category_metrics:
    print("\n" + "=" * 80)
    print("RESULTS SUMMARY TABLE")
    print("=" * 80 + "\n")
    print(format_metrics_table(model_category_metrics, CATEGORIES))
else:
    print("\nNo evaluation results available.")
    print("\nTo generate results:")
    print("1. Run transcriptions for each model (outputs go to test_data/output/{category}/{model_name}/)")
    print("2. Create ground truth JSONL files:")
    print("   a. python main/prepare_ground_truth.py --extract --input test_data/output/{category}/{model}")
    print("   b. Edit the generated _editable.txt files")
    print("   c. python main/prepare_ground_truth.py --apply --input {edited_file}")
    print("3. Re-run this notebook")

In [None]:
# Compute overall metrics per model (across all categories)
print("\n" + "=" * 80)
print("OVERALL MODEL PERFORMANCE (All Categories Combined)")
print("=" * 80 + "\n")

overall_model_metrics = {}

for model_name, cat_metrics in model_category_metrics.items():
    all_metrics = list(cat_metrics.values())
    if all_metrics:
        overall = aggregate_metrics(all_metrics)
        overall_model_metrics[model_name] = overall

# Sort by CER for ranking
if overall_model_metrics:
    ranked = sorted(overall_model_metrics.items(), key=lambda x: x[1].cer)
    
    print(f"{'Rank':<6} {'Model':<30} {'CER (%)':<12} {'WER (%)':<12} {'Chars':<12} {'Words':<10}")
    print("-" * 80)
    
    for rank, (model_name, metrics) in enumerate(ranked, 1):
        print(f"{rank:<6} {model_name:<30} {metrics.cer*100:<12.2f} {metrics.wer*100:<12.2f} "
              f"{metrics.ref_char_count:<12,} {metrics.ref_word_count:<10,}")
else:
    print("No overall metrics available yet.")

## 5. Detailed Per-Page Results

In [None]:
# Show detailed page-level results for a specific category/model (configurable)
SHOW_CATEGORY = "address_books"  # Change as needed
SHOW_MODEL = None  # Set to specific model name or None for first available
SHOW_SOURCE = None  # Set to specific source name or None for first available

if SHOW_CATEGORY in all_results and all_results[SHOW_CATEGORY]:
    print(f"\n{'='*80}")
    print(f"DETAILED PAGE-LEVEL RESULTS: {SHOW_CATEGORY.upper()}")
    print(f"{'='*80}\n")
    
    models_to_show = [SHOW_MODEL] if SHOW_MODEL else list(all_results[SHOW_CATEGORY].keys())[:1]
    
    for model_name in models_to_show:
        if model_name not in all_results[SHOW_CATEGORY]:
            continue
            
        results = all_results[SHOW_CATEGORY][model_name]
        sources_to_show = [r for r in results if r.source_name == SHOW_SOURCE] if SHOW_SOURCE else results[:1]
        
        for source_result in sources_to_show:
            print(f"\n{model_name} / {source_result.source_name}")
            print("-" * 70)
            print(f"{'Page':<6} {'Image':<35} {'CER (%)':<12} {'WER (%)':<12} {'Status'}")
            print("-" * 70)
            
            for page_result in source_result.page_results:
                page_num = page_result.page_index + 1
                img_display = page_result.image_name[:33] + ".." if len(page_result.image_name) > 35 else page_result.image_name
                
                if page_result.metrics:
                    print(f"{page_num:<6} {img_display:<35} {page_result.metrics.cer*100:<12.2f} {page_result.metrics.wer*100:<12.2f} OK")
                else:
                    status = page_result.error or "Error"
                    print(f"{page_num:<6} {img_display:<35} {'--':<12} {'--':<12} {status}")
            
            if source_result.aggregated_metrics:
                m = source_result.aggregated_metrics
                print("-" * 70)
                print(f"{'TOTAL':<6} {'':<35} {m.cer*100:<12.2f} {m.wer*100:<12.2f}")
else:
    print(f"Category '{SHOW_CATEGORY}' not found in results or has no evaluated models.")

## 6. Export Results

In [None]:
import csv
from datetime import datetime

# Generate timestamp for reports
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Export aggregated metrics to JSON
json_report = {
    "timestamp": timestamp,
    "evaluation_method": "page_level_jsonl",
    "categories": CATEGORIES,
    "models": list(MODELS.keys()),
    "results": {},
}

for model_name, cat_metrics in model_category_metrics.items():
    json_report["results"][model_name] = {
        "per_category": {cat: m.to_dict() for cat, m in cat_metrics.items()},
    }
    if model_name in overall_model_metrics:
        json_report["results"][model_name]["overall"] = overall_model_metrics[model_name].to_dict()

json_path = REPORTS_PATH / f"eval_results_{timestamp}.json"
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(json_report, f, indent=2)
print(f"JSON report saved: {json_path}")

# Export to CSV (per-page detail)
csv_path = REPORTS_PATH / f"eval_results_{timestamp}.csv"
with open(csv_path, 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(["Model", "Category", "Source", "Page", "Image", "CER (%)", "WER (%)", 
                     "Char Distance", "Ref Chars", "Status"])
    
    for category, models in all_results.items():
        for model_name, sources in models.items():
            for source_result in sources:
                for page_result in source_result.page_results:
                    if page_result.metrics:
                        writer.writerow([
                            model_name,
                            category,
                            source_result.source_name,
                            page_result.page_index + 1,
                            page_result.image_name,
                            round(page_result.metrics.cer * 100, 2),
                            round(page_result.metrics.wer * 100, 2),
                            page_result.metrics.char_distance,
                            page_result.metrics.ref_char_count,
                            "OK",
                        ])
                    else:
                        writer.writerow([
                            model_name,
                            category,
                            source_result.source_name,
                            page_result.page_index + 1,
                            page_result.image_name,
                            "",
                            "",
                            "",
                            "",
                            page_result.error or "Error",
                        ])

print(f"CSV report saved: {csv_path}")

# Export Markdown summary
md_path = REPORTS_PATH / f"eval_results_{timestamp}.md"
with open(md_path, 'w', encoding='utf-8') as f:
    f.write(f"# ChronoTranscriber Evaluation Results\n\n")
    f.write(f"**Generated:** {timestamp}\n\n")
    f.write(f"**Evaluation Method:** Page-level JSONL comparison\n\n")
    f.write(f"## Models Evaluated\n\n")
    for name, info in MODELS.items():
        f.write(f"- **{name}**: {info.get('description', '')}\n")
    f.write(f"\n## Results by Category\n\n")
    f.write(format_metrics_table(model_category_metrics, CATEGORIES))
    f.write(f"\n\n## Overall Rankings\n\n")
    if overall_model_metrics:
        ranked = sorted(overall_model_metrics.items(), key=lambda x: x[1].cer)
        f.write("| Rank | Model | CER (%) | WER (%) |\n")
        f.write("|------|-------|---------|---------|\n")
        for rank, (model_name, metrics) in enumerate(ranked, 1):
            f.write(f"| {rank} | {model_name} | {metrics.cer*100:.2f} | {metrics.wer*100:.2f} |\n")

print(f"Markdown report saved: {md_path}")

## 7. Visualization (Optional)

In [None]:
# Optional: Create visualizations if matplotlib is available
try:
    import matplotlib.pyplot as plt
    import numpy as np
    
    PLOT_AVAILABLE = True
except ImportError:
    PLOT_AVAILABLE = False
    print("matplotlib not available - skipping visualizations")
    print("Install with: pip install matplotlib")

if PLOT_AVAILABLE and overall_model_metrics:
    # Prepare data
    models = list(overall_model_metrics.keys())
    cer_values = [overall_model_metrics[m].cer * 100 for m in models]
    wer_values = [overall_model_metrics[m].wer * 100 for m in models]
    
    # Create figure
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # CER bar chart
    ax1 = axes[0]
    bars1 = ax1.barh(models, cer_values, color='steelblue')
    ax1.set_xlabel('Character Error Rate (%)')
    ax1.set_title('CER by Model')
    ax1.bar_label(bars1, fmt='%.2f%%', padding=3)
    ax1.set_xlim(0, max(cer_values) * 1.2 if cer_values else 10)
    
    # WER bar chart
    ax2 = axes[1]
    bars2 = ax2.barh(models, wer_values, color='darkorange')
    ax2.set_xlabel('Word Error Rate (%)')
    ax2.set_title('WER by Model')
    ax2.bar_label(bars2, fmt='%.2f%%', padding=3)
    ax2.set_xlim(0, max(wer_values) * 1.2 if wer_values else 10)
    
    plt.tight_layout()
    
    # Save figure
    fig_path = REPORTS_PATH / f"eval_chart_{timestamp}.png"
    plt.savefig(fig_path, dpi=150, bbox_inches='tight')
    print(f"Chart saved: {fig_path}")
    
    plt.show()

---

## Next Steps

### Ground Truth Workflow

1. **Extract transcriptions for editing**
   ```bash
   python main/prepare_ground_truth.py --extract --input eval/test_data/output/{category}/{model}
   ```

2. **Edit the generated `_editable.txt` files**
   - Each page is marked with `=== page NNN ===`
   - Correct transcription errors directly in the text
   - Use `[NO TRANSCRIBABLE TEXT]` for blank pages
   - Use `[TRANSCRIPTION NOT POSSIBLE]` for illegible pages

3. **Apply corrections to create ground truth**
   ```bash
   python main/prepare_ground_truth.py --apply --input eval/test_data/output/{category}/{model}
   ```

4. **Check ground truth status**
   ```bash
   python main/prepare_ground_truth.py --status
   ```

### Expected Directory Structure
```
eval/
├── test_data/
│   ├── input/                    # Source documents
│   │   ├── address_books/
│   │   ├── bibliography/
│   │   └── military_records/
│   ├── output/                   # Model outputs (JSONL per source)
│   │   └── {category}/
│   │       └── {model_name}/
│   │           └── {source}/
│   │               └── {source}.jsonl
│   └── ground_truth/             # Corrected transcriptions (JSONL)
│       └── {category}/
│           └── {source}.jsonl
└── reports/                      # Generated evaluation reports
```