# ChronoTranscriber Evaluation: CER & WER Analysis

This notebook evaluates transcription quality across multiple transcription systems (local OCR and multiple LLM providers) using edit-distance–based accuracy metrics computed against manually corrected ground truth.

The primary evaluation outputs are:
- **Character Error Rate (CER)**: edit distance at the character level, divided by the number of reference characters.
- **Word Error Rate (WER)**: edit distance at the word level, divided by the number of reference words.

## Evaluation Method
The evaluation is performed **page-by-page** using the temporary JSONL files produced by the ChronoTranscriber pipeline. Page-level evaluation is preferred over comparing final exported plain-text files because it:
- Avoids penalizing downstream whitespace or post-processing differences that do not reflect transcription quality.
- Preserves page boundaries to support error localization and qualitative inspection.
- Enables fairer comparisons across models when outputs vary in formatting.

The metrics implementation distinguishes:
- **Overall metrics**: computed on the full text (including markup and page markers if present).
- **Content-only metrics**: computed after removing formatting artifacts to focus on raw text recognition.
- **Formatting metrics** (available in the underlying metric implementation): separate accounting for page markers and common Markdown constructs.

## Models Evaluated
| Provider | Model | `model_id` | Reasoning |
|----------|-------|-----------|----------|
| Local | Tesseract OCR | `tesseract` | None (baseline) |
| OpenAI | GPT-5.2 | `gpt-5.2` | Medium |
| OpenAI | GPT-5 Mini | `gpt-5-mini` | Medium |
| Google | Gemini 3 Pro | `gemini-3-pro` | Medium |
| Google | Gemini 3 Flash | `gemini-3-flash-preview` | None |
| Anthropic | Claude Sonnet 4.5 | `claude-sonnet-4-5-20250929` | Medium |
| Anthropic | Claude Haiku 4.5 | `claude-haiku-4-5` | Medium |

## Dataset Categories
The evaluation dataset is organized into three document categories defined in the configuration:
1. **Address Books** — Swiss address book pages (Basel 1900); 31 pages processed as one source
2. **Bibliography** — European culinary bibliography (Oxford 1913); 187 pages
3. **Military Records** — Brazilian military enlistment cards; 3 sources × 2 pages each

## Ground Truth
Manually corrected reference transcriptions are stored as JSONL in `test_data/ground_truth/`
(converted from `Korrekturen.zip` via `setup_ground_truth.py`). Schema normalizations applied:
- Image tags: `[Image: ...]` → `![Image: ...]`
- Page markers: `<page_number>X<page_number>` → `<page_number>X</page_number>`

## Reproducibility
This notebook is designed to be paper-ready and reproducible:
- Key result tables are rendered inline as HTML for stable visual inspection in the notebook.
- If `SAVE_TABLES_LATEX = True`, tables are also exported as TeX table files to `LATEX_OUTPUT_DIR` with captions and labels suitable for manuscript inclusion.
- Run metadata (timestamp and output locations) are printed in the Setup section for provenance.


## Table of Contents

1. **Configuration**
   Load the evaluation configuration (YAML), resolve all dataset/report paths, and document the run-time settings used for this evaluation.

2. **Discover Available Data**
   Enumerate available input sources, model output JSONL files, and ground-truth JSONL files to determine which model–category combinations can be evaluated.

3. **Page-Level Evaluation**
   Define the evaluation data structures and compute page-aligned CER/WER metrics by comparing each model’s JSONL transcription output against ground truth.

4. **Results Summary**
   Produce paper-ready summary tables (inline HTML; optional LaTeX export) showing model performance by category and overall rankings across categories.

5. **Detailed Per-Page Results**
   Display (and optionally export) a per-page table for a selected `SHOW_CATEGORY` / `SHOW_MODEL` / `SHOW_SOURCE` to support qualitative inspection and debugging.

6. **Export Results**
   Save machine-readable artifacts (JSON + CSV + Markdown) into the reports directory for archiving and downstream analysis; show an export summary table.

7. **Visualization (Optional)**
   Generate a compact figure comparing models on CER and WER; save the plot for figures and optionally as PDF for LaTeX workflows.

In [None]:
# =============================================================================
# Standard library imports
# =============================================================================
import json
import os
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# =============================================================================
# Data handling and display
# =============================================================================
import yaml
import pandas as pd
from IPython.display import display, HTML

# =============================================================================
# Path Configuration
# =============================================================================
EVAL_DIR = Path.cwd()
PROJECT_ROOT = EVAL_DIR.parent
sys.path.insert(0, str(PROJECT_ROOT))
sys.path.insert(0, str(EVAL_DIR))

# =============================================================================
# OUTPUT CONFIGURATION
# =============================================================================
# Set to True to save tables as LaTeX .tex files
SAVE_TABLES_LATEX = True

# Directory for LaTeX table output (relative to EVAL_DIR)
LATEX_OUTPUT_DIR = EVAL_DIR / "reports" / "latex_tables"

# Create output directory if saving LaTeX
if SAVE_TABLES_LATEX:
    LATEX_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# =============================================================================
# Import evaluation metrics
# =============================================================================
from metrics import (
    compute_metrics,
    aggregate_metrics,
    TranscriptionMetrics,
    format_metrics_table,
)

# Import JSONL page-level utilities
from jsonl_eval import (
    PageTranscription,
    DocumentTranscriptions,
    parse_transcription_jsonl,
    find_jsonl_file,
    load_page_transcriptions,
    load_ground_truth_pages,
    align_pages,
)

# =============================================================================
# Run Summary
# =============================================================================
print(f"Analysis run: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Evaluation directory: {EVAL_DIR}")
print(f"Project root: {PROJECT_ROOT}")
print(f"Save tables as LaTeX: {SAVE_TABLES_LATEX}")
if SAVE_TABLES_LATEX:
    print(f"LaTeX output directory: {LATEX_OUTPUT_DIR.resolve()}")

## 1. Configuration

This section loads the evaluation configuration (YAML) and resolves all key paths used throughout the notebook. The configuration defines:

- **Dataset locations**
  - `INPUT_PATH`: source documents (images or PDFs), grouped by category.
  - `OUTPUT_PATH`: model-generated JSONL transcriptions (the hypothesis).
  - `GROUND_TRUTH_PATH`: manually corrected JSONL transcriptions (the reference).
  - `REPORTS_PATH`: where exported evaluation artifacts are written.

- **Evaluation scope**
  - `CATEGORIES`: which dataset categories are included.
  - `MODELS`: which model identifiers are available or expected.

**Output produced in this section**
- **Table 1**: a compact configuration table documenting the paths and scope used for the current run.
- Optional: a TeX export of Table 1 saved to `LATEX_OUTPUT_DIR` if `SAVE_TABLES_LATEX = True`.

In [None]:
# =============================================================================
# Load evaluation configuration
# =============================================================================
CONFIG_PATH = EVAL_DIR / "eval_config.yaml"

with open(CONFIG_PATH, 'r', encoding='utf-8') as f:
    config = yaml.safe_load(f)

# Extract paths
INPUT_PATH = EVAL_DIR / config['dataset']['input_path']
OUTPUT_PATH = EVAL_DIR / config['dataset']['output_path']
GROUND_TRUTH_PATH = EVAL_DIR / config['dataset']['ground_truth_path']
REPORTS_PATH = EVAL_DIR / config['evaluation']['reports_path']

# Create reports directory
REPORTS_PATH.mkdir(exist_ok=True)

# Extract categories and models
CATEGORIES = [cat['name'] for cat in config['dataset']['categories']]
MODELS = {m['name']: m for m in config['models']}

# =============================================================================
# Configuration Summary Table
# =============================================================================
config_data = {
    'Parameter': ['Input Path', 'Output Path', 'Ground Truth Path', 'Reports Path',
                  'Categories', 'Models'],
    'Value': [str(INPUT_PATH), str(OUTPUT_PATH), str(GROUND_TRUTH_PATH), str(REPORTS_PATH),
              ', '.join(CATEGORIES), str(len(MODELS))]
}
df_config = pd.DataFrame(config_data)

display(HTML('<h4>Table 1: Evaluation Configuration</h4>'))
display(HTML(df_config.to_html(index=False)))

if SAVE_TABLES_LATEX:
    latex_path = LATEX_OUTPUT_DIR / 'table_01_configuration.tex'
    df_config.to_latex(latex_path, index=False,
                       caption='Evaluation Configuration Parameters',
                       label='tab:eval_config')
    print(f'Saved: {latex_path}')

## 2. Discover Available Data

This section audits the evaluation data on disk to determine what can be evaluated in the current run.

Using the configured paths, it:
- Lists available **input sources** for each category.
- Detects which **models have produced JSONL output** for each category.
- Checks whether **ground truth JSONL files** exist for each category.

**Why this matters**
- The evaluation proceeds only for category–model combinations where both hypothesis output and ground truth exist.
- This step makes missing artifacts explicit before running more expensive computations.

**Output produced in this section**
- A console summary for each category showing:
  - Number of detected sources.
  - Which models have JSONL output.
  - Whether ground truth exists (and how many files).

In [None]:
def discover_sources(category: str) -> List[str]:
    """
    Discover source files/folders in the input directory for a category.
    
    Args:
        category: Dataset category
        
    Returns:
        List of source names
    """
    input_dir = INPUT_PATH / category
    
    if not input_dir.exists():
        return []
    
    sources = []
    has_direct_images = False
    
    for item in input_dir.iterdir():
        if item.is_file() and item.suffix.lower() == '.pdf':
            sources.append(item.name)
        elif item.is_file() and item.suffix.lower() in ['.jpg', '.jpeg', '.png', '.tiff']:
            has_direct_images = True
        elif item.is_dir():
            # Check if folder contains images
            images = list(item.glob('*.jpg')) + list(item.glob('*.png'))
            if images:
                sources.append(item.name)
    
    # If images are directly in the category folder (not in subfolders),
    # treat the whole folder as a single source named after the category.
    if has_direct_images and not sources:
        sources.append(category)
    
    return sorted(sources)


def discover_available_models(category: str) -> List[str]:
    """
    Discover which models have JSONL output for a given category.
    
    Args:
        category: Dataset category
        
    Returns:
        List of model names with available output
    """
    output_dir = OUTPUT_PATH / category
    
    if not output_dir.exists():
        return []
    
    models = []
    for d in output_dir.iterdir():
        if d.is_dir():
            # Check if model directory has any JSONL files
            jsonl_files = list(d.rglob('*.jsonl'))
            if jsonl_files:
                models.append(d.name)
    
    return sorted(models)


def check_ground_truth_available(category: str) -> Tuple[bool, int]:
    """
    Check if ground truth JSONL files exist for a category.
    
    Returns:
        Tuple of (has_ground_truth, count_of_files)
    """
    gt_dir = GROUND_TRUTH_PATH / category
    if not gt_dir.exists():
        return False, 0
    
    jsonl_files = list(gt_dir.glob('*.jsonl'))
    return len(jsonl_files) > 0, len(jsonl_files)


# Discover and display available data
print("=" * 60)
print("AVAILABLE DATA SUMMARY")
print("=" * 60)

data_summary = {}

for category in CATEGORIES:
    sources = discover_sources(category)
    available_models = discover_available_models(category)
    gt_available, gt_count = check_ground_truth_available(category)
    
    data_summary[category] = {
        'sources': sources,
        'models': available_models,
        'ground_truth_available': gt_available,
        'ground_truth_count': gt_count,
    }
    
    print(f"\n{category.upper()}")
    print("-" * 40)
    print(f"  Input sources: {len(sources)}")
    if sources:
        for s in sources[:5]:
            print(f"    - {s}")
        if len(sources) > 5:
            print(f"    ... and {len(sources) - 5} more")
    print(f"  Models with JSONL output: {len(available_models)}")
    for m in available_models:
        print(f"    - {m}")
    print(f"  Ground truth JSONL: {'Yes' if gt_available else 'No'} ({gt_count} files)")


## 3. Page-Level Evaluation

This section defines the evaluation logic and computes accuracy metrics at the **page** level.

### Unit of analysis
The unit of analysis is a page transcription parsed from a JSONL file. Pages are aligned between:
- **Reference**: ground truth JSONL
- **Hypothesis**: model output JSONL

A page is evaluated only if:
- The ground-truth page exists and contains transcribable text, and
- The model output page exists and contains transcribable text.

Pages flagged as `no_transcribable_text` or `transcription_not_possible` are treated as non-evaluable.

### What is computed
For each aligned, evaluable page, the notebook calls .compute_metrics(...) to produce:
- Overall CER and WER
- Content-only CER and WER (formatting stripped)

Page-level results are then aggregated with .aggregate_metrics(...) (micro-averaging by reference length).

**Output produced in this section**
- In-memory structures:
  - `all_results`: nested results by category, model, source, and page.
  - `aggregated_metrics`: summary metrics by category and model.
- A console progress summary for each evaluated model/category showing:
  - CER and WER
  - Count of evaluated pages.

In [None]:
@dataclass
class PageEvaluationResult:
    """Container for per-page evaluation results."""
    page_index: int
    image_name: str
    metrics: Optional[TranscriptionMetrics]
    ground_truth_found: bool
    output_found: bool
    error: Optional[str] = None


@dataclass
class SourceEvaluationResult:
    """Container for source-level evaluation results."""
    category: str
    model_name: str
    source_name: str
    page_results: List[PageEvaluationResult]
    aggregated_metrics: Optional[TranscriptionMetrics]
    ground_truth_found: bool
    output_found: bool
    error: Optional[str] = None
    
    @property
    def total_pages(self) -> int:
        return len(self.page_results)
    
    @property
    def evaluated_pages(self) -> int:
        return sum(1 for p in self.page_results if p.metrics is not None)


def evaluate_source_pages(
    category: str,
    model_name: str,
    source_name: str,
) -> SourceEvaluationResult:
    """
    Evaluate a source by comparing pages from model output to ground truth.
    
    Args:
        category: Dataset category
        model_name: Model identifier
        source_name: Source file/folder name
        
    Returns:
        SourceEvaluationResult with per-page and aggregated metrics
    """
    # Load ground truth pages
    gt_doc = load_ground_truth_pages(GROUND_TRUTH_PATH, category, source_name)
    if gt_doc is None or not gt_doc.pages:
        return SourceEvaluationResult(
            category=category,
            model_name=model_name,
            source_name=source_name,
            page_results=[],
            aggregated_metrics=None,
            ground_truth_found=False,
            output_found=False,
            error="Ground truth JSONL not found",
        )
    
    # Load model output pages
    hyp_doc = load_page_transcriptions(OUTPUT_PATH, category, model_name, source_name)
    if hyp_doc is None or not hyp_doc.pages:
        return SourceEvaluationResult(
            category=category,
            model_name=model_name,
            source_name=source_name,
            page_results=[],
            aggregated_metrics=None,
            ground_truth_found=True,
            output_found=False,
            error="Model output JSONL not found",
        )
    
    # Align pages
    aligned = align_pages(hyp_doc, gt_doc)
    
    # Compute per-page metrics
    page_results: List[PageEvaluationResult] = []
    valid_metrics: List[TranscriptionMetrics] = []
    
    for hyp_page, gt_page in aligned:
        # Determine page info
        if gt_page:
            page_index = gt_page.page_index
            image_name = gt_page.image_name or (hyp_page.image_name if hyp_page else "")
        elif hyp_page:
            page_index = hyp_page.page_index
            image_name = hyp_page.image_name
        else:
            continue
        
        # Check availability
        gt_found = gt_page is not None and gt_page.has_text()
        hyp_found = hyp_page is not None and hyp_page.has_text()
        
        if not gt_found:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=False,
                output_found=hyp_found,
                error="No ground truth for page",
            ))
            continue
        
        if not hyp_found:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=True,
                output_found=False,
                error="No model output for page",
            ))
            continue
        
        # Compute metrics
        try:
            metrics = compute_metrics(
                gt_page.transcription,
                hyp_page.transcription,
                normalize=True,
            )
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=metrics,
                ground_truth_found=True,
                output_found=True,
            ))
            valid_metrics.append(metrics)
        except Exception as e:
            page_results.append(PageEvaluationResult(
                page_index=page_index,
                image_name=image_name,
                metrics=None,
                ground_truth_found=True,
                output_found=True,
                error=str(e),
            ))
    
    # Aggregate metrics
    aggregated = aggregate_metrics(valid_metrics) if valid_metrics else None
    
    return SourceEvaluationResult(
        category=category,
        model_name=model_name,
        source_name=source_name,
        page_results=page_results,
        aggregated_metrics=aggregated,
        ground_truth_found=True,
        output_found=True,
    )


def evaluate_model_category(
    category: str,
    model_name: str,
) -> Tuple[List[SourceEvaluationResult], Optional[TranscriptionMetrics]]:
    """
    Evaluate all sources in a category for a given model.
    
    Args:
        category: Dataset category
        model_name: Model identifier
        
    Returns:
        Tuple of (list of per-source results, aggregated metrics)
    """
    sources = discover_sources(category)
    results = []
    all_page_metrics = []
    
    for source in sources:
        result = evaluate_source_pages(category, model_name, source)
        results.append(result)
        
        # Collect valid page metrics for aggregation
        for page_result in result.page_results:
            if page_result.metrics is not None:
                all_page_metrics.append(page_result.metrics)
    
    aggregated = aggregate_metrics(all_page_metrics) if all_page_metrics else None
    
    return results, aggregated


print("Page-level evaluation functions defined.")

In [None]:
# Run full evaluation
print("=" * 60)
print("RUNNING PAGE-LEVEL EVALUATION")
print("=" * 60)

all_results: Dict[str, Dict[str, List[SourceEvaluationResult]]] = {}
aggregated_metrics: Dict[str, Dict[str, TranscriptionMetrics]] = {}

for category in CATEGORIES:
    all_results[category] = {}
    aggregated_metrics[category] = {}
    
    available_models = discover_available_models(category)
    gt_available, gt_count = check_ground_truth_available(category)
    
    if not available_models:
        print(f"\n{category}: No model outputs found (skipping)")
        continue
    
    if not gt_available:
        print(f"\n{category}: No ground truth JSONL files (skipping)")
        print(f"  Hint: Run 'python main/prepare_ground_truth.py --extract' to create editable files")
        continue
    
    print(f"\n{category.upper()}")
    print("-" * 40)
    
    for model_name in available_models:
        results, agg_metrics = evaluate_model_category(category, model_name)
        all_results[category][model_name] = results
        
        if agg_metrics:
            aggregated_metrics[category][model_name] = agg_metrics
            total_pages = sum(r.total_pages for r in results)
            eval_pages = sum(r.evaluated_pages for r in results)
            print(f"  {model_name}:")
            print(f"    CER: {agg_metrics.cer*100:.2f}%  |  WER: {agg_metrics.wer*100:.2f}%")
            print(f"    Pages evaluated: {eval_pages}/{total_pages}")
        else:
            errors = [r.error for r in results if r.error]
            print(f"  {model_name}: No valid evaluations")
            if errors:
                print(f"    Error: {errors[0]}")

print("\n" + "=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)

## 4. Results Summary

This section converts the evaluation results into publication-ready summary tables.

### What is summarized
- **By category**: performance for each model within each dataset category (`CATEGORIES`).
- **Overall**: performance for each model aggregated across all categories.

### Outputs produced
- **Table 2**: Transcription accuracy by model and category (inline HTML; optional LaTeX export).
- **Table 3**: Overall model ranking across categories (inline HTML; optional LaTeX export).

If `SAVE_TABLES_LATEX = True`, the corresponding TeX table files are written to `LATEX_OUTPUT_DIR` using `DataFrame.to_latex(...)` with captions and labels suitable for manuscript inclusion.

In [None]:
# =============================================================================
# 4. Results Summary - Restructure for display
# =============================================================================

# Restructure for display: model -> category -> metrics
model_category_metrics: Dict[str, Dict[str, TranscriptionMetrics]] = {}

for category, models in aggregated_metrics.items():
    for model_name, metrics in models.items():
        if model_name not in model_category_metrics:
            model_category_metrics[model_name] = {}
        model_category_metrics[model_name][category] = metrics

# =============================================================================
# Build Results DataFrame for HTML/LaTeX output
# =============================================================================
if model_category_metrics:
    results_rows = []
    for model_name in sorted(model_category_metrics.keys()):
        for category in CATEGORIES:
            if category in model_category_metrics[model_name]:
                m = model_category_metrics[model_name][category]
                results_rows.append({
                    'Model': model_name,
                    'Category': category,
                    'CER (%)': f'{m.cer*100:.2f}',
                    'WER (%)': f'{m.wer*100:.2f}',
                    'Content CER (%)': f'{m.content_cer*100:.2f}',
                    'Content WER (%)': f'{m.content_wer*100:.2f}',
                    'Ref. Characters': f'{m.ref_char_count:,}',
                    'Ref. Words': f'{m.ref_word_count:,}',
                })

    df_results = pd.DataFrame(results_rows)

    display(HTML('<h4>Table 2: Transcription Accuracy by Model and Category</h4>'))
    display(HTML(df_results.to_html(index=False)))

    if SAVE_TABLES_LATEX:
        latex_path = LATEX_OUTPUT_DIR / 'table_02_results_by_category.tex'
        df_results.to_latex(latex_path, index=False,
                            caption='Transcription Accuracy Metrics by Model and Document Category',
                            label='tab:results_by_category')
        print(f'Saved: {latex_path}')
else:
    print("\nNo evaluation results available.")
    print("\nTo generate results:")
    print("1. Run transcriptions for each model (outputs go to test_data/output/{category}/{model_name}/)")
    print("2. Create ground truth JSONL files:")
    print("   a. python main/prepare_ground_truth.py --extract --input test_data/output/{category}/{model}")
    print("   b. Edit the generated _editable.txt files")
    print("   c. python main/prepare_ground_truth.py --apply --input {edited_file}")
    print("3. Re-run this notebook")

In [None]:
# =============================================================================
# Overall Model Performance (All Categories Combined)
# =============================================================================

overall_model_metrics = {}

for model_name, cat_metrics in model_category_metrics.items():
    all_metrics = list(cat_metrics.values())
    if all_metrics:
        overall = aggregate_metrics(all_metrics)
        overall_model_metrics[model_name] = overall

# Build ranking DataFrame
if overall_model_metrics:
    ranked = sorted(overall_model_metrics.items(), key=lambda x: x[1].cer)

    ranking_rows = []
    for rank, (model_name, metrics) in enumerate(ranked, 1):
        ranking_rows.append({
            'Rank': rank,
            'Model': model_name,
            'CER (%)': f'{metrics.cer*100:.2f}',
            'WER (%)': f'{metrics.wer*100:.2f}',
            'Content CER (%)': f'{metrics.content_cer*100:.2f}',
            'Content WER (%)': f'{metrics.content_wer*100:.2f}',
            'Total Characters': f'{metrics.ref_char_count:,}',
            'Total Words': f'{metrics.ref_word_count:,}',
        })

    df_ranking = pd.DataFrame(ranking_rows)

    display(HTML('<h4>Table 3: Overall Model Rankings (All Categories Combined)</h4>'))
    display(HTML(df_ranking.to_html(index=False)))

    if SAVE_TABLES_LATEX:
        latex_path = LATEX_OUTPUT_DIR / 'table_03_overall_rankings.tex'
        df_ranking.to_latex(latex_path, index=False,
                            caption='Overall Model Rankings by Character Error Rate (All Categories Combined)',
                            label='tab:overall_rankings')
        print(f'Saved: {latex_path}')
else:
    print("No overall metrics available yet.")

## 5. Detailed Per-Page Results

This section supports qualitative inspection by drilling down from aggregate metrics to individual pages.

### How to use this section
The code cell below is parameterized by:
- `SHOW_CATEGORY`: which dataset category to inspect.
- `SHOW_MODEL`: which model to inspect (or `None` to use the first available).
- `SHOW_SOURCE`: which source document to inspect (or `None` to use the first available).

### Output produced
- **Table 4**: Per-page CER/WER (and status codes for missing or unevaluable pages) for the selected model/source.
- Optional: a TeX export of the same per-page table to `LATEX_OUTPUT_DIR`.

This table is intended primarily for:
- Diagnosing systematic failure modes (layout, scripts, tables, degraded scans).
- Identifying outlier pages that dominate aggregate error rates.

In [None]:
# =============================================================================
# 5. Detailed Per-Page Results
# =============================================================================

# Configurable: select category/model/source to display
SHOW_CATEGORY = "address_books"  # Change as needed
SHOW_MODEL = None  # Set to specific model name or None for first available
SHOW_SOURCE = None  # Set to specific source name or None for first available

if SHOW_CATEGORY in all_results and all_results[SHOW_CATEGORY]:
    print(f"Detailed Page-Level Results: {SHOW_CATEGORY.upper()}")

    models_to_show = [SHOW_MODEL] if SHOW_MODEL else list(all_results[SHOW_CATEGORY].keys())[:1]

    for model_name in models_to_show:
        if model_name not in all_results[SHOW_CATEGORY]:
            continue

        results = all_results[SHOW_CATEGORY][model_name]
        sources_to_show = [r for r in results if r.source_name == SHOW_SOURCE] if SHOW_SOURCE else results[:1]

        for source_result in sources_to_show:
            # Build per-page DataFrame
            page_rows = []
            for page_result in source_result.page_results:
                page_num = page_result.page_index + 1
                img_name = page_result.image_name[:40] + '...' if len(page_result.image_name) > 40 else page_result.image_name

                if page_result.metrics:
                    page_rows.append({
                        'Page': page_num,
                        'Image': img_name,
                        'CER (%)': f'{page_result.metrics.cer*100:.2f}',
                        'WER (%)': f'{page_result.metrics.wer*100:.2f}',
                        'Status': 'OK',
                    })
                else:
                    page_rows.append({
                        'Page': page_num,
                        'Image': img_name,
                        'CER (%)': '--',
                        'WER (%)': '--',
                        'Status': page_result.error or 'Error',
                    })

            # Add totals row if aggregated metrics exist
            if source_result.aggregated_metrics:
                m = source_result.aggregated_metrics
                page_rows.append({
                    'Page': 'TOTAL',
                    'Image': '',
                    'CER (%)': f'{m.cer*100:.2f}',
                    'WER (%)': f'{m.wer*100:.2f}',
                    'Status': '',
                })

            df_pages = pd.DataFrame(page_rows)

            display(HTML(f'<h4>Table 4: Per-Page Results - {model_name} / {source_result.source_name}</h4>'))
            display(HTML(df_pages.to_html(index=False)))

            if SAVE_TABLES_LATEX:
                # Sanitize filename
                safe_model = model_name.replace('.', '_').replace(' ', '_')
                safe_source = source_result.source_name.replace('.', '_').replace(' ', '_')[:30]
                latex_path = LATEX_OUTPUT_DIR / f'table_04_pages_{safe_model}_{safe_source}.tex'
                df_pages.to_latex(latex_path, index=False,
                                  caption=f'Per-Page Transcription Results: {model_name}, {source_result.source_name}',
                                  label=f'tab:pages_{safe_model}_{safe_source}')
                print(f'Saved: {latex_path}')
else:
    print(f"Category '{SHOW_CATEGORY}' not found in results or has no evaluated models.")

## 6. Export Results

This section exports evaluation outputs to disk for reproducibility, archiving, and downstream analysis.

### Files written
- A timestamped JSON report containing structured metrics suitable for programmatic reuse.
- A timestamped CSV file containing a row per page-level result for spreadsheet workflows.
- A timestamped Markdown report containing a human-readable summary (including a Markdown-formatted table).

### Output produced
- **Table 5**: Export summary (paths and record counts), rendered inline as HTML.
- Optional: a TeX export of the export summary table saved to `LATEX_OUTPUT_DIR`.

All export paths are anchored to `REPORTS_PATH` (for JSON/CSV/Markdown) and `LATEX_OUTPUT_DIR` (for LaTeX artifacts).

In [None]:
# =============================================================================
# 6. Export Results
# =============================================================================
import csv

# Generate timestamp for reports
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# =============================================================================
# Export aggregated metrics to JSON
# =============================================================================
json_report = {
    "timestamp": timestamp,
    "evaluation_method": "page_level_jsonl",
    "categories": CATEGORIES,
    "models": list(MODELS.keys()),
    "results": {},
}

for model_name, cat_metrics in model_category_metrics.items():
    json_report["results"][model_name] = {
        "per_category": {cat: m.to_dict() for cat, m in cat_metrics.items()},
    }
    if model_name in overall_model_metrics:
        json_report["results"][model_name]["overall"] = overall_model_metrics[model_name].to_dict()

json_path = REPORTS_PATH / f"eval_results_{timestamp}.json"
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(json_report, f, indent=2)
print(f"JSON report saved: {json_path}")

# =============================================================================
# Export to CSV (per-page detail)
# =============================================================================
csv_rows = []
for category, models in all_results.items():
    for model_name, sources in models.items():
        for source_result in sources:
            for page_result in source_result.page_results:
                if page_result.metrics:
                    csv_rows.append({
                        'Model': model_name,
                        'Category': category,
                        'Source': source_result.source_name,
                        'Page': page_result.page_index + 1,
                        'Image': page_result.image_name,
                        'CER (%)': round(page_result.metrics.cer * 100, 2),
                        'WER (%)': round(page_result.metrics.wer * 100, 2),
                        'Char Distance': page_result.metrics.char_distance,
                        'Ref Chars': page_result.metrics.ref_char_count,
                        'Status': 'OK',
                    })
                else:
                    csv_rows.append({
                        'Model': model_name,
                        'Category': category,
                        'Source': source_result.source_name,
                        'Page': page_result.page_index + 1,
                        'Image': page_result.image_name,
                        'CER (%)': '',
                        'WER (%)': '',
                        'Char Distance': '',
                        'Ref Chars': '',
                        'Status': page_result.error or 'Error',
                    })

df_csv = pd.DataFrame(csv_rows)
csv_path = REPORTS_PATH / f"eval_results_{timestamp}.csv"
df_csv.to_csv(csv_path, index=False)
print(f"CSV report saved: {csv_path}")

# Display summary of exported data
display(HTML('<h4>Table 5: Export Summary</h4>'))
export_summary = pd.DataFrame({
    'Export Type': ['JSON Report', 'CSV (Per-Page)', 'LaTeX Tables'],
    'File Path': [str(json_path), str(csv_path),
                  str(LATEX_OUTPUT_DIR) if SAVE_TABLES_LATEX else 'Disabled'],
    'Records': [len(json_report['results']), len(csv_rows),
                len(list(LATEX_OUTPUT_DIR.glob('*.tex'))) if SAVE_TABLES_LATEX else 0],
})
display(HTML(export_summary.to_html(index=False)))

if SAVE_TABLES_LATEX:
    latex_path = LATEX_OUTPUT_DIR / 'table_05_export_summary.tex'
    export_summary.to_latex(latex_path, index=False,
                            caption='Summary of Exported Evaluation Results',
                            label='tab:export_summary')
    print(f'Saved: {latex_path}')

# =============================================================================
# Export Markdown summary
# =============================================================================
md_path = REPORTS_PATH / f"eval_results_{timestamp}.md"
with open(md_path, 'w', encoding='utf-8') as f:
    f.write(f"# ChronoTranscriber Evaluation Results\n\n")
    f.write(f"**Generated:** {timestamp}\n\n")
    f.write(f"**Evaluation Method:** Page-level JSONL comparison\n\n")
    f.write(f"## Models Evaluated\n\n")
    for name, info in MODELS.items():
        f.write(f"- **{name}**: {info.get('description', '')}\n")
    f.write(f"\n## Results by Category\n\n")
    f.write(format_metrics_table(model_category_metrics, CATEGORIES))
    f.write(f"\n\n## Overall Rankings\n\n")
    if overall_model_metrics:
        ranked = sorted(overall_model_metrics.items(), key=lambda x: x[1].cer)
        f.write("| Rank | Model | CER (%) | WER (%) |\n")
        f.write("|------|-------|---------|--------|\n")
        for rank, (model_name, metrics) in enumerate(ranked, 1):
            f.write(f"| {rank} | {model_name} | {metrics.cer*100:.2f} | {metrics.wer*100:.2f} |\n")

print(f"Markdown report saved: {md_path}")

## 7. Visualization (Optional)

This section produces a compact figure comparing model error rates.

### What is plotted
If `matplotlib` is available and `overall_model_metrics` has data, the code:
- Sorts models by overall CER for consistent ordering.
- Plots horizontal bar charts for:
  - CER (percent)
  - WER (percent)

### Files written
- A timestamped PNG figure is saved to the reports directory for quick viewing and sharing.
- If `SAVE_TABLES_LATEX = True`, a PDF version is also saved to `LATEX_OUTPUT_DIR` for LaTeX workflows.

### Output produced
- The figure is displayed inline in the notebook.
- **Table 6**: A small performance summary table (sorted by CER) is displayed inline and optionally exported as TeX.

In [None]:
# =============================================================================
# 7. Visualization (Optional)
# =============================================================================

try:
    import matplotlib.pyplot as plt
    import numpy as np

    PLOT_AVAILABLE = True
except ImportError:
    PLOT_AVAILABLE = False
    print("matplotlib not available - skipping visualizations")
    print("Install with: pip install matplotlib")

if PLOT_AVAILABLE and overall_model_metrics:
    # Prepare data - sort by CER for consistent ordering
    ranked = sorted(overall_model_metrics.items(), key=lambda x: x[1].cer)
    models = [m[0] for m in ranked]
    cer_values = [m[1].cer * 100 for m in ranked]
    wer_values = [m[1].wer * 100 for m in ranked]

    # Create figure
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # CER bar chart
    ax1 = axes[0]
    bars1 = ax1.barh(models, cer_values, color='steelblue')
    ax1.set_xlabel('Character Error Rate (%)')
    ax1.set_title('CER by Model (Lower is Better)')
    ax1.bar_label(bars1, fmt='%.2f%%', padding=3)
    ax1.set_xlim(0, max(cer_values) * 1.25 if cer_values else 10)

    # WER bar chart
    ax2 = axes[1]
    bars2 = ax2.barh(models, wer_values, color='darkorange')
    ax2.set_xlabel('Word Error Rate (%)')
    ax2.set_title('WER by Model (Lower is Better)')
    ax2.bar_label(bars2, fmt='%.2f%%', padding=3)
    ax2.set_xlim(0, max(wer_values) * 1.25 if wer_values else 10)

    plt.tight_layout()

    # Save figure
    fig_path = REPORTS_PATH / f"eval_chart_{timestamp}.png"
    plt.savefig(fig_path, dpi=150, bbox_inches='tight')
    print(f"Chart saved: {fig_path}")

    # Also save as PDF for LaTeX inclusion
    if SAVE_TABLES_LATEX:
        pdf_path = LATEX_OUTPUT_DIR / f"figure_01_error_rates.pdf"
        plt.savefig(pdf_path, dpi=300, bbox_inches='tight')
        print(f"PDF chart saved: {pdf_path}")

    plt.show()

    # =============================================================================
    # Create summary statistics table for visualization
    # =============================================================================
    viz_stats = pd.DataFrame({
        'Model': models,
        'CER (%)': [f'{v:.2f}' for v in cer_values],
        'WER (%)': [f'{v:.2f}' for v in wer_values],
        'CER Rank': range(1, len(models) + 1),
    })

    display(HTML('<h4>Table 6: Model Performance Summary (Sorted by CER)</h4>'))
    display(HTML(viz_stats.to_html(index=False)))

    if SAVE_TABLES_LATEX:
        latex_path = LATEX_OUTPUT_DIR / 'table_06_performance_summary.tex'
        viz_stats.to_latex(latex_path, index=False,
                           caption='Model Performance Summary Sorted by Character Error Rate',
                           label='tab:performance_summary')
        print(f'Saved: {latex_path}')

---

## Next Steps

### Ground Truth Workflow

1. **Extract transcriptions for editing**
   ```bash
   python main/prepare_ground_truth.py --extract --input eval/test_data/output/{category}/{model}
   ```

2. **Edit the generated `_editable.txt` files**
   - Each page is marked with `=== page NNN ===`
   - Correct transcription errors directly in the text
   - Use `[NO TRANSCRIBABLE TEXT]` for blank pages
   - Use `[TRANSCRIPTION NOT POSSIBLE]` for illegible pages

3. **Apply corrections to create ground truth**
   ```bash
   python main/prepare_ground_truth.py --apply --input eval/test_data/output/{category}/{model}
   ```

4. **Check ground truth status**
   ```bash
   python main/prepare_ground_truth.py --status
   ```

### Expected Directory Structure
```
eval/
├── test_data/
│   ├── input/                    # Source documents
│   │   ├── address_books/
│   │   ├── bibliography/
│   │   └── military_records/
│   ├── output/                   # Model outputs (JSONL per source)
│   │   └── {category}/
│   │       └── {model_name}/
│   │           └── {source}/
│   │               └── {source}.jsonl
│   └── ground_truth/             # Corrected transcriptions (JSONL)
│       └── {category}/
│           └── {source}.jsonl
└── reports/                      # Generated evaluation reports
```