# ChronoMiner Extraction Evaluation

This notebook evaluates the quality of structured data extraction produced by different LLM models on the ChronoMiner evaluation set. The unit of evaluation is a document chunk: each model output chunk is compared against a corresponding ground-truth chunk, and metrics are aggregated across chunks, sources, categories, and models.

## Evaluation method (chunk-level comparison)

The evaluation operates on the temporary JSONL outputs produced by the extraction pipeline. For each (category, source, model) combination, the notebook:

- Loads ground-truth chunk extractions from test_data/ground_truth/{category}/.
- Loads model output chunk extractions from test_data/output/{category}/{model_name}/.
- Aligns chunks and computes metrics per aligned chunk using compute_extraction_metrics().
- Aggregates metrics across chunks using aggregate_metrics().

This design:
- Avoids penalizing irrelevant formatting differences in serialized JSON.
- Preserves per-chunk error attribution for debugging and model comparison.
- Keeps evaluation logic consistent across heterogeneous document types.

## Reproducibility and paper-ready outputs

This notebook is intended to support academic reporting and reproducible evaluation:

- Results tables are rendered as HTML in the notebook for inspection.
- If SAVE_TABLES_LATEX is set to True, the same tables are exported as LaTeX (.tex) files into OUTPUT_DIR for direct inclusion in manuscripts.
- Machine-readable exports (JSON and CSV) are written to reports_path as configured in eval_config.yaml.

## Required inputs (expected directory layout)

Before running the notebook, ensure the evaluation data are present:

1. Input sources (used to discover which documents exist):
   test_data/input/{category}/*.txt

2. Ground truth chunk extractions (JSONL preferred):
   test_data/ground_truth/{category}/*.jsonl

3. Model output chunk extractions (JSONL):
   test_data/output/{category}/{model_name}/**/*.jsonl

## Table of Contents

1. **Load Configuration**
   Load eval_config.yaml, resolve dataset/report paths, and set evaluation parameters (e.g., string similarity threshold).

2. **Discover Available Data**
   Inspect test_data/ to list available sources, available model outputs, and ground-truth coverage by category.

3. **Chunk-Level Evaluation**
   Define the evaluation data structures and the chunk-alignment logic used to compare predicted vs. ground-truth extractions.

4. **Run Evaluation**
   Execute the evaluation across all configured models and categories, aggregate chunk-level metrics, and store results for reporting.

5. **Results Summary Table**
   Produce a publication-ready summary table of entry-level and micro-averaged metrics by model and category; display as HTML and optionally export as LaTeX.

6. **Field-Level Breakdown**
   Report field-wise precision/recall/F1 and error counts (TP/FP/FN) for each model/category; display as HTML and optionally export as LaTeX.

7. **Per-Source Details (Optional)**
   Provide an optional snippet to inspect per-source performance for a selected model and category.

8. **Save Reports**
   Export machine-readable results as JSON and CSV (and LaTeX tables if enabled) into the configured reports directories.

9. **Visualization (Optional)**
   Create a grouped bar chart of micro F1 by model and category; save as PNG and optionally export a PDF for LaTeX inclusion.

In [None]:
# ================================================================
# Imports and Global Configuration
# ================================================================

import json
import csv
import sys
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

import yaml
import pandas as pd
from IPython.display import display, HTML

# Add parent directory for imports
EVAL_DIR = Path.cwd()
PROJECT_ROOT = EVAL_DIR.parent
sys.path.insert(0, str(PROJECT_ROOT))
sys.path.insert(0, str(EVAL_DIR))

# Import extraction metrics
from metrics import (
    ExtractionMetrics,
    aggregate_metrics,
    compute_extraction_metrics,
)

# Import JSONL chunk-level utilities
from jsonl_eval import (
    ChunkExtraction,
    DocumentExtractions,
    parse_extraction_jsonl,
    find_jsonl_file,
    load_chunk_extractions,
    load_ground_truth_chunks,
    align_chunks,
)

# ================================================================
# OUTPUT CONFIGURATION
# ================================================================
# Set to True to save all result tables as LaTeX (.tex) files
SAVE_TABLES_LATEX = True

# Directory where LaTeX tables will be saved (relative to EVAL_DIR)
OUTPUT_DIR = EVAL_DIR / "reports" / "tables"

# ================================================================
# Display Configuration
# ================================================================
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 120)

# Create output directory if saving LaTeX
if SAVE_TABLES_LATEX:
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Print configuration summary
print(f"Analysis run: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Evaluation directory: {EVAL_DIR}")
print(f"Project root: {PROJECT_ROOT}")
print(f"Save tables as LaTeX: {SAVE_TABLES_LATEX}")
if SAVE_TABLES_LATEX:
    print(f"LaTeX output directory: {OUTPUT_DIR.resolve()}")
print("Imports successful!")

## Load Configuration

This section loads the evaluation configuration from eval_config.yaml and materializes all key runtime parameters used throughout the notebook.

Specifically, the configuration determines:

- Dataset paths
  - input_path: where input .txt sources are located (by category)
  - output_path: where model-produced JSONL extractions are located
  - ground_truth_path: where ground-truth JSONL extractions are located
  - reports_path: where exported reports (JSON/CSV/figures) should be written

- Evaluation parameters
  - threshold: string similarity threshold used inside compute_extraction_metrics()
  - case_sensitive and normalize_whitespace: normalization controls applied to string comparison
  - schema_fields: per-schema field selection used to restrict evaluation to a subset of fields

The code cell below prints a short configuration summary so each run leaves an auditable record of paths and parameter values.

In [None]:
# ================================================================
# Load Evaluation Configuration
# ================================================================

config_path = EVAL_DIR / "eval_config.yaml"

with open(config_path, "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

# Extract paths
input_path = EVAL_DIR / config["dataset"]["input_path"]
output_path = EVAL_DIR / config["dataset"]["output_path"]
ground_truth_path = EVAL_DIR / config["dataset"]["ground_truth_path"]
reports_path = EVAL_DIR / config["evaluation"]["reports_path"]

# Evaluation settings
threshold = config["evaluation"].get("string_similarity_threshold", 0.85)
case_sensitive = config["evaluation"].get("case_sensitive", False)
normalize_ws = config["evaluation"].get("normalize_whitespace", True)
schema_fields = config["evaluation"].get("schema_fields", {})

categories = config["dataset"]["categories"]
models = config["models"]

# Create reports directory
reports_path.mkdir(parents=True, exist_ok=True)

# Display configuration summary
print(f"Categories: {[c['name'] for c in categories]}")
print(f"Models: {[m['name'] for m in models]}")
print(f"Similarity threshold: {threshold}")
print(f"\nPaths:")
print(f"  Input: {input_path}")
print(f"  Output: {output_path}")
print(f"  Ground Truth: {ground_truth_path}")
print(f"  Reports: {reports_path}")

## Discover Available Data

This section performs a lightweight inventory of the evaluation dataset and outputs to help you validate that all expected inputs are present before running the full evaluation.

For each configured category, it reports:

- Input sources: discovered from input_path / {category} by scanning *.txt files (excluding helper files such as _line_ranges or _context).
- Available model outputs: discovered under output_path / {category} by checking which model subdirectories contain JSONL extraction outputs.
- Ground truth availability: checked under ground_truth_path / {category} (JSONL preferred, with optional legacy JSON fallback).

The printed summary is primarily a diagnostic aid:
- If a category has no ground truth, it cannot be evaluated.
- If a model has no outputs for a category, it will be skipped for that category.

In [None]:
def discover_sources(category_name: str) -> List[str]:
    """
    Discover source files in the input directory for a category.
    
    Returns:
        List of source names (without extension)
    """
    input_dir = input_path / category_name
    
    if not input_dir.exists():
        return []
    
    sources = []
    for input_file in sorted(input_dir.glob("*.txt")):
        stem = input_file.stem
        if stem.endswith("_line_ranges") or stem.endswith("_context"):
            continue
        sources.append(stem)
    
    return sources


def discover_available_models(category_name: str) -> List[str]:
    """
    Discover which models have JSONL output for a category.
    
    Returns:
        List of model names with available output
    """
    cat_output = output_path / category_name
    
    if not cat_output.exists():
        return []
    
    available = []
    for d in cat_output.iterdir():
        if d.is_dir():
            # Check if model directory has any JSONL files
            jsonl_files = list(d.rglob("*.jsonl"))
            # Filter out batch tracking files
            jsonl_files = [f for f in jsonl_files if "_batch_" not in f.name]
            if jsonl_files:
                available.append(d.name)
    
    return sorted(available)


def check_ground_truth_available(category_name: str) -> Tuple[bool, int, str]:
    """
    Check if ground truth files exist for a category.
    
    Returns:
        Tuple of (has_ground_truth, count_of_files, format)
    """
    gt_dir = ground_truth_path / category_name
    if not gt_dir.exists():
        return False, 0, "none"
    
    # Check for JSONL format (preferred)
    jsonl_files = list(gt_dir.glob("*.jsonl"))
    if jsonl_files:
        return True, len(jsonl_files), "jsonl"
    
    # Fall back to JSON format (legacy)
    json_files = list(gt_dir.glob("*.json"))
    if json_files:
        return True, len(json_files), "json"
    
    return False, 0, "none"


# Discover and display available data
print("=" * 60)
print("AVAILABLE DATA SUMMARY")
print("=" * 60)

data_summary = {}

for cat in categories:
    cat_name = cat["name"]
    sources = discover_sources(cat_name)
    available_models = discover_available_models(cat_name)
    gt_available, gt_count, gt_format = check_ground_truth_available(cat_name)
    
    data_summary[cat_name] = {
        "sources": sources,
        "models": available_models,
        "ground_truth_available": gt_available,
        "ground_truth_count": gt_count,
        "ground_truth_format": gt_format,
    }
    
    print(f"\n{cat_name.upper()}")
    print("-" * 40)
    print(f"  Input sources: {len(sources)}")
    if sources:
        for s in sources[:5]:
            print(f"    - {s}")
        if len(sources) > 5:
            print(f"    ... and {len(sources) - 5} more")
    print(f"  Models with JSONL output: {len(available_models)}")
    for m in available_models:
        print(f"    - {m}")
    print(f"  Ground truth: {'Yes' if gt_available else 'No'} ({gt_count} files, format: {gt_format})")

## Chunk-Level Evaluation

This section defines the chunk-level evaluation logic used throughout the notebook.

### Unit of analysis

- A chunk corresponds to one JSONL extraction record produced by the extraction pipeline for a contiguous portion of a source document.
- Each chunk can contain zero or more extracted entries.
- Metrics are computed per chunk and then aggregated.

### Core data structures and logic

The code cell below defines:

- ChunkEvaluationResult: stores per-chunk evaluation status and metrics (including failure modes such as missing ground truth or missing model output).
- SourceEvaluationResult: stores all chunk results for one (category, model, source) combination plus an aggregated metric summary.
- evaluate_source_chunks(...): loads ground truth and model outputs, aligns chunks, computes chunk metrics via compute_extraction_metrics(), and aggregates them via aggregate_metrics().

### Methodological notes

- Field selection: the evaluated fields are drawn from schema_fields[schema_name] when available. This supports schema-specific evaluation (e.g., bibliography vs. address books).
- Aggregation: aggregated results summarize performance across all valid chunks for a source, and later across all sources within a category/model.

In [None]:
@dataclass
class ChunkEvaluationResult:
    """Container for per-chunk evaluation results."""
    chunk_index: int
    custom_id: str
    metrics: Optional[ExtractionMetrics]
    ground_truth_found: bool
    output_found: bool
    gt_entry_count: int = 0
    hyp_entry_count: int = 0
    error: Optional[str] = None


@dataclass
class SourceEvaluationResult:
    """Container for source-level evaluation results."""
    category: str
    model_name: str
    source_name: str
    chunk_results: List[ChunkEvaluationResult]
    aggregated_metrics: Optional[ExtractionMetrics]
    ground_truth_found: bool
    output_found: bool
    error: Optional[str] = None
    
    @property
    def total_chunks(self) -> int:
        return len(self.chunk_results)
    
    @property
    def evaluated_chunks(self) -> int:
        return sum(1 for c in self.chunk_results if c.metrics is not None)


def evaluate_source_chunks(
    category_name: str,
    model_name: str,
    source_name: str,
    schema_name: str,
) -> SourceEvaluationResult:
    """
    Evaluate a source by comparing chunks from model output to ground truth.
    
    Args:
        category_name: Dataset category
        model_name: Model identifier
        source_name: Source file name
        schema_name: Schema name for field selection
        
    Returns:
        SourceEvaluationResult with per-chunk and aggregated metrics
    """
    # Load ground truth chunks
    gt_doc = load_ground_truth_chunks(ground_truth_path, category_name, source_name)
    if gt_doc is None or not gt_doc.chunks:
        return SourceEvaluationResult(
            category=category_name,
            model_name=model_name,
            source_name=source_name,
            chunk_results=[],
            aggregated_metrics=None,
            ground_truth_found=False,
            output_found=False,
            error="Ground truth not found",
        )
    
    # Load model output chunks
    hyp_doc = load_chunk_extractions(output_path, category_name, model_name, source_name)
    if hyp_doc is None or not hyp_doc.chunks:
        return SourceEvaluationResult(
            category=category_name,
            model_name=model_name,
            source_name=source_name,
            chunk_results=[],
            aggregated_metrics=None,
            ground_truth_found=True,
            output_found=False,
            error="Model output not found",
        )
    
    # Get fields to evaluate for this schema
    fields = schema_fields.get(schema_name, [])
    
    # Align chunks
    aligned = align_chunks(hyp_doc, gt_doc)
    
    # Compute per-chunk metrics
    chunk_results: List[ChunkEvaluationResult] = []
    valid_metrics: List[ExtractionMetrics] = []
    
    for hyp_chunk, gt_chunk in aligned:
        # Determine chunk info
        if gt_chunk:
            chunk_index = gt_chunk.chunk_index
            custom_id = gt_chunk.custom_id or (hyp_chunk.custom_id if hyp_chunk else "")
        elif hyp_chunk:
            chunk_index = hyp_chunk.chunk_index
            custom_id = hyp_chunk.custom_id
        else:
            continue
        
        # Check availability
        gt_found = gt_chunk is not None and gt_chunk.has_entries()
        hyp_found = hyp_chunk is not None and hyp_chunk.has_entries()
        
        gt_entries = gt_chunk.get_entries() if gt_chunk else []
        hyp_entries = hyp_chunk.get_entries() if hyp_chunk else []
        
        if not gt_found and not hyp_found:
            # Both empty - skip
            continue
        
        if not gt_found:
            chunk_results.append(ChunkEvaluationResult(
                chunk_index=chunk_index,
                custom_id=custom_id,
                metrics=None,
                ground_truth_found=False,
                output_found=hyp_found,
                gt_entry_count=0,
                hyp_entry_count=len(hyp_entries),
                error="No ground truth for chunk",
            ))
            continue
        
        if not hyp_found:
            chunk_results.append(ChunkEvaluationResult(
                chunk_index=chunk_index,
                custom_id=custom_id,
                metrics=None,
                ground_truth_found=True,
                output_found=False,
                gt_entry_count=len(gt_entries),
                hyp_entry_count=0,
                error="No model output for chunk",
            ))
            continue
        
        # Compute metrics for this chunk
        try:
            gt_data = {"entries": gt_entries}
            hyp_data = {"entries": hyp_entries}
            
            metrics = compute_extraction_metrics(
                ground_truth=gt_data,
                hypothesis=hyp_data,
                fields_to_evaluate=fields if fields else None,
                threshold=threshold,
                case_sensitive=case_sensitive,
                normalize_ws=normalize_ws,
            )
            
            chunk_results.append(ChunkEvaluationResult(
                chunk_index=chunk_index,
                custom_id=custom_id,
                metrics=metrics,
                ground_truth_found=True,
                output_found=True,
                gt_entry_count=len(gt_entries),
                hyp_entry_count=len(hyp_entries),
            ))
            valid_metrics.append(metrics)
        except Exception as e:
            chunk_results.append(ChunkEvaluationResult(
                chunk_index=chunk_index,
                custom_id=custom_id,
                metrics=None,
                ground_truth_found=True,
                output_found=True,
                gt_entry_count=len(gt_entries),
                hyp_entry_count=len(hyp_entries),
                error=str(e),
            ))
    
    # Aggregate metrics
    aggregated = aggregate_metrics(valid_metrics) if valid_metrics else None
    
    return SourceEvaluationResult(
        category=category_name,
        model_name=model_name,
        source_name=source_name,
        chunk_results=chunk_results,
        aggregated_metrics=aggregated,
        ground_truth_found=True,
        output_found=True,
    )


print("Chunk-level evaluation functions defined.")

## Run Evaluation

This section executes the evaluation across the full grid of configured models and categories.

### What happens in the code cell

1. For each (model, category) pair, the notebook iterates over all discovered sources in that category.
2. For each source, it calls evaluate_source_chunks(...) to compute:
   - per-chunk results (including errors and missing-data flags)
   - per-source aggregated metrics (when available)
3. It aggregates all valid chunk metrics within the (model, category) pair into a single ExtractionMetrics object and stores:
   - all_results[model_name][cat_name]: the detailed per-source results (for optional inspection)
   - all_metrics[model_name][cat_name]: the aggregated metrics used for publication tables and exports

### Output

During execution, the notebook prints one summary line per evaluated (model, category) with:
- Entry-level F1 (Entry F1)
- Field micro-averaged F1 (Micro F1)
- Counts of evaluated sources and chunks

These printed summaries are intended for quick sanity checking; the canonical outputs are the tables and exported files produced later.

In [None]:
def evaluate_category(
    category: dict,
    model: dict,
) -> Tuple[List[SourceEvaluationResult], Optional[ExtractionMetrics]]:
    """
    Evaluate all sources in a category for a given model.
    
    Args:
        category: Category config dict
        model: Model config dict
        
    Returns:
        Tuple of (list of per-source results, aggregated metrics)
    """
    cat_name = category["name"]
    schema_name = category["schema"]
    model_name = model["name"]
    
    sources = discover_sources(cat_name)
    results = []
    all_chunk_metrics = []
    
    for source in sources:
        result = evaluate_source_chunks(cat_name, model_name, source, schema_name)
        results.append(result)
        
        # Collect valid chunk metrics for aggregation
        for chunk_result in result.chunk_results:
            if chunk_result.metrics is not None:
                all_chunk_metrics.append(chunk_result.metrics)
    
    aggregated = aggregate_metrics(all_chunk_metrics) if all_chunk_metrics else None
    
    return results, aggregated


# Run evaluation
all_metrics = {}
all_results = {}  # Store detailed results for reporting

for model in models:
    model_name = model["name"]
    all_metrics[model_name] = {}
    all_results[model_name] = {}
    
    for category in categories:
        cat_name = category["name"]
        
        results, aggregated = evaluate_category(category, model)
        all_results[model_name][cat_name] = results
        
        if aggregated and aggregated.total_gt_entries > 0:
            all_metrics[model_name][cat_name] = aggregated
            evaluated_sources = sum(1 for r in results if r.aggregated_metrics is not None)
            total_chunks = sum(r.total_chunks for r in results)
            print(f"{model_name} / {cat_name}: "
                  f"Entry F1={aggregated.entry_f1:.2%}, "
                  f"Micro F1={aggregated.micro_f1:.2%} "
                  f"({evaluated_sources} sources, {total_chunks} chunks)")

print("\nEvaluation complete!")

## Results Summary Table

This section produces the main paper-ready summary of extraction quality by model and document category.

### Table contents

For each (model, category) combination with available metrics, the table reports:

- Entry-level metrics:
  - Entry P (%), Entry R (%), Entry F1 (%)
- Field-level micro-averaged metrics:
  - Micro P (%), Micro R (%), Micro F1 (%)

### Outputs

- The table is displayed as HTML in the notebook for readability and copy/paste workflows.
- If SAVE_TABLES_LATEX is True, the same table is exported as:
  - OUTPUT_DIR/table_1_extraction_summary.tex

This table is the primary high-level result suitable for inclusion in a social science paper.

In [None]:
# ================================================================
# Results Summary Table
# ================================================================

def metrics_to_summary_dataframe(
    model_metrics: Dict[str, Dict[str, ExtractionMetrics]],
) -> pd.DataFrame:
    """
    Convert model metrics dictionary to a pandas DataFrame for display and export.

    Args:
        model_metrics: Dict mapping model_name -> category -> ExtractionMetrics

    Returns:
        DataFrame with one row per model/category combination
    """
    rows = []
    for model_name in sorted(model_metrics.keys()):
        for cat_name, m in model_metrics[model_name].items():
            rows.append({
                "Model": model_name,
                "Category": cat_name,
                "Entry P (%)": round(m.entry_precision * 100, 2),
                "Entry R (%)": round(m.entry_recall * 100, 2),
                "Entry F1 (%)": round(m.entry_f1 * 100, 2),
                "Micro P (%)": round(m.micro_precision * 100, 2),
                "Micro R (%)": round(m.micro_recall * 100, 2),
                "Micro F1 (%)": round(m.micro_f1 * 100, 2),
            })
    return pd.DataFrame(rows)


if all_metrics:
    # Convert to DataFrame
    df_summary = metrics_to_summary_dataframe(all_metrics)

    # Display as HTML
    display(HTML("<h3>Table 1: Extraction Quality Summary</h3>"))
    display(HTML(df_summary.to_html(index=False)))

    # Save as LaTeX if configured
    if SAVE_TABLES_LATEX:
        latex_path = OUTPUT_DIR / "table_1_extraction_summary.tex"
        df_summary.to_latex(
            latex_path,
            index=False,
            caption="Extraction Quality by Model and Category",
            label="tab:extraction_summary",
            float_format="%.2f",
        )
        print(f"Saved: {latex_path}")
else:
    print("No metrics computed. Check that ground truth and model outputs exist.")

## Field-Level Breakdown

This section provides a detailed, field-by-field view of extraction performance for each evaluated (model, category) pair.

### Table contents

For each extracted field, the table reports:

- Precision (%), Recall (%), F1 (%)
- TP, FP, FN counts (true positives, false positives, false negatives)

This breakdown is useful for:
- identifying which fields drive overall performance differences
- diagnosing systematic failure modes (e.g., consistently low recall on specific attributes)

### Outputs

For each (model, category) pair:

- The field table is rendered as HTML in the notebook.
- If SAVE_TABLES_LATEX is True, a corresponding LaTeX table is written to OUTPUT_DIR with a filename that encodes the table number, model, and category (to avoid collisions).

Because these tables can be numerous (one per model-category pair), they are designed to be machine-exportable while still remaining readable in the notebook.

In [None]:
# ================================================================
# Field-Level Breakdown
# ================================================================

def field_metrics_to_dataframe(metrics: ExtractionMetrics) -> pd.DataFrame:
    """
    Convert field-level metrics to a pandas DataFrame.

    Args:
        metrics: ExtractionMetrics object

    Returns:
        DataFrame with one row per field
    """
    rows = []
    for field_name, fm in sorted(metrics.field_metrics.items()):
        rows.append({
            "Field": field_name,
            "Precision (%)": round(fm.precision * 100, 2),
            "Recall (%)": round(fm.recall * 100, 2),
            "F1 (%)": round(fm.f1 * 100, 2),
            "TP": fm.true_positives,
            "FP": fm.false_positives,
            "FN": fm.false_negatives,
        })
    return pd.DataFrame(rows)


# Display field-level breakdown for each model/category
table_counter = 2  # Start after Table 1

for model_name, cat_metrics in all_metrics.items():
    for cat_name, metrics in cat_metrics.items():
        # Convert to DataFrame
        df_fields = field_metrics_to_dataframe(metrics)

        # Display as HTML
        display(HTML(f"<h4>Table {table_counter}: Field-Level Metrics â€” {model_name} / {cat_name}</h4>"))
        display(HTML(df_fields.to_html(index=False)))

        # Save as LaTeX if configured
        if SAVE_TABLES_LATEX:
            # Create safe filename
            safe_model = model_name.replace(" ", "_").replace("/", "_")
            safe_cat = cat_name.replace(" ", "_").replace("/", "_")
            latex_path = OUTPUT_DIR / f"table_{table_counter}_fields_{safe_model}_{safe_cat}.tex"
            df_fields.to_latex(
                latex_path,
                index=False,
                caption=f"Field-Level Metrics for {model_name} on {cat_name}",
                label=f"tab:fields_{safe_model}_{safe_cat}",
                float_format="%.2f",
            )
            print(f"Saved: {latex_path}")

        table_counter += 1
        print()

## Per-Source Details (Optional)

This section is intentionally optional and is meant for targeted inspection during development or analysis.

The code cell below contains a commented snippet that can be enabled to print per-source summaries for a selected:

- SHOW_MODEL
- SHOW_CATEGORY

This helps identify whether a model's aggregate performance is driven by a small number of difficult sources or is consistent across sources.

No outputs are produced unless you uncomment and run the snippet.

In [None]:
# Show per-source breakdown for a specific model/category
# Uncomment and modify as needed:

# SHOW_MODEL = "gpt_5.1_medium"
# SHOW_CATEGORY = "bibliography"

# if SHOW_MODEL in all_results and SHOW_CATEGORY in all_results[SHOW_MODEL]:
#     results = all_results[SHOW_MODEL][SHOW_CATEGORY]
#     print(f"\n{SHOW_MODEL} / {SHOW_CATEGORY}:")
#     print("-" * 60)
#     for r in results:
#         if r.aggregated_metrics:
#             m = r.aggregated_metrics
#             print(f"  {r.source_name}: F1={m.entry_f1:.2%}, "
#                   f"{r.evaluated_chunks}/{r.total_chunks} chunks")
#         else:
#             print(f"  {r.source_name}: {r.error or 'No data'}")

## Save Reports

This section exports evaluation results in formats suitable for reproducible research workflows and downstream analysis.

### Files written

The code cell writes:

- JSON (eval_results_{timestamp}.json)
  A structured, machine-readable report containing:
  - run timestamp
  - evaluation parameter settings (threshold, case_sensitive, normalize_whitespace)
  - aggregated metrics serialized via ExtractionMetrics.to_dict()

- CSV (eval_results_{timestamp}.csv)
  A flat summary table (matching the Results Summary Table section) for easy import into:
  - statistical software
  - spreadsheets
  - manuscript workflows

Additionally, if SAVE_TABLES_LATEX is True, the section prints a count of .tex files present in OUTPUT_DIR, which serves as a quick validation that LaTeX exports were created.

In [None]:
# ================================================================
# Save Reports (JSON, CSV)
# ================================================================

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
reports_path.mkdir(parents=True, exist_ok=True)

# --- Save JSON Report ---
json_path = reports_path / f"eval_results_{timestamp}.json"
json_data = {
    "timestamp": timestamp,
    "evaluation_method": "chunk-level",
    "config": {
        "threshold": threshold,
        "case_sensitive": case_sensitive,
        "normalize_whitespace": normalize_ws,
    },
    "results": {
        model: {cat: m.to_dict() for cat, m in cats.items()}
        for model, cats in all_metrics.items()
    },
}
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(json_data, f, indent=2, ensure_ascii=False)
print(f"Saved JSON report: {json_path}")

# --- Save CSV Report ---
csv_path = reports_path / f"eval_results_{timestamp}.csv"
if all_metrics:
    df_summary = metrics_to_summary_dataframe(all_metrics)
    df_summary.to_csv(csv_path, index=False, encoding="utf-8")
    print(f"Saved CSV report: {csv_path}")

# --- Summary ---
print(f"\n{'=' * 60}")
print("REPORTS SAVED")
print(f"{'=' * 60}")
print(f"  JSON: {json_path.name}")
print(f"  CSV:  {csv_path.name}")
if SAVE_TABLES_LATEX:
    tex_files = list(OUTPUT_DIR.glob("*.tex"))
    print(f"  LaTeX tables: {len(tex_files)} files in {OUTPUT_DIR.relative_to(EVAL_DIR)}/")

## Visualization (Optional)

This section produces a compact visual summary of model performance across categories.

### Figure

If metrics are available, the code creates a grouped bar chart:

- x-axis: model
- bars: category
- y-axis: micro-averaged F1 (Micro F1 (%))

### Files written

- A PNG figure is saved to reports_path as eval_chart_{timestamp}.png.
- If SAVE_TABLES_LATEX is True, a PDF copy is also saved to OUTPUT_DIR/figure_extraction_quality.pdf for LaTeX inclusion.

If matplotlib is not installed, the section is skipped gracefully.

In [None]:
# ================================================================
# Visualization
# ================================================================

try:
    import matplotlib.pyplot as plt
    import numpy as np

    if not all_metrics:
        print("No metrics to visualize")
    else:
        # Prepare data for plotting
        model_names = list(all_metrics.keys())
        cat_names = list(set(
            cat for cats in all_metrics.values() for cat in cats.keys()
        ))

        if model_names and cat_names:
            # Create grouped bar chart for F1 scores
            fig, ax = plt.subplots(figsize=(12, 6))

            x = np.arange(len(model_names))
            width = 0.8 / len(cat_names)

            for i, cat in enumerate(cat_names):
                f1_scores = [
                    all_metrics[model].get(cat, ExtractionMetrics()).micro_f1 * 100
                    for model in model_names
                ]
                offset = (i - len(cat_names) / 2 + 0.5) * width
                ax.bar(x + offset, f1_scores, width, label=cat)

            ax.set_ylabel("Micro F1 Score (%)")
            ax.set_xlabel("Model")
            ax.set_title("Extraction Quality by Model and Category (Chunk-Level Evaluation)")
            ax.set_xticks(x)
            ax.set_xticklabels(model_names, rotation=45, ha="right")
            ax.legend(title="Category")
            ax.set_ylim(0, 100)
            ax.grid(axis="y", alpha=0.3)

            plt.tight_layout()

            # Save chart
            chart_path = reports_path / f"eval_chart_{timestamp}.png"
            plt.savefig(chart_path, dpi=150)
            print(f"Saved chart: {chart_path}")

            # Save as PDF for LaTeX inclusion
            if SAVE_TABLES_LATEX:
                pdf_path = OUTPUT_DIR / "figure_extraction_quality.pdf"
                plt.savefig(pdf_path, format="pdf", bbox_inches="tight")
                print(f"Saved PDF figure: {pdf_path}")
            
            plt.show()
        else:
            print("No data to visualize")
    
except ImportError:
    print("matplotlib not available - skipping visualization")