# Contract-Driven Data Pipelines

Extract structured data from heterogeneous documents using declarative JSON contracts.

One schema. Many providers. Full async.

## Design

A **data contract** (JSON) declares everything the pipeline needs:
- Which LLM model to use
- How to classify tables (keyword matching)
- What output schema to produce (columns, types, aliases, formats)
- How to enrich records (constants, metadata from filenames/titles)

The pipeline code is 100% generic — swap the contract, not the code.

The interpretation step uses a **deterministic-first** architecture: when schema
aliases fully cover every ` / `-separated header part, records are built via pure
string matching with zero LLM calls. The LLM pipeline activates only as a fallback
for pages with unmatched columns.

**Three use cases** demonstrate the same pipeline across different domains:

| # | Use Case | Format | Documents | Challenge |
|---|----------|--------|-----------|----------|
| 1 | Russian agricultural reports | DOCX | 2 weekly reports | Multi-category extraction, dynamic pivot years, deterministic mapping |
| 2 | Australian shipping stems | PDF | 6 providers | One canonical model, 6 different layouts, full concurrency |
| 3 | ACEA car registrations | PDF | 1 press release | Pivoted table → flat records, deterministic unpivoting |

In [None]:
# Setup
import asyncio
import time
from pathlib import Path

import nest_asyncio
import pandas as pd

from pdf_ocr import (
    CanonicalSchema, ColumnDef,
    compress_spatial_text,
    to_pandas, to_records, to_parquet,
)

nest_asyncio.apply()

print("Setup complete.")

In [None]:
# Display helpers
import base64, html as html_mod, shutil, subprocess, tempfile
from IPython.display import display, HTML


def render_document_page(path, page=0, dpi=150):
    """Render a document page as a base64 PNG. Supports PDF and DOCX."""
    import fitz

    path = str(path)
    if path.lower().endswith((".docx", ".doc")):
        soffice = shutil.which("soffice") or "/Applications/LibreOffice.app/Contents/MacOS/soffice"
        with tempfile.TemporaryDirectory() as tmpdir:
            subprocess.run(
                [soffice, "--headless", "--convert-to", "pdf", "--outdir", tmpdir, path],
                capture_output=True, check=True,
            )
            pdf_path = next(Path(tmpdir).glob("*.pdf"))
            doc = fitz.open(str(pdf_path))
    else:
        doc = fitz.open(path)

    page_count = len(doc)
    pix = doc[page].get_pixmap(dpi=dpi)
    b64 = base64.b64encode(pix.tobytes("png")).decode()
    doc.close()
    return b64, page_count


def side_by_side_display(*, image_b64=None, compressed_text=None, dataframe=None, max_height=500):
    """Display up to 3 panels side-by-side: document image | compressed text | DataFrame."""
    panels = []
    style = f"overflow-y:auto; max-height:{max_height}px; border:1px solid #ddd; padding:6px; flex:1"
    if image_b64:
        panels.append(f'<div style="{style}"><img src="data:image/png;base64,{image_b64}" style="width:100%"></div>')
    if compressed_text:
        escaped = html_mod.escape(compressed_text)
        panels.append(f'<div style="{style}"><pre style="font-size:11px; margin:0; white-space:pre">{escaped}</pre></div>')
    if dataframe is not None:
        df_html = dataframe.to_html(index=False, max_rows=30)
        panels.append(f'<div style="{style}; font-size:11px">{df_html}</div>')
    display(HTML(f'<div style="display:flex; gap:8px; align-items:flex-start">{" ".join(panels)}</div>'))


print("Display helpers ready.")

## Pipeline Architecture

```
contract (JSON) → load_contract() → process_document_async() → save()
```

**Deterministic bypass**: Before any LLM call, each page is tested for deterministic
mapping. If every ` / `-separated header part matches a schema alias, records are built
via string matching — no LLM call, no latency, no cost. This fires automatically for
Russian DOCX (compound headers) and ACEA PDF (hierarchical headers) when the contract
has complete aliases.

Five levels of concurrency — all on a single event loop, no sync wrappers:

| Level | Scope | Pattern |
|---|---|---|
| **Document-level** | Process N documents simultaneously | `asyncio.gather(*[process_document_async(doc, cc) for doc in docs])` |
| **Compression** | PDF compression in thread pool (CPU-bound) | `asyncio.to_thread(compress_spatial_text, ...)` |
| **Category-level** | Multiple outputs per document (e.g. harvest + planting) | `asyncio.gather(*[interpret_output_async(...)])` |
| **Table/Page-level** | DOCX: N tables concurrent; PDF: N pages concurrent | `interpret_tables_async()` / `_interpret_pages_batched_async()` |
| **Batch-level** | Step 2 mapping in chunks of 20 rows | Built into `_interpret_pages_batched_async()` |

Key optimization: compression and interpretation are **pipelined per document** — as soon as
one PDF finishes compression, its LLM interpretation starts immediately while other PDFs
are still compressing. No sequential fetch barrier.

The three pipeline helpers live in `pdf_ocr.pipeline`:
- `compress_and_classify_async()` — compress + classify (Layer 1)
- `interpret_output_async()` — interpret + enrich + format one category (Layer 2)
- `process_document_async()` — full per-document pipeline (Layer 3)

In [None]:
# ── Generic Pipeline Functions ────────────────────────────────────────────────
# These functions are contract-agnostic. Swap the JSON contract and re-run.

from pdf_ocr.contracts import load_contract, ContractContext
from pdf_ocr.pipeline import process_document_async, DocumentResult


def save(results: list[DocumentResult], output_dir="outputs"):
    """Merge DataFrames across documents, write each output to Parquet."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    merged = {}

    for result in results:
        for out_name, df in result.dataframes.items():
            if out_name not in merged:
                merged[out_name] = []
            merged[out_name].append(df)

    paths = {}
    for out_name, frames in merged.items():
        df = pd.concat(frames, ignore_index=True)
        path = output_dir / f"{out_name}.parquet"
        df.to_parquet(path, index=False)
        paths[out_name] = path
        print(f"  {out_name}: {path} ({len(df)} rows)")

    return merged, paths


async def run_pipeline_async(contract_path, doc_paths, output_dir="outputs"):
    """Orchestrate full pipeline with document-level concurrency."""
    t0 = time.perf_counter()

    print("PREPARE")
    cc = load_contract(contract_path)
    print(f"  Contract: {cc.provider}, Model: {cc.model}, Outputs: {list(cc.outputs.keys())}")

    print("\nPROCESS (async — compress + classify + interpret per document)")
    results = await asyncio.gather(
        *[process_document_async(str(p), cc) for p in doc_paths]
    )
    for r in results:
        cats = list(r.dataframes.keys())
        rows = sum(len(df) for df in r.dataframes.values())
        print(f"  {Path(r.doc_path).name[:50]}: {cats} → {rows} records")

    print("\nSAVE")
    merged, paths = save(results, output_dir)

    elapsed = time.perf_counter() - t0
    print(f"\nDone in {elapsed:.1f}s")
    return results, merged, paths, elapsed


print("Pipeline functions defined: load_contract -> run_pipeline_async -> save")

---
## Use Case 1: Russian Agricultural DOCX Reports

Multi-category extraction from Russian Ministry of Agriculture weekly grain reports.
Each DOCX contains harvest, planting, and export tables. The contract classifies
tables by keywords and extracts crop names from table titles, report dates from
filenames, and year values from dynamic pivot headers.

**Deterministic mapping**: The DOCX extractor produces compound headers with ` / `
separators (e.g., `spring crops / MOA Target 2025`). When the contract aliases cover
every header part, the interpretation step resolves the full table — including
unpivoting across crop groups — with zero LLM calls.

In [None]:
# Load the Russian agricultural contract
ru_cc = load_contract("contracts/ru_ag_ministry.json")
print(f"  Contract: {ru_cc.provider}, Model: {ru_cc.model}, Outputs: {list(ru_cc.outputs.keys())}")

print("\nSchema summary:")
for name, spec in ru_cc.outputs.items():
    cols = [c.name for c in spec.schema.columns]
    enrich = list(spec.enrichment.keys())
    print(f"  {name}: LLM cols={cols}, enriched={enrich}")

In [None]:
# Run pipeline on the June DOCX
june_path = "inputs/docx/input/2025-06-24_11-58-45.Russian weekly grain EOW June 20-21 2025-1.docx"

results_june, merged_june, _, elapsed_june = await run_pipeline_async(
    "contracts/ru_ag_ministry.json", [june_path], output_dir="outputs/ru_june"
)

In [None]:
# Side-by-side: DOCX page | pipe-table | DataFrame (first available category)
try:
    img_b64, _ = render_document_page(june_path, page=0, dpi=120)
except Exception:
    img_b64 = None  # LibreOffice not available

# Get the first available category from this document
june_result = results_june[0]
first_cat = next(iter(june_result.dataframes), None)

if first_cat:
    cat_data = june_result.compressed_by_category.get(first_cat)
    # DOCX: list of (md, meta) tuples; PDF: compressed text string
    if isinstance(cat_data, list):
        sample_md = cat_data[0][0]
    elif isinstance(cat_data, str):
        sample_md = cat_data.split("\f")[0]  # first page
    else:
        sample_md = "(no compressed text)"
    df_display = june_result.dataframes[first_cat]
    print(f"Showing category: {first_cat} ({len(df_display)} rows)")
    side_by_side_display(image_b64=img_b64, compressed_text=sample_md, dataframe=df_display)
else:
    print("No tables classified for this document.")

In [None]:
# Display output DataFrames
for name, frames in merged_june.items():
    df = frames[0] if isinstance(frames, list) else frames
    print(f"=== {name.upper()} ({len(df)} rows) ===")
    display(df.head(10))
    print()

In [None]:
# Run pipeline on the July DOCX — same contract, different document
july_path = "inputs/docx/input/2025-07-17_10-16-25.Russian weekly grain EOW July 11-12 2025-1.docx"

results_july, merged_july, _, elapsed_july = await run_pipeline_async(
    "contracts/ru_ag_ministry.json", [july_path], output_dir="outputs/ru_july"
)

print(f"\nJune: {elapsed_june:.1f}s, July: {elapsed_july:.1f}s")
for name in merged_july:
    df = merged_july[name][0] if isinstance(merged_july[name], list) else merged_july[name]
    print(f"  {name}: {len(df)} rows")

---
## Use Case 2: Australian Shipping Stems — XXL

**One canonical model, six providers, full concurrency.**

6 shipping stem PDFs from 6 different providers — each expresses the same semantic data
(vessel name, port, commodity, tonnage, ETA) with completely different layouts, column
names, and formatting. A single canonical schema with rich aliases normalizes them all
into one unified DataFrame.

In [None]:
# Load the shipping stem contract
ship_cc = load_contract("contracts/au_shipping_stem.json")
print(f"  Contract: {ship_cc.provider}, Model: {ship_cc.model}, Outputs: {list(ship_cc.outputs.keys())}")

print("\nSchema columns with aliases:")
for col in ship_cc.outputs["vessels"].schema.columns:
    print(f"  {col.name:15s} {col.type:6s} aliases={col.aliases}")

In [None]:
# List all 6 PDFs
shipping_pdfs = {
    "Newcastle":  "inputs/2857439.pdf",
    "Bunge":      "inputs/Bunge_loadingstatement_2025-09-25.pdf",
    "CBH":        "inputs/CBH Shipping Stem 26092025.pdf",
    "GrainCorp":  "inputs/shipping-stem-2025-11-13.pdf",
    "Riordan":    "inputs/shipping_stem-accc-30092025-1.pdf",
    "Queensland": "inputs/document (1).pdf",
}

import fitz
print(f"{'Provider':<14s} {'Filename':<50s} {'Pages':>5s}")
print("-" * 72)
for provider, path in shipping_pdfs.items():
    doc = fitz.open(path)
    pages = len(doc)
    doc.close()
    print(f"{provider:<14s} {Path(path).name:<50s} {pages:>5d}")

In [None]:
# Run pipeline on ALL 6 PDFs concurrently
results_ship, merged_ship, paths_ship, elapsed_ship = await run_pipeline_async(
    "contracts/au_shipping_stem.json",
    list(shipping_pdfs.values()),
    output_dir="outputs/shipping",
)

In [None]:
# Per-document summary
print(f"{'Provider':<14s} {'Records':>8s}")
print("-" * 24)
total = 0
for (provider, _), result in zip(shipping_pdfs.items(), results_ship):
    n = sum(len(df) for df in result.dataframes.values())
    total += n
    print(f"{provider:<14s} {n:>8d}")
print("-" * 24)
print(f"{'TOTAL':<14s} {total:>8d}")
print(f"\nWall-clock time: {elapsed_ship:.1f}s")

In [None]:
# Side-by-side gallery — one representative page per provider
for provider, path in list(shipping_pdfs.items())[:3]:
    img_b64, _ = render_document_page(path, page=0, dpi=100)
    compressed = compress_spatial_text(path, refine_headers=False)
    # Truncate compressed text for display
    lines = compressed.splitlines()[:25]
    truncated = "\n".join(lines) + "\n..."
    print(f"\n--- {provider} ---")
    side_by_side_display(image_b64=img_b64, compressed_text=truncated, max_height=350)

In [None]:
# Unified DataFrame — all providers in one schema
df_vessel_frames = merged_ship["vessels"]
df_all = pd.concat(df_vessel_frames, ignore_index=True) if isinstance(df_vessel_frames, list) else df_vessel_frames
print(f"Unified DataFrame: {df_all.shape}")
display(df_all.head(20))

In [None]:
# Distribution by provider (inferred from source document)
# Add source_provider from result ordering
provider_frames = []
for (provider, _), result in zip(shipping_pdfs.items(), results_ship):
    for df in result.dataframes.values():
        dfp = df.copy()
        dfp["source_provider"] = provider
        provider_frames.append(dfp)

if provider_frames:
    df_with_provider = pd.concat(provider_frames, ignore_index=True)
    print("Records by provider:")
    print(df_with_provider.groupby("source_provider").size().to_string())

---
## Use Case 3: ACEA Car Registrations

**Normalization: pivoted table to flat records.**

The ACEA press release PDF contains a dense pivoted table: 28 countries x 7 power types
x 3 metrics. The pipeline unpivots this into flat records — a 23-column wide table
becomes a 4-column long DataFrame.

**Deterministic mapping**: After header separator standardization (stacked headers
joined with ` / `), headers like `BATTERY ELECTRIC / Dec-25` have parts that match
schema aliases for motorization type and period. When aliases cover all header parts,
the deterministic mapper handles the full unpivot — each data row produces one record
per power type — with zero LLM calls. Vision mode is only needed when headers are
garbled and aliases cannot match.

In [None]:
# Load the ACEA contract
acea_cc = load_contract("contracts/acea_car_registrations.json")
print(f"  Contract: {acea_cc.provider}, Model: {acea_cc.model}, Outputs: {list(acea_cc.outputs.keys())}")

print("\nSchema:")
for col in acea_cc.outputs["registrations_by_market"].schema.columns:
    print(f"  {col.name:25s} {col.type:6s} {col.description}")

In [None]:
# Run pipeline on ACEA press release PDF (vision-enabled)
acea_pdf = "inputs/Press_release_car_registrations_December_2025.pdf"

results_acea, merged_acea, _, elapsed_acea = await run_pipeline_async(
    "contracts/acea_car_registrations.json",
    [acea_pdf],
    output_dir="outputs/acea",
)

In [None]:
# Side-by-side: PDF | compressed pivot table | unpivoted DataFrame
img_b64, _ = render_document_page(acea_pdf, page=0, dpi=120)
acea_compressed = compress_spatial_text(acea_pdf, refine_headers=False)
acea_df = list(merged_acea.values())[0]
if isinstance(acea_df, list):
    acea_df = acea_df[0]

# Show first page of compressed text
first_page = acea_compressed.split("\f")[0]
side_by_side_display(image_b64=img_b64, compressed_text=first_page, dataframe=acea_df)

In [None]:
# Display unpivoted DataFrame
print(f"Shape: {acea_df.shape} (from wide pivoted table to long flat records)")
display(acea_df.head(20))

In [None]:
# Round-trip verification: pivot back to wide format
if "country" in acea_df.columns and "car_motorization" in acea_df.columns:
    try:
        pivot = acea_df.pivot_table(
            index="country",
            columns="car_motorization",
            values="new_car_registration",
            aggfunc="sum",
        )
        print(f"Pivoted back: {pivot.shape} (countries x power types)")
        display(pivot.head(10))
    except Exception as e:
        print(f"Pivot failed: {e}")
else:
    print("Columns not available for pivot")

---
## Summary

| Use Case | Format | Documents | Elapsed | Key Feature |
|----------|--------|-----------|---------|-------------|
| Russian Agriculture | DOCX | 2 reports | see above | Multi-category, pivot years, deterministic mapping |
| Australian Shipping | PDF | 6 providers | see above | One schema, 6 layouts, full concurrency |
| ACEA Registrations | PDF | 1 press release | see above | Deterministic unpivot, vision fallback |

**Deterministic-first architecture**: Tables with complete alias coverage in the schema
are interpreted via pure string matching — no LLM calls, no latency, no cost. The LLM
pipeline activates only as a fallback for pages with unmatched header parts.

In [None]:
# Summary statistics
total_docs = 2 + 6 + 1  # June + July + 6 shipping + 1 ACEA
total_records = 0
for m in [merged_june, merged_july, merged_ship, merged_acea]:
    for v in m.values():
        df = v[0] if isinstance(v, list) else v
        total_records += len(df)

total_time = elapsed_june + elapsed_july + elapsed_ship + elapsed_acea

print(f"Total documents processed: {total_docs}")
print(f"Total records extracted:   {total_records:,}")
print(f"Total wall-clock time:     {total_time:.1f}s")
print(f"Throughput:                {total_records / total_time:.0f} records/s")