# pdf-ocr: Capabilities Walkthrough

You have documents. You need data.

This notebook walks through the complete transformation pipeline — from raw PDF/DOCX
to clean, typed DataFrames — one step at a time. Each section builds on the previous,
showing how the library turns messy, unstructured documents into structured records.

**Story arc:** The Problem → Spatial Reconstruction → Compression → Structure Detection → Classification → Schema Definition → LLM Interpretation → Serialization

In [None]:
# Setup — file paths and imports
from pathlib import Path

# PDF inputs
PDF_2857439    = "inputs/2857439.pdf"
PDF_CBH        = "inputs/CBH Shipping Stem 26092025.pdf"
PDF_BUNGE      = "inputs/Bunge_loadingstatement_2025-09-25.pdf"
PDF_GRAINCORP  = "inputs/shipping-stem-2025-11-13.pdf"

# DOCX inputs
DOCX_HSPAN     = "inputs/docx/synthetic/hspan.docx"
DOCX_MULTI     = "inputs/docx/synthetic/multi_category.docx"
DOCX_JULY      = "inputs/docx/input/2025-07-17_10-16-25.Russian weekly grain EOW July 11-12 2025-1.docx"

print("Paths configured.")

In [None]:
# Display helpers — render document pages and side-by-side comparisons
import base64, html as html_mod, shutil, subprocess, tempfile
from IPython.display import display, HTML


def render_document_page(path, page=0, dpi=150):
    """Render a document page as a base64 PNG image.

    Supports PDF (via fitz) and DOCX (via LibreOffice → PDF → fitz).
    Returns (base64_png, page_count).
    """
    import fitz

    path = str(path)
    if path.lower().endswith((".docx", ".doc")):
        soffice = shutil.which("soffice") or "/Applications/LibreOffice.app/Contents/MacOS/soffice"
        with tempfile.TemporaryDirectory() as tmpdir:
            subprocess.run(
                [soffice, "--headless", "--convert-to", "pdf", "--outdir", tmpdir, path],
                capture_output=True, check=True,
            )
            pdf_path = next(Path(tmpdir).glob("*.pdf"))
            doc = fitz.open(str(pdf_path))
    else:
        doc = fitz.open(path)

    page_count = len(doc)
    pix = doc[page].get_pixmap(dpi=dpi)
    b64 = base64.b64encode(pix.tobytes("png")).decode()
    doc.close()
    return b64, page_count


def side_by_side_display(*, image_b64=None, compressed_text=None, dataframe=None, max_height=500):
    """Display up to 3 panels side-by-side: document image | compressed text | DataFrame."""
    panels = []
    style = f"overflow-y:auto; max-height:{max_height}px; border:1px solid #ddd; padding:6px; flex:1"
    if image_b64:
        panels.append(f'<div style="{style}"><img src="data:image/png;base64,{image_b64}" style="width:100%"></div>')
    if compressed_text:
        escaped = html_mod.escape(compressed_text)
        panels.append(f'<div style="{style}"><pre style="font-size:11px; margin:0; white-space:pre">{escaped}</pre></div>')
    if dataframe is not None:
        df_html = dataframe.to_html(index=False, max_rows=30)
        panels.append(f'<div style="{style}; font-size:11px">{df_html}</div>')
    display(HTML(f'<div style="display:flex; gap:8px; align-items:flex-start">{" ".join(panels)}</div>'))


print("Display helpers ready.")

---
## Section 1: The Problem

Standard text extraction tools produce jumbled output from tabular documents.
PDFs store text in arbitrary file order. DOCX merged cells create duplicate values.
The result: unusable text that can't be parsed into structured data.

In [None]:
# PDF: fitz.get_text() produces text in file storage order — not reading order
import fitz

doc = fitz.open(PDF_2857439)
raw = doc[0].get_text()
doc.close()

print(f"Raw text length: {len(raw)} chars")
print("First 600 chars (note the jumbled column order):")
print(raw[:600])

In [None]:
# DOCX: python-docx returns duplicate _tc references for horizontally merged cells
from docx import Document

doc = Document(DOCX_HSPAN)
table = doc.tables[0]

print(f"Table: {len(table.rows)} rows x {len(table.columns)} columns")
print("\nRaw cell text (note duplicates from merged cells):")
for i, row in enumerate(table.rows):
    cells = [c.text for c in row.cells]
    print(f"  Row {i}: {cells}")

---
## Section 2: Spatial Reconstruction

The first step is format-specific: reconstruct the visual layout from the raw document.

- **PDF**: Project every text span onto a monospace character grid using (x, y) coordinates
- **DOCX**: Deduplicate merged `_tc` references and stack multi-row headers with ` / ` separators

In [None]:
# PDF: Spatial text preserves the visual layout as a monospace grid
from pdf_ocr import pdf_to_spatial_text

spatial = pdf_to_spatial_text(PDF_2857439)
print(f"Spatial text: {len(spatial)} chars, {len(spatial.splitlines())} lines")
print()
# Show first 30 lines — columns are aligned as they appear in the PDF
for line in spatial.splitlines()[:30]:
    print(line)

In [None]:
# DOCX: Extract tables with proper merge handling and compound headers
from pdf_ocr import extract_tables_from_docx

tables = extract_tables_from_docx(DOCX_HSPAN)
t = tables[0]
print(f"hspan.docx: {len(t.column_names)} columns, {len(t.data)} data rows")
print(f"Source format: {t.source_format}")
print(f"\nCompound column names (stacked with ' / '):")
for col in t.column_names:
    print(f"  {col}")

In [None]:
# Side-by-side: PDF page image vs spatial text reconstruction
img_b64, _ = render_document_page(PDF_2857439, page=0)
side_by_side_display(image_b64=img_b64, compressed_text=spatial)

---
## Section 3: Compression

Spatial text is verbose — it preserves every whitespace character. Compression analyzes
the page structure and renders each region in its optimal format:

- **Tables** → pipe-delimited markdown
- **Text blocks** → flowing paragraphs
- **Key-value pairs** → `key: value` lines

From this point forward, both PDF and DOCX produce the **same pipe-table format**.
All downstream processing (classification, interpretation, serialization) is format-agnostic.

In [None]:
# PDF compression without header refinement (deterministic, no LLM call)
from pdf_ocr import compress_spatial_text

compressed_raw = compress_spatial_text(PDF_2857439, refine_headers=False)
print(f"Compressed (no refinement): {len(compressed_raw)} chars")
print()
for line in compressed_raw.splitlines()[:20]:
    print(line)

In [None]:
# PDF compression WITH header refinement (calls LLM to fix stacked headers)
compressed = compress_spatial_text(PDF_2857439, refine_headers=True)
print(f"Compressed (refined): {len(compressed)} chars")
print(f"Compression ratio: {len(spatial) / len(compressed):.1f}x vs spatial text")
print()
for line in compressed.splitlines()[:20]:
    print(line)

In [None]:
# DOCX compression produces the SAME pipe-table format
from pdf_ocr import compress_docx_tables

docx_results = compress_docx_tables(DOCX_HSPAN)
docx_md, docx_meta = docx_results[0]
print(f"DOCX pipe-table: {docx_meta['row_count']} rows x {docx_meta['col_count']} cols")
print()
print(docx_md)

In [None]:
# Compression ratio comparison across 4 PDFs
pdfs = {
    "Newcastle (2857439)": PDF_2857439,
    "CBH": PDF_CBH,
    "Bunge": PDF_BUNGE,
    "GrainCorp (3pp)": PDF_GRAINCORP,
}

print(f"{'Document':<22s} {'Spatial':>8s} {'Compressed':>11s} {'Ratio':>6s}")
print("-" * 50)
for name, path in pdfs.items():
    sp = pdf_to_spatial_text(path)
    cp = compress_spatial_text(path, refine_headers=False)
    ratio = len(sp) / len(cp) if len(cp) > 0 else 0
    print(f"{name:<22s} {len(sp):>7,d}c {len(cp):>10,d}c {ratio:>5.1f}x")

---
## Section 4: Structure Detection Heuristics

Before compression, the library runs three layers of heuristics to detect table structure:

- **Text heuristics (TH)**: Cell type classification, header row estimation
- **Visual heuristics (VH)**: Grid lines, header fills, zebra striping, section separators
- **Font heuristics (FH)**: Size hierarchy, bold patterns, monospace detection

Results are cross-validated: agreements strengthen confidence, contradictions trigger fallbacks.

In [None]:
# Visual and font heuristics on CBH (has bold headers, visual grid)
from pdf_ocr import compress_spatial_text_structured
from pdf_ocr.compress import _analyze_visual_structure, _analyze_font_structure
from pdf_ocr.spatial_text import _extract_page_layout
import fitz

# Get the page layout with visual elements and font spans
doc = fitz.open(PDF_CBH)
layout = _extract_page_layout(doc[0], extract_visual=True)
doc.close()

# Run visual heuristics (VH1-VH4)
row_y = list(layout.row_y_positions.values()) if layout.row_y_positions else []
vis = _analyze_visual_structure(layout.visual, row_y)

print("=== Visual Heuristics (CBH) ===")
if vis.grid:
    print(f"  VH1 Grid: {vis.grid.h_line_count} h-lines, {vis.grid.v_line_count} v-lines, has_grid={vis.grid.has_grid}")
if vis.header:
    print(f"  VH2 Header fill: {len(vis.header.header_fill_rows)} rows highlighted")
if vis.zebra:
    print(f"  VH3 Zebra: rows {vis.zebra.zebra_start_row}-{vis.zebra.zebra_end_row}")
if vis.separators:
    print(f"  VH4 Separators: {len(vis.separators.separator_y_positions)} section breaks")

# Run font heuristics (FH1-FH6)
font = _analyze_font_structure(layout.font_spans, row_y, header_row_estimate=2)

print("\n=== Font Heuristics (CBH) ===")
if font.hierarchy:
    print(f"  FH1 Size tiers: {len(font.hierarchy.size_tiers)} tiers")
    print(f"       Header size: {font.hierarchy.header_size}, Body size: {font.hierarchy.body_size}")
if font.bold:
    print(f"  FH2 Bold: header={font.bold.header_bold_ratio:.0%}, data={font.bold.data_bold_ratio:.0%}")
if font.monospace:
    print(f"  FH4 Monospace columns: {font.monospace.monospace_columns}")

---
## Section 5: Classification

When a document contains multiple tables, classification assigns each table to a
category using keyword matching against header text. This is format-agnostic — both
PDF and DOCX tables are classified using the same `classify_tables()` function
operating on pipe-table markdown + metadata tuples.

In [None]:
# DOCX classification: multi_category.docx has 4 different table types
from pdf_ocr import classify_docx_tables

categories = {
    "shipping": ["cargo", "port", "vessel"],
    "hr": ["employee", "salary", "department"],
    "finance": ["revenue", "expenses"],
    "inventory": ["stock", "warehouse"],
}

classes = classify_docx_tables(DOCX_MULTI, categories)
print(f"multi_category.docx: {len(classes)} tables classified")
for c in classes:
    print(f"  Table {c['index']}: {c['category']:12s} title={c['title'] or '(none)':25s} {c['rows']}r x {c['cols']}c")

In [None]:
# PDF classification: CBH has side-by-side tables → StructuredTable.to_compressed()
from pdf_ocr import classify_tables

structured_tables = compress_spatial_text_structured(PDF_CBH)
compressed_tuples = [t.to_compressed() for t in structured_tables]

shipping_cats = {
    "shipping": ["vessel", "ship", "cargo", "commodity", "eta", "port", "loading"],
}

cbh_classes = classify_tables(compressed_tuples, shipping_cats)
print(f"CBH PDF: {len(structured_tables)} tables found, {len(cbh_classes)} classified")
for c in cbh_classes:
    print(f"  Table {c['index']}: {c['category']:12s} {c['rows']}r x {c['cols']}c")

In [None]:
# Real-world DOCX: Russian July report — harvest, planting, and export tables
ag_categories = {
    "harvest":  ["area harvested", "yield", "collected", "bunker", "centner"],
    "planting": ["spring crops", "moa target", "sown area", "planting", "sowing"],
    "export":   ["export", "shipment", "ports", "fob"],
}

july_classes = classify_docx_tables(DOCX_JULY, ag_categories)
counts = {}
for c in july_classes:
    counts[c["category"]] = counts.get(c["category"], 0) + 1
print(f"Russian July report: {len(july_classes)} tables")
print(f"  By category: {dict(sorted(counts.items()))}")

---
## Section 6: Schema Definition

A `CanonicalSchema` tells the LLM what output structure you want. Each `ColumnDef`
specifies a column name, type, description, and aliases (alternative names the LLM
should recognize in the source table). The schema is domain-specific — you define it
once and reuse it across any number of documents.

In [None]:
from pdf_ocr import CanonicalSchema, ColumnDef

# Define a shipping schema — works for any Australian shipping stem PDF
shipping_schema = CanonicalSchema.from_dict({
    "description": "Vessel loading records by port",
    "columns": [
        {"name": "load_port",    "type": "string", "description": "Loading port name",
         "aliases": ["Port", "PORT", "Loading Port"]},
        {"name": "vessel_name",  "type": "string", "description": "Name of the vessel",
         "aliases": ["Ship Name", "NAME OF SHIP", "Vessel Name", "Vessel"]},
        {"name": "shipper",      "type": "string", "description": "Exporting company",
         "aliases": ["Exporter", "EXPORTER", "Client", "Shipper"]},
        {"name": "commodity",    "type": "string", "description": "Type of commodity",
         "aliases": ["Commodity", "COMMODITY", "Cargo"]},
        {"name": "tons",         "type": "int",    "description": "Quantity in metric tonnes",
         "aliases": ["Quantity (tonnes)", "QUANTITY (TONNES)", "Volume", "Tonnes"]},
        {"name": "eta",          "type": "string", "description": "Estimated time of arrival",
         "aliases": ["ETA", "Date ETA of Ship To"]},
    ],
})

print(f"Schema: {shipping_schema.description}")
print(f"Columns ({len(shipping_schema.columns)}):")
for col in shipping_schema.columns:
    print(f"  {col.name:15s} {col.type:6s}  aliases={col.aliases}")

---
## Section 7: Interpretation

The interpretation pipeline uses a **deterministic-first** architecture:

1. **Deterministic mapping** (no LLM): Split each header on ` / `, match parts against
   schema aliases. If every part matches, build records via string matching — zero LLM calls.
   Handles unpivoting, sections, and dimension/measure classification purely from aliases.

2. **LLM fallback** (2-step pipeline): Only for pages with unmatched columns.
   - **Step 1 — Parse**: Analyze table structure, detect headers, merge multi-row records
   - **Step 2 — Map**: Map parsed columns to canonical schema using aliases and descriptions

Multi-page documents are auto-split and processed concurrently. Step 2 is batched
(default 20 rows per batch) to prevent LLM truncation on dense pages.

**When does deterministic mapping fire?** Whenever the schema aliases fully cover
every header part. This is typical for DOCX compound headers (`spring crops / 2025`)
and PDF hierarchical headers (`BATTERY ELECTRIC / Dec-25`) when the contract has
complete aliases.

In [None]:
# Deterministic mapping demo — zero LLM calls
from pdf_ocr.interpret import _try_deterministic

# A Russian-style compound header table
demo_text = """## PLANTING
| Region | spring crops / MOA Target 2025 | spring crops / 2025 | spring grain / MOA Target 2025 | spring grain / 2025 |
|---|---|---|---|---|
| Belgorod | 100 | 90 | 50 | 45 |
| Bryansk | 80 | 70 | 40 | 35 |"""

demo_schema = CanonicalSchema(columns=[
    ColumnDef("region", "string", "Region", aliases=["Region"]),
    ColumnDef("area", "float", "Target area", aliases=["MOA Target 2025"]),
    ColumnDef("value", "float", "Actual area", aliases=["2025"]),
    ColumnDef("crop", "string", "Crop type", aliases=["spring crops", "spring grain"]),
])

det_result = _try_deterministic(demo_text, demo_schema)
print(f"Deterministic result: {len(det_result.records)} records (0 LLM calls)")
print(f"Model: {det_result.metadata.model}")
print(f"Strategy: {det_result.metadata.table_type_inference.mapping_strategy_used}")
print()
for r in det_result.records:
    print(f"  region={r.region:10s} crop={r.crop:15s} area={r.area:>4s} value={r.value}")


In [None]:
# Single PDF: interpret_table() on Newcastle shipping stem
import time
from pdf_ocr import interpret_table, to_records, to_pandas

t0 = time.perf_counter()
result = interpret_table(compressed, shipping_schema, model="openai/gpt-4o")
elapsed = time.perf_counter() - t0

records = to_records(result)
print(f"Newcastle: {len(records)} records in {elapsed:.1f}s")
print(f"Pages: {sorted(result.keys())}")
print(f"\nFirst 3 records:")
for r in records[:3]:
    print(f"  {r}")

In [None]:
# Side-by-side: PDF page | compressed text | DataFrame
import pandas as pd

df_newcastle = to_pandas(result, shipping_schema)
img_b64, _ = render_document_page(PDF_2857439, page=0)
side_by_side_display(image_b64=img_b64, compressed_text=compressed, dataframe=df_newcastle)

In [None]:
# Multi-page PDF: GrainCorp (3 pages, ~180 records, concurrent processing)
gc_compressed = compress_spatial_text(PDF_GRAINCORP, refine_headers=True)
pages = gc_compressed.split("\f")
print(f"GrainCorp: {len(pages)} pages")

t0 = time.perf_counter()
gc_result = interpret_table(gc_compressed, shipping_schema, model="openai/gpt-4o")
elapsed = time.perf_counter() - t0

gc_records = to_records(gc_result)
print(f"Total records: {len(gc_records)} in {elapsed:.1f}s (all pages concurrent)")
for page_num, mapped in sorted(gc_result.items()):
    print(f"  Page {page_num}: {len(mapped.records)} records")

In [None]:
# DOCX: interpret multiple harvest tables from Russian July report concurrently
from pdf_ocr import interpret_tables, extract_pivot_values

harvest_idx = [c["index"] for c in july_classes if c["category"] == "harvest"]
july_results = compress_docx_tables(DOCX_JULY, table_indices=harvest_idx)

# Build harvest schema with dynamic year aliases
pivot_years = extract_pivot_values(july_results[0][0])
harvest_schema = CanonicalSchema.from_dict({
    "description": "Harvest progress by region, metric, and year",
    "columns": [
        {"name": "region",  "type": "string", "aliases": ["Region"]},
        {"name": "metric",  "type": "string", "aliases": ["Area harvested", "collected", "Yield"]},
        {"name": "year",    "type": "int",    "aliases": pivot_years[-2:]},
        {"name": "value",   "type": "float",  "aliases": []},
    ],
})
print(f"Year aliases from headers: {pivot_years[-2:]}")

texts = [md for md, _ in july_results]
titles = [meta["title"] or "Unknown" for _, meta in july_results]

t0 = time.perf_counter()
mapped_tables = interpret_tables(texts, harvest_schema, model="openai/gpt-4o")
elapsed = time.perf_counter() - t0

frames = []
for title, mapped in zip(titles, mapped_tables):
    df = to_pandas(mapped, harvest_schema)
    df["crop"] = title
    frames.append(df)
    print(f"  {title}: {len(df)} records")

df_july = pd.concat(frames, ignore_index=True)
print(f"\nTotal July harvest: {len(df_july)} records ({elapsed:.1f}s concurrent)")
df_july.head(12)

---
## Section 8: Serialization

Typed export to CSV, TSV, Parquet, pandas, and polars. The serializer validates records
against the schema using Pydantic, with automatic coercion of OCR artifacts
(e.g., `"1,234"` → `1234`, `"(500)"` → `-500`). Output formatting is driven by
`ColumnDef.format` — date patterns, number patterns, string case transformations.

In [None]:
---
## Summary

| Step | Function | Input | Output | Format-Specific? |
|------|----------|-------|--------|-------------------|
| 1. Extract | `pdf_to_spatial_text()` / `extract_tables_from_docx()` | Raw document | Spatial text / StructuredTable | Yes |
| 2. Compress | `compress_spatial_text()` / `compress_docx_tables()` | Spatial text / StructuredTable | Pipe-table markdown (` / ` headers) | Yes (entry point) |
| 3. Classify | `classify_tables()` / `classify_docx_tables()` | Pipe-tables + keywords | Category assignments | No |
| 4. Define | `CanonicalSchema.from_dict()` | Column definitions | Schema object | No |
| 5. Interpret | `interpret_table()` / `interpret_tables()` | Pipe-table + schema | MappedTable records | No |
| 6. Serialize | `to_csv()` / `to_pandas()` / `to_parquet()` | MappedTable + schema | CSV / DataFrame / Parquet | No |

Steps 1-2 are format-specific. Steps 3-6 are completely format-agnostic — the same
code works identically for PDF, DOCX, XLSX, PPTX, and HTML inputs.

Step 5 uses a **deterministic-first** architecture: when schema aliases fully cover
every ` / `-separated header part, records are built via string matching with zero LLM
calls. The LLM pipeline activates only as a fallback for unmatched columns.

---
## Summary

| Step | Function | Input | Output | Format-Specific? |
|------|----------|-------|--------|-------------------|
| 1. Extract | `pdf_to_spatial_text()` / `extract_tables_from_docx()` | Raw document | Spatial text / StructuredTable | Yes |
| 2. Compress | `compress_spatial_text()` / `compress_docx_tables()` | Spatial text / StructuredTable | Pipe-table markdown | Yes (entry point) |
| 3. Classify | `classify_tables()` / `classify_docx_tables()` | Pipe-tables + keywords | Category assignments | No |
| 4. Define | `CanonicalSchema.from_dict()` | Column definitions | Schema object | No |
| 5. Interpret | `interpret_table()` / `interpret_tables()` | Pipe-table + schema | MappedTable records | No |
| 6. Serialize | `to_csv()` / `to_pandas()` / `to_parquet()` | MappedTable + schema | CSV / DataFrame / Parquet | No |

Steps 1-2 are format-specific. Steps 3-6 are completely format-agnostic — the same
code works identically for PDF, DOCX, XLSX, PPTX, and HTML inputs.