# PDF-OCR Walkthrough

This notebook demonstrates why standard PDF extraction tools fail on real-world tabular documents, and how pdf-ocr's spatial approach solves the problem — producing clean, structured text that LLMs can reliably interpret.

We use `2857439.pdf` (a shipping stem with 11 vessel records) as the running example. It has dense multi-row records: each vessel spans 3 rows (dates, data, times) with stacked headers and no drawn table borders.

In [None]:
PDF_PATH = "inputs/2857439.pdf"
BUNGE_PATH = "inputs/Bunge_loadingstatement_2025-09-25.pdf"

## 1. Traditional PDF Extraction Tools

Let's try the most common Python libraries for extracting text and tables from PDFs, and see where they break down.

In [13]:
import fitz  # PyMuPDF

doc = fitz.open(PDF_PATH)
page = doc[0]

# Raw text extraction — returns text in PDF stream order
raw_text = page.get_text()
doc.close()

print("=== fitz page.get_text() ===")
print(raw_text)

=== fitz page.get_text() ===
UNIQUE SLOT 
REFERENCE 
NUMBER
NAME OF SHIP
DATE AT WHICH 
NOMINATION 
WAS RECEIVED
TIME AT WHICH 
NOMINATION 
WAS RECEIVED
DATE AT WHICH 
NOMINATION 
WAS ACCEPTED
TIME AT WHICH 
NOMINATION 
WAS ACCEPTED
PORT
DATE ETA OF 
SHIP FROM
TIME ETA SHIP 
FROM
DATE ETA OF 
SHIP TO
TIME ETA OF 
SHIP TO
DATE ETA OF 
GRAIN 
LOADING 
COMMENCEM
ENT
TIME ETA OF 
GRAIN 
LOADING 
COMMENCEM
ENT
DATE ETD OF 
SHIP
TIME ETD OF 
SHIP
EXPORTER
QUANTITY 
(TONNES)
COMMODITY
LOADING 
"COMMENCED" 
OR 
"COMPLETED" 
DATE LOADING 
COMPLETED
TIME LOADING 
COMPLETED
NOTES
BG20250025
AFRICAN DOVE
05/06/2025
4:04:00 PM
05/06/2025
4:53:00 PM
BUNBURY
01/07/2025
9:00:00 AM
10/07/2025
12:00:00 PM
07/07/2025
9:00:00 AM
17/07/2025
7:00:00 PM BUNGE
33020 WHEAT
COMPLETED
16/07/2025
8:29:00 AM
BG20250026
THE ETERNAL
25/06/2025
7:31:00 PM
26/06/2025
10:48:00 AM
BUNBURY
10/07/2025
9:00:00 AM
30/07/2025
12:00:00 PM
15/07/2025
9:00:00 AM
25/07/2025
7:00:00 PM BUNGE
62666 BARLEY
COMPLETED
26/07/2025
1:35

`get_text()` returns text in **PDF stream order** — the order objects were written into the file. Column headers, dates, ship names, and numeric values are jumbled into a single stream. There is no way to tell which value belongs to which column.

In [14]:
import pymupdf4llm

md_text = pymupdf4llm.to_markdown(PDF_PATH)
print("=== pymupdf4llm.to_markdown() ===")
print(md_text)

=== pymupdf4llm.to_markdown() ===
TIME ETA OF
GRAIN
LOADING
COMMENCEM
ENT



UNIQUE SLOT
REFERENCE
NUMBER



TIME AT WHICH

NOMINATION
WAS RECEIVED



DATE AT WHICH

NOMINATION
WAS ACCEPTED



TIME AT WHICH

NOMINATION
WAS ACCEPTED



TIME ETA SHIP
FROM



DATE ETA OF
SHIP TO



TIME ETA OF
SHIP TO



DATE ETA OF
GRAIN
LOADING
COMMENCEM
ENT



DATE ETD OF
SHIP



TIME ETD OF QUANTITY
EXPORTER COMMODITY
SHIP (TONNES)



LOADING
"COMMENCED"
OR
"COMPLETED"



DATE LOADING
COMPLETED



TIME LOADING

COMPLETED



NAME OF SHIP



DATE AT WHICH

NOMINATION
WAS RECEIVED



DATE ETA OF
PORT
SHIP FROM



NOTES



BG20250025 AFRICAN DOVE 05/06/2025 4:04:00 PM 05/06/2025 4:53:00 PM BUNBURY 01/07/2025 9:00:00 AM 10/07/2025 12:00:00 PM 07/07/2025 9:00:00 AM 17/07/2025 7:00:00 PM BUNGE 33020 WHEAT COMPLETED 16/07/2025 8:29:00 AM
BG20250026 THE ETERNAL 25/06/2025 7:31:00 PM 26/06/2025 10:48:00 AM BUNBURY 10/07/2025 9:00:00 AM 30/07/2025 12:00:00 PM 15/07/2025 9:00:00 AM 25/07/2025 7:00:00 PM BUNGE 626

`pymupdf4llm` attempts to reconstruct markdown from the PDF. It detects there is a table, but the result is garbled — headers and data don't align into the correct columns. Multi-row records (dates/data/times per vessel) are impossible to reconstruct from this output.

In [6]:
try:
    import pdfplumber

    with pdfplumber.open(PDF_PATH) as pdf:
        page = pdf.pages[0]
        tables = page.extract_tables()
        print(f"=== pdfplumber: {len(tables)} table(s) detected ===\n")

        for i, table in enumerate(tables):
            print(f"--- Table {i+1} ({len(table)} rows) ---")
            for row in table[:8]:
                print(row)
            if len(table) > 8:
                print(f"  ... ({len(table) - 8} more rows)")

        if not tables:
            text = page.extract_text()
            print("No tables detected. Raw text (first 500 chars):")
            print(text[:500])

except ImportError:
    print("pdfplumber is not installed (pip install pdfplumber)")
    print()
    print("Typical pdfplumber behavior on this PDF:")
    print("- extract_tables() returns 0 tables (no visible borders/lines)")
    print("- extract_text() returns stream-order text, similar to fitz")
    print()
    print("pdfplumber relies on visible table borders (lines/rectangles)")
    print("to detect table boundaries. Most shipping stems use whitespace")
    print("alignment without drawn borders, so pdfplumber finds nothing.")

pdfplumber is not installed (pip install pdfplumber)

Typical pdfplumber behavior on this PDF:
- extract_tables() returns 0 tables (no visible borders/lines)
- extract_text() returns stream-order text, similar to fitz

pdfplumber relies on visible table borders (lines/rectangles)
to detect table boundaries. Most shipping stems use whitespace
alignment without drawn borders, so pdfplumber finds nothing.


## 2. Why Traditional Tools Fail

The fundamental issue is **stream order vs. visual layout**:

- PDFs store text as positioned drawing commands — each text span has (x, y) coordinates and a string. The **file order is arbitrary** and rarely matches reading order.
- `get_text()` returns spans in file order, **destroying spatial relationships**.
- `pymupdf4llm` applies heuristics to reorder text, but breaks on multi-row records, narrow columns, and stacked headers.
- `pdfplumber` relies on **drawn table borders** (lines/rectangles) to detect tables. Most real-world documents — shipping stems, loading statements, financial reports — use whitespace alignment with no borders at all.

These tools try to *reorder* text into reading order. But tabular PDFs need **spatial positioning**: knowing that `"ADAGIO"` sits directly below `"Ship Name"` and to the right of `"Newcastle"`. No amount of reordering recovers that relationship — you need the original (x, y) coordinates.

### A deeper problem: merged spans

Beyond stream order, some PDFs encode **multiple column values as a single text string**. The Bunge loading statement is a prime example — PyMuPDF returns the time and exporter as one span: `"7:00:00 PM BUNGE"`, and the quantity and commodity as another: `"33020 WHEAT"`.

No amount of reordering can split these — the values are fused at the PDF encoding level.

In [None]:
import fitz

doc = fitz.open(BUNGE_PATH)
page = doc[0]
data = page.get_text("dict")

print("=== Bunge: raw PDF spans (right side of table) ===\n")
for block in data["blocks"]:
    if block["type"] != 0:
        continue
    for line in block["lines"]:
        for span in line["spans"]:
            text = span["text"].strip()
            if not text:
                continue
            x = span["origin"][0]
            y = span["origin"][1]
            # Focus on the right side of the first few data rows
            if x >= 440 and 85 < y < 105:
                w = span["bbox"][2] - span["bbox"][0]
                print(f'  x={x:7.1f}  width={w:5.1f}  "{text}"')

print()
print("Notice: \"7:00:00 PM BUNGE\" is ONE span (width=38.7pt).")
print("The time and exporter are fused — no tool can separate them")
print("without knowing where the column boundary should be.")
doc.close()

## 3. Spatial Grid Rendering

Instead of reordering text, pdf-ocr projects every text span onto a **monospace character grid** that mirrors the physical page:

```
PDF span (x=120pt, y=200pt, "ADAGIO")  →  grid[row=15][col=20] = "ADAGIO"
```

The pipeline:

1. **Extract spans** — `page.get_text("dict")` returns each span with its text, origin (x, y), and bounding box
2. **Compute cell width** — median character width across the page (adapts to font size)
3. **Cluster y-coordinates** — group spans into rows (2pt tolerance for sub-pixel jitter)
4. **Map to grid** — `col = round((x - x_min) / cell_w)`, `row = cluster_index`
5. **Render** — write each span into a character buffer at its grid position

The result is plain text where columns, tables, and scattered labels appear exactly where they sit visually in the PDF.

In [15]:
from pdf_ocr import pdf_to_spatial_text

spatial = pdf_to_spatial_text(PDF_PATH)
print("=== pdf_to_spatial_text() ===")
print(spatial)

=== pdf_to_spatial_text() ===
                                                                                                                                                                    DATE ETA OF   TIME ETA OF
                                                                                                                                                                                                                                                                   LOADING
UNIQUE SLOT                      DATE AT WHICH  TIME AT WHICH   DATE AT WHICH  TIME AT WHICH                                                                        GRAIN         GRAIN
                                                                                                           DATE ETA OF   TIME ETA SHIP  DATE ETA OF   TIME ETA OF                              DATE ETD OF   TIME ETD OF                 QUANTITY                  "COMMENCED"    DATE LOADING     TIME LOADING
REFERENCE     NAME OF SHI

The spatial grid **preserves the visual layout perfectly**. Column headers sit above their data. Multi-row records (dates / data / times) stay visually grouped per vessel. You can read the table just as you would in the PDF.

But there's a problem: this ~6,000 character output is **mostly whitespace padding** — an expensive waste of LLM tokens. The next step compresses this into a token-efficient structured format.

### Spatial text reveals column boundaries

Even when spans are merged, the spatial grid positions every character correctly. Looking at the Bunge spatial text, you can see the header labels `"EXPORTER"` at column 219 and `"COMMODITY"` at column 245 — exactly above where the exporter and commodity values should appear in the data rows.

The merged span `"7:00:00 PM BUNGE"` starts at column 208 and stretches to column 223. Character 11 (`"B"` of `"BUNGE"`) lands at column 219 — the same position as the `"EXPORTER"` header. The spatial grid tells us exactly where to split.

In [None]:
from pdf_ocr.spatial_text import _extract_page_layout, _open_pdf

doc = _open_pdf(BUNGE_PATH)
layout = _extract_page_layout(doc[0])

print("=== Bunge: header rows (define column boundaries) ===\n")
for ri in [3, 4]:  # Key header rows
    entries = sorted(layout.rows[ri])
    for col, text in entries:
        if col >= 190:
            print(f"  row {ri}  col={col:4d}  \"{text}\"")

print("\n=== Bunge: data row 9 (has merged spans) ===\n")
entries = sorted(layout.rows[9])
for col, text in entries:
    if col >= 190:
        marker = " <-- MERGED" if len(text) > 12 else ""
        print(f"  row 9  col={col:4d}  \"{text}\"{marker}")

print()
print("Header col 219 = EXPORTER, col 245 = COMMODITY")
print("Data   col 208 = \"7:00:00 PM BUNGE\" (16 chars, ends at 224)")
print("Split position: 219 - 208 = char index 11 → between \"PM\" and \"BUNGE\"")
doc.close()

In [16]:
import os
from pdf_ocr import pdf_to_spatial_text

print(f"{'File':<55} {'Lines':>6} {'Width':>6} {'Chars':>7}")
print("-" * 78)

for fname in sorted(os.listdir("inputs")):
    if not fname.endswith(".pdf"):
        continue
    path = os.path.join("inputs", fname)
    text = pdf_to_spatial_text(path)
    lines = text.split("\n")
    max_width = max((len(l) for l in lines), default=0)
    print(f"{fname:<55} {len(lines):>6} {max_width:>6} {len(text):>7}")

File                                                     Lines  Width   Chars
------------------------------------------------------------------------------
2857439.pdf                                                 40    181    6471
Bunge_loadingstatement_2025-09-25.pdf                       17    336    4456
CBH Shipping Stem 26092025.pdf                              66    380   19589
Loading-Statement-for-Web-Portal-20250923.pdf                9    201    1141
Press_release_car_registrations_December_2025.pdf          270    313   38757
Shipping-Stem-2025-09-30.pdf                               212    307   62357
document (1).pdf                                           208    150   12677
shipping-stem-2025-11-13.pdf                               194    309   57602
shipping_stem-accc-30092025-1.pdf                           17    262    3626


## 4. Compressed Spatial Text

`compress_spatial_text()` works from the same raw span data but produces a **token-efficient structured representation**. It:

1. **Splits merged spans** — uses column boundaries from header rows to split data spans that the PDF fused (e.g., `"7:00:00 PM BUNGE"` → `"7:00:00 PM"` + `"BUNGE"`)
2. **Classifies page regions** — tables, headings, text blocks, key-value pairs, scattered text
3. **Detects multi-row records** — merges repeating row patterns (e.g., 3-row shipping records) into single logical rows
4. **Renders each region** using its natural format — markdown pipe tables, flowing paragraphs, `key: value` lines

When `refine_headers=True` (the default), a lightweight LLM call (GPT-4o-mini) further improves results:

- **Header refinement**: Cleans stacked/multiline column headers, handles hierarchical spanning headers
- **Table detection fallback**: When heuristics find no table on a page, the LLM detects and extracts tabular data

Setting `refine_headers=False` gives pure heuristic output with no LLM calls (no API key required). Even without the LLM, span splitting and pipe table rendering work correctly.

In [17]:
from pdf_ocr import compress_spatial_text

# Pure heuristic compression — no LLM, no API key needed
compressed_heuristic = compress_spatial_text(PDF_PATH, refine_headers=False)

print("=== compress_spatial_text(refine_headers=False) — heuristic only ===")
print(f"Characters: {len(compressed_heuristic)}")
print(f"Pipe tables: {'Yes' if '|---|' in compressed_heuristic else 'No'}\n")
print(compressed_heuristic)

=== compress_spatial_text(refine_headers=False) — heuristic only ===
Characters: 2142
Pipe tables: Yes

DATE ETA OF LOADING: TIME ETA OF

UNIQUE SLOT	DATE AT WHICH	TIME AT WHICH	DATE AT WHICH	TIME AT WHICH	GRAIN	GRAIN

DATE ETA OF	TIME ETA SHIP	DATE ETA OF	TIME ETA OF	DATE ETD OF	TIME ETD OF	QUANTITY	"COMMENCED"	DATE LOADING	TIME LOADING

REFERENCE	NAME OF SHIP	NOMINATION	NOMINATION	NOMINATION	NOMINATION	PORT	LOADING	LOADING	EXPORTER	COMMODITY	NOTES

SHIP FROM	FROM	SHIP TO	SHIP TO	SHIP	SHIP	(TONNES)	OR	COMPLETED	COMPLETED

NUMBER	WAS RECEIVED	WAS RECEIVED	WAS ACCEPTED	WAS ACCEPTED	COMMENCEM	COMMENCEM

"COMPLETED"

||||||||||||ENT|ENT||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|BG20250025|AFRICAN DOVE|05/06/2025|4:04:00 PM|05/06/2025|4:53:00 PM|BUNBURY|01/07/2025|9:00:00 AM|10/07/2025|12:00:00 PM|07/07/2025|9:00:00 AM|17/07/2025|7:00:00 PM BUNGE|33020 WHEAT|COMPLETED|16/07/2025|8:29:00 AM||
|BG20250026|THE ETERNAL|25/06/2025|7:31:00 PM|26/0

For `2857439.pdf`, heuristic compression produces structured output (headings, key-value pairs, tab-separated data) but **no pipe table** — the multi-row record layout doesn't meet the heuristic's strict requirements (2+ spans, 2+ shared column anchors, 3+ qualifying rows).

With `refine_headers=True`, the LLM fallback detects the table structure and produces a clean pipe table.

In [None]:
# Bunge: the merged-span PDF — now with clean pipe tables
compressed_bunge = compress_spatial_text(BUNGE_PATH, refine_headers=False)

print("=== Bunge compress_spatial_text(refine_headers=False) ===")
print(f"Characters: {len(compressed_bunge)}")
print(f"Pipe tables: {'Yes' if '|---|' in compressed_bunge else 'No'}\n")

# Show the pipe table rows with cell breakdown
lines = compressed_bunge.split("\n")
pipe_data = [l for l in lines if l.startswith("|") and "---" not in l and l.count("|") > 3]
if pipe_data:
    # Show first data row cell by cell
    cells_list = pipe_data[1].strip("|").split("|")  # skip header row
    print(f"First data row ({len(cells_list)} cells):")
    for i, cell in enumerate(cells_list):
        print(f"  [{i:2d}] {cell}")

Every cell contains exactly one logical value. The span splitting fixed what PyMuPDF fused:

| Before | After |
|---|---|
| `\|7:00:00 PM BUNGE\|` | `\|7:00:00 PM\|BUNGE\|` |
| `\|33020 WHEAT\|` | `\|33020\|WHEAT\|` |

This structural integrity is the foundation for reliable downstream processing. Each pipe table row has a consistent cell count matching the header — making it a **natural chunking key** for parallel interpretation.

In [18]:
# LLM-assisted compression — refines headers + catches missed tables
compressed_llm = compress_spatial_text(PDF_PATH, refine_headers=True)

print("=== compress_spatial_text(refine_headers=True) — with LLM ===")
print(f"Characters: {len(compressed_llm)}")
print(f"Pipe tables: {'Yes' if '|---|' in compressed_llm else 'No'}\n")
print(compressed_llm)

2026-02-05T06:20:38.764 [BAML [92mINFO[0m] [35mFunction RefineTableHeaders[0m:
    [33mClient: CustomGPT4oMini (gpt-4o-mini-2024-07-18) - 6871ms. StopReason: stop. Tokens(in/out): 1329/167[0m
    [34m---PROMPT---[0m
    [2m[43msystem: [0m[2mYou are a table-structure analyst. You receive a monospace spatial text excerpt
    of a table region from a PDF. Your job is to identify the header rows and produce
    clean column names.
    
    SPATIAL TEXT (monospace — positions matter):
                                                                                                                                                                        ENT           ENT
    BG20250025    AFRICAN DOVE         05/06/2025      4:04:00 PM     05/06/2025      4:53:00 PM   BUNBURY        01/07/2025     9:00:00 AM    10/07/2025   12:00:00 PM    07/07/2025    9:00:00 AM   17/07/2025    7:00:00 PM BUNGE               33020 WHEAT         COMPLETED           16/07/2025     8:29:00 AM
    BG20

In [11]:
import os
from pdf_ocr import pdf_to_spatial_text, compress_spatial_text

print(f"{'File':<55} {'Spatial':>8} {'Compress':>9} {'Reduc.':>7} {'Pipes':>6}")
print("-" * 89)

for fname in sorted(os.listdir("inputs")):
    if not fname.endswith(".pdf"):
        continue
    path = os.path.join("inputs", fname)
    s = pdf_to_spatial_text(path)
    c = compress_spatial_text(path, refine_headers=False)
    reduction = (1 - len(c) / len(s)) * 100 if len(s) > 0 else 0
    pipes = c.count("|---|")
    print(f"{fname:<55} {len(s):>8} {len(c):>9} {reduction:>6.0f}% {pipes:>6}")

print()
print("'Pipes' counts pipe-table separator rows (|---|).")
print("Files with 0 pipes have tables that heuristics missed —")
print("refine_headers=True (the default) catches these via an LLM fallback.")

File                                                     Spatial  Compress  Reduc.  Pipes
-----------------------------------------------------------------------------------------
2857439.pdf                                                 6471      2021     69%      0
Bunge_loadingstatement_2025-09-25.pdf                       4456      2142     52%     10
CBH Shipping Stem 26092025.pdf                             19589      7212     63%     46
Loading-Statement-for-Web-Portal-20250923.pdf               1141       364     68%      0
Press_release_car_registrations_December_2025.pdf          38757     21199     45%     33
Shipping-Stem-2025-09-30.pdf                               62357     20866     67%     36
document (1).pdf                                           12677      7593     40%     44
shipping-stem-2025-11-13.pdf                               57602     19234     67%     27
shipping_stem-accc-30092025-1.pdf                           3626      1826     50%      9

'Pipes' c

## Table Interpretation

`interpret_table()` takes compressed text and a **canonical schema** describing the columns your application expects, then uses an LLM pipeline to extract structured records.

The schema maps inconsistent PDF column names (e.g. "Ship Name", "Vessel", "Vessel Name") to stable canonical names via aliases. Two modes are available:

- **2-step** (`interpret_table`) — parse table structure first, then map to schema. Step 2 is **batched**: each page's parsed rows are split into chunks (default 20 rows) so the LLM produces complete output without truncation. All batches across all pages run concurrently.
- **Single-shot** (`interpret_table_single_shot`) — one LLM call per page. Faster for simple flat tables, but cannot batch and may truncate on dense pages (50+ rows).

Both modes **auto-split** multi-page input (pages joined by `\f`) and process all pages **concurrently** via `asyncio.gather()`. The return value is a `dict[int, MappedTable]` keyed by 1-indexed page number — each page gets its own complete result (records, unmapped columns, mapping notes, metadata). Records contain only canonical schema fields.

Use `to_records(result)` to flatten all pages into a single `list[dict]`, or `to_records_by_page(result)` for `{page: [dicts]}`.

### Vision-based schema inference (optional)

Some PDFs have dense tables with stacked/multi-line headers where text extraction produces **garbled or concatenated column names** (e.g. `"7:00:00 PM BUNGE"` or `"33020 WHEAT"` as single text runs). For these cases, pass `pdf_path=` to `interpret_table()` to enable a vision pre-step:

```
Step 0 (vision):  page image + compressed text → InferredTableSchema
Step 1 (guided):  compressed text + InferredTableSchema → ParsedTable
Step 2 (unchanged): ParsedTable → MappedTable
```

The vision step renders each PDF page as an image and uses a vision-capable LLM to read the correct column headers from the visual layout, then step 1 uses that schema to correctly split compound values. When `pdf_path` is omitted, the pipeline behaves exactly as before (no vision overhead).

In [None]:

# Define the canonical schema as a plain dict (e.g. loaded from a JSON file).
# CanonicalSchema.from_dict() converts it into the typed dataclass.
#
# Note: "port" has no aliases — it will be inferred from context (section headers,
# document title, or repeated contextual values) rather than matched to a column name.
schema_dict = {
    "description": "Shipping stem vessel loading records",
    "columns": [
        {"name": "load_port", "type": "string", "description": "Loading port name", "aliases": ["Port"], "format": "uppercase"},
        {"name": "vessel_name", "type": "string", "description": "Name of the vessel", "aliases": ["Name of Ship"], "format": "uppercase"},
        {"name": "unique_shipping_slot_id", "type": "string", "description": "Reference number", "aliases": ["Unique Slot Reference Number"]},
        {"name": "shipper", "type": "string", "description": "Exporting company", "aliases": ["Exporter"], "format": "uppercase"},
        {"name": "commodity", "type": "string", "description": "Type of commodity", "aliases": ["Commodity"], "format": "titlecase"},
        {"name": "tons", "type": "int", "description": "Quantity in metric tonnes", "aliases": ["Quantity(tonnes)"], "format": "#,###"},
        {"name": "eta", "type": "string", "description": "Estimated time of arrival", "aliases": ["Date ETA of Ship To"], "format": "YYYY-MM-DD HH:mm"},
        {"name": "status", "type": "string", "description": "Loading status", "aliases": ["Status", "Load Status"], "format": "titlecase"},
    ],
}

schema = CanonicalSchema.from_dict(schema_dict)

print(f"Schema: {schema.description}")
print(f"Columns ({len(schema.columns)}):")
for col in schema.columns:
    print(f"  {col.name:20s}  {col.type:6s}  format={col.format or 'None':20s}  aliases={col.aliases}")


## Multi-page auto-split with batching

`compress_spatial_text()` joins pages with `\f` (form-feed). When `interpret_table()` receives multi-page input, it splits on `\f` and processes all pages **concurrently**.

Step 2 (schema mapping) is **batched** — each page's parsed rows are split into chunks of `batch_size` rows (default 20) before calling the LLM. This prevents truncation on dense pages with many data rows. All batches across all pages run concurrently via `asyncio.gather()`.

The result is a `dict[int, MappedTable]` keyed by 1-indexed page number. Each page has its own `records`, `unmapped_columns`, `mapping_notes`, and `metadata`. Use `to_records()` to flatten or `to_records_by_page()` for page-grouped dicts.

Below we run the full pipeline on `shipping-stem-2025-11-13.pdf` (3 pages, 180+ records) — no manual splitting needed.

## Vision-based interpretation (garbled-header PDFs)

The Bunge loading statement has dense stacked headers where text extraction produces concatenated spans. Passing `pdf_path=` enables the vision pipeline: each page is rendered as an image, a vision LLM infers the correct column structure, and the guided parser uses that schema to split compound values.

In [10]:
import logging
logging.basicConfig(level=logging.INFO)

from pdf_ocr import (
    # Core functions
    compress_spatial_text,
    pdf_to_spatial_text,
    # Table interpretation
    interpret_table,
    interpret_table_single_shot,
    CanonicalSchema,
    ColumnDef,
    to_records,
    to_records_by_page,
    # PDF filtering
    filter_pdf_by_table_titles,
    extract_table_titles,
    FilterMatch,
)

In [None]:
# Vision-enabled pipeline on a garbled-header PDF.
# The only difference from normal usage is pdf_path= which enables step 0 (vision).

newcastle_pdf = "inputs/2857439.pdf"
compressed_bunge = compress_spatial_text(newcastle_pdf)
print(f"Compressed chars: {len(compressed_bunge)}")
print(compressed_bunge[:500])
print("...")

# Define a schema suitable for Newcastle loading statements
newcastle_schema_dict = {
    "description": "Shipping stem vessel loading records",
    "columns": [
        {"name": "load_port", "type": "string", "description": "Loading port name", "aliases": ["port"], "format": "uppercase"},
        {"name": "vessel_name", "type": "string", "description": "Name of the vessel", "aliases": ["ship name"], "format": "uppercase"},
        {"name": "unique_shipping_slot_id", "type": "string", "description": "Reference number", "aliases": ["unique slot reference number"]},
        {"name": "shipper", "type": "string", "description": "Exporting company", "aliases": ["exporter"], "format": "uppercase"},
        {"name": "commodity", "type": "string", "description": "Type of commodity", "aliases": ["commodity"], "format": "titlecase"},
        {"name": "tons", "type": "int", "description": "Quantity in metric tonnes", "aliases": ["quantity(tonnes)"], "format": "#,###"},
        {"name": "eta", "type": "string", "description": "Estimated time of arrival", "aliases": ["eta"], "format": "YYYY-MM-DD HH:mm"},
        {"name": "status", "type": "string", "description": "Loading status", "aliases": ["load status"], "format": "titlecase"},
    ],
}

newcastle_schema = CanonicalSchema.from_dict(newcastle_schema_dict)

# Run WITH vision (pdf_path= enables step 0)
result_newcastle = interpret_table(
    compressed_bunge,
    newcastle_schema,
    model="openai/gpt-4o",
    pdf_path=newcastle_pdf,
)

# Result is dict[int, MappedTable] — one entry per page
records_newcastle = to_records(result_newcastle)
print(f"Records extracted (vision): {len(records_newcastle)}")
for page, mt in sorted(result_newcastle.items()):
    print(f"Page {page}: {len(mt.records)} records, unmapped={mt.unmapped_columns}")
print(f"\n--- First 5 records ---\n")
for i, rec in enumerate(records_newcastle[:5], 1):
    print(f"[{i}] {rec}")

# Inspect per-page structure: each page has its own records, unmapped_columns, metadata
{page: mt.model_dump() for page, mt in result_newcastle.items()}


In [None]:
# Vision-enabled pipeline on the Bunge PDF (garbled stacked headers).

bunge_pdf = "inputs/Bunge_loadingstatement_2025-09-25.pdf"
compressed_bunge = compress_spatial_text(bunge_pdf)
print(f"Compressed chars: {len(compressed_bunge)}")
print(compressed_bunge[:500])
print("...")

# Define a schema suitable for Bunge loading statements
bunge_schema_dict = {
    "description": "Shipping stem vessel loading records",
    "columns": [
        {"name": "load_port", "type": "string", "description": "Loading port name", "aliases": ["Port"], "format": "uppercase"},
        {"name": "vessel_name", "type": "string", "description": "Name of the vessel", "aliases": ["Name of Ship"], "format": "uppercase"},
        {"name": "unique_shipping_slot_id", "type": "string", "description": "Reference number", "aliases": ["Unique Slot Reference Number"]},
        {"name": "shipper", "type": "string", "description": "Exporting company", "aliases": ["Exporter"], "format": "uppercase"},
        {"name": "commodity", "type": "string", "description": "Type of commodity", "aliases": ["Commodity"], "format": "titlecase"},
        {"name": "tons", "type": "int", "description": "Quantity in metric tonnes", "aliases": ["Quantity(tonnes)"], "format": "#,###"},
        {"name": "eta", "type": "string", "description": "Estimated time of arrival", "aliases": ["Date ETA of Ship To"], "format": "YYYY-MM-DD HH:mm"},
        {"name": "status", "type": "string", "description": "Loading status", "aliases": ["Status", "Load Status"], "format": "titlecase"},
    ],
}

bunge_schema = CanonicalSchema.from_dict(bunge_schema_dict)

# Run WITH vision (pdf_path= enables step 0)
result_bunge = interpret_table(
    compressed_bunge,
    bunge_schema,
    model="openai/gpt-4o",
    pdf_path=bunge_pdf,
)

# Result is dict[int, MappedTable] — one entry per page
records_bunge = to_records(result_bunge)
print(f"Records extracted (vision): {len(records_bunge)}")
for page, mt in sorted(result_bunge.items()):
    print(f"Page {page}: {len(mt.records)} records, unmapped={mt.unmapped_columns}")
print(f"\n--- First 5 records ---\n")
for i, rec in enumerate(records_bunge[:5], 1):
    print(f"[{i}] {rec}")

# Inspect per-page structure: each page has its own records, unmapped_columns, metadata
{page: mt.model_dump() for page, mt in result_bunge.items()}


In [None]:
# Vision-enabled pipeline on the CBH PDF.

cbh_pdf = "inputs/CBH Shipping Stem 26092025.pdf"
compressed_cbh = compress_spatial_text(cbh_pdf)
print(f"Compressed chars: {len(compressed_cbh)}")
print(compressed_cbh[:500])
print("...")

# Define a schema suitable for CBH loading statements
cbh_schema_dict = {
    "description": "Shipping stem vessel loading records",
    "columns": [
        {"name": "load_port", "type": "string", "description": "Loading port name, not captured in the original schema but present inside the header", "aliases": [], "format": "uppercase"},
        {"name": "vessel_name", "type": "string", "description": "Name of the vessel", "aliases": ["vessel name"], "format": "uppercase"},
        {"name": "unique_shipping_slot_id", "type": "string", "description": "Reference number", "aliases": ["vna #"]},
        {"name": "shipper", "type": "string", "description": "Exporting company", "aliases": ["client"], "format": "uppercase"},
        {"name": "commodity", "type": "string", "description": "Type of commodity", "aliases": ["Commodity"], "format": "titlecase"},
        {"name": "tons", "type": "int", "description": "Quantity in metric tonnes", "aliases": ["volume"], "format": "#,###"},
        {"name": "eta", "type": "string", "description": "Estimated time of arrival", "aliases": ["ETA"], "format": "YYYY-MM-DD"},
        {"name": "status", "type": "string", "description": "Loading status", "aliases": ["Status", "Loading Status"], "format": "titlecase"},
    ],
}

cbh_schema = CanonicalSchema.from_dict(cbh_schema_dict)

# Run WITH vision (pdf_path= enables step 0)
result_cbh = interpret_table(
    compressed_cbh,
    cbh_schema,
    model="openai/gpt-4o",
    pdf_path=cbh_pdf,
)

# Result is dict[int, MappedTable] — one entry per page
records_cbh = to_records(result_cbh)
print(f"Records extracted (vision): {len(records_cbh)}")
for page, mt in sorted(result_cbh.items()):
    print(f"Page {page}: {len(mt.records)} records, unmapped={mt.unmapped_columns}")
print(f"\n--- First 5 records ---\n")
for i, rec in enumerate(records_cbh[:5], 1):
    print(f"[{i}] {rec}")

# Inspect per-page structure: each page has its own records, unmapped_columns, metadata
{page: mt.model_dump() for page, mt in result_cbh.items()}


In [None]:
# Vision-enabled pipeline on the Queensland PDF.

queensland_pdf = "inputs/document (1).pdf"
compressed_queensland = compress_spatial_text(queensland_pdf)
print(f"Compressed chars: {len(compressed_queensland)}")
print(compressed_queensland[:500])
print("...")

# Define a schema suitable for Queensland loading statements
queensland_schema_dict = {
    "description": "Shipping stem vessel loading records",
    "columns": [
        {"name": "load_port", "type": "string", "description": "Loading port name", "aliases": ["port"], "format": "uppercase"},
        {"name": "vessel_name", "type": "string", "description": "Name of the vessel", "aliases": ["name of ship"], "format": "uppercase"},
        {"name": "unique_shipping_slot_id", "type": "string", "description": "Reference number", "aliases": ["unique slot reference number"]},
        {"name": "shipper", "type": "string", "description": "Exporting company", "aliases": ["exporter"], "format": "uppercase"},
        {"name": "commodity", "type": "string", "description": "Type of commodity", "aliases": ["commodity"], "format": "titlecase"},
        {"name": "tons", "type": "int", "description": "Quantity in metric tonnes", "aliases": ["quantity(tonnes)"], "format": "#,###"},
        {"name": "eta", "type": "string", "description": "Estimated time of arrival", "aliases": ["date of eta of ship to"], "format": "YYYY-MM-DD HH:mm"},
        {"name": "status", "type": "string", "description": "Loading status", "aliases": ["Status", "loading ' commenced' or ' completed'"], "format": "titlecase"},
    ],
}

queensland_schema = CanonicalSchema.from_dict(queensland_schema_dict)

# Run WITH vision (pdf_path= enables step 0)
result_queensland = interpret_table(
    compressed_queensland,
    queensland_schema,
    model="openai/gpt-4o",
    pdf_path=queensland_pdf,
)

# Result is dict[int, MappedTable] — one entry per page
records_queensland = to_records(result_queensland)
print(f"Records extracted (vision): {len(records_queensland)}")
for page, mt in sorted(result_queensland.items()):
    print(f"Page {page}: {len(mt.records)} records, unmapped={mt.unmapped_columns}")
print(f"\n--- First 5 records ---\n")
for i, rec in enumerate(records_queensland[:5], 1):
    print(f"[{i}] {rec}")

# Inspect per-page structure: each page has its own records, unmapped_columns, metadata
{page: mt.model_dump() for page, mt in result_queensland.items()}


In [None]:
acea_pdf = "inputs/Press_release_car_registrations_December_2025.pdf"
acea_pdf_filtered = filter_pdf_by_table_titles(
    acea_pdf,
    ["new car registrations by market and power source, monthly"],
)


In [5]:
# Vision-enabled pipeline on the ACEA car registrations PDF.

acea_pdf = "inputs/Press_release_car_registrations_December_2025.pdf"
acea_pdf_filtered, matches = filter_pdf_by_table_titles(
    acea_pdf,
    ["new car registrations by market and power source, monthly"],
)
compressed_acea = compress_spatial_text(acea_pdf_filtered)
print(f"Compressed chars: {len(compressed_acea)}")
print(compressed_acea[:500])
print("...")

# Define a schema for ACEA car registrations
# Note: car_motorization aliases match header parts to trigger unpivot
# Note: date has empty aliases - the LLM infers year values from headers
acea_schema_dict = {
    "description": "ACEA new car registrations by market and power source, monthly",
    "columns": [
        {"name": "country", "type": "string", "description": "Country of registration", "aliases": [], "format": "titlecase"},
        {"name": "car_motorization", "type": "string", "description": "Car motorization type", "aliases": ["battery electric", "plug-in hybrid", "hybrid electric", "others", "petrol", "diesel"], "format": "titlecase"},
        {"name": "new_car_registration", "type": "int", "description": "Number of new car registrations", "aliases": [], "format": "#,###"},
        {"name": "date", "type": "string", "description": "Registration period (year from column header, month from document context)", "aliases": [], "format": "YYYY-MM"},
    ],
}

acea_schema = CanonicalSchema.from_dict(acea_schema_dict)

# Run WITH vision (pdf_path= enables step 0)
result_acea = interpret_table(
    compressed_acea,
    acea_schema,
    model="openai/gpt-4o",
    pdf_path=acea_pdf,
)

# Result is dict[int, MappedTable] — one entry per page
records_acea = to_records(result_acea)
print(f"Records extracted (vision): {len(records_acea)}")
for page, mt in sorted(result_acea.items()):
    print(f"Page {page}: {len(mt.records)} records, unmapped={mt.unmapped_columns}")
print(f"\n--- First 5 records ---\n")
for i, rec in enumerate(records_acea[:5], 1):
    print(f"[{i}] {rec}")

# Inspect per-page structure: each page has its own records, unmapped_columns, metadata
{page: mt.model_dump() for page, mt in result_acea.items()}


INFO:pdf_ocr.interpret:Vision enabled: rendering 1 page(s)


Compressed chars: 4943
NEW CAR REGISTRATIONS BY MARKET AND POWER SOURCE

2

MONTHLY

1	2

||BATTERY ELECTRIC|||PLUG-IN HYBRID||HYBRID ELECTRIC|||OTHERS|||PETROL||DIESEL|||TOTAL||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
||December|December % change|December|December|% change|December|December % change|December|December|% change|December|December|% change|December December|% change|December|December|% change|
||2025|2024|25/24|2025|2024 25/24|2025|2024|25/24|2025|2024 25/24|2025|202
...


INFO:pdf_ocr.interpret:Step 0 page 1: inferred 22 columns: ['Country', 'Battery Electric 2025', 'Battery Electric 2024', 'Battery Electric % Change', 'Plug-In Hybrid 2025', 'Plug-In Hybrid 2024', 'Plug-In Hybrid % Change', 'Hybrid Electric 2025', 'Hybrid Electric 2024', 'Hybrid Electric % Change', 'Others 2025', 'Others 2024', 'Others % Change', 'Petrol 2025', 'Petrol 2024', 'Petrol % Change', 'Diesel 2025', 'Diesel 2024', 'Diesel % Change', 'Total 2025', 'Total 2024', 'Total % Change']


2026-02-04T16:25:16.914 [BAML [92mINFO[0m] [35mFunction InferTableSchemaFromImage[0m:
    [33mClient: openai/gpt-4o (gpt-4o-2024-08-06) - 12937ms. StopReason: stop. Tokens(in/out): 4847/273[0m
    [34m---PROMPT---[0m
    [2m[43msystem: [0m[2mYou are a table structure analyst. You are given:
    1. An image of a PDF page containing a table
    2. Compressed text extracted from the same page (which may have parsing errors)
    
    Your task: determine the COMPLETE table column structure. Use a DATA-FIRST
    approach — the data rows are the ground truth for column count.
    
    === STEP 1: ANALYZE DATA ROWS (Ground Truth) ===
    
    Look at the DATA ROWS in the image (below the headers):
    1. Pick a complete data row and count the distinct columns visually
    2. Note column boundaries — where values end and new values begin
    3. Identify the DATA TYPE in each column:
       - Dates (DD/MM/YYYY, YYYY-MM-DD, etc.)
       - Times (HH:MM, HH:MM:SS AM/PM)
       - Number

INFO:pdf_ocr.interpret:Step 1 page 1: parsed 31 data rows, 2 header levels, type=TableType.HierarchicalHeader
INFO:pdf_ocr.interpret:  Step 1 page 1 headers[0]: ['Country', 'Battery Electric', 'Battery Electric', 'Battery Electric % Change', 'Plug-In Hybrid', 'Plug-In Hybrid', 'Plug-In Hybrid % Change', 'Hybrid Electric', 'Hybrid Electric', 'Hybrid Electric % Change', 'Others', 'Others', 'Others % Change', 'Petrol', 'Petrol', 'Petrol % Change', 'Diesel', 'Diesel', 'Diesel % Change', 'Total', 'Total', 'Total % Change']
INFO:pdf_ocr.interpret:  Step 1 page 1 headers[1]: ['Country', '2025', '2024', '% Change', '2025', '2024', '% Change', '2025', '2024', '% Change', '2025', '2024', '% Change', '2025', '2024', '% Change', '2025', '2024', '% Change', '2025', '2024', '% Change']
INFO:pdf_ocr.interpret:  Step 1 page 1 row[0] (22 cells): ['Austria', '4,621', '4,263', '+8.4', '2,776', '1,282', '+116.5', '7,253', '5,864', '+23.7', '0', '0', '0', '5,750', '7,076', '-18.7', '1,976', '3,204', '-38.3

2026-02-04T16:25:59.710 [BAML [92mINFO[0m] [35mFunction AnalyzeAndParseTableGuided[0m:
    [33mClient: openai/gpt-4o (gpt-4o-2024-08-06) - 42761ms. StopReason: stop. Tokens(in/out): 4362/4165[0m
    [34m---PROMPT---[0m
    [2m[43msystem: [0m[2mYou are a table-structure analyst. Given compressed text extracted from a PDF
    AND a visual schema describing the correct column structure (inferred from the
    page image), parse the table data rows.
    
    VISUAL SCHEMA (from image analysis — treat as ground truth for column count
    and header names):
    {
        "column_count": 22,
        "column_names": [
            "Country",
            "Battery Electric 2025",
            "Battery Electric 2024",
            "Battery Electric % Change",
            "Plug-In Hybrid 2025",
            "Plug-In Hybrid 2024",
            "Plug-In Hybrid % Change",
            "Hybrid Electric 2025",
            "Hybrid Electric 2024",
            "Hybrid Electric % Change",
            

{1: {'records': [{'country': 'Austria',
    'car_motorization': 'Battery Electric',
    'new_car_registration': 4621,
    'date': '2025'},
   {'country': 'Austria',
    'car_motorization': 'Battery Electric',
    'new_car_registration': 4263,
    'date': '2024'},
   {'country': 'Austria',
    'car_motorization': 'Plug-In Hybrid',
    'new_car_registration': 2776,
    'date': '2025'},
   {'country': 'Austria',
    'car_motorization': 'Plug-In Hybrid',
    'new_car_registration': 1282,
    'date': '2024'},
   {'country': 'Austria',
    'car_motorization': 'Hybrid Electric',
    'new_car_registration': 7253,
    'date': '2025'},
   {'country': 'Austria',
    'car_motorization': 'Hybrid Electric',
    'new_car_registration': 5864,
    'date': '2024'},
   {'country': 'Austria',
    'car_motorization': 'Petrol',
    'new_car_registration': 5750,
    'date': '2025'},
   {'country': 'Austria',
    'car_motorization': 'Petrol',
    'new_car_registration': 7076,
    'date': '2024'},
   {'country

In [None]:
{page: mt.model_dump() for page, mt in result_acea.items()}

## Serialization

After interpreting tables, export results to CSV, TSV, Parquet, pandas or polars DataFrames using the `serialize` module. All functions validate records against the schema and coerce OCR artifacts (e.g., `"1,234"` → `1234`).

In [3]:
# Serialize interpretation results to various formats
from pdf_ocr import to_csv, to_tsv, to_pandas

# Export to CSV string
csv_str = to_csv(result_acea, acea_schema)
print("=== CSV (first 500 chars) ===")
print(csv_str[:500])
print("...")

# Export to CSV file with page column
to_csv(result_acea, acea_schema, path="/tmp/acea_output.csv", include_page=True)
print("\nWrote /tmp/acea_output.csv")

# Export to pandas DataFrame with proper nullable dtypes
df = to_pandas(result_acea, acea_schema, include_page=True)
print("\n=== pandas DataFrame ===")
print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"\nDtypes:\n{df.dtypes}")

=== CSV (first 500 chars) ===
country,car_motorization,new_car_registration,date
Austria,Battery Electric,4621,2025-12
Austria,Plug-in Hybrid,2776,2025-12
Austria,Hybrid Electric,7253,2025-12
Austria,Others,0,2025-12
Austria,Petrol,5750,2025-12
Austria,Diesel,1976,2025-12
Belgium,Battery Electric,11333,2025-12
Belgium,Plug-in Hybrid,3345,2025-12
Belgium,Hybrid Electric,3591,2025-12
Belgium,Others,114,2025-12
Belgium,Petrol,9900,2025-12
Belgium,Diesel,594,2025-12
Bulgaria,Battery Electric,236,2025-12
Bulgaria,Pl
...

Wrote /tmp/acea_output.csv

=== pandas DataFrame ===
   page  country  car_motorization new_car_registration     date
0     1  Austria  Battery Electric                 4621  2025-12
1     1  Austria    Plug-in Hybrid                 2776  2025-12
2     1  Austria   Hybrid Electric                 7253  2025-12
3     1  Austria            Others                    0  2025-12
4     1  Austria            Petrol                 5750  2025-12
5     1  Austria            Diesel 

In [None]:
df.head(50)

In [None]:
from pdf_ocr.interpret import analyze_and_parse                                                                                                                                                                              
from pdf_ocr import compress_spatial_text, filter_pdf_by_table_titles                                                                                                                                                        
                                                                                                                                                                                                                            
acea_pdf = "inputs/Press_release_car_registrations_December_2025.pdf"                                                                                                                                                        
filtered, _ = filter_pdf_by_table_titles(acea_pdf, pages=[2])                                                                                                                                                                
compressed = compress_spatial_text(filtered)                                                                                                                                                                                 
                                                                                                                                                                                                                            
# Check what Step 1 outputs                                                                                                                                                                                                  
parsed = analyze_and_parse(compressed, model="openai/gpt-4o")                                                                                                                                                                
print(f"table_type: {parsed.table_type}")                                                                                                                                                                                    
print(f"headers: {parsed.headers}")                                                                                                                                                                                          
print(f"notes: {parsed.notes}")                                                                                                                                                                                              
print(f"data_rows count: {len(parsed.data_rows)}")  