feat: add OpenDataLoader PDF engine and include in benchmark by Mihailorama · Pull Request #7 · Mihailorama/docfold

Mihailorama · 2026-04-18T09:00:50Z

Wraps the Java-backed opendataloader-pdf Python package as a new
engine adapter. Produces markdown/html/json/text with per-element
bounding boxes and page numbers derived from the JSON kids tree.
Registered in benchmark.py and exposed via the [opendataloader]
extra.

Wraps the Java-backed opendataloader-pdf Python package as a new engine adapter. Produces markdown/html/json/text with per-element bounding boxes and page numbers derived from the JSON kids tree. Registered in benchmark.py and exposed via the [opendataloader] extra.

Uses PyMuPDF insert_htmlbox with Noto Naskh Arabic to render a shaped RTL document, and PyMuPDF's own extraction as ground truth. Surfaces divergence between engines on Arabic: PyMuPDF returns visual order, opendataloader flips to logical order (CER ~0.87 on this doc), which is a useful real-world signal. Skipped automatically when no Arabic font is installed.

Relying on a system package (fonts-noto-core) made the Arabic benchmark doc conditional, which defeated its purpose. Drop the font (OFL-1.1, 176KB) into tests/fixtures/fonts/ and load it from there; fall back to system paths only as a safety net.

Subsets Noto Sans CJK SC and Noto Sans Hebrew with fontTools to fit the exact benchmark text (~60 KB + 5 KB vs. 20 MB full). Hebrew acts as a no-shaping RTL control vs. Arabic's full shaping; CJK covers multi-byte Unicode with no shaping and LTR. Confirms the pattern from the Arabic doc: opendataloader diverges from pymupdf on RTL scripts regardless of shaping (CER ~0.85 on both Arabic and Hebrew) but agrees on CJK (CER 0.03) — the issue is reading-order direction, not character-level extraction. Devanagari and Thai intentionally omitted — insert_htmlbox produces PDFs that don't round-trip cleanly for those scripts; left as a follow-up for when real fixture PDFs are added.

Adds OpenDataLoader to both engine tables (matrix + install guide) and records the engine + multi-script benchmark additions under an Unreleased heading.

claude added 5 commits April 16, 2026 20:28

docs: document OpenDataLoader engine in CHANGELOG and README

962fcae

Adds OpenDataLoader to both engine tables (matrix + install guide) and records the engine + multi-script benchmark additions under an Unreleased heading.

Mihailorama merged commit 6f5eae8 into main Apr 18, 2026
9 checks passed

Mihailorama deleted the claude/add-tool-update-benchmark-E5ZYr branch April 18, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OpenDataLoader PDF engine and include in benchmark#7

feat: add OpenDataLoader PDF engine and include in benchmark#7
Mihailorama merged 5 commits into
mainfrom
claude/add-tool-update-benchmark-E5ZYr

Mihailorama commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mihailorama commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants