Skip to content

feat: add OpenDataLoader PDF engine and include in benchmark#7

Merged
Mihailorama merged 5 commits into
mainfrom
claude/add-tool-update-benchmark-E5ZYr
Apr 18, 2026
Merged

feat: add OpenDataLoader PDF engine and include in benchmark#7
Mihailorama merged 5 commits into
mainfrom
claude/add-tool-update-benchmark-E5ZYr

Conversation

@Mihailorama
Copy link
Copy Markdown
Owner

Wraps the Java-backed opendataloader-pdf Python package as a new
engine adapter. Produces markdown/html/json/text with per-element
bounding boxes and page numbers derived from the JSON kids tree.
Registered in benchmark.py and exposed via the [opendataloader]
extra.

claude added 5 commits April 16, 2026 20:28
Wraps the Java-backed opendataloader-pdf Python package as a new
engine adapter. Produces markdown/html/json/text with per-element
bounding boxes and page numbers derived from the JSON kids tree.
Registered in benchmark.py and exposed via the [opendataloader]
extra.
Uses PyMuPDF insert_htmlbox with Noto Naskh Arabic to render a shaped
RTL document, and PyMuPDF's own extraction as ground truth. Surfaces
divergence between engines on Arabic: PyMuPDF returns visual order,
opendataloader flips to logical order (CER ~0.87 on this doc), which
is a useful real-world signal.

Skipped automatically when no Arabic font is installed.
Relying on a system package (fonts-noto-core) made the Arabic
benchmark doc conditional, which defeated its purpose. Drop the font
(OFL-1.1, 176KB) into tests/fixtures/fonts/ and load it from there;
fall back to system paths only as a safety net.
Subsets Noto Sans CJK SC and Noto Sans Hebrew with fontTools to fit
the exact benchmark text (~60 KB + 5 KB vs. 20 MB full). Hebrew acts
as a no-shaping RTL control vs. Arabic's full shaping; CJK covers
multi-byte Unicode with no shaping and LTR.

Confirms the pattern from the Arabic doc: opendataloader diverges
from pymupdf on RTL scripts regardless of shaping (CER ~0.85 on
both Arabic and Hebrew) but agrees on CJK (CER 0.03) — the issue is
reading-order direction, not character-level extraction.

Devanagari and Thai intentionally omitted — insert_htmlbox produces
PDFs that don't round-trip cleanly for those scripts; left as a
follow-up for when real fixture PDFs are added.
Adds OpenDataLoader to both engine tables (matrix + install guide)
and records the engine + multi-script benchmark additions under an
Unreleased heading.
@Mihailorama Mihailorama merged commit 6f5eae8 into main Apr 18, 2026
9 checks passed
@Mihailorama Mihailorama deleted the claude/add-tool-update-benchmark-E5ZYr branch April 18, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants