feat: add OpenDataLoader PDF engine and include in benchmark#7
Merged
Conversation
Wraps the Java-backed opendataloader-pdf Python package as a new engine adapter. Produces markdown/html/json/text with per-element bounding boxes and page numbers derived from the JSON kids tree. Registered in benchmark.py and exposed via the [opendataloader] extra.
Uses PyMuPDF insert_htmlbox with Noto Naskh Arabic to render a shaped RTL document, and PyMuPDF's own extraction as ground truth. Surfaces divergence between engines on Arabic: PyMuPDF returns visual order, opendataloader flips to logical order (CER ~0.87 on this doc), which is a useful real-world signal. Skipped automatically when no Arabic font is installed.
Relying on a system package (fonts-noto-core) made the Arabic benchmark doc conditional, which defeated its purpose. Drop the font (OFL-1.1, 176KB) into tests/fixtures/fonts/ and load it from there; fall back to system paths only as a safety net.
Subsets Noto Sans CJK SC and Noto Sans Hebrew with fontTools to fit the exact benchmark text (~60 KB + 5 KB vs. 20 MB full). Hebrew acts as a no-shaping RTL control vs. Arabic's full shaping; CJK covers multi-byte Unicode with no shaping and LTR. Confirms the pattern from the Arabic doc: opendataloader diverges from pymupdf on RTL scripts regardless of shaping (CER ~0.85 on both Arabic and Hebrew) but agrees on CJK (CER 0.03) — the issue is reading-order direction, not character-level extraction. Devanagari and Thai intentionally omitted — insert_htmlbox produces PDFs that don't round-trip cleanly for those scripts; left as a follow-up for when real fixture PDFs are added.
Adds OpenDataLoader to both engine tables (matrix + install guide) and records the engine + multi-script benchmark additions under an Unreleased heading.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wraps the Java-backed opendataloader-pdf Python package as a new
engine adapter. Produces markdown/html/json/text with per-element
bounding boxes and page numbers derived from the JSON kids tree.
Registered in benchmark.py and exposed via the [opendataloader]
extra.