Skip to content

Travor278/DocFailBench

Repository files navigation

DocFailBench

DocFailBench is a failure-oriented benchmark for PDF-to-Markdown, OCR, and VLM document parsers on Chinese and Chinese-English documents.

Instead of asking whether an extracted page looks roughly right, DocFailBench checks small, auditable facts: a table value stayed in the right cell, a formula survived, a two-column page was read in order, a caption stayed near its figure, and optional bbox elements really ground text to the page.

DocFailBench community benchmark summary

Why It Exists

Most OCR and document parsing benchmarks report aggregate similarity. That is useful, but it does not tell a parser maintainer which fact broke. DocFailBench is built for diagnosis:

  • executable assertions instead of fuzzy page-level judgments,
  • visual review packets with source page evidence,
  • parser-agnostic markdown + elements predictions,
  • public and private evaluation modes,
  • Chinese, mixed-script, tables, formulas, reading order, page furniture, and grounding failures.

Which Release Should I Use?

Release Status Best use Size
DocFailBench-v0.1-combined-public-rc frozen RC recommended community comparison with broader public source diversity 116 cases / 877 assertions
DocFailBench-v0.1-public-real-rc frozen RC smaller public-real comparison target 74 cases / 674 main assertions
DocFailBench-v0.1-non-gov-public-stage7-rc frozen auxiliary RC non-government public PDF stress test 24 cases / 165 assertions
DocFailBench-v0.1-diagnostic frozen RC local regression and failure analysis 54 cases / 506 assertions
Stage8 non-government batch2 included audit input second-reviewed contribution to the combined RC; original staging files kept for audit 18 cases / 38 accepted assertions

For new parser submissions, use DocFailBench-v0.1-combined-public-rc unless you need the smaller public-real RC for a faster comparison.

Combined Public RC Leaderboard

DocFailBench-v0.1-combined-public-rc is the recommended community-facing target. It keeps the public-real RC as the largest profile, then adds the frozen Stage7 non-government structural track and the second-reviewed Stage8 non-government expansion.

  • 116 cases / 877 assertions
  • 7 cached parser baselines
  • Profile labels preserved: public_real_rc, non_gov_stage7_structural, non_gov_stage8_reviewed
  • Ulang deepseek-ocr2 is not included because authenticated image smoke tests returned upstream 500 errors on 2026-05-09
Parser Passed Failed Score
Marker 621 256 0.7081
PyMuPDF bbox 612 265 0.6978
Docling 599 278 0.6830
PyMuPDF plain 589 288 0.6716
Qwen-VL API (qwen-vl-ocr-latest, run 2026-05) 559 318 0.6374
MinerU 496 381 0.5656
PaddleOCR 334 543 0.3808

Frozen combined artifacts:

Verify the cached release scores from frozen predictions:

powershell -ExecutionPolicy Bypass -File scripts\run_combined_public_compare.ps1

Public-Real RC Leaderboard

DocFailBench-v0.1-public-real-rc freezes the first strict-reviewed real-public PDF expansion on top of the diagnostic set.

  • 7 official public PDFs
  • 20 strict-reviewed public pages
  • 168 public-real assertions, plus 3 secondary hygiene checks excluded from score
  • 74 merged cases / 674 main assertions
  • 7 cached parser baselines with parser metadata

DocFailBench public-real expansion leaderboard

Parser Passed Failed Score
PyMuPDF4LLM bbox 550 124 0.8160
PyMuPDF4LLM plain 541 133 0.8027
Marker 522 152 0.7745
Docling 465 209 0.6899
Qwen-VL API (qwen-vl-ocr-latest, run 2026-05) 443 231 0.6573
MinerU 388 286 0.5757
PaddleOCR 317 357 0.4703

Frozen artifacts:

Scores are computed from frozen case files and cached prediction artifacts. Parser/runtime metadata and spot-check notes are linked above. For community evaluation, cached-score checks, and maintainer-only artifact rebuilds, see docs/reproducibility-public-real-rc.md.

Diagnostic Release

DocFailBench-v0.1-diagnostic is the older local regression set. It is synthetic-heavy by design and useful for controlled failure analysis. Use the combined public RC above for new community parser submissions.

DocFailBench v0.1 leaderboard

Parser Passed Failed Score
PyMuPDF4LLM bbox 436 70 0.8617
Marker 435 71 0.8597
PyMuPDF4LLM plain 427 79 0.8439
Docling 388 118 0.7668
Qwen-VL API (qwen-vl-ocr-latest, run 2026-05) 352 154 0.6957
MinerU 321 185 0.6344
PaddleOCR 273 233 0.5395

Artifacts: benchmark card, cases, leaderboard.

Non-Government Public RC

DocFailBench-v0.1-non-gov-public-stage7-rc freezes the reviewed Stage7 non-government public subset. It expands source diversity with OpenStax, ACL Anthology, PMC/PeerJ, Frontiers, and BioMed Central pages.

Treat it as an auxiliary track, not a standalone replacement for the combined public RC. It is stronger for academic/table/textbook layouts but smaller; in the combined target, it is reported with its profile label preserved.

Set Scope Status
Stage7 strict reviewed set 23 pages / 44 assertions staging audit subset
Stage7 structural-v2 RC 24 pages / 165 assertions frozen auxiliary RC
Stage8 batch2 24 pages / 181 candidates; 38 accepted assertions second-review accepted, 7-parser baselined, included in combined RC

Stage7 cached comparisons remain useful as a second leaderboard axis. In the combined RC, Stage7 and Stage8 keep separate profile labels so the aggregate score does not hide source-family behavior.

Stage7 artifacts: progress report, strict compare, structural-v2 compare, next public PDF queue.

Frozen Stage7 RC artifacts: card, cases, leaderboard, manifest.

Stage8 batch2 started as a strict, second-review accepted staging subset and is now included in DocFailBench-v0.1-combined-public-rc. The original Stage8 files remain audit artifacts. Its 38 accepted checks have cached 7-parser diagnostics:

Parser Passed Failed Score
PyMuPDF bbox 30 8 0.7895
PyMuPDF plain 23 15 0.6053
Marker 9 29 0.2368
Qwen-VL API (qwen-vl-ocr-latest, run 2026-05) 8 30 0.2105
Docling 7 31 0.1842
MinerU 6 32 0.1579
PaddleOCR 5 33 0.1316

Stage8 artifacts: first review, human second-review acceptance, 7-parser compare, staging manifest, source/license manifest, parser metadata.

Quick Start

Evaluate a prediction file against the recommended combined public RC:

python -m docfailbench.cli evaluate `
  --cases data/releases/docfailbench_v0_1_combined_public_rc_cases.json `
  --predictions path/to/your_predictions.json `
  --out runs/submissions/YOUR_PARSER/combined_public_rc_results.json

Run a parser adapter end to end:

python -m docfailbench.cli baseline `
  --manifest examples/parser_manifest.json `
  --parser pymupdf4llm `
  --cases data/releases/docfailbench_v0_1_combined_public_rc_cases.json `
  --out runs/combined_public_rc_rerun/pymupdf4llm/predictions.json `
  --results runs/combined_public_rc_rerun/pymupdf4llm/results.json `
  --html runs/combined_public_rc_rerun/pymupdf4llm/report.html

Run the small built-in smoke sample:

python -m docfailbench.cli evaluate `
  --cases data/cases/sample_cases.json `
  --predictions data/predictions/sample_parser_predictions.json `
  --out runs/sample/results.json `
  --html runs/sample/report.html

Open the HTML report in a browser to inspect source metadata, parser Markdown, assertion results, and evidence.

Prediction Format

A parser submission is a JSON file with one prediction per case:

{
  "case_id": "public_real_nist_ai_rmf_p017",
  "parser": "your_parser_name",
  "markdown": "extracted Markdown or text",
  "elements": [
    {
      "type": "text",
      "text": "optional spatially grounded text",
      "bbox": [72, 100, 300, 140]
    }
  ],
  "metadata": {
    "version": "1.2.3",
    "command": "your reproducible command"
  }
}

elements may be empty. Parsers with bbox or polygon elements can pass bbox-aware element_grounded checks; plain Markdown parsers are expected to fail them.

Submit Parser Results

Open an issue or PR with:

  • parser name, version, and installation notes,
  • exact command or adapter manifest entry,
  • prediction JSON and result JSON,
  • OS, Python/runtime, GPU if used,
  • API endpoint family, model name, and run date for hosted models,
  • machine-readable per-case wrapper metadata for API results, including requested model, endpoint host, status, and elapsed time when available,
  • any known caveats such as OCR-only output or no bbox support.

See docs/submitting-parser-results.md for the full submission flow. For a concrete metadata example, see docs/parser-result-submission-example.md.

DocFailBench submission labels

How It Works

DocFailBench workflow

Each case contains executable assertions. During evaluation, a parser prediction is normalized into Markdown plus optional spatial elements, and each assertion returns pass/fail evidence.

Common assertion types include:

  • table_cell_exists: a visible value must remain table-cell-like,
  • table_grid_cell: a value must stay at a specific row and column,
  • formula_contains: a formula fragment must survive normalization,
  • reading_order: two page-local anchors must appear in the expected order,
  • caption_binding: a caption must stay near its figure or table anchor,
  • element_grounded: optional bbox/poly elements must ground text to the page,
  • regex_absence / text_absence: page furniture and pollution checks.

Assertion distribution

Combined public RC failure types

Review Examples

The review packet keeps assertions that are visible, specific, and diagnostic. For public tables, grid-position checks are preferred over easy headers:

Table assertion review example

For formulas, concise symbol-level checks are kept when they test structure that plain text similarity can miss:

Formula assertion review example

Scope And Limits

DocFailBench is strongest as a diagnostic parser benchmark. It should not be used as the sole basis for broad OCR quality claims, model training suitability, or production accuracy across all document types.

Current limitations:

  • the combined RC is still small enough to be diagnostic, not a population-scale OCR benchmark,
  • the public-real profile is government-source heavy, while non-government profiles are smaller,
  • element_grounded checks bbox existence, not exact gold-region overlap,
  • hosted VLM results can drift when providers update latest models,
  • there is no hosted leaderboard service yet.

Local And Private Use

Private mode keeps aggregate scores and failure taxonomy counts, but redacts assertion text, Markdown, elements, paths, messages, and evidence payloads:

python -m docfailbench.cli evaluate `
  --cases data/cases/sample_cases.json `
  --predictions data/predictions/sample_parser_predictions.json `
  --out runs/private/results.private.json `
  --private `
  --private-profile runs/private/profile.json

Private mode rejects --html and --raw-dir, because those outputs can contain page images, parser Markdown, bboxes, metadata, or source excerpts.

Repository Map

docfailbench/            assertion handlers, evaluator, adapters, reports
examples/                parser manifest and parser wrapper examples
data/releases/           frozen release artifacts and leaderboards
data/cases/              development and diagnostic case files
docs/                    benchmark cards, source policy, submission guides
runs/stage7_non_gov_public/  non-government public RC audit workspace
runs/stage8_non_gov_public_batch2/  combined RC audit workspace
tests/                   unit and integration tests

Key docs:

Roadmap

Stage7 and Stage8 have now been folded into DocFailBench-v0.1-combined-public-rc with profile labels preserved. The next community-quality step is broader participation and source diversity:

  1. keep the one-command cached-score verification script green,
  2. gather external parser submissions against the combined target,
  3. add 20-40 more non-government public pages for the next release,
  4. keep broader page-furniture checks in a secondary hygiene profile,
  5. prototype stricter region-overlap checks for element_grounded.

Audit artifacts:

Stage8 has been included in DocFailBench-v0.1-combined-public-rc; the original Stage8 staging files remain available as audit artifacts.

Packages

 
 
 

Contributors

Languages