Skip to content

SoMarkAI/DocParsingBench

Repository files navigation

HuggingFace ModelScope Python License

English | 中文


Why DocParsingBench?

DocParsingBench is an evaluation toolkit purpose-built for intelligent document parsing products, bridging academia and real‑world industry.

  • Industry‑aligned: Built with real‑world enterprise documents, not just academic datasets.
  • Compatible: Directly compares GT Markdown vs. predicted Markdown. Works with any parsing solution.
  • Optimal segment matching: Segments by text, inline formulas, and tables, then matches within same types. More accurate than full‑text string comparison.
  • Engineering-friendly: CLI + quick start + visual dashboards. Easily plug into your model experiment pipeline for fast iteration.

If this project helps you, please consider giving it a ⭐ Star in the top-right corner. Your support is a huge encouragement to the team.

Latest Updates

[2026.04.17] DocParsingBench evaluation toolkit release. It will provide unified scoring for the three core elements of document parsing: text, formula, and table, along with CLI batch evaluation, segment matching, visualization analysis, and leaderboard generation. 📊

[2026.03.09] DocParsingBench dataset release. The first intelligent document parsing dataset built for real industry scenarios, covering finance, legal, scientific research, manufacturing, and education. Now available on Hugging FaceModelScope!🔥🔥🔥

Dataset

We systematically collected and annotated document samples from real business workflows, preserving scan noise, stamp occlusion, and blurry characters.

Dimension Category
Total Samples 1400 pages
Languages Chinese, English, bilingual
Industry Coverage Finance / Legal / Scientific Research / Manufacturing / Education
Layout Coverage Single-column / Double-column / Triple-column / Mixed
Annotation Format Markdown
Chemical Annotation Uses the SoMarkdown specification, combining SMILES with LaTeX to render chemical structure formulas completely

Metric Overview

DocParsingBench is an evaluation toolkit for document parsing. It takes two Markdown files (prediction and ground truth), performs segment-level matching and scoring by category, and outputs both overall and per-category scores with reusable metric wrappers and visualization tools.

  • Segment categories: text (with inline formulas), display_formula, table, image (currently dropped in evaluation)
  • Segmentation: text and display formulas are split by line boundaries; tables are bounded by <table> ... </table>
  • Matching: Hungarian matching is applied within each category using configured matching metrics (NED, CDM, TEDS)
  • Metric wrappers: NED/CER, CDM, TEDS/TEDS-S
  • Overall metric: DPB (Document Parsing Benchmark), a weighted average with default weights α=0.5, β=0.3, γ=0.2
$$\begin{aligned} text\_score &= \alpha \cdot avg(1 - NED) + (1 - \alpha) \cdot avg(CDM) \\\ display\_formula\_score &= avg(CDM) \\\ table\_score &= avg(TEDS) \\\ DPB &= \alpha \cdot text\_score + \beta \cdot display\_formula\_score + \gamma \cdot table\_score \end{aligned}$$

Evaluation Leaderboard

Rank Methods DPB Text Formula Table
1 PaddleOCR-1.5 0.8535 0.8959 0.7527 0.7104
2 MonkeyOCR-Pro-3B 0.8260 0.8669 0.7206 0.7014
3 MinerU2.5 0.8164 0.8426 0.7993 0.7557
4 Qwen3-VL-235B-Instruct 0.7971 0.8496 0.4355 0.6691
5 ChandraOCR-2 0.7906 0.8361 0.7772 0.7242
6 Deepseek-OCR-2 0.7403 0.7917 0.6775 0.5741
7 GLM-OCR 0.7348 0.7695 0.5773 0.5046
8 dots.ocr-1.5 0.6564 0.6885 0.6236 0.5655
9 HunyuanOCR 0.5128 0.5319 0.6018 0.6428

Summary Chart

Summary Chart

Interactive Leaderboard

Interactive Leaderboard

Installation

  • Requires Python 3.8+
  • Dependencies are declared in pyproject.toml
git clone https://github.com/SoMarkAI/DocParsingBench.git
cd docparsingbench

pip install .

# For local development:
pip install -e .

Configuration

Configuration is defined in YAML and maps 1:1 to the internal Config dataclass. A reference file is provided at config.example.yaml.

Key options:

  • chromedriver_path: if unset or null, fastcdm uses its own default.
  • visualize: whether to generate CDM visualization images during evaluation (effective only when formula.metric: "CDM"). Output images are saved in <output>/cdm_vis/.

Local Development With fastcdm Source

To use local fastcdm source code instead of an installed package, set FASTCDM_SRC to the source root:

export FASTCDM_SRC=/path/to/fastcdm

You can add this line to ~/.zshrc or ~/.bashrc for persistence. If not set, the installed fastcdm package is used.

Usage

Evaluation

dpb is packaged as a CLI entrypoint and is equivalent to python -m docparsingbench.cli.

python -m dpb eval \
  --gt path/to/gt.md \
  --pred path/to/pred.md \
  --config config.yaml \
  --out result.json

If --gt and --pred are directories, matching filenames are evaluated in batch.

# gt_dir contains a.md, b.md, c.md ...
# pred_dir contains a.md, b.md, c.md ...
python -m dpb eval \
  --gt gt_dir/ \
  --pred pred_dir/ \
  --config config.yaml \
  --out result.json

After eval, the terminal prints a model-level one-line summary with Model, Files, DPB, Text, Formula, Table, FormulaRenderFailures, and Output.

Segment Testing

python -m dpb segment \
  --in path/to/md \
  --out segments.json

Visualization

dpb visualize \
  --labels path/to/labels.json \
  --img path/to/images_dir \
  --gt path/to/gt_markdowns_dir \
  --pred path/to/pred_markdowns_dir \
  --result path/to/model.result.json
  • labels.json stores only sample-to-industry/sub-industry mappings. If --labels is omitted, it is auto-generated alongside gt.

Summary Bar Chart (summary-chart)

dpb summary-chart \
  --labels path/to/labels.json \
  --results path/to/results_dir \
  --exclude-model-prefix deepseek_ocr \
  --y-min 30 \
  --y-max 100 \
  --output path/to/summary_chart.png
  • Optional: repeat --exclude-model-prefix to hide model families by result filename prefix.
  • Optional: set y-axis range via --y-min / --y-max (defaults: 30 / 100).

Batch evaluation can auto-generate the chart when all conditions are met:

  • --gt and --pred are both directories
  • summary_chart.enable: true (default: true)
  • summary_chart.y_min / summary_chart.y_max (defaults: 30 / 100)
  • --labels is omitted and can be auto-generated from gt
dpb eval \
  --gt data/gt/DocParsingBench/markdowns \
  --pred data/pred/some_model_md \
  --config config.yaml \
  --out data/results/some_model_md.result.json

Interactive HTML Leaderboard (leaderboard-html)

Generates a single self-contained .html file with interactive sorting and filtering. Open it in any browser or share it directly without a server.

dpb leaderboard-html \
  --labels path/to/labels.json \
  --results path/to/results_dir \
  --output leaderboard.html \
  --exclude-model-prefix deepseek_ocr   # optional, repeatable
  • All data (All + per-industry views) is embedded inline as JSON
  • Industry switch: All / Education / Finance / Legal / Manufacturing / Research
  • Metrics and ranking in one table: DPB / Text / Formula / Table
  • Default sort: DPB descending; click any column header to cycle desc/asc
  • Hover a metric cell → cursor-following tooltip with the 4-decimal raw value
  • Save as image button exports the current view as PNG via html2canvas
  • Smooth bar-width transitions when switching industries or sort columns

Metric Notes

  • NED (Normalized Edit Distance): normalized edit distance computed after character-level normalization
  • CER (Character Error Rate): edit distance divided by GT length
  • CDM: formula matching metric based on fastcdm, returns F1/recall/precision (F1 is used by default)
  • TEDS/TEDS-S: table tree-edit-distance-based similarity (greater is better); TEDS-S compares structure only
  • Hungarian matching: one-to-one matching within each segment category; unmatched pairs are scored as 0 similarity

DPB Calculation

  • Text: text_score = α * avg(1 - NED) + (1 - α) * avg(CDM)
  • Display formula: avg(CDM) (or NED depending on config)
  • Table: avg(TEDS) (or TEDS-S)

DPB = α * text_score + β * display_formula_score + γ * table_score

Different domain presets (for example paper/finance/tech) can define different weight presets.

Model Runner Scripts

The scripts/ directory provides OCR model runner scaffolding with a unified pipeline: scan image directory -> call model -> post-process -> output Markdown.

Usage

# Deepseek-OCR example
python -m scripts.deepseek_ocr ./images ./output/deepseek_ocr_md

Add a New Model

Inherit BaseModelRunner and implement parse_md:

from scripts.base import BaseModelRunner

class MyModelRunner(BaseModelRunner):
    name = "my_model"

    def parse_md(self, img_path: str) -> str:
        # call model API / SDK / local inference and return markdown
        ...

    def postprocess(self, md: str) -> str:
        # optional: cleanup / formatting
        return md

The base class handles image scanning, tqdm progress display, resume behavior (skip existing outputs), and failure statistics.

Implemented Models

Script Model Status
deepseek_ocr.py DeepSeek OCR Implemented
dots_ocr.py Dots OCR Implemented
glm_ocr.py GLM OCR Implemented
hunyuan_ocr.py Hunyuan OCR Implemented
mineru.py MinerU Implemented
monkey_ocr.py Monkey OCR Pro 3B Implemented
paddle.py PaddleOCR Implemented
qwen3_vl.py Qwen3-VL Implemented
chandra_ocr.py Chandra OCR Implemented

Performance Evaluation Design

This project reserves hooks and a unified output schema for performance benchmarking. Real model invocation can be driven externally.

  • In CLI eval, when perf.enable=true, it records:
    • segmentation time, matching time, each metric's time, and total time
    • document count and throughput (docs/s)
  • Output is written to perf in result.json:
    • phases: timing by phase
    • throughput: document throughput
    • notes: external model invocation marker (empty by default or filled by upper layers)

Benchmark speed reporting should use evaluation-phase runtime + document throughput, excluding external model generation latency. External model latency should be recorded by upper-layer systems.

WeChat Group

WeChat Group QR Code

About

An benchmark toolkit for intelligent document parsing products

Resources

License

Stars

Watchers

Forks

Contributors

Languages