GitHub - SoMarkAI/DocParsingBench: An benchmark toolkit for intelligent document parsing products

English | 中文

Why DocParsingBench?

DocParsingBench is an evaluation toolkit purpose-built for intelligent document parsing products, bridging academia and real‑world industry.

Industry‑aligned: Built with real‑world enterprise documents, not just academic datasets.
Compatible: Directly compares GT Markdown vs. predicted Markdown. Works with any parsing solution.
Optimal segment matching: Segments by text, inline formulas, and tables, then matches within same types. More accurate than full‑text string comparison.
Engineering-friendly: CLI + quick start + visual dashboards. Easily plug into your model experiment pipeline for fast iteration.

If this project helps you, please consider giving it a ⭐ Star in the top-right corner. Your support is a huge encouragement to the team.

Latest Updates

[2026.04.17] DocParsingBench evaluation toolkit release. It will provide unified scoring for the three core elements of document parsing: text, formula, and table, along with CLI batch evaluation, segment matching, visualization analysis, and leaderboard generation. 📊

[2026.03.09] DocParsingBench dataset release. The first intelligent document parsing dataset built for real industry scenarios, covering finance, legal, scientific research, manufacturing, and education. Now available on Hugging Face、ModelScope！🔥🔥🔥

Dataset

We systematically collected and annotated document samples from real business workflows, preserving scan noise, stamp occlusion, and blurry characters.

Dimension	Category
Total Samples	1400 pages
Languages	Chinese, English, bilingual
Industry Coverage	Finance / Legal / Scientific Research / Manufacturing / Education
Layout Coverage	Single-column / Double-column / Triple-column / Mixed
Annotation Format	Markdown
Chemical Annotation	Uses the SoMarkdown specification, combining SMILES with LaTeX to render chemical structure formulas completely

Metric Overview

DocParsingBench is an evaluation toolkit for document parsing. It takes two Markdown files (prediction and ground truth), performs segment-level matching and scoring by category, and outputs both overall and per-category scores with reusable metric wrappers and visualization tools.

Segment categories: text (with inline formulas), display_formula, table, image (currently dropped in evaluation)
Segmentation: text and display formulas are split by line boundaries; tables are bounded by <table> ... </table>
Matching: Hungarian matching is applied within each category using configured matching metrics (NED, CDM, TEDS)
Metric wrappers: NED/CER, CDM, TEDS/TEDS-S
Overall metric: DPB (Document Parsing Benchmark), a weighted average with default weights α=0.5, β=0.3, γ=0.2

$$\begin{aligned} text\_score &= \alpha \cdot avg(1 - NED) + (1 - \alpha) \cdot avg(CDM) \\\ display\_formula\_score &= avg(CDM) \\\ table\_score &= avg(TEDS) \\\ DPB &= \alpha \cdot text\_score + \beta \cdot display\_formula\_score + \gamma \cdot table\_score \end{aligned}$$

Evaluation Leaderboard

Rank	Methods	DPB	Text	Formula	Table
1	PaddleOCR-1.5	0.8535	0.8959	0.7527	0.7104
2	MonkeyOCR-Pro-3B	0.8260	0.8669	0.7206	0.7014
3	MinerU2.5	0.8164	0.8426	0.7993	0.7557
4	Qwen3-VL-235B-Instruct	0.7971	0.8496	0.4355	0.6691
5	ChandraOCR-2	0.7906	0.8361	0.7772	0.7242
6	Deepseek-OCR-2	0.7403	0.7917	0.6775	0.5741
7	GLM-OCR	0.7348	0.7695	0.5773	0.5046
8	dots.ocr-1.5	0.6564	0.6885	0.6236	0.5655
9	HunyuanOCR	0.5128	0.5319	0.6018	0.6428

Summary Chart

Interactive Leaderboard

Installation

Requires Python 3.8+
Dependencies are declared in pyproject.toml

git clone https://github.com/SoMarkAI/DocParsingBench.git
cd docparsingbench

pip install .

# For local development:
pip install -e .

Configuration

Configuration is defined in YAML and maps 1:1 to the internal Config dataclass. A reference file is provided at config.example.yaml.

Key options:

chromedriver_path: if unset or null, fastcdm uses its own default.
- See the FastCDM chromedriver installation guide
visualize: whether to generate CDM visualization images during evaluation (effective only when formula.metric: "CDM"). Output images are saved in <output>/cdm_vis/.

Local Development With fastcdm Source

To use local fastcdm source code instead of an installed package, set FASTCDM_SRC to the source root:

export FASTCDM_SRC=/path/to/fastcdm

You can add this line to ~/.zshrc or ~/.bashrc for persistence. If not set, the installed fastcdm package is used.

Usage

Evaluation

dpb is packaged as a CLI entrypoint and is equivalent to python -m docparsingbench.cli.

python -m dpb eval \
  --gt path/to/gt.md \
  --pred path/to/pred.md \
  --config config.yaml \
  --out result.json

If --gt and --pred are directories, matching filenames are evaluated in batch.

# gt_dir contains a.md, b.md, c.md ...
# pred_dir contains a.md, b.md, c.md ...
python -m dpb eval \
  --gt gt_dir/ \
  --pred pred_dir/ \
  --config config.yaml \
  --out result.json

After eval, the terminal prints a model-level one-line summary with Model, Files, DPB, Text, Formula, Table, FormulaRenderFailures, and Output.

Segment Testing

python -m dpb segment \
  --in path/to/md \
  --out segments.json

Visualization

dpb visualize \
  --labels path/to/labels.json \
  --img path/to/images_dir \
  --gt path/to/gt_markdowns_dir \
  --pred path/to/pred_markdowns_dir \
  --result path/to/model.result.json

labels.json stores only sample-to-industry/sub-industry mappings. If --labels is omitted, it is auto-generated alongside gt.

Summary Bar Chart (`summary-chart`)

dpb summary-chart \
  --labels path/to/labels.json \
  --results path/to/results_dir \
  --exclude-model-prefix deepseek_ocr \
  --y-min 30 \
  --y-max 100 \
  --output path/to/summary_chart.png

Optional: repeat --exclude-model-prefix to hide model families by result filename prefix.
Optional: set y-axis range via --y-min / --y-max (defaults: 30 / 100).

Batch evaluation can auto-generate the chart when all conditions are met:

--gt and --pred are both directories
summary_chart.enable: true (default: true)
summary_chart.y_min / summary_chart.y_max (defaults: 30 / 100)
--labels is omitted and can be auto-generated from gt

dpb eval \
  --gt data/gt/DocParsingBench/markdowns \
  --pred data/pred/some_model_md \
  --config config.yaml \
  --out data/results/some_model_md.result.json

Interactive HTML Leaderboard (`leaderboard-html`)

Generates a single self-contained .html file with interactive sorting and filtering. Open it in any browser or share it directly without a server.

dpb leaderboard-html \
  --labels path/to/labels.json \
  --results path/to/results_dir \
  --output leaderboard.html \
  --exclude-model-prefix deepseek_ocr   # optional, repeatable

All data (All + per-industry views) is embedded inline as JSON
Industry switch: All / Education / Finance / Legal / Manufacturing / Research
Metrics and ranking in one table: DPB / Text / Formula / Table
Default sort: DPB descending; click any column header to cycle desc/asc
Hover a metric cell → cursor-following tooltip with the 4-decimal raw value
Save as image button exports the current view as PNG via html2canvas
Smooth bar-width transitions when switching industries or sort columns

Metric Notes

NED (Normalized Edit Distance): normalized edit distance computed after character-level normalization
CER (Character Error Rate): edit distance divided by GT length
CDM: formula matching metric based on fastcdm, returns F1/recall/precision (F1 is used by default)
TEDS/TEDS-S: table tree-edit-distance-based similarity (greater is better); TEDS-S compares structure only
Hungarian matching: one-to-one matching within each segment category; unmatched pairs are scored as 0 similarity

DPB Calculation

Text: text_score = α * avg(1 - NED) + (1 - α) * avg(CDM)
Display formula: avg(CDM) (or NED depending on config)
Table: avg(TEDS) (or TEDS-S)

DPB = α * text_score + β * display_formula_score + γ * table_score

Different domain presets (for example paper/finance/tech) can define different weight presets.

Model Runner Scripts

The scripts/ directory provides OCR model runner scaffolding with a unified pipeline: scan image directory -> call model -> post-process -> output Markdown.

Usage

# Deepseek-OCR example
python -m scripts.deepseek_ocr ./images ./output/deepseek_ocr_md

Add a New Model

Inherit BaseModelRunner and implement parse_md:

from scripts.base import BaseModelRunner

class MyModelRunner(BaseModelRunner):
    name = "my_model"

    def parse_md(self, img_path: str) -> str:
        # call model API / SDK / local inference and return markdown
        ...

    def postprocess(self, md: str) -> str:
        # optional: cleanup / formatting
        return md

The base class handles image scanning, tqdm progress display, resume behavior (skip existing outputs), and failure statistics.

Implemented Models

Script	Model	Status
`deepseek_ocr.py`	DeepSeek OCR	Implemented
`dots_ocr.py`	Dots OCR	Implemented
`glm_ocr.py`	GLM OCR	Implemented
`hunyuan_ocr.py`	Hunyuan OCR	Implemented
`mineru.py`	MinerU	Implemented
`monkey_ocr.py`	Monkey OCR Pro 3B	Implemented
`paddle.py`	PaddleOCR	Implemented
`qwen3_vl.py`	Qwen3-VL	Implemented
`chandra_ocr.py`	Chandra OCR	Implemented

Performance Evaluation Design

This project reserves hooks and a unified output schema for performance benchmarking. Real model invocation can be driven externally.

In CLI eval, when perf.enable=true, it records:
- segmentation time, matching time, each metric's time, and total time
- document count and throughput (docs/s)
Output is written to perf in result.json:
- phases: timing by phase
- throughput: document throughput
- notes: external model invocation marker (empty by default or filled by upper layers)

Benchmark speed reporting should use evaluation-phase runtime + document throughput, excluding external model generation latency. External model latency should be recorded by upper-layer systems.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
docparsingbench		docparsingbench
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why DocParsingBench?

Latest Updates

Dataset

Metric Overview

Evaluation Leaderboard

Installation

Configuration

Local Development With fastcdm Source

Usage

Evaluation

Segment Testing

Visualization

Summary Bar Chart (`summary-chart`)

Interactive HTML Leaderboard (`leaderboard-html`)

Metric Notes

DPB Calculation

Model Runner Scripts

Usage

Add a New Model

Implemented Models

Performance Evaluation Design

WeChat Group

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why DocParsingBench?

Latest Updates

Dataset

Metric Overview

Evaluation Leaderboard

Installation

Configuration

Local Development With fastcdm Source

Usage

Evaluation

Segment Testing

Visualization

Summary Bar Chart (summary-chart)

Interactive HTML Leaderboard (leaderboard-html)

Metric Notes

DPB Calculation

Model Runner Scripts

Usage

Add a New Model

Implemented Models

Performance Evaluation Design

WeChat Group

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Summary Bar Chart (`summary-chart`)

Interactive HTML Leaderboard (`leaderboard-html`)