DocParsingBench is an evaluation toolkit purpose-built for intelligent document parsing products, bridging academia and real‑world industry.
- Industry‑aligned: Built with real‑world enterprise documents, not just academic datasets.
- Compatible: Directly compares GT Markdown vs. predicted Markdown. Works with any parsing solution.
- Optimal segment matching: Segments by text, inline formulas, and tables, then matches within same types. More accurate than full‑text string comparison.
- Engineering-friendly: CLI + quick start + visual dashboards. Easily plug into your model experiment pipeline for fast iteration.
If this project helps you, please consider giving it a ⭐ Star in the top-right corner. Your support is a huge encouragement to the team.
[2026.04.17] DocParsingBench evaluation toolkit release. It will provide unified scoring for the three core elements of document parsing: text, formula, and table, along with CLI batch evaluation, segment matching, visualization analysis, and leaderboard generation. 📊
[2026.03.09] DocParsingBench dataset release. The first intelligent document parsing dataset built for real industry scenarios, covering finance, legal, scientific research, manufacturing, and education. Now available on Hugging Face、ModelScope!🔥🔥🔥
We systematically collected and annotated document samples from real business workflows, preserving scan noise, stamp occlusion, and blurry characters.
| Dimension | Category |
|---|---|
| Total Samples | 1400 pages |
| Languages | Chinese, English, bilingual |
| Industry Coverage | Finance / Legal / Scientific Research / Manufacturing / Education |
| Layout Coverage | Single-column / Double-column / Triple-column / Mixed |
| Annotation Format | Markdown |
| Chemical Annotation | Uses the SoMarkdown specification, combining SMILES with LaTeX to render chemical structure formulas completely |
DocParsingBench is an evaluation toolkit for document parsing. It takes two Markdown files (prediction and ground truth), performs segment-level matching and scoring by category, and outputs both overall and per-category scores with reusable metric wrappers and visualization tools.
- Segment categories:
text (with inline formulas),display_formula,table,image(currently dropped in evaluation) - Segmentation: text and display formulas are split by line boundaries; tables are bounded by
<table> ... </table> - Matching: Hungarian matching is applied within each category using configured matching metrics (
NED,CDM,TEDS) - Metric wrappers:
NED/CER,CDM,TEDS/TEDS-S - Overall metric: DPB (Document Parsing Benchmark), a weighted average with default weights
α=0.5, β=0.3, γ=0.2
| Rank | Methods | DPB | Text | Formula | Table |
|---|---|---|---|---|---|
| 1 | PaddleOCR-1.5 | 0.8535 | 0.8959 | 0.7527 | 0.7104 |
| 2 | MonkeyOCR-Pro-3B | 0.8260 | 0.8669 | 0.7206 | 0.7014 |
| 3 | MinerU2.5 | 0.8164 | 0.8426 | 0.7993 | 0.7557 |
| 4 | Qwen3-VL-235B-Instruct | 0.7971 | 0.8496 | 0.4355 | 0.6691 |
| 5 | ChandraOCR-2 | 0.7906 | 0.8361 | 0.7772 | 0.7242 |
| 6 | Deepseek-OCR-2 | 0.7403 | 0.7917 | 0.6775 | 0.5741 |
| 7 | GLM-OCR | 0.7348 | 0.7695 | 0.5773 | 0.5046 |
| 8 | dots.ocr-1.5 | 0.6564 | 0.6885 | 0.6236 | 0.5655 |
| 9 | HunyuanOCR | 0.5128 | 0.5319 | 0.6018 | 0.6428 |
- Requires Python 3.8+
- Dependencies are declared in
pyproject.toml
git clone https://github.com/SoMarkAI/DocParsingBench.git
cd docparsingbench
pip install .
# For local development:
pip install -e .Configuration is defined in YAML and maps 1:1 to the internal Config dataclass. A reference file is provided at config.example.yaml.
Key options:
chromedriver_path: if unset ornull,fastcdmuses its own default.visualize: whether to generate CDM visualization images during evaluation (effective only whenformula.metric: "CDM"). Output images are saved in<output>/cdm_vis/.
To use local fastcdm source code instead of an installed package, set FASTCDM_SRC to the source root:
export FASTCDM_SRC=/path/to/fastcdmYou can add this line to ~/.zshrc or ~/.bashrc for persistence. If not set, the installed fastcdm package is used.
dpb is packaged as a CLI entrypoint and is equivalent to python -m docparsingbench.cli.
python -m dpb eval \
--gt path/to/gt.md \
--pred path/to/pred.md \
--config config.yaml \
--out result.jsonIf --gt and --pred are directories, matching filenames are evaluated in batch.
# gt_dir contains a.md, b.md, c.md ...
# pred_dir contains a.md, b.md, c.md ...
python -m dpb eval \
--gt gt_dir/ \
--pred pred_dir/ \
--config config.yaml \
--out result.jsonAfter eval, the terminal prints a model-level one-line summary with Model, Files, DPB, Text, Formula, Table, FormulaRenderFailures, and Output.
python -m dpb segment \
--in path/to/md \
--out segments.jsondpb visualize \
--labels path/to/labels.json \
--img path/to/images_dir \
--gt path/to/gt_markdowns_dir \
--pred path/to/pred_markdowns_dir \
--result path/to/model.result.jsonlabels.jsonstores only sample-to-industry/sub-industry mappings. If--labelsis omitted, it is auto-generated alongsidegt.
dpb summary-chart \
--labels path/to/labels.json \
--results path/to/results_dir \
--exclude-model-prefix deepseek_ocr \
--y-min 30 \
--y-max 100 \
--output path/to/summary_chart.png- Optional: repeat
--exclude-model-prefixto hide model families by result filename prefix. - Optional: set y-axis range via
--y-min/--y-max(defaults:30/100).
Batch evaluation can auto-generate the chart when all conditions are met:
--gtand--predare both directoriessummary_chart.enable: true(default:true)summary_chart.y_min/summary_chart.y_max(defaults:30/100)--labelsis omitted and can be auto-generated fromgt
dpb eval \
--gt data/gt/DocParsingBench/markdowns \
--pred data/pred/some_model_md \
--config config.yaml \
--out data/results/some_model_md.result.jsonGenerates a single self-contained .html file with interactive sorting
and filtering. Open it in any browser or share it directly without a server.
dpb leaderboard-html \
--labels path/to/labels.json \
--results path/to/results_dir \
--output leaderboard.html \
--exclude-model-prefix deepseek_ocr # optional, repeatable- All data (All + per-industry views) is embedded inline as JSON
- Industry switch:
All / Education / Finance / Legal / Manufacturing / Research - Metrics and ranking in one table:
DPB / Text / Formula / Table - Default sort: DPB descending; click any column header to cycle desc/asc
- Hover a metric cell → cursor-following tooltip with the 4-decimal raw value
- Save as image button exports the current view as PNG via
html2canvas - Smooth bar-width transitions when switching industries or sort columns
- NED (Normalized Edit Distance): normalized edit distance computed after character-level normalization
- CER (Character Error Rate): edit distance divided by GT length
- CDM: formula matching metric based on
fastcdm, returns F1/recall/precision (F1 is used by default) - TEDS/TEDS-S: table tree-edit-distance-based similarity (
greater is better); TEDS-S compares structure only - Hungarian matching: one-to-one matching within each segment category; unmatched pairs are scored as 0 similarity
- Text:
text_score = α * avg(1 - NED) + (1 - α) * avg(CDM) - Display formula:
avg(CDM)(or NED depending on config) - Table:
avg(TEDS)(or TEDS-S)
DPB = α * text_score + β * display_formula_score + γ * table_score
Different domain presets (for example paper/finance/tech) can define different weight presets.
The scripts/ directory provides OCR model runner scaffolding with a unified pipeline:
scan image directory -> call model -> post-process -> output Markdown.
# Deepseek-OCR example
python -m scripts.deepseek_ocr ./images ./output/deepseek_ocr_mdInherit BaseModelRunner and implement parse_md:
from scripts.base import BaseModelRunner
class MyModelRunner(BaseModelRunner):
name = "my_model"
def parse_md(self, img_path: str) -> str:
# call model API / SDK / local inference and return markdown
...
def postprocess(self, md: str) -> str:
# optional: cleanup / formatting
return mdThe base class handles image scanning, tqdm progress display, resume behavior (skip existing outputs), and failure statistics.
| Script | Model | Status |
|---|---|---|
deepseek_ocr.py |
DeepSeek OCR | Implemented |
dots_ocr.py |
Dots OCR | Implemented |
glm_ocr.py |
GLM OCR | Implemented |
hunyuan_ocr.py |
Hunyuan OCR | Implemented |
mineru.py |
MinerU | Implemented |
monkey_ocr.py |
Monkey OCR Pro 3B | Implemented |
paddle.py |
PaddleOCR | Implemented |
qwen3_vl.py |
Qwen3-VL | Implemented |
chandra_ocr.py |
Chandra OCR | Implemented |
This project reserves hooks and a unified output schema for performance benchmarking. Real model invocation can be driven externally.
- In CLI
eval, whenperf.enable=true, it records:- segmentation time, matching time, each metric's time, and total time
- document count and throughput (
docs/s)
- Output is written to
perfinresult.json:phases: timing by phasethroughput: document throughputnotes: external model invocation marker (empty by default or filled by upper layers)
Benchmark speed reporting should use evaluation-phase runtime + document throughput, excluding external model generation latency. External model latency should be recorded by upper-layer systems.



