Add Chandra OCR 2 engine — state-of-the-art document OCR by Mihailorama · Pull Request #4 · Mihailorama/docfold

Mihailorama · 2026-03-31T20:37:10Z

Summary

Adds Chandra OCR 2 (by Datalab) as a new document processing engine to docfold. Chandra is a 5B-parameter Vision Language Model achieving 85.9% on the olmOCR benchmark — significantly outperforming existing engines like Marker (76.5%) and Mistral OCR (72.0%). It converts images and PDFs to structured Markdown/HTML/JSON with layout preservation, supports 90+ languages, and excels at handwriting, tables, math, and complex layouts.

Key Changes

New engine adapter (src/docfold/engines/chandra_engine.py):
- Implements DocumentEngine ABC with support for both vLLM (remote server) and HuggingFace (local) inference backends
- Lazy model loading on first process() call to avoid startup overhead
- Handles PDF-to-image conversion and per-page processing
- Maps Chandra's native Markdown output to EngineResult with support for JSON/HTML formats
- Supports 8 image/PDF extensions: pdf, png, jpg, jpeg, tiff, bmp, webp
- Declares capabilities: table_structure=True, heading_detection=True, reading_order=True
Router integration (src/docfold/engines/router.py):
- Added "chandra" to high priority in _IMAGE_PRIORITY and _PDF_PRIORITY lists (given superior benchmark scores)
- Added to _DEFAULT_FALLBACK list
CLI registration (src/docfold/cli.py):
- Registered ChandraEngine in _build_router() with graceful fallback if dependencies missing
Optional dependency (pyproject.toml):
- Added chandra extra group with chandra-ocr>=0.1 dependency
- Included in all extra
Comprehensive test coverage (tests/engines/test_adapters.py):
- TestChandraEngine class with 7 test methods covering name, extensions, availability, config storage, defaults, and capabilities
- Added to TestAllEnginesImplementInterface parametrized tests
Documentation:
- Added docs/RESEARCH_CHANDRA_OCR.md — detailed research document with benchmarks, capabilities, usage examples, and integration rationale
- Added docs/TASK_CHANDRA_ENGINE.md — implementation task specification
- Updated docs/benchmarks.md — added Chandra to Quick Comparison table and Engine Profiles section

Implementation Details

Dual backend support: vLLM (default, recommended for production) and HuggingFace Transformers (local inference). Configurable via method parameter.
No model loading at import/init time: Model only loads on first process() call, keeping startup fast.
Async-safe: Uses loop.run_in_executor() to offload CPU-bound inference to thread pool.
License-aware: OpenRAIL-M model license documented; free for research/personal/startups <$2M, requires commercial license otherwise.
Follows existing patterns: Implementation mirrors other VLM-based engines (Nougat, Zerox) in docfold.

https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW

Analyze Chandra OCR 2 (Datalab) as a candidate engine for docfold: - RESEARCH_CHANDRA_OCR.md: model details, benchmarks (85.9% olmOCR SOTA), capabilities, API usage, confidence scoring, and comparison with existing engines - TASK_CHANDRA_ENGINE.md: step-by-step integration plan following TDD approach, covering engine adapter, router, CLI, tests, and documentation updates Sources: GitHub repo, HuggingFace model card, Datalab blog, confidence scoring docs https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW

Add Datalab Chandra OCR 2 as the 19th engine in docfold — a 5B VLM achieving 85.9% on olmOCR benchmark (SOTA). Supports 90+ languages, handwriting, tables, math, and complex layouts via vLLM or HuggingFace. - New ChandraEngine adapter with dual backend (vllm/hf), lazy model loading - Tests: 6 unit tests + interface compliance test - Router: chandra added to PDF, image, and default fallback priorities - CLI: registered in _build_router() - pyproject.toml: chandra optional dependency group, added to [all] - benchmarks.md: quick comparison, engine profile, feature/format/hw/cost matrices All 279 tests pass, ruff clean. https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW

claude added 2 commits March 31, 2026 19:55

Mihailorama merged commit 0f3fa36 into main Mar 31, 2026
9 checks passed

Mihailorama deleted the claude/analyze-chandra-ocr-Nd4T1 branch March 31, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Chandra OCR 2 engine — state-of-the-art document OCR#4

Add Chandra OCR 2 engine — state-of-the-art document OCR#4
Mihailorama merged 2 commits into
mainfrom
claude/analyze-chandra-ocr-Nd4T1

Mihailorama commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mihailorama commented Mar 31, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants