Skip to content

Add Chandra OCR 2 engine — state-of-the-art document OCR#4

Merged
Mihailorama merged 2 commits into
mainfrom
claude/analyze-chandra-ocr-Nd4T1
Mar 31, 2026
Merged

Add Chandra OCR 2 engine — state-of-the-art document OCR#4
Mihailorama merged 2 commits into
mainfrom
claude/analyze-chandra-ocr-Nd4T1

Conversation

@Mihailorama
Copy link
Copy Markdown
Owner

Summary

Adds Chandra OCR 2 (by Datalab) as a new document processing engine to docfold. Chandra is a 5B-parameter Vision Language Model achieving 85.9% on the olmOCR benchmark — significantly outperforming existing engines like Marker (76.5%) and Mistral OCR (72.0%). It converts images and PDFs to structured Markdown/HTML/JSON with layout preservation, supports 90+ languages, and excels at handwriting, tables, math, and complex layouts.

Key Changes

  • New engine adapter (src/docfold/engines/chandra_engine.py):

    • Implements DocumentEngine ABC with support for both vLLM (remote server) and HuggingFace (local) inference backends
    • Lazy model loading on first process() call to avoid startup overhead
    • Handles PDF-to-image conversion and per-page processing
    • Maps Chandra's native Markdown output to EngineResult with support for JSON/HTML formats
    • Supports 8 image/PDF extensions: pdf, png, jpg, jpeg, tiff, bmp, webp
    • Declares capabilities: table_structure=True, heading_detection=True, reading_order=True
  • Router integration (src/docfold/engines/router.py):

    • Added "chandra" to high priority in _IMAGE_PRIORITY and _PDF_PRIORITY lists (given superior benchmark scores)
    • Added to _DEFAULT_FALLBACK list
  • CLI registration (src/docfold/cli.py):

    • Registered ChandraEngine in _build_router() with graceful fallback if dependencies missing
  • Optional dependency (pyproject.toml):

    • Added chandra extra group with chandra-ocr>=0.1 dependency
    • Included in all extra
  • Comprehensive test coverage (tests/engines/test_adapters.py):

    • TestChandraEngine class with 7 test methods covering name, extensions, availability, config storage, defaults, and capabilities
    • Added to TestAllEnginesImplementInterface parametrized tests
  • Documentation:

    • Added docs/RESEARCH_CHANDRA_OCR.md — detailed research document with benchmarks, capabilities, usage examples, and integration rationale
    • Added docs/TASK_CHANDRA_ENGINE.md — implementation task specification
    • Updated docs/benchmarks.md — added Chandra to Quick Comparison table and Engine Profiles section

Implementation Details

  • Dual backend support: vLLM (default, recommended for production) and HuggingFace Transformers (local inference). Configurable via method parameter.
  • No model loading at import/init time: Model only loads on first process() call, keeping startup fast.
  • Async-safe: Uses loop.run_in_executor() to offload CPU-bound inference to thread pool.
  • License-aware: OpenRAIL-M model license documented; free for research/personal/startups <$2M, requires commercial license otherwise.
  • Follows existing patterns: Implementation mirrors other VLM-based engines (Nougat, Zerox) in docfold.

https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW

claude added 2 commits March 31, 2026 19:55
Analyze Chandra OCR 2 (Datalab) as a candidate engine for docfold:
- RESEARCH_CHANDRA_OCR.md: model details, benchmarks (85.9% olmOCR SOTA),
  capabilities, API usage, confidence scoring, and comparison with existing engines
- TASK_CHANDRA_ENGINE.md: step-by-step integration plan following TDD approach,
  covering engine adapter, router, CLI, tests, and documentation updates

Sources: GitHub repo, HuggingFace model card, Datalab blog, confidence scoring docs

https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW
Add Datalab Chandra OCR 2 as the 19th engine in docfold — a 5B VLM
achieving 85.9% on olmOCR benchmark (SOTA). Supports 90+ languages,
handwriting, tables, math, and complex layouts via vLLM or HuggingFace.

- New ChandraEngine adapter with dual backend (vllm/hf), lazy model loading
- Tests: 6 unit tests + interface compliance test
- Router: chandra added to PDF, image, and default fallback priorities
- CLI: registered in _build_router()
- pyproject.toml: chandra optional dependency group, added to [all]
- benchmarks.md: quick comparison, engine profile, feature/format/hw/cost matrices

All 279 tests pass, ruff clean.

https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW
@Mihailorama Mihailorama merged commit 0f3fa36 into main Mar 31, 2026
9 checks passed
@Mihailorama Mihailorama deleted the claude/analyze-chandra-ocr-Nd4T1 branch March 31, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants