Skip to content

Custom OCR #1502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pusapatiakhilraju opened this issue May 1, 2025 · 5 comments
Closed

Custom OCR #1502

pusapatiakhilraju opened this issue May 1, 2025 · 5 comments
Labels
question Further information is requested

Comments

@pusapatiakhilraju
Copy link

pusapatiakhilraju commented May 1, 2025

Question

Can I create my custom ocr class and pass it in to ocr_options? Any example code that can help me get started?
...

will this work?

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.datamodel.pipeline_options import PdfPipelineOptions
from surya.recognition import RecognitionPredictor
from surya.detection import DetectionPredictor
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions, OcrMacOptions, RapidOcrOptions, smolvlm_picture_description
from docling.datamodel.settings import settings
from PIL import Image
import os
from docling.models.base_model import BaseEnrichmentModel
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline


class SuryaOcrModel(BaseEnrichmentModel):
    def __init__(self, enabled: bool = True):
        self.enabled = enabled
        self.recognition_predictor = RecognitionPredictor()
        self.detection_predictor = DetectionPredictor()

    def is_processable(self, doc, element) -> bool:
        return self.enabled and element.type == "page_image"

    def __call__(self, doc, element_batch: Iterable) -> Iterable:
        for element in element_batch:
            image: Image.Image = element.get_image(doc).convert("RGB")

            # Run Surya OCR
            prediction = self.recognition_predictor([image], [None], self.detection_predictor)[0][0]

            for line in prediction.text_lines:
                text = line.text.strip()
                if not text:
                    continue

                l, t, r, b = line.bbox  # Already in LTRB
                bbox = BoundingBox.from_ltrb(l, t, r, b)
                doc.add_item(TextItem(text=text, bbox=bbox, page_no=element.page_no))

            yield element

class SuryaOcrPipeline(StandardPdfPipeline):
    def __init__(self, pipeline_options):
        super().__init__(pipeline_options)
        self.enrichment_pipe = []
        self.enrichment_pipe.append(SuryaOcrModel(enabled=True))

    @classmethod
    def get_default_options(cls):
        return PdfPipelineOptions(
            generate_page_images=True,
            images_scale=2.0,
            do_ocr=True
        )

Converting

input_pdf_path = Path("./img/test.png")
output_dir = Path("parsed-doc-advanced/test")
output_dir.mkdir(parents=True, exist_ok=True)

pipeline_options = PyMuPdfOcrPipeline.get_default_options()

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=SuryaOcrPipeline,
            pipeline_options=pipeline_options
        )
    }
)

result = converter.convert(input_pdf_path)

is this the right way to use the custom OCR? I create a class and use it in pipeline_cls

@pusapatiakhilraju pusapatiakhilraju added the question Further information is requested label May 1, 2025
@dolfim-ibm
Copy link
Contributor

Note that we won't accept contribution adding dependencies with incompatible license (Surya is licensed as GPL).

This is the reason we have a plugin system for users to contribute their own third-party integration. You can read more in the plugin docs: https://docling-project.github.io/docling/concepts/plugins/

@Bill-XU
Copy link

Bill-XU commented May 8, 2025

Hi, @dolfim-ibm

I added following configuration to pyproject.toml, which is under my project's root named "fastapi_test".

[project.entry-points."docling"]
custom_ocr = "fastapi_test.docling_custom"

And I created a python file "docling_custom.py" under my project's root, within which ocr_engines method is defined as the following:

def ocr_engines():
    return {
        "ocr_engines": [
            CustomOcrModel,
        ]
    }

But when I ran converter, errors occurred.

...

KeyError: <class 'docling_custom.CustomOcrOptions'>

During handling of the above exception, another exception occurred:

...

RuntimeError: No class found with the name 'custom_ocr', known classes are:
	'easyocr' => <class 'docling.models.easyocr_model.EasyOcrModel'>
	'ocrmac' => <class 'docling.models.ocr_mac_model.OcrMacModel'>
	'rapidocr' => <class 'docling.models.rapid_ocr_model.RapidOcrModel'>
	'tesserocr' => <class 'docling.models.tesseract_ocr_model.TesseractOcrModel'>
	'tesseract' => <class 'docling.models.tesseract_ocr_cli_model.TesseractOcrCliModel'>

I have set the ocr_options of pipeline_options with my custom ocr options, in which the value of kind is "custom_ocr".
It seems that the ocr factory did not know custom_ocr . How can I fix this ?

Best regards,
Bill

@Bill-XU
Copy link

Bill-XU commented May 8, 2025

@dolfim-ibm @pusapatiakhilraju
Okay, I figured it our myself. Below is how I made it. Just for future reference.

  1. To create a plugin and use it in one's project, it needs to create an individual project for the plugin first.
    In my case, I created a new project "custom_docling".
  2. And then, prepare the folder structure of this new project for future install.
    In my case, my project looks like this.
custom_docling
├───src
│   └───custom_docling
│       ├───plugins
│       │   ├───__init__.py
│       │   └───custom_ocr.py
│       └───__init__.py
└───pyproject.toml
  1. Choose a packaging tool and edit the project packaging information.
    In my case, I chose setuptools for packaging (following docling's plugin guide: https://docling-project.github.io/docling/concepts/plugins/), so I wrote pyproject.toml as below.
[build-system]
requires = ["setuptools >= 65.0.0"]
build-backend = "setuptools.build_meta"

[project]
name = "custom_docling"
version = "0.0.1"
dependencies = [
  "docling>=2.30.0",
  "openai>=1.65.0",
]

[project.entry-points."docling"]
custom_ocr = "custom_docling.plugins.custom_ocr"

*path of entry-point should not begin from "src", it is ignored by python.
4. After all above, go on coding the plugin, but don't forget add a method "ocr_engines" to the plugin.
In my case, I added the method in custom_ocr.py.

def ocr_engines():
    return {
        "ocr_engines": [
            CustomOcrModel,
        ]
    }
  1. While coding options for the model, should configure the kind correctly.
    In my case, I created the options like this.
class CustomOcrOptions(OcrOptions):
    kind : ClassVar[str] = "custom_ocr"
    ...

Some notes on implementing an OCR plugin.

  • Options class should inherits OcrOptions (from docling.datamodel.pipeline_options).
  • Options class must declare a class variable "kind" and set its default value as the same as configured in pyproject.toml.
  • Model class should inherits BaseOcrModel (from docling.models.base_ocr_model).
  • Model class should implements following method:
    • def init
    • def call
    • def get_options_type
  • get_options_type is a classmethod and should return type of Options.

After everything is done, use pip install -e xxx to install this plugin in the main project.

That's all.

@pusapatiakhilraju
Copy link
Author

thank you.

@cau-git
Copy link
Contributor

cau-git commented May 23, 2025

@Bill-XU Thanks for outlining your findings and solution. I will close this issue as resolved.
@pusapatiakhilraju Feel free to re-open if you have further questions or feedback.

@cau-git cau-git closed this as completed May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants