Azure OCR Integration? #3346

shartzog · 2025-06-30T18:52:39Z

shartzog
Jun 30, 2025

I've created an extension to pypdf's text extraction called pypdftotext that falls back to using Azure Document Intelligence OCR for pages that fail to produce output with pypdf's extraction routines. If OCR is triggered, the Azure OCR response is restructured into a fixed width page representation using the machinery I introduced with layout mode a while ago.

I think this behavior would be high value for pypdf users. E.g. when you're asked 'why didn't I get text' by someone sitting on an image based pdf, you could tell them to sign up for Azure and set a couple constants and they'd be off to the races. Same when they say "the formatting is all weird". The OCR stuff generally comes back nice and neat.

It also comes with a lot of potential headaches, so I'd definitely understand a hard pass. E.g.:

Azure problems occasionally become your problems (why isn't text extraction working? b/c Azure is on the fritz)
Maintainers would need a basic familiarity w/ Azure's OCR services
Potentially vulnerable to changes in Azure's APIs
Requests for 'bring your own OCR'
etc

In any event, if your interest is piqued, you can create a 3.10+ env and pip install pypdftotext to check it out. To see how the OCR stuff behaves, you'd need to set up an Azure account and follow the steps in the link above. Then you'd run:

from pathlib import Path
import pypdftotext
pypdftotext.constants.AZURE_DOCINTEL_ENDPOINT = "https://your.document-intelligence.endpoint/"
pypdftotext.constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your_Sub$cript!onKEY"
pdf = Path("some_pdf.pdf").read_bytes()  # can be PdfReader, bytes, or io.BytesIO
pdf_text = "\n".join(pypdftotext.pdf_text_pages(pdf))
print(pdf_text)

In its current form, pypdftotext.pdf_text_pages() will accept a PdfReader, BytesIO, or bytes object as input and returns a list of multiline strings, one per page. All that would need restructuring, of course. Just gauging interest in the possibility of pulling something like this in at this point. ;)

(NOTE: in the 'real world' you'd set ENV vars of the same name as those AZURE_DOCINTEL_ constants. Just showing it that way for convenience.)

stefan6419846 · 2025-07-01T07:02:24Z

stefan6419846
Jul 1, 2025
Maintainer

Thanks for your proposal. I do not think that we should add support for provider-specific (commercial) OCR solutions into pypdf directly. I tend to mostly use Tesseract as OCR fallback, while others might want to use their personally trained ML model, their favorite LLM, another existing FOSS OCR tool or one of the other cloud providers (besides Microsoft Azure) like Amazon or Google which provide similar functionality AFAIK. Production systems tend to regularly block all external connections as well.

We might want to add some more sophisticated examples for such cases into our docs and refer to tools like pypdftotext.

2 replies

shartzog Jul 1, 2025
Author

Sounds good and makes perfect sense. Different tools certainly fit different use cases, so agree that locking in a specific provider probably isn't the best idea.

As an aside, we started w/ Tesseract and needed something better at handwriting, so we switched to Google. It was better but still not as good as we'd hoped, so we ended up testing Google against Azure about a year and a half ago. Azure was cheaper and hands down better at the handwriting, but who knows if that still holds up as fast as things change these days? Just pointing out that Azure might be helpful if you're ever struggling w/ a handwritten form use case... ;)

stefan6419846 Jul 1, 2025
Maintainer

Yes, handwriting detection tends to be a hassle. Tesseract and other better known OCR libraries do not have support for it or no trained models or are unmaintained. For some cases, we use a commercial tool working locally, but have been experimenting with own training and LLMs with open weights as well. Given how much sometimes even I struggle to interpret some handwritten text, it is no real surprise that automating this is complicated as well - hopefully general direct digital submissions solve this in the long term for the most important use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Azure OCR Integration? #3346

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Azure OCR Integration? #3346

Uh oh!

Uh oh!

shartzog Jun 30, 2025

Replies: 1 comment · 2 replies

Uh oh!

stefan6419846 Jul 1, 2025 Maintainer

Uh oh!

shartzog Jul 1, 2025 Author

Uh oh!

stefan6419846 Jul 1, 2025 Maintainer

shartzog
Jun 30, 2025

Replies: 1 comment 2 replies

stefan6419846
Jul 1, 2025
Maintainer

shartzog Jul 1, 2025
Author

stefan6419846 Jul 1, 2025
Maintainer