Replies: 1 comment 2 replies
-
Thanks for your proposal. I do not think that we should add support for provider-specific (commercial) OCR solutions into pypdf directly. I tend to mostly use Tesseract as OCR fallback, while others might want to use their personally trained ML model, their favorite LLM, another existing FOSS OCR tool or one of the other cloud providers (besides Microsoft Azure) like Amazon or Google which provide similar functionality AFAIK. Production systems tend to regularly block all external connections as well. We might want to add some more sophisticated examples for such cases into our docs and refer to tools like pypdftotext. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've created an extension to pypdf's text extraction called pypdftotext that falls back to using Azure Document Intelligence OCR for pages that fail to produce output with pypdf's extraction routines. If OCR is triggered, the Azure OCR response is restructured into a fixed width page representation using the machinery I introduced with layout mode a while ago.
I think this behavior would be high value for pypdf users. E.g. when you're asked 'why didn't I get text' by someone sitting on an image based pdf, you could tell them to sign up for Azure and set a couple constants and they'd be off to the races. Same when they say "the formatting is all weird". The OCR stuff generally comes back nice and neat.
It also comes with a lot of potential headaches, so I'd definitely understand a hard pass. E.g.:
In any event, if your interest is piqued, you can create a 3.10+ env and
pip install pypdftotext
to check it out. To see how the OCR stuff behaves, you'd need to set up an Azure account and follow the steps in the link above. Then you'd run:In its current form,
pypdftotext.pdf_text_pages()
will accept a PdfReader, BytesIO, or bytes object as input and returns a list of multiline strings, one per page. All that would need restructuring, of course. Just gauging interest in the possibility of pulling something like this in at this point. ;)(NOTE: in the 'real world' you'd set ENV vars of the same name as those AZURE_DOCINTEL_ constants. Just showing it that way for convenience.)
Beta Was this translation helpful? Give feedback.
All reactions