-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds pytesseract, tesseract and poopler-utils #1648
Conversation
… of copy protected PDF via OCR instead of text extraction
Thanks 🙏 could you tell me the difference in size that it incurs ? |
Sure, here you go:
It is already massive and doesn't change much hopefully. I saw you tried some multi-stage build that might bring this size down. I was surprised how fast my volume was filling, now I know exactly why! :) Tell somewhere in dev dock to PS: probablement à bientôt à Paris! ;) |
Damn I really need to tackle this issue. It is not possible to continue like that. I'll try it on tuesday. On vacation with access to limited internet. Thanks a lot and check your mails ;) |
1 similar comment
Damn I really need to tackle this issue. It is not possible to continue like that. I'll try it on tuesday. On vacation with access to limited internet. Thanks a lot and check your mails ;) |
@StanGirard looks like a more general problem about the tests! :) |
Yeah they don't run on contributors PR 🗡️ |
Alright, should I do anything else ? (PR Title?) |
🤖 I have created a release *beep* *boop* --- ## 0.0.118 (2023-11-22) ## What's Changed * docs: add api based brains by @mamadoudicko in #1685 * Adds pytesseract, tesseract and poopler-utils by @MeTaNoV in #1648 ## New Contributors * @MeTaNoV made their first contribution in #1648 **Full Changelog**: v0.0.117...v0.0.118 --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## 0.0.118 (2023-11-22) ## What's Changed * docs: add api based brains by @mamadoudicko in QuivrHQ/quivr#1685 * Adds pytesseract, tesseract and poopler-utils by @MeTaNoV in QuivrHQ/quivr#1648 ## New Contributors * @MeTaNoV made their first contribution in QuivrHQ/quivr#1648 **Full Changelog**: QuivrHQ/quivr@v0.0.117...v0.0.118 --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
To enable the ingestion of copy protected PDF via OCR instead of text extraction
Description
Copy protected PDF can't be properly imported via the standard langchain loader.
See the following errors:
Checklist before requesting a review
Please delete options that are not relevant.
Screenshots (if appropriate):
None