Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds pytesseract, tesseract and poopler-utils #1648

Merged
merged 6 commits into from
Nov 22, 2023
Merged

Adds pytesseract, tesseract and poopler-utils #1648

merged 6 commits into from
Nov 22, 2023

Conversation

MeTaNoV
Copy link
Contributor

@MeTaNoV MeTaNoV commented Nov 16, 2023

To enable the ingestion of copy protected PDF via OCR instead of text extraction

Description

Copy protected PDF can't be properly imported via the standard langchain loader.

See the following errors:

2023-11-15 14:16:31,927 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf
[nltk_data] Downloading package punkt to
[nltk_data]     /home/pascal_gula_luccid_ai/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pascal_gula_luccid_ai/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Error processing file: detectron2 is not installed, pytesseract is not installed and the text of the PDF is not extractable. To process this file, install detectron2, install pytesseract, or remove copy protection from the PDF.
2023-11-15 15:04:14,624 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf
Error processing file: Unable to get page count. Is poppler installed and in PATH?
023-11-15 15:59:11,886 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf
Error processing file: tesseract is not installed or it's not in your PATH. See README file for more information.

Checklist before requesting a review

Please delete options that are not relevant.

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented hard-to-understand areas
  • I have ideally added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged

Screenshots (if appropriate):

None

… of copy protected PDF via OCR instead of text extraction
@dosubot dosubot bot added the area: backend Related to backend functionality or under the /backend directory label Nov 16, 2023
@StanGirard
Copy link
Collaborator

Thanks 🙏 could you tell me the difference in size that it incurs ?

@MeTaNoV
Copy link
Contributor Author

MeTaNoV commented Nov 17, 2023

Sure, here you go:

Old:
quivr-backend-core   latest    056750d86247   2 minutes ago   13.4GB
New:
quivr-backend-core   latest    742c5d235313   15 hours ago    13.5GB

It is already massive and doesn't change much hopefully. I saw you tried some multi-stage build that might bring this size down. I was surprised how fast my volume was filling, now I know exactly why! :)

Tell somewhere in dev dock to docker images prune since during dev, you have tons of dangling images lying around! :D

PS: probablement à bientôt à Paris! ;)

@StanGirard
Copy link
Collaborator

Damn I really need to tackle this issue. It is not possible to continue like that.

I'll try it on tuesday. On vacation with access to limited internet.

Thanks a lot and check your mails ;)

1 similar comment
@StanGirard
Copy link
Collaborator

Damn I really need to tackle this issue. It is not possible to continue like that.

I'll try it on tuesday. On vacation with access to limited internet.

Thanks a lot and check your mails ;)

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Nov 21, 2023
@MeTaNoV
Copy link
Contributor Author

MeTaNoV commented Nov 21, 2023

@StanGirard looks like a more general problem about the tests! :)

@StanGirard
Copy link
Collaborator

Yeah they don't run on contributors PR 🗡️

@MeTaNoV
Copy link
Contributor Author

MeTaNoV commented Nov 21, 2023

Alright, should I do anything else ? (PR Title?)

@StanGirard StanGirard merged commit 8bbe6e7 into QuivrHQ:main Nov 22, 2023
mamadoudicko pushed a commit that referenced this pull request Nov 23, 2023
🤖 I have created a release *beep* *boop*
---


## 0.0.118 (2023-11-22)

## What's Changed
* docs: add api based brains by @mamadoudicko in
#1685
* Adds pytesseract, tesseract and poopler-utils by @MeTaNoV in
#1648

## New Contributors
* @MeTaNoV made their first contribution in
#1648

**Full Changelog**:
v0.0.117...v0.0.118

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
coolCatalyst added a commit to coolCatalyst/quivr that referenced this pull request Jun 1, 2024
🤖 I have created a release *beep* *boop*
---


## 0.0.118 (2023-11-22)

## What's Changed
* docs: add api based brains by @mamadoudicko in
QuivrHQ/quivr#1685
* Adds pytesseract, tesseract and poopler-utils by @MeTaNoV in
QuivrHQ/quivr#1648

## New Contributors
* @MeTaNoV made their first contribution in
QuivrHQ/quivr#1648

**Full Changelog**:
QuivrHQ/quivr@v0.0.117...v0.0.118

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: backend Related to backend functionality or under the /backend directory size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants