-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/pdf processed by tika #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR migrates PDF text extraction from the pypdf library to an external Tika microservice, improving content extraction capabilities and modernizing the PDF processing pipeline.
- Replaces
pypdfwith Tika-based PDF extraction functions across all plugins (HAL, OAPEN, OpenAlex) - Removes PDF page size validation previously done with
pypdf - Updates dependencies and adds environment configuration for the Tika service
Reviewed Changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/modules/pdf_extractor.py | Implements new Tika-based PDF extraction functions to replace pypdf functionality |
| welearn_datastack/plugins/rest_requesters/*.py | Updates PDF processing in HAL, OAPEN, and OpenAlex plugins to use Tika extraction |
| tests/document_collector_hub/test_pdf_extractor.py | Refactors tests to cover new Tika-based extraction methods |
| tests/document_collector_hub/plugins_test/*.py | Updates plugin tests to mock Tika functions instead of pypdf |
| pyproject.toml | Updates dependencies: removes pypdf, upgrades refinedoc, pins qdrant-client |
| k8s/welearn-datastack/values.yaml | Adds TIKA_ADDRESS environment variable configuration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This pull request introduces significant changes to how PDF content is extracted and processed, migrating from the
pypdflibrary to using an external Tika microservice for PDF text extraction. It also includes dependency updates, environment variable additions, and test refactoring to support these changes. The most important updates are grouped below.PDF Extraction and Processing Modernization
Replaced all usage of
pypdffor PDF text extraction with new functions that send PDF files to a Tika microservice (extract_txt_from_pdf_with_tika,_send_pdf_to_tika, and_parse_tika_contentinpdf_extractor.py). This affects multiple plugins, including HAL and OAPEN, and removes page size checks previously done withpypdf. [1] [2] [3] [4] [5]Updated environment configuration to add the
TIKA_ADDRESSvariable in the Kubernetes values file, allowing the Tika microservice address to be set via environment variables. [1] [2]Dependency and Compatibility Updates
pyproject.toml: upgradedrefinedoc, pinnedqdrant-clientto version 1.12.2, and addedazure-storage-blobas a new dependency. Removedpypdfas it is no longer used. [1] [2]Testing Refactor and Coverage
pypdf. [1] [2] [3] [4] [5] [6] [7]Other Notable Changes
document_related.qty_document_in_qdrant.