Skip to content

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Aug 28, 2025

This pull request introduces significant changes to how PDF content is extracted and processed, migrating from the pypdf library to using an external Tika microservice for PDF text extraction. It also includes dependency updates, environment variable additions, and test refactoring to support these changes. The most important updates are grouped below.

PDF Extraction and Processing Modernization

  • Replaced all usage of pypdf for PDF text extraction with new functions that send PDF files to a Tika microservice (extract_txt_from_pdf_with_tika, _send_pdf_to_tika, and _parse_tika_content in pdf_extractor.py). This affects multiple plugins, including HAL and OAPEN, and removes page size checks previously done with pypdf. [1] [2] [3] [4] [5]

  • Updated environment configuration to add the TIKA_ADDRESS variable in the Kubernetes values file, allowing the Tika microservice address to be set via environment variables. [1] [2]

Dependency and Compatibility Updates

  • Updated dependencies in pyproject.toml: upgraded refinedoc, pinned qdrant-client to version 1.12.2, and added azure-storage-blob as a new dependency. Removed pypdf as it is no longer used. [1] [2]

Testing Refactor and Coverage

  • Refactored and expanded tests for PDF extraction, including new tests for the Tika-based extraction methods and updating mocks to patch Tika-related functions instead of pypdf. [1] [2] [3] [4] [5] [6] [7]

Other Notable Changes

  • Lowercased the corpus category title in a SQL migration for consistency.
  • Removed the unused materialized view document_related.qty_document_in_qdrant.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR migrates PDF text extraction from the pypdf library to an external Tika microservice, improving content extraction capabilities and modernizing the PDF processing pipeline.

  • Replaces pypdf with Tika-based PDF extraction functions across all plugins (HAL, OAPEN, OpenAlex)
  • Removes PDF page size validation previously done with pypdf
  • Updates dependencies and adds environment configuration for the Tika service

Reviewed Changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
welearn_datastack/modules/pdf_extractor.py Implements new Tika-based PDF extraction functions to replace pypdf functionality
welearn_datastack/plugins/rest_requesters/*.py Updates PDF processing in HAL, OAPEN, and OpenAlex plugins to use Tika extraction
tests/document_collector_hub/test_pdf_extractor.py Refactors tests to cover new Tika-based extraction methods
tests/document_collector_hub/plugins_test/*.py Updates plugin tests to mock Tika functions instead of pypdf
pyproject.toml Updates dependencies: removes pypdf, upgrades refinedoc, pins qdrant-client
k8s/welearn-datastack/values.yaml Adds TIKA_ADDRESS environment variable configuration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

lpi-tn and others added 3 commits August 29, 2025 11:09
Co-authored-by: Sandra Guerreiro  <sandragjacinto@gmail.com>
@lpi-tn lpi-tn merged commit 1e0f57e into main Aug 29, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Feature/pdf-processed-by-tika branch August 29, 2025 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants