Feature/pdf processed by tika #60

lpi-tn · 2025-08-28T15:13:18Z

This pull request introduces significant changes to how PDF content is extracted and processed, migrating from the pypdf library to using an external Tika microservice for PDF text extraction. It also includes dependency updates, environment variable additions, and test refactoring to support these changes. The most important updates are grouped below.

PDF Extraction and Processing Modernization

Replaced all usage of pypdf for PDF text extraction with new functions that send PDF files to a Tika microservice (extract_txt_from_pdf_with_tika, _send_pdf_to_tika, and _parse_tika_content in pdf_extractor.py). This affects multiple plugins, including HAL and OAPEN, and removes page size checks previously done with pypdf. [1] [2] [3] [4] [5]
Updated environment configuration to add the TIKA_ADDRESS variable in the Kubernetes values file, allowing the Tika microservice address to be set via environment variables. [1] [2]

Dependency and Compatibility Updates

Updated dependencies in pyproject.toml: upgraded refinedoc, pinned qdrant-client to version 1.12.2, and added azure-storage-blob as a new dependency. Removed pypdf as it is no longer used. [1] [2]

Testing Refactor and Coverage

Refactored and expanded tests for PDF extraction, including new tests for the Tika-based extraction methods and updating mocks to patch Tika-related functions instead of pypdf. [1] [2] [3] [4] [5] [6] [7]

Other Notable Changes

Lowercased the corpus category title in a SQL migration for consistency.
Removed the unused materialized view document_related.qty_document_in_qdrant.

Copilot

Pull Request Overview

This PR migrates PDF text extraction from the pypdf library to an external Tika microservice, improving content extraction capabilities and modernizing the PDF processing pipeline.

Replaces pypdf with Tika-based PDF extraction functions across all plugins (HAL, OAPEN, OpenAlex)
Removes PDF page size validation previously done with pypdf
Updates dependencies and adds environment configuration for the Tika service

Reviewed Changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
welearn_datastack/modules/pdf_extractor.py	Implements new Tika-based PDF extraction functions to replace pypdf functionality
welearn_datastack/plugins/rest_requesters/*.py	Updates PDF processing in HAL, OAPEN, and OpenAlex plugins to use Tika extraction
tests/document_collector_hub/test_pdf_extractor.py	Refactors tests to cover new Tika-based extraction methods
tests/document_collector_hub/plugins_test/*.py	Updates plugin tests to mock Tika functions instead of pypdf
pyproject.toml	Updates dependencies: removes pypdf, upgrades refinedoc, pins qdrant-client
k8s/welearn-datastack/values.yaml	Adds TIKA_ADDRESS environment variable configuration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

welearn_datastack/plugins/rest_requesters/open_alex.py

welearn_datastack/plugins/rest_requesters/oapen.py

welearn_datastack/plugins/rest_requesters/hal.py

welearn_datastack/plugins/rest_requesters/open_alex.py

welearn_datastack/plugins/rest_requesters/oapen.py

welearn_datastack/plugins/rest_requesters/hal.py

welearn_datastack/modules/pdf_extractor.py

Co-authored-by: Sandra Guerreiro <sandragjacinto@gmail.com>

lpi-tn added 16 commits August 26, 2025 16:45

various update + httpx

f9f52c2

Tika support for PDF

7083def

rm httpx

b063c36

linter

1845237

use new method

6d5c866

use new method

c093156

reliability

c655ccd

test pdf processed by tika

dc4bc18

test pdf processed by tika

2e46bd9

add tika address

d63db82

remove useless lib

d397e7c

remove useless lib

6c1681d

remove useless lib

1a58c99

typo

f9b08f3

view

e046028

rm french comments

45efd1b

lpi-tn requested review from Copilot and sandragjacinto August 28, 2025 15:13

Copilot AI reviewed Aug 28, 2025

View reviewed changes

typo

f622ae1

sandragjacinto reviewed Aug 29, 2025

View reviewed changes

welearn_datastack/modules/pdf_extractor.py Outdated Show resolved Hide resolved

sandragjacinto approved these changes Aug 29, 2025

View reviewed changes

lpi-tn and others added 3 commits August 29, 2025 11:09

delete

e67525c

Update welearn_datastack/modules/pdf_extractor.py

caa67d5

Co-authored-by: Sandra Guerreiro <sandragjacinto@gmail.com>

import re

4abd80c

lpi-tn merged commit 1e0f57e into main Aug 29, 2025
7 checks passed

lpi-tn deleted the Feature/pdf-processed-by-tika branch August 29, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/pdf processed by tika #60

Feature/pdf processed by tika #60

Uh oh!

lpi-tn commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/pdf processed by tika #60

Feature/pdf processed by tika #60

Uh oh!

Conversation

lpi-tn commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants