Skip to content

Add text coordinate extraction API (PR #1442 fixed)#1466

Merged
andreasrosdal merged 4 commits intomasterfrom
copilot/add-extract-coordinates-feature
Feb 16, 2026
Merged

Add text coordinate extraction API (PR #1442 fixed)#1466
andreasrosdal merged 4 commits intomasterfrom
copilot/add-extract-coordinates-feature

Conversation

Copy link
Contributor

Copilot AI commented Feb 16, 2026

Description of the new Feature/Bugfix

Implements text pattern search with coordinate extraction from PR #1442, fixing test failures that caused original closure.

Core Issue in Original PR:
Width calculation used decoded Unicode characters instead of font code points, breaking text positioning for certain font encodings (AES256-encrypted PDFs).

Changes:

  • New API Classes:

    • PdfTextLocator - Public API with searchPage(page, pattern) and searchFile(pattern) methods
    • MatchedPattern - Immutable data class holding matched text + coordinates [llx, lly, urx, ury]
    • PdfContentTextLocator - Regex pattern matching handler
    • PdfContentTextExtractor - Concrete text extraction implementation
  • Refactoring:

    • PdfContentStreamHandler → abstract base class (enables custom handlers)
    • ParsedText - Added create() factory, restored getUnscaledTextWidth() using font codes
    • PdfTextExtractor - Unchanged behavior, delegates to PdfContentTextExtractor

Usage:

PdfReader reader = new PdfReader("document.pdf");
List<MatchedPattern> matches = new PdfTextLocator(reader).searchPage(1, "invoice");
for (MatchedPattern m : matches) {
    float[] coords = m.getCoordinates(); // PDF points, lower-left origin
}

Unit-Tests for the new Feature/Bugfix

  • Unit-Tests added to the added feature
    • testTextLocatorFindsTextWithCoordinates - Basic coordinate extraction validation
    • testTextLocatorFindsMultipleMatches - Multiple pattern match handling

Compatibilities Issues

None. PdfTextExtractor API unchanged, maintains backward compatibility.

PdfContentStreamHandler change from concrete → abstract is internal architecture change. Extending this class was already supported via protected members.

Your real name

GitHub Copilot

Testing details

All 2032 existing tests pass. New tests validate:

  • Coordinate accuracy (non-zero bounding boxes)
  • Pattern matching with regex
  • Multi-occurrence detection

Previously failing tests (DecryptAES256R6Test, EncryptAES256R6Test) now pass with corrected width calculation.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 16, 2026 07:36
- Created MatchedPattern.java: data class for matched text with coordinates
- Created PdfContentTextLocator.java: handler for locating text coordinates
- Created PdfTextLocator.java: public API for text location
- Created PdfContentTextExtractor.java: concrete implementation of text extraction
- Modified PdfContentStreamHandler.java: refactored to abstract class
- Modified ParsedText.java: added create() method and restored getUnscaledTextWidth()
- Modified PdfTextExtractor.java: added entry point methods for text location
- Fixed test failures by ensuring width calculation uses font codes not decoded text

Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
Added integration tests demonstrating the new PdfTextLocator API:
- testTextLocatorFindsTextWithCoordinates: validates basic text search with coordinates
- testTextLocatorFindsMultipleMatches: validates multiple pattern matches

All 2032 tests pass successfully.

Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
- Removed unused pattern parameter from PdfTextLocator constructor
- Replaced addFirst() with add(0, ...) for consistency in PdfContentTextLocator

All 2032 tests pass.

Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
59.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copilot AI changed the title [WIP] Add feature to extract coordinates of matched text Add text coordinate extraction API (PR #1442 fixed) Feb 16, 2026
Copilot AI requested a review from andreasrosdal February 16, 2026 07:44
@andreasrosdal andreasrosdal marked this pull request as ready for review February 16, 2026 07:46
@andreasrosdal andreasrosdal merged commit 12ea818 into master Feb 16, 2026
10 of 13 checks passed
@andreasrosdal andreasrosdal deleted the copilot/add-extract-coordinates-feature branch February 16, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments