Add text coordinate extraction API (PR #1442 fixed)#1466
Merged
andreasrosdal merged 4 commits intomasterfrom Feb 16, 2026
Merged
Add text coordinate extraction API (PR #1442 fixed)#1466andreasrosdal merged 4 commits intomasterfrom
andreasrosdal merged 4 commits intomasterfrom
Conversation
- Created MatchedPattern.java: data class for matched text with coordinates - Created PdfContentTextLocator.java: handler for locating text coordinates - Created PdfTextLocator.java: public API for text location - Created PdfContentTextExtractor.java: concrete implementation of text extraction - Modified PdfContentStreamHandler.java: refactored to abstract class - Modified ParsedText.java: added create() method and restored getUnscaledTextWidth() - Modified PdfTextExtractor.java: added entry point methods for text location - Fixed test failures by ensuring width calculation uses font codes not decoded text Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
Added integration tests demonstrating the new PdfTextLocator API: - testTextLocatorFindsTextWithCoordinates: validates basic text search with coordinates - testTextLocatorFindsMultipleMatches: validates multiple pattern matches All 2032 tests pass successfully. Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
- Removed unused pattern parameter from PdfTextLocator constructor - Replaced addFirst() with add(0, ...) for consistency in PdfContentTextLocator All 2032 tests pass. Co-authored-by: andreasrosdal <259156774+andreasrosdal@users.noreply.github.com>
|
Copilot
AI
changed the title
[WIP] Add feature to extract coordinates of matched text
Add text coordinate extraction API (PR #1442 fixed)
Feb 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Description of the new Feature/Bugfix
Implements text pattern search with coordinate extraction from PR #1442, fixing test failures that caused original closure.
Core Issue in Original PR:
Width calculation used decoded Unicode characters instead of font code points, breaking text positioning for certain font encodings (AES256-encrypted PDFs).
Changes:
New API Classes:
PdfTextLocator- Public API withsearchPage(page, pattern)andsearchFile(pattern)methodsMatchedPattern- Immutable data class holding matched text + coordinates[llx, lly, urx, ury]PdfContentTextLocator- Regex pattern matching handlerPdfContentTextExtractor- Concrete text extraction implementationRefactoring:
PdfContentStreamHandler→ abstract base class (enables custom handlers)ParsedText- Addedcreate()factory, restoredgetUnscaledTextWidth()using font codesPdfTextExtractor- Unchanged behavior, delegates toPdfContentTextExtractorUsage:
Unit-Tests for the new Feature/Bugfix
testTextLocatorFindsTextWithCoordinates- Basic coordinate extraction validationtestTextLocatorFindsMultipleMatches- Multiple pattern match handlingCompatibilities Issues
None.
PdfTextExtractorAPI unchanged, maintains backward compatibility.PdfContentStreamHandlerchange from concrete → abstract is internal architecture change. Extending this class was already supported via protected members.Your real name
GitHub Copilot
Testing details
All 2032 existing tests pass. New tests validate:
Previously failing tests (DecryptAES256R6Test, EncryptAES256R6Test) now pass with corrected width calculation.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.