-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestpriority:highHigh priority taskHigh priority task
Description
Overview
Implement PDF parser to extract document structure, metadata, and text while skipping binary image data.
Parent Epic
Part of #91 - Document & Office Format Awareness
Description
Parse PDF structure (objects, streams, cross-reference tables) and extract meaningful strings from metadata, annotations, bookmarks, and text streams.
Implementation Details
- Use
lopdforpdfcrate - Parse PDF object structure
- Extract document info dictionary (Title, Author, Subject, Keywords)
- Parse catalog and page tree
- Extract text from content streams
- Identify and skip image streams
- Parse annotations and form fields
- Extract JavaScript from actions
String Sources
- Document metadata (Title, Author, Subject, Keywords, Creator, Producer)
- Bookmark titles
- Annotation text
- Form field names and values
- Font names
- JavaScript code
- Hyperlink URLs
- Named destinations
Acceptance Criteria
- Parse PDF structure (v1.4-1.7)
- Extract all metadata dictionary entries
- Parse page content streams for text
- Skip binary image streams (entropy-based)
- Extract annotations and bookmarks
- Handle encrypted PDFs (metadata only)
- Tests with diverse PDF samples
Test Cases
- Simple text PDFs
- PDFs with images
- PDFs with forms
- PDFs with JavaScript
- Encrypted PDFs
- Large PDFs (>100MB)
Related
Project: #76
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority:highHigh priority taskHigh priority task