Skip to content

Feature: PDF structure parser and metadata extraction #92

@coderabbitai

Description

@coderabbitai

Overview

Implement PDF parser to extract document structure, metadata, and text while skipping binary image data.

Parent Epic

Part of #91 - Document & Office Format Awareness

Description

Parse PDF structure (objects, streams, cross-reference tables) and extract meaningful strings from metadata, annotations, bookmarks, and text streams.

Implementation Details

  • Use lopdf or pdf crate
  • Parse PDF object structure
  • Extract document info dictionary (Title, Author, Subject, Keywords)
  • Parse catalog and page tree
  • Extract text from content streams
  • Identify and skip image streams
  • Parse annotations and form fields
  • Extract JavaScript from actions

String Sources

  • Document metadata (Title, Author, Subject, Keywords, Creator, Producer)
  • Bookmark titles
  • Annotation text
  • Form field names and values
  • Font names
  • JavaScript code
  • Hyperlink URLs
  • Named destinations

Acceptance Criteria

  • Parse PDF structure (v1.4-1.7)
  • Extract all metadata dictionary entries
  • Parse page content streams for text
  • Skip binary image streams (entropy-based)
  • Extract annotations and bookmarks
  • Handle encrypted PDFs (metadata only)
  • Tests with diverse PDF samples

Test Cases

  • Simple text PDFs
  • PDFs with images
  • PDFs with forms
  • PDFs with JavaScript
  • Encrypted PDFs
  • Large PDFs (>100MB)

Related

Project: #76

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions