Skip to content

feat(extractor): adopt Detective-XH/pdf v0.6.0 Metadata API#65

Merged
Detective-XH merged 1 commit into
mainfrom
chore/pdf-v0.6.0-metadata-api
Jun 2, 2026
Merged

feat(extractor): adopt Detective-XH/pdf v0.6.0 Metadata API#65
Detective-XH merged 1 commit into
mainfrom
chore/pdf-v0.6.0-metadata-api

Conversation

@Detective-XH
Copy link
Copy Markdown
Owner

Summary

Upgrades the PDF dependency to v0.6.0 and adopts its new Metadata API.

  • Dep bump github.com/Detective-XH/pdf v0.5.0 → v0.6.0. Public API unchanged (OpenBytes/Page/Font/Value signatures identical) → no call-site changes.
    • Automatic gains: broader CJK CMap coverage (GB-EUC / GBKp-EUC / ETenms-B5 / KSC-EUC / KSCms-UHC) and upstream hardening (inline-image crash, zero-length stream, multi-stream OOM, object-nesting depth) that strengthen the malformed-PDF panic guard in extractor.go.
  • Metadata API adoptionextractPDF reads the Info dict via r.Info() (Title/Author/Subject/Keywords) instead of hand-walking Trailer → Info.
  • ⚠️ Behavior changepdf.creation_date is now a normalized RFC3339 UTC timestamp via info.CreationDate() (was the raw D:YYYYMMDD… PDF string); empty on absent/unparseable date. pdf.go is the sole writer and no consumer parses the old format, so this is display-only.

Tests

  • Extend buildMinimalPDFEnc with an /Info dict param.
  • New TestExtractPDF_CreationDateNormalized pins D:20240115093000+05'00'2024-01-15T04:30:00Z.
  • go build ./... ✅ · go test -race ./... 18/0 ✅ · gofmt -s / vet clean ✅

Touchpoints

  • README:289 lists which Info fields are indexed (not the format) — still accurate, no doc change.
  • No schema change → no .docgraph migration needed.

🤖 Generated with Claude Code

- bump github.com/Detective-XH/pdf v0.5.0 -> v0.6.0. Public API unchanged
  (OpenBytes/Page/Font/Value signatures identical), so no call-site changes.
  Automatic gains: broader CJK CMap coverage (GB-EUC/GBKp-EUC/ETenms-B5/
  KSC-EUC/KSCms-UHC) and upstream hardening (inline-image crash, zero-length
  stream, multi-stream OOM, object-nesting depth) that strengthen the
  malformed-PDF panic guard in extractor.go.
- extractPDF reads the Info dict via r.Info() (Title/Author/Subject/Keywords)
  instead of hand-walking Trailer -> Info.
- BEHAVIOR CHANGE: pdf.creation_date is now a normalized RFC3339 UTC timestamp
  via info.CreationDate() (was the raw 'D:YYYYMMDD...' PDF string); empty on
  absent/unparseable date. pdf.go is the sole writer and no consumer parsed
  the old format, so this is display-only.
- test: extend buildMinimalPDFEnc with an /Info dict param; add
  TestExtractPDF_CreationDateNormalized pinning +05'00' -> 04:30:00Z.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Detective-XH Detective-XH merged commit fdf7f9d into main Jun 2, 2026
10 checks passed
@Detective-XH Detective-XH deleted the chore/pdf-v0.6.0-metadata-api branch June 2, 2026 01:20
@Detective-XH Detective-XH mentioned this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant