feat(extractor): adopt Detective-XH/pdf v0.6.0 Metadata API by Detective-XH · Pull Request #65 · Detective-XH/DocGraph

Detective-XH · 2026-06-02T01:11:32Z

Summary

Upgrades the PDF dependency to v0.6.0 and adopts its new Metadata API.

Dep bump github.com/Detective-XH/pdf v0.5.0 → v0.6.0. Public API unchanged (OpenBytes/Page/Font/Value signatures identical) → no call-site changes.
- Automatic gains: broader CJK CMap coverage (GB-EUC / GBKp-EUC / ETenms-B5 / KSC-EUC / KSCms-UHC) and upstream hardening (inline-image crash, zero-length stream, multi-stream OOM, object-nesting depth) that strengthen the malformed-PDF panic guard in extractor.go.
Metadata API adoption — extractPDF reads the Info dict via r.Info() (Title/Author/Subject/Keywords) instead of hand-walking Trailer → Info.
⚠️ Behavior change — pdf.creation_date is now a normalized RFC3339 UTC timestamp via info.CreationDate() (was the raw D:YYYYMMDD… PDF string); empty on absent/unparseable date. pdf.go is the sole writer and no consumer parses the old format, so this is display-only.

Tests

Extend buildMinimalPDFEnc with an /Info dict param.
New TestExtractPDF_CreationDateNormalized pins D:20240115093000+05'00' → 2024-01-15T04:30:00Z.
go build ./... ✅ · go test -race ./... 18/0 ✅ · gofmt -s / vet clean ✅

Touchpoints

README:289 lists which Info fields are indexed (not the format) — still accurate, no doc change.
No schema change → no .docgraph migration needed.

🤖 Generated with Claude Code

- bump github.com/Detective-XH/pdf v0.5.0 -> v0.6.0. Public API unchanged (OpenBytes/Page/Font/Value signatures identical), so no call-site changes. Automatic gains: broader CJK CMap coverage (GB-EUC/GBKp-EUC/ETenms-B5/ KSC-EUC/KSCms-UHC) and upstream hardening (inline-image crash, zero-length stream, multi-stream OOM, object-nesting depth) that strengthen the malformed-PDF panic guard in extractor.go. - extractPDF reads the Info dict via r.Info() (Title/Author/Subject/Keywords) instead of hand-walking Trailer -> Info. - BEHAVIOR CHANGE: pdf.creation_date is now a normalized RFC3339 UTC timestamp via info.CreationDate() (was the raw 'D:YYYYMMDD...' PDF string); empty on absent/unparseable date. pdf.go is the sole writer and no consumer parsed the old format, so this is display-only. - test: extend buildMinimalPDFEnc with an /Info dict param; add TestExtractPDF_CreationDateNormalized pinning +05'00' -> 04:30:00Z. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Detective-XH merged commit fdf7f9d into main Jun 2, 2026
10 checks passed

Detective-XH deleted the chore/pdf-v0.6.0-metadata-api branch June 2, 2026 01:20

Detective-XH mentioned this pull request Jun 2, 2026

release: v0.3.0 #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extractor): adopt Detective-XH/pdf v0.6.0 Metadata API#65

feat(extractor): adopt Detective-XH/pdf v0.6.0 Metadata API#65
Detective-XH merged 1 commit into
mainfrom
chore/pdf-v0.6.0-metadata-api

Detective-XH commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Detective-XH commented Jun 2, 2026

Summary

Tests

Touchpoints

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant