Skip to content

fix(zip): shared SafeArchive with four-axis decompression-bomb caps#72

Merged
pratyush618 merged 2 commits intomainfrom
fix/safezip-limits
Apr 24, 2026
Merged

fix(zip): shared SafeArchive with four-axis decompression-bomb caps#72
pratyush618 merged 2 commits intomainfrom
fix/safezip-limits

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

Completes the SafeZip work sketched during the security pass — the per-entry caps that landed in #69 become one of four caps enforced across every archive read, and the ~170-line implementation that was duplicated between paperjam-epub and paperjam-pptx is now a single shared module in paperjam-model.

What changed

paperjam-model::zip_safety (new, feature-gated)

Opt-in via a zip_safety Cargo feature so the PDF engine and other non-ZIP consumers keep a zero-dependency paperjam-model. Surface:

  • ArchiveLimits with four independent caps and a sensible DEFAULT
  • SafeArchive<R> wrapper that tracks running totals for the scan
  • ZipSafetyError with typed variants for each way a read can fail

Four caps enforced on every read:

Cap Default Attack it stops
max_entry_bytes 100 MB One oversized entry
max_total_bytes 500 MB Many medium entries summing huge
max_entries 10,000 Millions of trivial entries
max_ratio 100× Compression-ratio bomb (1 byte → 1 MB)

Seven tests cover the happy path, each cap, missing-entry handling, and non-UTF-8 decoding.

paperjam-epub + paperjam-pptx (consolidate)

  • Delete the per-crate safe_read.rs modules (~170 lines each, identical)
  • Enable zip_safety on paperjam-model
  • Thread a single SafeArchive through the parser so the caps apply across every read (images, TOC, slide notes) rather than per-entry only
  • Collapse 4 duplicated error variants per crate into one Archive(#[from] ZipSafetyError) — external error surface is #[error(transparent)] so callers see the underlying typed error directly

Net: ~200 lines of security-critical duplicated code collapses to one authoritative implementation.

Why now

Auditing for "clean, production-grade, maintainable" — the first-pass security fix worked but violated DRY in the most sensitive module. Any future tuning (adjusting defaults, adding caps, fixing a bug) would have to land in two places.

Test plan

  • cargo test --workspace — 4 xlsx reader + 5 mcp sandbox + 7 new zip_safety = 16 Rust tests, no regressions
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo fmt --all --check
  • uv run pytest tests/python/ — 88 passed, 4 skipped
  • pre-commit run --all-files — every hook passes

What's still outstanding from the audit

After this PR:

  • DOCX metadata reader still uses an inline per-entry cap. It only reads one entry (core.xml) so the total/count caps would be no-ops; left alone.
  • Layer discipline (paperjam-epubpaperjam-html), paperjam-studio scope, Rust test strategy — all still need your direction.
  • Remaining low-priority polish: Rust CI matrix, MSRV verification, dep dedup, calamine / docx-rs pin refresh.

Introduce a shared hardened ZIP reader in paperjam-model, gated behind
the optional `zip_safety` feature so the PDF engine and other non-ZIP
consumers keep a zero-dependency paperjam-model.

`SafeArchive` wraps a `zip::ZipArchive` and enforces four independent
caps across the lifetime of a single archive scan:

  1. per-entry decompressed size
  2. aggregate decompressed-byte budget
  3. entry count
  4. compression ratio (declared / compressed)

Each cap surfaces as a structured `ZipSafetyError` variant so callers
can convert the failure into their own error type with `#[from]`.
`ArchiveLimits::DEFAULT` is tuned for ordinary office / EPUB files
(100 MB per entry, 500 MB total, 10k entries, 100x ratio); consumers
can override any field.

Seven tests cover the normal path, each cap, missing-entry handling,
and non-UTF-8 decoding.
…model

Replace the per-crate `safe_read` modules with the new
`paperjam_model::zip_safety::SafeArchive`. Each format crate now:

- enables the `zip_safety` feature on paperjam-model
- deletes its local safe_read.rs (~170 lines each, identical logic)
- threads a single SafeArchive through its parser, so the per-entry,
  total-bytes, entry-count, and compression-ratio caps apply across
  every archive read rather than just individual entries
- exposes the archive-safety failures as a single `Archive(#[from]
  ZipSafetyError)` variant on EpubError / PptxError; the per-crate
  duplicates of EntryTooLarge / ArchiveTotalExceeded / ... are dropped

Net effect: ~200 lines of duplicated security-critical code collapses
into one implementation with seven tests in paperjam-model, plus the
same bounded-read semantics now cover images, TOC entries, and slide
notes that were previously capped only on per-entry size.
@github-actions github-actions Bot added the rust Pull requests that update rust code label Apr 24, 2026
@pratyush618 pratyush618 merged commit ea13265 into main Apr 24, 2026
13 checks passed
@pratyush618 pratyush618 mentioned this pull request Apr 24, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant