fix(zip): shared SafeArchive with four-axis decompression-bomb caps by pratyush618 · Pull Request #72 · ByteVeda/paperjam

pratyush618 · 2026-04-24T11:33:43Z

Summary

Completes the SafeZip work sketched during the security pass — the per-entry caps that landed in #69 become one of four caps enforced across every archive read, and the ~170-line implementation that was duplicated between paperjam-epub and paperjam-pptx is now a single shared module in paperjam-model.

What changed

`paperjam-model::zip_safety` (new, feature-gated)

Opt-in via a zip_safety Cargo feature so the PDF engine and other non-ZIP consumers keep a zero-dependency paperjam-model. Surface:

ArchiveLimits with four independent caps and a sensible DEFAULT
SafeArchive<R> wrapper that tracks running totals for the scan
ZipSafetyError with typed variants for each way a read can fail

Four caps enforced on every read:

Cap	Default	Attack it stops
`max_entry_bytes`	100 MB	One oversized entry
`max_total_bytes`	500 MB	Many medium entries summing huge
`max_entries`	10,000	Millions of trivial entries
`max_ratio`	100×	Compression-ratio bomb (1 byte → 1 MB)

Seven tests cover the happy path, each cap, missing-entry handling, and non-UTF-8 decoding.

`paperjam-epub` + `paperjam-pptx` (consolidate)

Delete the per-crate safe_read.rs modules (~170 lines each, identical)
Enable zip_safety on paperjam-model
Thread a single SafeArchive through the parser so the caps apply across every read (images, TOC, slide notes) rather than per-entry only
Collapse 4 duplicated error variants per crate into one Archive(#[from] ZipSafetyError) — external error surface is #[error(transparent)] so callers see the underlying typed error directly

Net: ~200 lines of security-critical duplicated code collapses to one authoritative implementation.

Why now

Auditing for "clean, production-grade, maintainable" — the first-pass security fix worked but violated DRY in the most sensitive module. Any future tuning (adjusting defaults, adding caps, fixing a bug) would have to land in two places.

Test plan

cargo test --workspace — 4 xlsx reader + 5 mcp sandbox + 7 new zip_safety = 16 Rust tests, no regressions
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all --check
uv run pytest tests/python/ — 88 passed, 4 skipped
pre-commit run --all-files — every hook passes

What's still outstanding from the audit

After this PR:

DOCX metadata reader still uses an inline per-entry cap. It only reads one entry (core.xml) so the total/count caps would be no-ops; left alone.
Layer discipline (paperjam-epub → paperjam-html), paperjam-studio scope, Rust test strategy — all still need your direction.
Remaining low-priority polish: Rust CI matrix, MSRV verification, dep dedup, calamine / docx-rs pin refresh.

Introduce a shared hardened ZIP reader in paperjam-model, gated behind the optional `zip_safety` feature so the PDF engine and other non-ZIP consumers keep a zero-dependency paperjam-model. `SafeArchive` wraps a `zip::ZipArchive` and enforces four independent caps across the lifetime of a single archive scan: 1. per-entry decompressed size 2. aggregate decompressed-byte budget 3. entry count 4. compression ratio (declared / compressed) Each cap surfaces as a structured `ZipSafetyError` variant so callers can convert the failure into their own error type with `#[from]`. `ArchiveLimits::DEFAULT` is tuned for ordinary office / EPUB files (100 MB per entry, 500 MB total, 10k entries, 100x ratio); consumers can override any field. Seven tests cover the normal path, each cap, missing-entry handling, and non-UTF-8 decoding.

…model Replace the per-crate `safe_read` modules with the new `paperjam_model::zip_safety::SafeArchive`. Each format crate now: - enables the `zip_safety` feature on paperjam-model - deletes its local safe_read.rs (~170 lines each, identical logic) - threads a single SafeArchive through its parser, so the per-entry, total-bytes, entry-count, and compression-ratio caps apply across every archive read rather than just individual entries - exposes the archive-safety failures as a single `Archive(#[from] ZipSafetyError)` variant on EpubError / PptxError; the per-crate duplicates of EntryTooLarge / ArchiveTotalExceeded / ... are dropped Net effect: ~200 lines of duplicated security-critical code collapses into one implementation with seven tests in paperjam-model, plus the same bounded-read semantics now cover images, TOC entries, and slide notes that were previously capped only on per-entry size.

pratyush618 added 2 commits April 24, 2026 17:02

github-actions Bot added the rust Pull requests that update rust code label Apr 24, 2026

pratyush618 merged commit ea13265 into main Apr 24, 2026
13 checks passed

pratyush618 mentioned this pull request Apr 24, 2026

chore(release): 0.2.1 #74

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(zip): shared SafeArchive with four-axis decompression-bomb caps#72

fix(zip): shared SafeArchive with four-axis decompression-bomb caps#72
pratyush618 merged 2 commits intomainfrom
fix/safezip-limits

pratyush618 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Apr 24, 2026

Summary

What changed

paperjam-model::zip_safety (new, feature-gated)

paperjam-epub + paperjam-pptx (consolidate)

Why now

Test plan

What's still outstanding from the audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`paperjam-model::zip_safety` (new, feature-gated)

`paperjam-epub` + `paperjam-pptx` (consolidate)