Skip to content

7z: raw LZMA2 decoder + BCJ2 4-stream filter (#74)#79

Merged
MagicalTux merged 2 commits into
masterfrom
lzma2-bcj2-7z
May 30, 2026
Merged

7z: raw LZMA2 decoder + BCJ2 4-stream filter (#74)#79
MagicalTux merged 2 commits into
masterfrom
lzma2-bcj2-7z

Conversation

@MagicalTux
Copy link
Copy Markdown
Member

Summary

Closes #74 — both pieces done.

Raw LZMA2 decoder (lzma2) — the priority

Decodes the raw 7-Zip LZMA2 chunk stream (codec id 21): control-byte-framed chunks, self-terminating on 0x00, distinct from the .xz container. The 1-byte 7z dict-size coder property is supplied via DecoderConfig::with_dict_prop (decoded with the same lzma2_dict_size logic xz uses); with_dict_size/advisory with_len also offered. Decode-only (Unsupported encoder stub), DecoderReader-compatible.

Reuses the existing xz LZMA2 engine rather than reimplementing LZMA: the shared LzmaCore/Lzma2Props/lzma2_dict_size were relocated (git mv) from src/xz/ into a crate-internal src/lzma2_internal/ reachable by both xz and lzma2; xz's import path updated. xz behavior is unchanged — its 31 tests still pass. Added LzmaCore::append_literals so uncompressed chunks feed the LZ window.

BCJ2 filter (bcj2)

The 7-Zip 4-stream x86 branch filter (0303011B) — distinct from the single-stream BCJ x86 (03030103) already shipped. The 4-input shape (main + call + jump + range-coded control) doesn't fit the single-stream Decoder trait, so it's a dedicated function API: compcol::bcj2::decode(main, call, jump, rc, out_len) + an encode inverse. Public-domain LZMA SDK algorithm (E8/E9/0F8x detection, per-opcode range-coded control bit, abs↔rel address math cross-checked against the repo's already-validated single-stream x86 BCJ).

DoS hygiene

#![forbid(unsafe_code)]; clamped dict allocation; no panics on crafted input; truncation → UnexpectedEnd, malformed → Corrupt, poison-on-error.

Validation

  • raw LZMA2: 12 in-module + 7 integration tests (single/multi-chunk, dict resets, uncompressed chunks, 1-byte streaming, truncation/corruption, dict-prop config, factory wiring) — round-trip vs the xz LZMA2 encoder.
  • BCJ2: 11 in-module + 3 integration round-trips (random incl. 64 KiB, synthetic x86 all branch kinds, all 256 prev-byte E8 models, tail cases, truncation).
  • Gates: builds (lzma2, lzma2 std, bcj2, xz, all-features, default), cargo test (xz 31, lzma2 7, bcj2 3, 290 lib), clippy -D warnings (narrow + all), cargo fmt --check, strict cargo doc — all green.

Closes #74.

🤖 Generated with Claude Code

MagicalTux and others added 2 commits May 31, 2026 07:51
Part 1 (priority) — raw LZMA2 `Decoder` (7z coder id 21):
- New `lzma2 = ["alloc", "lzma"]` feature, `src/lzma2/` module, marker
  `Lzma2` (NAME "lzma2"), decode-only (encoder is an Unsupported stub).
- Decodes the raw LZMA2 chunk stream (dict-reset control bytes + LZMA
  chunks), self-terminating on the 0x00 end-control byte — distinct from
  the `.xz` container. The 7z coder property (1-byte dict-size code) is
  accepted via `DecoderConfig::with_dict_prop`, decoded the same way xz
  derives the LZMA2 dict size; `with_dict_size` / advisory `with_len`
  also offered.
- Reuses the existing xz LZMA2 machinery rather than reimplementing LZMA:
  the shared `LzmaCore` / `Lzma2Props` / `lzma2_dict_size` are moved into
  a crate-internal `src/lzma2_internal/` module reachable by both `xz`
  and `lzma2` (the LZMA payload *encoder* compiles only under `xz`/test).
  Added `LzmaCore::append_literals` so uncompressed chunks feed the LZ
  window.
- DoS hygiene: bounded dict allocation (clamped 4 KiB..128 MiB), checked
  arithmetic, truncation -> UnexpectedEnd, malformed -> Corrupt, poison
  on error.

Part 2 — BCJ2 filter (0303011B, 4-stream):
- New `bcj2 = ["alloc"]` feature, `src/bcj2/` module. Dedicated function
  API `compcol::bcj2::decode(main, call, jump, rc, out_len)` (the 4-input
  shape does not fit the single-input `Decoder` trait), plus an `encode`
  inverse for round-trip validation.
- Implements the public-domain LZMA SDK BCJ2 algorithm: E8/E9/0F8x
  candidate detection, per-opcode range-coded control bit (prob model
  E8 -> 2+prev, E9 -> 1, 0F8x -> 0), E8 -> call stream / E9,0F8x -> jump
  stream, abs<->rel = +/- (operand_pos + 4) matching the crate's
  validated single-stream x86 BCJ. Range coder is LZMA-style.

Wiring: features (+ `all` meta-feature), `src/lib.rs` modules, LZMA2
registered in `src/factory.rs` (encoder/decoder/names/extension "lzma2").

Validation: lzma2 — 12 in-module round-trips (single/multi compressed
chunk, dict resets, uncompressed chunks, 1-byte streaming, truncation,
corruption, reset reuse, dict-prop) + 7 public-API integration tests;
bcj2 — 11 in-module + 3 integration round-trips (random, synthetic x86
with all branch kinds, all 256 prev-byte E8 models, tail/no-room,
truncation errors). bcj2 is DONE (round-trip validated; address math
cross-checked against the repo's validated single-stream x86 BCJ).
All gates pass: builds (lzma2, lzma2+std, bcj2, xz, all-features),
clippy -D warnings, cargo test, fmt --check, strict rustdoc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MagicalTux MagicalTux merged commit 5331055 into master May 30, 2026
31 checks passed
@MagicalTux MagicalTux mentioned this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7z: raw LZMA2 decoder entry point + BCJ2 filter

1 participant