7z: raw LZMA2 decoder + BCJ2 4-stream filter (#74)#79
Merged
Conversation
Part 1 (priority) — raw LZMA2 `Decoder` (7z coder id 21): - New `lzma2 = ["alloc", "lzma"]` feature, `src/lzma2/` module, marker `Lzma2` (NAME "lzma2"), decode-only (encoder is an Unsupported stub). - Decodes the raw LZMA2 chunk stream (dict-reset control bytes + LZMA chunks), self-terminating on the 0x00 end-control byte — distinct from the `.xz` container. The 7z coder property (1-byte dict-size code) is accepted via `DecoderConfig::with_dict_prop`, decoded the same way xz derives the LZMA2 dict size; `with_dict_size` / advisory `with_len` also offered. - Reuses the existing xz LZMA2 machinery rather than reimplementing LZMA: the shared `LzmaCore` / `Lzma2Props` / `lzma2_dict_size` are moved into a crate-internal `src/lzma2_internal/` module reachable by both `xz` and `lzma2` (the LZMA payload *encoder* compiles only under `xz`/test). Added `LzmaCore::append_literals` so uncompressed chunks feed the LZ window. - DoS hygiene: bounded dict allocation (clamped 4 KiB..128 MiB), checked arithmetic, truncation -> UnexpectedEnd, malformed -> Corrupt, poison on error. Part 2 — BCJ2 filter (0303011B, 4-stream): - New `bcj2 = ["alloc"]` feature, `src/bcj2/` module. Dedicated function API `compcol::bcj2::decode(main, call, jump, rc, out_len)` (the 4-input shape does not fit the single-input `Decoder` trait), plus an `encode` inverse for round-trip validation. - Implements the public-domain LZMA SDK BCJ2 algorithm: E8/E9/0F8x candidate detection, per-opcode range-coded control bit (prob model E8 -> 2+prev, E9 -> 1, 0F8x -> 0), E8 -> call stream / E9,0F8x -> jump stream, abs<->rel = +/- (operand_pos + 4) matching the crate's validated single-stream x86 BCJ. Range coder is LZMA-style. Wiring: features (+ `all` meta-feature), `src/lib.rs` modules, LZMA2 registered in `src/factory.rs` (encoder/decoder/names/extension "lzma2"). Validation: lzma2 — 12 in-module round-trips (single/multi compressed chunk, dict resets, uncompressed chunks, 1-byte streaming, truncation, corruption, reset reuse, dict-prop) + 7 public-API integration tests; bcj2 — 11 in-module + 3 integration round-trips (random, synthetic x86 with all branch kinds, all 256 prev-byte E8 models, tail/no-room, truncation errors). bcj2 is DONE (round-trip validated; address math cross-checked against the repo's validated single-stream x86 BCJ). All gates pass: builds (lzma2, lzma2+std, bcj2, xz, all-features), clippy -D warnings, cargo test, fmt --check, strict rustdoc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #74 — both pieces done.
Raw LZMA2 decoder (
lzma2) — the priorityDecodes the raw 7-Zip LZMA2 chunk stream (codec id 21): control-byte-framed chunks, self-terminating on
0x00, distinct from the.xzcontainer. The 1-byte 7z dict-size coder property is supplied viaDecoderConfig::with_dict_prop(decoded with the samelzma2_dict_sizelogic xz uses);with_dict_size/advisorywith_lenalso offered. Decode-only (Unsupportedencoder stub),DecoderReader-compatible.Reuses the existing xz LZMA2 engine rather than reimplementing LZMA: the shared
LzmaCore/Lzma2Props/lzma2_dict_sizewere relocated (git mv) fromsrc/xz/into a crate-internalsrc/lzma2_internal/reachable by bothxzandlzma2;xz's import path updated. xz behavior is unchanged — its 31 tests still pass. AddedLzmaCore::append_literalsso uncompressed chunks feed the LZ window.BCJ2 filter (
bcj2)The 7-Zip 4-stream x86 branch filter (
0303011B) — distinct from the single-stream BCJ x86 (03030103) already shipped. The 4-input shape (main + call + jump + range-coded control) doesn't fit the single-streamDecodertrait, so it's a dedicated function API:compcol::bcj2::decode(main, call, jump, rc, out_len)+ anencodeinverse. Public-domain LZMA SDK algorithm (E8/E9/0F8x detection, per-opcode range-coded control bit, abs↔rel address math cross-checked against the repo's already-validated single-stream x86 BCJ).DoS hygiene
#![forbid(unsafe_code)]; clamped dict allocation; no panics on crafted input; truncation →UnexpectedEnd, malformed →Corrupt, poison-on-error.Validation
lzma2,lzma2 std,bcj2,xz, all-features, default),cargo test(xz 31, lzma2 7, bcj2 3, 290 lib), clippy-D warnings(narrow + all),cargo fmt --check, strictcargo doc— all green.Closes #74.
🤖 Generated with Claude Code