Skip to content

feat(be-tier4): BE support for Y210/Y212/Y216/V210 row kernels#88

Open
uqio wants to merge 2 commits intomainfrom
feat/be-tier4
Open

feat(be-tier4): BE support for Y210/Y212/Y216/V210 row kernels#88
uqio wants to merge 2 commits intomainfrom
feat/be-tier4

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 4 BE rollout. Adds <const BE: bool> to all Y210 / Y212 / Y216 / V210 row kernels across all 6 backends + dispatcher.

Implementation:

  • Scalar reads via centralized load_endian_u16<BE> / load_endian_u32<BE> helpers in src/row/scalar/mod.rs — target-endian aware (u16::from_be_bytes / u32::from_le_bytes semantics, matching the SIMD load_endian_*::<BE> helpers from feat(be-infra): endian-aware SIMD loaders across 5 backends #81). No naive if BE { swap_bytes() } pattern (which would corrupt rows on s390x).
  • All 5 SIMD backends use load_endian_u16xN::<BE> / load_endian_u32xN::<BE> from feat(be-infra): endian-aware SIMD loaders across 5 backends #81.
  • BE path uses if !BE { /* SIMD */ } else { /* scalar fallback */ } for backends where the SIMD deinterleave path is too complex to easily make endian-aware (Y210/Y212/Y216 chroma deinterleave).

BE-host scalar fix note

Codex flagged a high-severity scalar BE bug in tier 10b (fix(be-tier10b): make scalar BE conversion target-endian aware, 26e5077): the inline if BE { x.swap_bytes() } else { x } pattern is unconditional — it swaps bytes on every load when BE = true, regardless of host endianness, so on s390x both BE-input and LE-input rows are corrupted. The codex-recommended fix is if BE { u16::from_be(x) } else { u16::from_le(x) }, which is target-endian aware (no-op when source byte order matches the host).

Tier 4 audit finding: the buggy if BE { swap_bytes() } pattern does not exist in tier 4 production scalar code. Tier 4 was implemented from the start using the helper functions load_endian_u16<BE> / load_endian_u32<BE> declared in src/row/scalar/mod.rs, which decode raw bytes via u16::from_be_bytes / u16::from_le_bytes (and the u32 pair). Those byte-array decoders have the same target-endian-aware semantics as u16::from_be / u16::from_le, so tier 4 scalar paths agree with their SIMD counterparts on both LE and BE hosts.

The follow-up commit fix(be-tier4): make scalar BE conversion target-endian aware (2835895) records this audit finding by upgrading the doc-comments on load_endian_u16<BE> / load_endian_u32<BE> to spell out the target-endian-aware contract and cite the codex motivation, preventing a future regression that introduces the buggy inline pattern.

Tests

  • 2171 lib tests pass on aarch64.
  • All 4 BE parity tests added per scalar kernel: scalar_v210_be_*, scalar_y210_be_*, y216_be_* — verify (LE-encoded buffer + BE = false) ≡ (BE-encoded buffer + BE = true).
  • Per-arch SIMD test files (sse41/avx2/avx512/wasm/neon) updated for the new <BE> const generic.

Test plan

  • cargo test --target aarch64-apple-darwin --lib (2171 passed)
  • cargo build --target x86_64-apple-darwin --tests (0 warnings)
  • cargo build --target wasm32-unknown-unknown --tests (0 warnings; -C target-feature=+simd128)
  • cargo build --no-default-features
  • cargo fmt --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

uqio and others added 2 commits May 8, 2026 01:22
Adds `<const BE: bool>` const-generic gating to all four Tier 4 packed
YUV 4:2:2 formats across scalar, NEON, x86 (SSE4.1/AVX2/AVX512), and
WASM SIMD128 backends, plus dispatch functions and sinker call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex flagged a high-severity scalar BE bug in tier 10b: the
inline `if BE { x.swap_bytes() } else { x }` pattern is wrong
on big-endian hosts because `swap_bytes()` is unconditional —
it swaps even when the data already matches the host's byte
order. The matching SIMD `load_endian_*::<BE>` helpers from
PR #81 are target-endian aware (cfg-gated reverses; no-op when
source order matches host order), so the buggy scalar paths
diverge on s390x, corrupting both BE-input and LE-input rows
when run through scalar tails or the (always-scalar) luma
kernels.

Audit of tier 4 scalar code confirms tier 4 was implemented
from the start using the helper functions `load_endian_u16::<BE>`
and `load_endian_u32::<BE>` declared in `src/row/scalar/mod.rs`,
which build a fresh `[u8; N]` from the source pointer and decode
via `u16::from_be_bytes` / `u16::from_le_bytes` (and the u32
pair). Those byte-array decoders are target-endian aware: each
is a no-op when the data byte order matches the host CPU and a
hardware byte-swap when they differ — the same semantics as
`u16::from_be` / `u16::from_le` and the SIMD `load_endian_*`
helpers. No `if BE { x.swap_bytes() } else { x }` pattern exists
in tier 4 production scalar code (`src/row/scalar/{v210,y2xx,y216}.rs`),
so no scalar production fix is needed for s390x correctness.

To prevent a future regression that introduces the buggy
pattern (a real risk now that the codex finding is on file
across tier 5/8/10b/10-float/11), this commit upgrades the
doc-comments on `load_endian_u16<BE>` and `load_endian_u32<BE>`
to:
- Spell out the **target-endian aware** contract (no swap on
  matching host order, swap on differing order).
- Cite the codex finding and reference the tier 10b fix commit
  message for the full motivation.
- Mark the inline-`swap_bytes` pattern as the "naive alternative"
  that the helpers exist specifically to avoid.

Test helpers `to_be_u16` (`src/row/scalar/y2xx.rs`,
`src/row/scalar/y216.rs`) and `pack_v210_word_be`
(`src/row/scalar/v210.rs`) are intentionally left unchanged —
they synthesize BE-encoded fixtures from LE inputs on the
LE-host CI, mirroring the tier 5/8/10b convention; a future
phase 3 s390x QEMU run will revisit them.

Verified:
- cargo test --target aarch64-apple-darwin --lib (2171 passed, 0 failed)
- cargo build --target x86_64-apple-darwin --tests (clean, 0 warnings)
- RUSTFLAGS='-C target-feature=+simd128' cargo build --target wasm32-unknown-unknown --tests
- cargo build --no-default-features
- cargo fmt --check
- cargo clippy --all-targets --all-features -- -D warnings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant