feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89
Open
feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89
Conversation
…rnels (scalar + NEON) Adds `<const BE: bool>` to all scalar and NEON row kernels for high-bit YUV planar (yuv420p/422p/444p 9–16-bit, yuva 4:2:0/4:2:2/4:4:4) and P-format semi-planar (P010, P012, P016, P410, P412, P416) → RGB/RGBA conversion. BE loads use `load_u16::<BE>` at scalar sites and `load_endian_u16x8::<BE>` / `deinterleave_endian::<BE>` at NEON SIMD sites; x86 and wasm backends wired as `<false>` pending a follow-up tranche. All 2159 existing tests pass; dispatch + test call sites forward `<false>` to preserve LE-only behaviour until BE callers land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mat row kernels Extends the BE-aware kernel rollout (scalar + NEON in 5b989ba) to the remaining four SIMD backends — SSE4.1, AVX2, AVX-512, and wasm-simd128 — for high-bit YUV planar and P-format kernels. Files updated (16 backend files × 4 kernel families): - src/row/arch/x86_sse41/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/x86_avx2/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/x86_avx512/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/wasm_simd128/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs Each pub(crate) row kernel now takes <const BE: bool>; SIMD u16 loads go through endian::load_endian_u16x{8,16,32}::<BE> per backend. Tail fallbacks forward BE to the corresponding scalar kernels. Lane strategy decisions per backend: - All planar 4:2:0 / 4:4:4 / 16-bit kernels: native endian-aware `load_endian_u16x{N}::<BE>` for every Y/U/V load. - Pn/P016/P416 semi-planar interleaved-UV kernels: deinterleave first via the existing `deinterleave_uv_u16{,_avx2,_avx512,_wasm}` helper, then byte-swap each U/V vector with a local `byteswap_u16x{8,16,32} ::<BE>` helper (per-arch). Compiles away when BE = false. Cost is one extra pshufb per deinterleaved vector on BE — strictly minor versus the deinterleave itself, and keeps the LE fast path identical. - AVX2/AVX-512 P016 u16-output paths use a half-width 256-bit / 128-bit inline `shuffle_epi8` for the `_mm256_loadu_si256` / `_mm_loadu_si128` half-loads (no public 256-bit / 128-bit endian helper in the AVX2 / AVX-512 endian module). Same pattern for AVX2/AVX-512/wasm 4:2:0 half-load chroma — fold per-u16-lane swap into the deinterleave mask where possible (P016 u16 paths) or use a dedicated post-load swizzle (4:2:0 u16 outputs). - wasm yuv_420p16 u16-output: half-width `v128_load64_zero` uses an inline `u8x16_swizzle` for the BE byte-swap (high 8 bytes are zero so same shuffle leaves them unchanged). All 9 NEON BE parity tests pass on aarch64 (+9 vs 5b989ba=2159, total 2168). Five new tests/be_parity.rs files added across all 5 backends — 36 BE parity tests total exercising yuv_420p_n, yuv_444p_n, yuv_*p16, P010, P410, P016, P416 across u8 + u16 outputs. Each test takes a randomized LE input, byte-swaps every u16, and asserts kernel<true> on swapped == kernel<false> on original. x86/wasm test sites updated to forward `, false` through generic chain (matches NEON's existing pattern from 5b989ba). Verified: - cargo test --target aarch64-apple-darwin: 2168 passed, 0 failed - cargo build --target x86_64-apple-darwin --tests: clean - RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests: clean - cargo build --no-default-features: clean - cargo fmt --check: clean - cargo clippy --all-targets -- -D warnings: clean (aarch64 host) Sinker call sites remain hardcoded `<false>` per task spec (out of scope for this tranche). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YUV-HB scalar reads route through a centralized helper
`load_u16<const BE: bool>` in `src/row/scalar/mod.rs`. The original
implementation used the naive `if BE { v.swap_bytes() } else { v }`
pattern which is wrong on big-endian hosts (s390x): it
unconditionally swaps when BE=true regardless of host endianness,
diverging from the SIMD `load_endian_u16x*::<BE>` helpers which are
target-endian aware (a swap is needed only when source byte order
differs from host CPU's native byte order).
Replaced with `if BE { u16::from_be(v) } else { u16::from_le(v) }`.
`u16::from_be`/`from_le` each emit a `bswap` only when the source
byte order differs from the host — exactly matching the SIMD helper
semantics. All 9/10/12/14/16-bit YUV planar + P010/012/016/410/412/
416 kernels go through this single helper, so this fix corrects
every YUV-HB BE scalar path crate-wide in one commit.
Verified:
- cargo test --target aarch64-apple-darwin --lib: 2168 passed
- cargo build --target x86_64-apple-darwin --tests: 0 warnings
- cargo build --target wasm32-unknown-unknown --tests
(RUSTFLAGS=-C target-feature=+simd128): clean
- cargo build --no-default-features: clean
- cargo fmt --check: clean
- cargo clippy --all-targets --all-features -D warnings: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 — YUV-HB BE rollout. Adds
<const BE: bool>to all 9/10/12/14/16-bit YUV planar (Yuv420p_n, Yuv422p_n, Yuv444p_n, plus 16-bit i64 family) and all P-format (P010/012/016 4:2:0, P210/212/216 4:2:2, P410/412/416 4:4:4) row kernels across scalar + 5 SIMD backends.Implementation:
load_u16<const BE: bool>helper insrc/row/scalar/mod.rs— target-endian aware (u16::from_be/u16::from_le) matching SIMDload_endian_u16x*::<BE>helpers from feat(be-infra): endian-aware SIMD loaders across 5 backends #81. No naiveif BE { swap_bytes() }pattern.load_endian_u16xN::<BE>from feat(be-infra): endian-aware SIMD loaders across 5 backends #81 + per-archbyteswap_u16x{8,16,32}::<BE>helpers folded into existing UV deinterleave shuffles where possible_mm{,256}_shuffle_epi8post-load (no public 128/256-bit endian load helper for those widths)v128_load64_zerohalf-loads use inlineu8x16_swizzlewith high-zero maskTest results:
-DwarningsPublic dispatchers stay LE-only by design —
p010_to_rgb_row,yuv420p10_to_rgb_row, etc. take their endianness from the format name (P010 = LE-encoded by spec). No caller plumbing BE through. Phase 4 will add type aliases (P010LeFrame/P010BeFrame) at the Frame layer.Test plan
cargo test --target aarch64-apple-darwin --lib(2168 passed)cargo build --target x86_64-apple-darwin --tests(0 warnings)RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --testscargo build --no-default-featurescargo fmt --checkcargo clippy --all-targets --all-features -- -D warnings🤖 Generated with Claude Code