Skip to content

feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89

Open
uqio wants to merge 3 commits intomainfrom
feat/be-yuv-hb
Open

feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89
uqio wants to merge 3 commits intomainfrom
feat/be-yuv-hb

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — YUV-HB BE rollout. Adds <const BE: bool> to all 9/10/12/14/16-bit YUV planar (Yuv420p_n, Yuv422p_n, Yuv444p_n, plus 16-bit i64 family) and all P-format (P010/012/016 4:2:0, P210/212/216 4:2:2, P410/412/416 4:4:4) row kernels across scalar + 5 SIMD backends.

Implementation:

  • Scalar reads route through centralized load_u16<const BE: bool> helper in src/row/scalar/mod.rs — target-endian aware (u16::from_be / u16::from_le) matching SIMD load_endian_u16x*::<BE> helpers from feat(be-infra): endian-aware SIMD loaders across 5 backends #81. No naive if BE { swap_bytes() } pattern.
  • All 5 SIMD backends use load_endian_u16xN::<BE> from feat(be-infra): endian-aware SIMD loaders across 5 backends #81 + per-arch byteswap_u16x{8,16,32}::<BE> helpers folded into existing UV deinterleave shuffles where possible
  • Folded byteswap into deinterleave shuffle for SSE4.1/AVX-512 P016 u16 paths (zero extra cost on LE)
  • AVX2/AVX-512 4:2:0 u16-output half-loads use inline _mm{,256}_shuffle_epi8 post-load (no public 128/256-bit endian load helper for those widths)
  • wasm v128_load64_zero half-loads use inline u8x16_swizzle with high-zero mask

Test results:

  • 2168 tests pass on aarch64
  • 36 new BE parity tests across all 5 backends (9 NEON, 9 SSE4.1, 9 AVX2, 9 AVX-512, 9 wasm-simd128)
  • All cross-target builds clean with -Dwarnings

Public dispatchers stay LE-only by designp010_to_rgb_row, yuv420p10_to_rgb_row, etc. take their endianness from the format name (P010 = LE-encoded by spec). No caller plumbing BE through. Phase 4 will add type aliases (P010LeFrame / P010BeFrame) at the Frame layer.

Test plan

  • cargo test --target aarch64-apple-darwin --lib (2168 passed)
  • cargo build --target x86_64-apple-darwin --tests (0 warnings)
  • RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests
  • cargo build --no-default-features
  • cargo fmt --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

uqio and others added 3 commits May 8, 2026 02:11
…rnels (scalar + NEON)

Adds `<const BE: bool>` to all scalar and NEON row kernels for high-bit
YUV planar (yuv420p/422p/444p 9–16-bit, yuva 4:2:0/4:2:2/4:4:4) and
P-format semi-planar (P010, P012, P016, P410, P412, P416) → RGB/RGBA
conversion. BE loads use `load_u16::<BE>` at scalar sites and
`load_endian_u16x8::<BE>` / `deinterleave_endian::<BE>` at NEON SIMD
sites; x86 and wasm backends wired as `<false>` pending a follow-up
tranche. All 2159 existing tests pass; dispatch + test call sites
forward `<false>` to preserve LE-only behaviour until BE callers land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mat row kernels

Extends the BE-aware kernel rollout (scalar + NEON in 5b989ba) to the
remaining four SIMD backends — SSE4.1, AVX2, AVX-512, and wasm-simd128
— for high-bit YUV planar and P-format kernels.

Files updated (16 backend files × 4 kernel families):
- src/row/arch/x86_sse41/{yuv_planar_high_bit, yuv_planar_16bit,
  subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs
- src/row/arch/x86_avx2/{yuv_planar_high_bit, yuv_planar_16bit,
  subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs
- src/row/arch/x86_avx512/{yuv_planar_high_bit, yuv_planar_16bit,
  subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs
- src/row/arch/wasm_simd128/{yuv_planar_high_bit, yuv_planar_16bit,
  subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs

Each pub(crate) row kernel now takes <const BE: bool>; SIMD u16 loads
go through endian::load_endian_u16x{8,16,32}::<BE> per backend. Tail
fallbacks forward BE to the corresponding scalar kernels.

Lane strategy decisions per backend:

- All planar 4:2:0 / 4:4:4 / 16-bit kernels: native endian-aware
  `load_endian_u16x{N}::<BE>` for every Y/U/V load.

- Pn/P016/P416 semi-planar interleaved-UV kernels: deinterleave first
  via the existing `deinterleave_uv_u16{,_avx2,_avx512,_wasm}` helper,
  then byte-swap each U/V vector with a local `byteswap_u16x{8,16,32}
  ::<BE>` helper (per-arch). Compiles away when BE = false. Cost is
  one extra pshufb per deinterleaved vector on BE — strictly minor
  versus the deinterleave itself, and keeps the LE fast path identical.

- AVX2/AVX-512 P016 u16-output paths use a half-width 256-bit / 128-bit
  inline `shuffle_epi8` for the `_mm256_loadu_si256` / `_mm_loadu_si128`
  half-loads (no public 256-bit / 128-bit endian helper in the AVX2 /
  AVX-512 endian module). Same pattern for AVX2/AVX-512/wasm 4:2:0
  half-load chroma — fold per-u16-lane swap into the deinterleave mask
  where possible (P016 u16 paths) or use a dedicated post-load swizzle
  (4:2:0 u16 outputs).

- wasm yuv_420p16 u16-output: half-width `v128_load64_zero` uses an
  inline `u8x16_swizzle` for the BE byte-swap (high 8 bytes are zero so
  same shuffle leaves them unchanged).

All 9 NEON BE parity tests pass on aarch64 (+9 vs 5b989ba=2159, total
2168). Five new tests/be_parity.rs files added across all 5 backends —
36 BE parity tests total exercising yuv_420p_n, yuv_444p_n, yuv_*p16,
P010, P410, P016, P416 across u8 + u16 outputs. Each test takes a
randomized LE input, byte-swaps every u16, and asserts kernel<true>
on swapped == kernel<false> on original.

x86/wasm test sites updated to forward `, false` through generic chain
(matches NEON's existing pattern from 5b989ba).

Verified:
- cargo test --target aarch64-apple-darwin: 2168 passed, 0 failed
- cargo build --target x86_64-apple-darwin --tests: clean
- RUSTFLAGS="-C target-feature=+simd128" cargo build
  --target wasm32-unknown-unknown --tests: clean
- cargo build --no-default-features: clean
- cargo fmt --check: clean
- cargo clippy --all-targets -- -D warnings: clean (aarch64 host)

Sinker call sites remain hardcoded `<false>` per task spec (out of
scope for this tranche).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
YUV-HB scalar reads route through a centralized helper
`load_u16<const BE: bool>` in `src/row/scalar/mod.rs`.  The original
implementation used the naive `if BE { v.swap_bytes() } else { v }`
pattern which is wrong on big-endian hosts (s390x): it
unconditionally swaps when BE=true regardless of host endianness,
diverging from the SIMD `load_endian_u16x*::<BE>` helpers which are
target-endian aware (a swap is needed only when source byte order
differs from host CPU's native byte order).

Replaced with `if BE { u16::from_be(v) } else { u16::from_le(v) }`.
`u16::from_be`/`from_le` each emit a `bswap` only when the source
byte order differs from the host — exactly matching the SIMD helper
semantics.  All 9/10/12/14/16-bit YUV planar + P010/012/016/410/412/
416 kernels go through this single helper, so this fix corrects
every YUV-HB BE scalar path crate-wide in one commit.

Verified:
  - cargo test --target aarch64-apple-darwin --lib: 2168 passed
  - cargo build --target x86_64-apple-darwin --tests: 0 warnings
  - cargo build --target wasm32-unknown-unknown --tests
    (RUSTFLAGS=-C target-feature=+simd128): clean
  - cargo build --no-default-features: clean
  - cargo fmt --check: clean
  - cargo clippy --all-targets --all-features -D warnings: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant