feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels by uqio · Pull Request #89 · Findit-AI/colconv

uqio · 2026-05-07T14:13:17Z

Summary

Phase 2 — YUV-HB BE rollout. Adds <const BE: bool> to all 9/10/12/14/16-bit YUV planar (Yuv420p_n, Yuv422p_n, Yuv444p_n, plus 16-bit i64 family) and all P-format (P010/012/016 4:2:0, P210/212/216 4:2:2, P410/412/416 4:4:4) row kernels across scalar + 5 SIMD backends.

Implementation:

Scalar reads route through centralized load_u16<const BE: bool> helper in src/row/scalar/mod.rs — target-endian aware (u16::from_be / u16::from_le) matching SIMD load_endian_u16x*::<BE> helpers from feat(be-infra): endian-aware SIMD loaders across 5 backends #81. No naive if BE { swap_bytes() } pattern.
All 5 SIMD backends use load_endian_u16xN::<BE> from feat(be-infra): endian-aware SIMD loaders across 5 backends #81 + per-arch byteswap_u16x{8,16,32}::<BE> helpers folded into existing UV deinterleave shuffles where possible
Folded byteswap into deinterleave shuffle for SSE4.1/AVX-512 P016 u16 paths (zero extra cost on LE)
AVX2/AVX-512 4:2:0 u16-output half-loads use inline _mm{,256}_shuffle_epi8 post-load (no public 128/256-bit endian load helper for those widths)
wasm v128_load64_zero half-loads use inline u8x16_swizzle with high-zero mask

Test results:

2168 tests pass on aarch64
36 new BE parity tests across all 5 backends (9 NEON, 9 SSE4.1, 9 AVX2, 9 AVX-512, 9 wasm-simd128)
All cross-target builds clean with -Dwarnings

Public dispatchers stay LE-only by design — p010_to_rgb_row, yuv420p10_to_rgb_row, etc. take their endianness from the format name (P010 = LE-encoded by spec). No caller plumbing BE through. Phase 4 will add type aliases (P010LeFrame / P010BeFrame) at the Frame layer.

Test plan

cargo test --target aarch64-apple-darwin --lib (2168 passed)
cargo build --target x86_64-apple-darwin --tests (0 warnings)
RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests
cargo build --no-default-features
cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
s390x QEMU (Phase 3)

🤖 Generated with Claude Code

…rnels (scalar + NEON) Adds `<const BE: bool>` to all scalar and NEON row kernels for high-bit YUV planar (yuv420p/422p/444p 9–16-bit, yuva 4:2:0/4:2:2/4:4:4) and P-format semi-planar (P010, P012, P016, P410, P412, P416) → RGB/RGBA conversion. BE loads use `load_u16::<BE>` at scalar sites and `load_endian_u16x8::<BE>` / `deinterleave_endian::<BE>` at NEON SIMD sites; x86 and wasm backends wired as `<false>` pending a follow-up tranche. All 2159 existing tests pass; dispatch + test call sites forward `<false>` to preserve LE-only behaviour until BE callers land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mat row kernels Extends the BE-aware kernel rollout (scalar + NEON in 5b989ba) to the remaining four SIMD backends — SSE4.1, AVX2, AVX-512, and wasm-simd128 — for high-bit YUV planar and P-format kernels. Files updated (16 backend files × 4 kernel families): - src/row/arch/x86_sse41/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/x86_avx2/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/x86_avx512/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs - src/row/arch/wasm_simd128/{yuv_planar_high_bit, yuv_planar_16bit, subsampled_high_bit_pn_4_2_0, subsampled_high_bit_pn_4_4_4}.rs Each pub(crate) row kernel now takes <const BE: bool>; SIMD u16 loads go through endian::load_endian_u16x{8,16,32}::<BE> per backend. Tail fallbacks forward BE to the corresponding scalar kernels. Lane strategy decisions per backend: - All planar 4:2:0 / 4:4:4 / 16-bit kernels: native endian-aware `load_endian_u16x{N}::<BE>` for every Y/U/V load. - Pn/P016/P416 semi-planar interleaved-UV kernels: deinterleave first via the existing `deinterleave_uv_u16{,_avx2,_avx512,_wasm}` helper, then byte-swap each U/V vector with a local `byteswap_u16x{8,16,32} ::<BE>` helper (per-arch). Compiles away when BE = false. Cost is one extra pshufb per deinterleaved vector on BE — strictly minor versus the deinterleave itself, and keeps the LE fast path identical. - AVX2/AVX-512 P016 u16-output paths use a half-width 256-bit / 128-bit inline `shuffle_epi8` for the `_mm256_loadu_si256` / `_mm_loadu_si128` half-loads (no public 256-bit / 128-bit endian helper in the AVX2 / AVX-512 endian module). Same pattern for AVX2/AVX-512/wasm 4:2:0 half-load chroma — fold per-u16-lane swap into the deinterleave mask where possible (P016 u16 paths) or use a dedicated post-load swizzle (4:2:0 u16 outputs). - wasm yuv_420p16 u16-output: half-width `v128_load64_zero` uses an inline `u8x16_swizzle` for the BE byte-swap (high 8 bytes are zero so same shuffle leaves them unchanged). All 9 NEON BE parity tests pass on aarch64 (+9 vs 5b989ba=2159, total 2168). Five new tests/be_parity.rs files added across all 5 backends — 36 BE parity tests total exercising yuv_420p_n, yuv_444p_n, yuv_*p16, P010, P410, P016, P416 across u8 + u16 outputs. Each test takes a randomized LE input, byte-swaps every u16, and asserts kernel<true> on swapped == kernel<false> on original. x86/wasm test sites updated to forward `, false` through generic chain (matches NEON's existing pattern from 5b989ba). Verified: - cargo test --target aarch64-apple-darwin: 2168 passed, 0 failed - cargo build --target x86_64-apple-darwin --tests: clean - RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests: clean - cargo build --no-default-features: clean - cargo fmt --check: clean - cargo clippy --all-targets -- -D warnings: clean (aarch64 host) Sinker call sites remain hardcoded `<false>` per task spec (out of scope for this tranche). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

YUV-HB scalar reads route through a centralized helper `load_u16<const BE: bool>` in `src/row/scalar/mod.rs`. The original implementation used the naive `if BE { v.swap_bytes() } else { v }` pattern which is wrong on big-endian hosts (s390x): it unconditionally swaps when BE=true regardless of host endianness, diverging from the SIMD `load_endian_u16x*::<BE>` helpers which are target-endian aware (a swap is needed only when source byte order differs from host CPU's native byte order). Replaced with `if BE { u16::from_be(v) } else { u16::from_le(v) }`. `u16::from_be`/`from_le` each emit a `bswap` only when the source byte order differs from the host — exactly matching the SIMD helper semantics. All 9/10/12/14/16-bit YUV planar + P010/012/016/410/412/ 416 kernels go through this single helper, so this fix corrects every YUV-HB BE scalar path crate-wide in one commit. Verified: - cargo test --target aarch64-apple-darwin --lib: 2168 passed - cargo build --target x86_64-apple-darwin --tests: 0 warnings - cargo build --target wasm32-unknown-unknown --tests (RUSTFLAGS=-C target-feature=+simd128): clean - cargo build --no-default-features: clean - cargo fmt --check: clean - cargo clippy --all-targets --all-features -D warnings: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uqio and others added 3 commits May 8, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89

feat(be-yuv-hb): BE support for YUV planar high-bit + P-format row kernels#89
uqio wants to merge 3 commits intomainfrom
feat/be-yuv-hb

uqio commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uqio commented May 7, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant