feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels by uqio · Pull Request #83 · Findit-AI/colconv

uqio · 2026-05-07T11:38:39Z

Summary

Phase 2 — Tier 9 BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Rgbf32 / Rgbf16 packed-RGB float row kernels across all 6 backends + dispatcher.

Implementation:

Scalar: to_bits().swap_bytes() for both f32 and f16 elements; LE f32 fast path uses copy_from_slice
NEON: load_f32x4::<BE> via load_endian_u32x4; widen_f16x4::<BE> via load_endian_u16x8; BE-only deinterleave path for vld3q_f32-inaccessible layouts
SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); F16C widening via _mm{,256,512}_cvtph_ps with endian-aware u16 loads
wasm-simd128: load_f32x4::<BE> via load_endian_u32x4; scalar f16 byte-swap before widening (no native f16 SIMD on wasm)

Test results: 2862 tests pass total (33 new BE parity tests). cargo clippy and cargo fmt clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

cargo test --target aarch64-apple-darwin
cargo build --target x86_64-apple-darwin --tests
cargo build --target wasm32-unknown-unknown --tests
s390x QEMU (Phase 3)

🤖 Generated with Claude Code

`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).

Add `<const BE: bool>` to all Rgbf32 and Rgbf16 row kernels across scalar, NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends, plus both dispatchers. BE parity tests added to every backend; existing callers (sinkers, scalar tests, arch tests) updated to `<false>`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).

Tier 9 packed-float-RGB scalar BE conversion used unconditional `x.swap_bytes()`, which always swaps regardless of host endianness. On big-endian hosts (powerpc64, s390x) the source bytes are already in host-native order, so an extra swap corrupts every BE row. The SIMD `load_endian_*::<BE>` helpers shipped with feat/be-infra are already target-endian aware (no-op on a matching host), so the scalar and per-arch tail paths produced wrong output relative to the SIMD body on a hypothetical s390x runner. Replaced every `bits.swap_bytes()` / `to_bits().swap_bytes()` site in the source files with `u32::from_be` / `u32::from_le` (for f32) or `u16::from_be` / `u16::from_le` (for f16): - `if BE { x.swap_bytes() } else { x }` → `if BE { u32::from_be(x) } else { u32::from_le(x) }` - `f32::from_bits(raw.to_bits().swap_bytes())` (BE-only) → `f32::from_bits(if BE { u32::from_be(raw.to_bits()) } else { u32::from_le(raw.to_bits()) })` `from_be` / `from_le` is a no-op when the encoded byte order matches the host, a byte-swap when they differ — exactly mirroring the SIMD helper semantics so LE and BE hosts now produce bit-identical output. Special note for the f32 / f16 pass-through kernels: previously the `else` branch fell back to `copy_from_slice`, which is a byte-level copy. On a BE host that copies LE-encoded bytes into f32 / f16 lanes verbatim, leaving the destination in non-host-native order — the docstring claims "output is always host-native". The fix routes both branches through `from_bits(from_be/from_le(to_bits()))`, which is a no-op on LE host (correct, byte order matches) and a swap on BE host (correct, since the data is LE-encoded). Source-file call sites fixed: - u32 (f32 → bits, target-endian decoded): 7 — scalar `load_f32`, scalar `rgbf32_to_rgb_f32_row`, neon / x86_sse41 / x86_avx2 / x86_avx512 / wasm_simd128 `rgbf32_to_rgb_f32_row` BE tails. - u16 (f16 → bits, target-endian decoded): 12 — scalar `load_f16`, scalar `rgbf16_to_rgb_f32_row`, scalar `rgbf16_to_rgb_f16_row`, neon `widen_f16_tail`, x86_sse41 `load_f16_scalar`, x86_avx2 / x86_avx512 `rgbf16_to_rgb_f32_row` f16 widen tails, wasm_simd128 five f16 widen lanes (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`, `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`, `rgbf16_to_rgb_f32_row`). - f32 special case: covered by the u32 sites (scalar `rgbf32_to_rgb_f32_row` and the per-arch BE tails go through `f32::from_bits(u32::from_be/le(to_bits()))`). - f16 special case: covered by the u16 sites (scalar `rgbf16_to_rgb_f16_row` and `rgbf16_to_rgb_f32_row` go through `half::f16::from_bits(u16::from_be/le(to_bits()))`). Test helpers (`be_rgbf32` / `be_rgbf16` in `tests/packed_rgb_float.rs` across all arch backends) intentionally still use `swap_bytes()` because they synthesize a BE-encoded buffer from an LE host input — the unconditional swap is correct there and per-instructions remains unchanged. The neon `widen_f16_tail` helper additionally became `<const BE: bool>` (was previously calling `to_f32()` directly on host-native bits, producing garbage when fed BE-encoded f16 — the test `neon_rgbf16_to_rgb_f32_be_matches_le` failed at widths where the 4-lane SIMD body left a non-zero tail). Verified: 2170 lib tests pass on `aarch64-apple-darwin`; `cargo build --target x86_64-apple-darwin --tests` clean; `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests` clean (warnings pre-existing); `cargo build --no-default-features` clean; `cargo fmt --check` clean; `cargo clippy --all-targets --all-features -- -D warnings` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uqio force-pushed the feat/be-tier9 branch from c99a0a6 to d604016 Compare May 7, 2026 12:26

Base automatically changed from feat/be-infra to main May 7, 2026 12:37

uqio and others added 2 commits May 8, 2026 00:50

uqio force-pushed the feat/be-tier9 branch from 471a794 to 5967967 Compare May 7, 2026 12:52

uqio force-pushed the feat/be-tier9 branch from e83608b to bcc366b Compare May 7, 2026 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
uqio wants to merge 3 commits intomainfrom
feat/be-tier9

uqio commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uqio commented May 7, 2026

Summary

Stacking

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant