feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
Open
feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
Conversation
uqio
added a commit
that referenced
this pull request
May 7, 2026
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).
Add `<const BE: bool>` to all Rgbf32 and Rgbf16 row kernels across scalar, NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends, plus both dispatchers. BE parity tests added to every backend; existing callers (sinkers, scalar tests, arch tests) updated to `<false>`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).
Tier 9 packed-float-RGB scalar BE conversion used unconditional
`x.swap_bytes()`, which always swaps regardless of host endianness. On
big-endian hosts (powerpc64, s390x) the source bytes are already in
host-native order, so an extra swap corrupts every BE row. The SIMD
`load_endian_*::<BE>` helpers shipped with feat/be-infra are already
target-endian aware (no-op on a matching host), so the scalar and
per-arch tail paths produced wrong output relative to the SIMD body
on a hypothetical s390x runner.
Replaced every `bits.swap_bytes()` / `to_bits().swap_bytes()` site in
the source files with `u32::from_be` / `u32::from_le` (for f32) or
`u16::from_be` / `u16::from_le` (for f16):
- `if BE { x.swap_bytes() } else { x }`
→ `if BE { u32::from_be(x) } else { u32::from_le(x) }`
- `f32::from_bits(raw.to_bits().swap_bytes())` (BE-only)
→ `f32::from_bits(if BE { u32::from_be(raw.to_bits()) }
else { u32::from_le(raw.to_bits()) })`
`from_be` / `from_le` is a no-op when the encoded byte order matches
the host, a byte-swap when they differ — exactly mirroring the SIMD
helper semantics so LE and BE hosts now produce bit-identical output.
Special note for the f32 / f16 pass-through kernels: previously the
`else` branch fell back to `copy_from_slice`, which is a byte-level
copy. On a BE host that copies LE-encoded bytes into f32 / f16 lanes
verbatim, leaving the destination in non-host-native order — the
docstring claims "output is always host-native". The fix routes both
branches through `from_bits(from_be/from_le(to_bits()))`, which is a
no-op on LE host (correct, byte order matches) and a swap on BE host
(correct, since the data is LE-encoded).
Source-file call sites fixed:
- u32 (f32 → bits, target-endian decoded): 7 — scalar `load_f32`,
scalar `rgbf32_to_rgb_f32_row`, neon / x86_sse41 / x86_avx2 /
x86_avx512 / wasm_simd128 `rgbf32_to_rgb_f32_row` BE tails.
- u16 (f16 → bits, target-endian decoded): 12 — scalar `load_f16`,
scalar `rgbf16_to_rgb_f32_row`, scalar `rgbf16_to_rgb_f16_row`,
neon `widen_f16_tail`, x86_sse41 `load_f16_scalar`, x86_avx2 /
x86_avx512 `rgbf16_to_rgb_f32_row` f16 widen tails, wasm_simd128
five f16 widen lanes (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`,
`rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`,
`rgbf16_to_rgb_f32_row`).
- f32 special case: covered by the u32 sites (scalar
`rgbf32_to_rgb_f32_row` and the per-arch BE tails go through
`f32::from_bits(u32::from_be/le(to_bits()))`).
- f16 special case: covered by the u16 sites (scalar
`rgbf16_to_rgb_f16_row` and `rgbf16_to_rgb_f32_row` go through
`half::f16::from_bits(u16::from_be/le(to_bits()))`).
Test helpers (`be_rgbf32` / `be_rgbf16` in `tests/packed_rgb_float.rs`
across all arch backends) intentionally still use `swap_bytes()`
because they synthesize a BE-encoded buffer from an LE host input —
the unconditional swap is correct there and per-instructions remains
unchanged. The neon `widen_f16_tail` helper additionally became
`<const BE: bool>` (was previously calling `to_f32()` directly on
host-native bits, producing garbage when fed BE-encoded f16 — the
test `neon_rgbf16_to_rgb_f32_be_matches_le` failed at widths where
the 4-lane SIMD body left a non-zero tail).
Verified: 2170 lib tests pass on `aarch64-apple-darwin`;
`cargo build --target x86_64-apple-darwin --tests` clean;
`RUSTFLAGS="-C target-feature=+simd128" cargo build --target
wasm32-unknown-unknown --tests` clean (warnings pre-existing);
`cargo build --no-default-features` clean; `cargo fmt --check` clean;
`cargo clippy --all-targets --all-features -- -D warnings` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 — Tier 9 BE rollout. Stacked on #81 (BE infra). Adds
<const BE: bool>to all Rgbf32 / Rgbf16 packed-RGB float row kernels across all 6 backends + dispatcher.Implementation:
to_bits().swap_bytes()for both f32 and f16 elements; LE f32 fast path usescopy_from_sliceload_f32x4::<BE>viaload_endian_u32x4;widen_f16x4::<BE>viaload_endian_u16x8; BE-only deinterleave path forvld3q_f32-inaccessible layouts_mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); F16C widening via_mm{,256,512}_cvtph_pswith endian-aware u16 loadsload_f32x4::<BE>viaload_endian_u32x4; scalar f16 byte-swap before widening (no native f16 SIMD on wasm)Test results: 2862 tests pass total (33 new BE parity tests).
cargo clippyandcargo fmtclean.Stacking
Base:
feat/be-infra(#81). Will rebase ontomainonce #81 merges.Test plan
cargo test --target aarch64-apple-darwincargo build --target x86_64-apple-darwin --testscargo build --target wasm32-unknown-unknown --tests🤖 Generated with Claude Code