Skip to content

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83

Open
uqio wants to merge 3 commits intomainfrom
feat/be-tier9
Open

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
uqio wants to merge 3 commits intomainfrom
feat/be-tier9

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 9 BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Rgbf32 / Rgbf16 packed-RGB float row kernels across all 6 backends + dispatcher.

Implementation:

  • Scalar: to_bits().swap_bytes() for both f32 and f16 elements; LE f32 fast path uses copy_from_slice
  • NEON: load_f32x4::<BE> via load_endian_u32x4; widen_f16x4::<BE> via load_endian_u16x8; BE-only deinterleave path for vld3q_f32-inaccessible layouts
  • SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); F16C widening via _mm{,256,512}_cvtph_ps with endian-aware u16 loads
  • wasm-simd128: load_f32x4::<BE> via load_endian_u32x4; scalar f16 byte-swap before widening (no native f16 SIMD on wasm)

Test results: 2862 tests pass total (33 new BE parity tests). cargo clippy and cargo fmt clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

  • cargo test --target aarch64-apple-darwin
  • cargo build --target x86_64-apple-darwin --tests
  • cargo build --target wasm32-unknown-unknown --tests
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

@uqio uqio force-pushed the feat/be-tier9 branch from c99a0a6 to d604016 Compare May 7, 2026 12:26
uqio added a commit that referenced this pull request May 7, 2026
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16
bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable
bytes per call (4 × f16). The third widen call per loop iteration reads
[lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when
`lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on
PR #83's CI sanitizer job.

Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half
byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps`
only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the
zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512
(`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so
they keep using `load_endian_u16x8` / `load_endian_u16x16`.

Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test
--target x86_64-apple-darwin` (2862 tests pass).
Base automatically changed from feat/be-infra to main May 7, 2026 12:37
uqio and others added 2 commits May 8, 2026 00:50
Add `<const BE: bool>` to all Rgbf32 and Rgbf16 row kernels across
scalar, NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends, plus
both dispatchers. BE parity tests added to every backend; existing
callers (sinkers, scalar tests, arch tests) updated to `<false>`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16
bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable
bytes per call (4 × f16). The third widen call per loop iteration reads
[lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when
`lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on
PR #83's CI sanitizer job.

Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half
byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps`
only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the
zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512
(`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so
they keep using `load_endian_u16x8` / `load_endian_u16x16`.

Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test
--target x86_64-apple-darwin` (2862 tests pass).
@uqio uqio force-pushed the feat/be-tier9 branch from 471a794 to 5967967 Compare May 7, 2026 12:52
Tier 9 packed-float-RGB scalar BE conversion used unconditional
`x.swap_bytes()`, which always swaps regardless of host endianness. On
big-endian hosts (powerpc64, s390x) the source bytes are already in
host-native order, so an extra swap corrupts every BE row. The SIMD
`load_endian_*::<BE>` helpers shipped with feat/be-infra are already
target-endian aware (no-op on a matching host), so the scalar and
per-arch tail paths produced wrong output relative to the SIMD body
on a hypothetical s390x runner.

Replaced every `bits.swap_bytes()` / `to_bits().swap_bytes()` site in
the source files with `u32::from_be` / `u32::from_le` (for f32) or
`u16::from_be` / `u16::from_le` (for f16):

- `if BE { x.swap_bytes() } else { x }`
  → `if BE { u32::from_be(x) } else { u32::from_le(x) }`
- `f32::from_bits(raw.to_bits().swap_bytes())` (BE-only)
  → `f32::from_bits(if BE { u32::from_be(raw.to_bits()) }
       else { u32::from_le(raw.to_bits()) })`

`from_be` / `from_le` is a no-op when the encoded byte order matches
the host, a byte-swap when they differ — exactly mirroring the SIMD
helper semantics so LE and BE hosts now produce bit-identical output.

Special note for the f32 / f16 pass-through kernels: previously the
`else` branch fell back to `copy_from_slice`, which is a byte-level
copy. On a BE host that copies LE-encoded bytes into f32 / f16 lanes
verbatim, leaving the destination in non-host-native order — the
docstring claims "output is always host-native". The fix routes both
branches through `from_bits(from_be/from_le(to_bits()))`, which is a
no-op on LE host (correct, byte order matches) and a swap on BE host
(correct, since the data is LE-encoded).

Source-file call sites fixed:

- u32 (f32 → bits, target-endian decoded): 7 — scalar `load_f32`,
  scalar `rgbf32_to_rgb_f32_row`, neon / x86_sse41 / x86_avx2 /
  x86_avx512 / wasm_simd128 `rgbf32_to_rgb_f32_row` BE tails.
- u16 (f16 → bits, target-endian decoded): 12 — scalar `load_f16`,
  scalar `rgbf16_to_rgb_f32_row`, scalar `rgbf16_to_rgb_f16_row`,
  neon `widen_f16_tail`, x86_sse41 `load_f16_scalar`, x86_avx2 /
  x86_avx512 `rgbf16_to_rgb_f32_row` f16 widen tails, wasm_simd128
  five f16 widen lanes (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`,
  `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`,
  `rgbf16_to_rgb_f32_row`).
- f32 special case: covered by the u32 sites (scalar
  `rgbf32_to_rgb_f32_row` and the per-arch BE tails go through
  `f32::from_bits(u32::from_be/le(to_bits()))`).
- f16 special case: covered by the u16 sites (scalar
  `rgbf16_to_rgb_f16_row` and `rgbf16_to_rgb_f32_row` go through
  `half::f16::from_bits(u16::from_be/le(to_bits()))`).

Test helpers (`be_rgbf32` / `be_rgbf16` in `tests/packed_rgb_float.rs`
across all arch backends) intentionally still use `swap_bytes()`
because they synthesize a BE-encoded buffer from an LE host input —
the unconditional swap is correct there and per-instructions remains
unchanged. The neon `widen_f16_tail` helper additionally became
`<const BE: bool>` (was previously calling `to_f32()` directly on
host-native bits, producing garbage when fed BE-encoded f16 — the
test `neon_rgbf16_to_rgb_f32_be_matches_le` failed at widths where
the 4-lane SIMD body left a non-zero tail).

Verified: 2170 lib tests pass on `aarch64-apple-darwin`;
`cargo build --target x86_64-apple-darwin --tests` clean;
`RUSTFLAGS="-C target-feature=+simd128" cargo build --target
wasm32-unknown-unknown --tests` clean (warnings pre-existing);
`cargo build --no-default-features` clean; `cargo fmt --check` clean;
`cargo clippy --all-targets --all-features -- -D warnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio force-pushed the feat/be-tier9 branch from e83608b to bcc366b Compare May 7, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant