Skip to content

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84

Open
uqio wants to merge 3 commits intomainfrom
feat/be-tier10-float
Open

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84
uqio wants to merge 3 commits intomainfrom
feat/be-tier10-float

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 10 float BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Gbrpf32 / Gbrapf32 / Gbrpf16 / Gbrapf16 planar GBR float row kernels across all 6 backends + dispatcher.

Implementation:

  • Scalar: new load_f32::<BE> and load_f16::<BE> helpers; all 18 kernels (15 f32 + 3 lossless f16) parameterized
  • NEON: load_f32x4::<BE> + load_endian_u16x4::<BE>; 31 pub fns parameterized
  • SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); f16 loads via load_endian_u16xN::<BE>
  • wasm-simd128: 22 pub fns parameterized via load_endian_u32x4::<BE>

Endian helper additions (extending #81 infra surface):

  • NEON: load_le_u16x4, load_be_u16x4, load_endian_u16x4 (4-lane via vld1_u16)
  • SSE4.1: load_le_u16x4, load_be_u16x4, load_endian_u16x4 (4-lane via _mm_loadl_epi64)
  • AVX-512: load_le_u16x16, load_be_u16x16, load_endian_u16x16 (16-lane __m256i for _mm512_cvtph_ps)

These were added here rather than in #81 because they're consumed only by the float planar GBR f16 kernels.

Test results: 2183 tests pass total (6 new BE parity tests). cargo build and cargo test clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

  • cargo test --target aarch64-apple-darwin
  • cargo build --target x86_64-apple-darwin --tests
  • cargo build --target wasm32-unknown-unknown --tests
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

Base automatically changed from feat/be-infra to main May 7, 2026 12:37
@uqio uqio force-pushed the feat/be-tier10-float branch from 8b7ba10 to 5b1717c Compare May 7, 2026 12:52
uqio and others added 3 commits May 8, 2026 00:52
…16 row kernels

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply post-rebase fixups to make the tier 10 float work pass CI on all
targets (x86_64, wasm32, no-default-features) — original branch was
forked off pre-be-infra and only verified on aarch64.

Compile fixes:
- Add load_endian_u16x8 helper to x86_avx2/endian.rs (128-bit lane load
  for use with _mm256_cvtph_ps in 8-pixel f16 widening); planar_gbr_f16
  AVX2 paths needed it.
- Import endian module in x86_sse41/planar_gbr_float.rs (was previously
  resolved via a missing path; SSE4.1 backend's f16 widening calls
  endian::load_endian_u16x4 / load_endian_u32x4 helpers).
- Add ::<BE> turbofish to recursive gbrpf16_to_rgb_row_f16c calls in
  AVX-512 / AVX2 / SSE4.1 luma/HSV staged paths (3 sites each).
- Add ::<BE> turbofish to wasm_simd128 inner gbrpf32_to_rgba_row /
  gbrpf32_to_rgb_u16_row / gbrpf32_to_rgba_u16_row recursive calls in
  the f16-widen + f32-SIMD path (3 sites).

Test-only fixes:
- Add ::<false> turbofish to f16-row scalar/SIMD calls in wasm_simd128
  tests/planar_gbr_float.rs (now that the kernels are <const BE: bool>).
- Add ::<false> turbofish to overflow-panic dispatch tests in
  dispatch/planar_gbr_float.rs and sinker/mixed/tests/planar_gbr_float.rs.

Pre-existing wasm-only unused-imports cleanup (caught now under -Dwarnings
on wasm32):
- src/row/arch/wasm_simd128/tests/high_bit_4_2_0.rs: drop unused
  high_bit_plane_wasm, interleave_uv_wasm imports.
- src/row/arch/wasm_simd128/tests/planar_8bit_and_nv.rs: drop full
  helper-list import (none used in this file).
- src/row/arch/wasm_simd128/tests/yuva.rs: drop unused
  p_n_packed_plane, p010_uv_interleave imports.

cargo fmt rewraps long lines in the existing tier10-float kernels (line
length over 100 — pre-existing in the original commit, only surfaced
after rebase exposed cross-arch build).

Verified:
- cargo test --target aarch64-apple-darwin: 2183 passed, 0 failed
- RUSTFLAGS=-Dwarnings cargo build --target x86_64-apple-darwin --tests: ok
- RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build
  --target wasm32-unknown-unknown --tests: ok
- RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build
  --target wasm32-wasip1 --tests: ok
- cargo build --no-default-features: ok
- cargo fmt --check: clean
- cargo clippy --all-targets --all-features -- -D warnings: ok

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scalar `load_f32::<BE>` / `load_f16::<BE>` helpers used an unconditional
`swap_bytes()` regardless of host endianness. The corresponding SIMD
`load_endian_u32x4::<BE>` / `load_endian_u16x8::<BE>` helpers (added in PR #81
be-infra) are target-endian aware via `cfg(target_endian = ...)`, so SIMD and
scalar disagreed on big-endian hosts. Tail loops dispatch to the scalar
fallback, so any width whose tail is non-zero on s390x corrupted the row.

Why s390x corrupts with the old code: when reading a `&[f32]` reinterpreted
from raw bytes, the host CPU reads the four bytes in host-native order. On
LE hosts that matches LE-on-disk; on BE hosts it matches BE-on-disk. An
unconditional swap therefore:
  - LE host + BE data: correct (swap turns BE bytes into native LE) — the
    case the original code targeted.
  - BE host + LE data: correct (swap turns LE bytes into native BE).
  - BE host + BE data: WRONG (host-native is already BE, swap inverts it).
  - LE host + LE data: handled by `BE = false` no-op — fine.

The fix routes both branches through `u32::from_be` / `u32::from_le` (and
`u16::from_be` / `u16::from_le` for f16):

  BE  branch: `f32::from_bits(u32::from_be(raw.to_bits()))`
  LE  branch: `f32::from_bits(u32::from_le(raw.to_bits()))`

`u32::from_le` is a no-op on LE hosts and a byte-swap on BE hosts; symmetric
for `from_be`. This makes both `<BE>` monomorphizations correct on every
target endianness and matches the contract the SIMD endian helpers already
implement.

f32 / f16 paths use `from_bits(u{32,16}::from_be(raw.to_bits()))` so the
result is host-native f32 / `half::f16` regardless of the source encoding.

The test helpers (`be_encode` in `planar_gbr_float.rs`, `be_encode_f16` in
`planar_gbr_f16.rs`) intentionally use unconditional `swap_bytes` to
synthesise BE-on-disk fixtures from LE input on an LE host. They are not
load helpers and remain unchanged.

No SIMD code paths needed changes — the per-arch `load_endian_*` helpers
already use `cfg(target_endian = ...)`. Tail loops still call the scalar
helpers, which are now correct.

Verified:
- `cargo test --target aarch64-apple-darwin --lib`: 2176 passed
- `cargo build --target x86_64-apple-darwin --tests`: 0 warnings
- `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests`: clean
- `cargo build --no-default-features`: clean
- `cargo fmt --check`: clean
- `cargo clippy --all-targets --all-features -- -D warnings`: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio force-pushed the feat/be-tier10-float branch from 5b1717c to 1897dad Compare May 7, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant