feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels by uqio · Pull Request #84 · Findit-AI/colconv

uqio · 2026-05-07T11:39:29Z

Summary

Phase 2 — Tier 10 float BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Gbrpf32 / Gbrapf32 / Gbrpf16 / Gbrapf16 planar GBR float row kernels across all 6 backends + dispatcher.

Implementation:

Scalar: new load_f32::<BE> and load_f16::<BE> helpers; all 18 kernels (15 f32 + 3 lossless f16) parameterized
NEON: load_f32x4::<BE> + load_endian_u16x4::<BE>; 31 pub fns parameterized
SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); f16 loads via load_endian_u16xN::<BE>
wasm-simd128: 22 pub fns parameterized via load_endian_u32x4::<BE>

Endian helper additions (extending #81 infra surface):

NEON: load_le_u16x4, load_be_u16x4, load_endian_u16x4 (4-lane via vld1_u16)
SSE4.1: load_le_u16x4, load_be_u16x4, load_endian_u16x4 (4-lane via _mm_loadl_epi64)
AVX-512: load_le_u16x16, load_be_u16x16, load_endian_u16x16 (16-lane __m256i for _mm512_cvtph_ps)

These were added here rather than in #81 because they're consumed only by the float planar GBR f16 kernels.

Test results: 2183 tests pass total (6 new BE parity tests). cargo build and cargo test clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

cargo test --target aarch64-apple-darwin
cargo build --target x86_64-apple-darwin --tests
cargo build --target wasm32-unknown-unknown --tests
s390x QEMU (Phase 3)

🤖 Generated with Claude Code

…16 row kernels Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply post-rebase fixups to make the tier 10 float work pass CI on all targets (x86_64, wasm32, no-default-features) — original branch was forked off pre-be-infra and only verified on aarch64. Compile fixes: - Add load_endian_u16x8 helper to x86_avx2/endian.rs (128-bit lane load for use with _mm256_cvtph_ps in 8-pixel f16 widening); planar_gbr_f16 AVX2 paths needed it. - Import endian module in x86_sse41/planar_gbr_float.rs (was previously resolved via a missing path; SSE4.1 backend's f16 widening calls endian::load_endian_u16x4 / load_endian_u32x4 helpers). - Add ::<BE> turbofish to recursive gbrpf16_to_rgb_row_f16c calls in AVX-512 / AVX2 / SSE4.1 luma/HSV staged paths (3 sites each). - Add ::<BE> turbofish to wasm_simd128 inner gbrpf32_to_rgba_row / gbrpf32_to_rgb_u16_row / gbrpf32_to_rgba_u16_row recursive calls in the f16-widen + f32-SIMD path (3 sites). Test-only fixes: - Add ::<false> turbofish to f16-row scalar/SIMD calls in wasm_simd128 tests/planar_gbr_float.rs (now that the kernels are <const BE: bool>). - Add ::<false> turbofish to overflow-panic dispatch tests in dispatch/planar_gbr_float.rs and sinker/mixed/tests/planar_gbr_float.rs. Pre-existing wasm-only unused-imports cleanup (caught now under -Dwarnings on wasm32): - src/row/arch/wasm_simd128/tests/high_bit_4_2_0.rs: drop unused high_bit_plane_wasm, interleave_uv_wasm imports. - src/row/arch/wasm_simd128/tests/planar_8bit_and_nv.rs: drop full helper-list import (none used in this file). - src/row/arch/wasm_simd128/tests/yuva.rs: drop unused p_n_packed_plane, p010_uv_interleave imports. cargo fmt rewraps long lines in the existing tier10-float kernels (line length over 100 — pre-existing in the original commit, only surfaced after rebase exposed cross-arch build). Verified: - cargo test --target aarch64-apple-darwin: 2183 passed, 0 failed - RUSTFLAGS=-Dwarnings cargo build --target x86_64-apple-darwin --tests: ok - RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build --target wasm32-unknown-unknown --tests: ok - RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build --target wasm32-wasip1 --tests: ok - cargo build --no-default-features: ok - cargo fmt --check: clean - cargo clippy --all-targets --all-features -- -D warnings: ok Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The scalar `load_f32::<BE>` / `load_f16::<BE>` helpers used an unconditional `swap_bytes()` regardless of host endianness. The corresponding SIMD `load_endian_u32x4::<BE>` / `load_endian_u16x8::<BE>` helpers (added in PR #81 be-infra) are target-endian aware via `cfg(target_endian = ...)`, so SIMD and scalar disagreed on big-endian hosts. Tail loops dispatch to the scalar fallback, so any width whose tail is non-zero on s390x corrupted the row. Why s390x corrupts with the old code: when reading a `&[f32]` reinterpreted from raw bytes, the host CPU reads the four bytes in host-native order. On LE hosts that matches LE-on-disk; on BE hosts it matches BE-on-disk. An unconditional swap therefore: - LE host + BE data: correct (swap turns BE bytes into native LE) — the case the original code targeted. - BE host + LE data: correct (swap turns LE bytes into native BE). - BE host + BE data: WRONG (host-native is already BE, swap inverts it). - LE host + LE data: handled by `BE = false` no-op — fine. The fix routes both branches through `u32::from_be` / `u32::from_le` (and `u16::from_be` / `u16::from_le` for f16): BE branch: `f32::from_bits(u32::from_be(raw.to_bits()))` LE branch: `f32::from_bits(u32::from_le(raw.to_bits()))` `u32::from_le` is a no-op on LE hosts and a byte-swap on BE hosts; symmetric for `from_be`. This makes both `<BE>` monomorphizations correct on every target endianness and matches the contract the SIMD endian helpers already implement. f32 / f16 paths use `from_bits(u{32,16}::from_be(raw.to_bits()))` so the result is host-native f32 / `half::f16` regardless of the source encoding. The test helpers (`be_encode` in `planar_gbr_float.rs`, `be_encode_f16` in `planar_gbr_f16.rs`) intentionally use unconditional `swap_bytes` to synthesise BE-on-disk fixtures from LE input on an LE host. They are not load helpers and remain unchanged. No SIMD code paths needed changes — the per-arch `load_endian_*` helpers already use `cfg(target_endian = ...)`. Tail loops still call the scalar helpers, which are now correct. Verified: - `cargo test --target aarch64-apple-darwin --lib`: 2176 passed - `cargo build --target x86_64-apple-darwin --tests`: 0 warnings - `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests`: clean - `cargo build --no-default-features`: clean - `cargo fmt --check`: clean - `cargo clippy --all-targets --all-features -- -D warnings`: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Base automatically changed from feat/be-infra to main May 7, 2026 12:37

uqio force-pushed the feat/be-tier10-float branch from 8b7ba10 to 5b1717c Compare May 7, 2026 12:52

uqio and others added 3 commits May 8, 2026 00:52

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf…

604064f

…16 row kernels Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uqio force-pushed the feat/be-tier10-float branch from 5b1717c to 1897dad Compare May 7, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84
uqio wants to merge 3 commits intomainfrom
feat/be-tier10-float

uqio commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uqio commented May 7, 2026

Summary

Stacking

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant