feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84
Open
feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels#84
Conversation
…16 row kernels Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply post-rebase fixups to make the tier 10 float work pass CI on all targets (x86_64, wasm32, no-default-features) — original branch was forked off pre-be-infra and only verified on aarch64. Compile fixes: - Add load_endian_u16x8 helper to x86_avx2/endian.rs (128-bit lane load for use with _mm256_cvtph_ps in 8-pixel f16 widening); planar_gbr_f16 AVX2 paths needed it. - Import endian module in x86_sse41/planar_gbr_float.rs (was previously resolved via a missing path; SSE4.1 backend's f16 widening calls endian::load_endian_u16x4 / load_endian_u32x4 helpers). - Add ::<BE> turbofish to recursive gbrpf16_to_rgb_row_f16c calls in AVX-512 / AVX2 / SSE4.1 luma/HSV staged paths (3 sites each). - Add ::<BE> turbofish to wasm_simd128 inner gbrpf32_to_rgba_row / gbrpf32_to_rgb_u16_row / gbrpf32_to_rgba_u16_row recursive calls in the f16-widen + f32-SIMD path (3 sites). Test-only fixes: - Add ::<false> turbofish to f16-row scalar/SIMD calls in wasm_simd128 tests/planar_gbr_float.rs (now that the kernels are <const BE: bool>). - Add ::<false> turbofish to overflow-panic dispatch tests in dispatch/planar_gbr_float.rs and sinker/mixed/tests/planar_gbr_float.rs. Pre-existing wasm-only unused-imports cleanup (caught now under -Dwarnings on wasm32): - src/row/arch/wasm_simd128/tests/high_bit_4_2_0.rs: drop unused high_bit_plane_wasm, interleave_uv_wasm imports. - src/row/arch/wasm_simd128/tests/planar_8bit_and_nv.rs: drop full helper-list import (none used in this file). - src/row/arch/wasm_simd128/tests/yuva.rs: drop unused p_n_packed_plane, p010_uv_interleave imports. cargo fmt rewraps long lines in the existing tier10-float kernels (line length over 100 — pre-existing in the original commit, only surfaced after rebase exposed cross-arch build). Verified: - cargo test --target aarch64-apple-darwin: 2183 passed, 0 failed - RUSTFLAGS=-Dwarnings cargo build --target x86_64-apple-darwin --tests: ok - RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build --target wasm32-unknown-unknown --tests: ok - RUSTFLAGS="-C target-feature=+simd128 -Dwarnings" cargo build --target wasm32-wasip1 --tests: ok - cargo build --no-default-features: ok - cargo fmt --check: clean - cargo clippy --all-targets --all-features -- -D warnings: ok Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scalar `load_f32::<BE>` / `load_f16::<BE>` helpers used an unconditional `swap_bytes()` regardless of host endianness. The corresponding SIMD `load_endian_u32x4::<BE>` / `load_endian_u16x8::<BE>` helpers (added in PR #81 be-infra) are target-endian aware via `cfg(target_endian = ...)`, so SIMD and scalar disagreed on big-endian hosts. Tail loops dispatch to the scalar fallback, so any width whose tail is non-zero on s390x corrupted the row. Why s390x corrupts with the old code: when reading a `&[f32]` reinterpreted from raw bytes, the host CPU reads the four bytes in host-native order. On LE hosts that matches LE-on-disk; on BE hosts it matches BE-on-disk. An unconditional swap therefore: - LE host + BE data: correct (swap turns BE bytes into native LE) — the case the original code targeted. - BE host + LE data: correct (swap turns LE bytes into native BE). - BE host + BE data: WRONG (host-native is already BE, swap inverts it). - LE host + LE data: handled by `BE = false` no-op — fine. The fix routes both branches through `u32::from_be` / `u32::from_le` (and `u16::from_be` / `u16::from_le` for f16): BE branch: `f32::from_bits(u32::from_be(raw.to_bits()))` LE branch: `f32::from_bits(u32::from_le(raw.to_bits()))` `u32::from_le` is a no-op on LE hosts and a byte-swap on BE hosts; symmetric for `from_be`. This makes both `<BE>` monomorphizations correct on every target endianness and matches the contract the SIMD endian helpers already implement. f32 / f16 paths use `from_bits(u{32,16}::from_be(raw.to_bits()))` so the result is host-native f32 / `half::f16` regardless of the source encoding. The test helpers (`be_encode` in `planar_gbr_float.rs`, `be_encode_f16` in `planar_gbr_f16.rs`) intentionally use unconditional `swap_bytes` to synthesise BE-on-disk fixtures from LE input on an LE host. They are not load helpers and remain unchanged. No SIMD code paths needed changes — the per-arch `load_endian_*` helpers already use `cfg(target_endian = ...)`. Tail loops still call the scalar helpers, which are now correct. Verified: - `cargo test --target aarch64-apple-darwin --lib`: 2176 passed - `cargo build --target x86_64-apple-darwin --tests`: 0 warnings - `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests`: clean - `cargo build --no-default-features`: clean - `cargo fmt --check`: clean - `cargo clippy --all-targets --all-features -- -D warnings`: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 — Tier 10 float BE rollout. Stacked on #81 (BE infra). Adds
<const BE: bool>to all Gbrpf32 / Gbrapf32 / Gbrpf16 / Gbrapf16 planar GBR float row kernels across all 6 backends + dispatcher.Implementation:
load_f32::<BE>andload_f16::<BE>helpers; all 18 kernels (15 f32 + 3 lossless f16) parameterizedload_f32x4::<BE>+load_endian_u16x4::<BE>; 31 pub fns parameterized_mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); f16 loads viaload_endian_u16xN::<BE>load_endian_u32x4::<BE>Endian helper additions (extending #81 infra surface):
load_le_u16x4,load_be_u16x4,load_endian_u16x4(4-lane viavld1_u16)load_le_u16x4,load_be_u16x4,load_endian_u16x4(4-lane via_mm_loadl_epi64)load_le_u16x16,load_be_u16x16,load_endian_u16x16(16-lane__m256ifor_mm512_cvtph_ps)These were added here rather than in #81 because they're consumed only by the float planar GBR f16 kernels.
Test results: 2183 tests pass total (6 new BE parity tests).
cargo buildandcargo testclean.Stacking
Base:
feat/be-infra(#81). Will rebase ontomainonce #81 merges.Test plan
cargo test --target aarch64-apple-darwincargo build --target x86_64-apple-darwin --testscargo build --target wasm32-unknown-unknown --tests🤖 Generated with Claude Code