fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32 by AdaWorldAPI · Pull Request #142 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-13T14:32:16Z

Summary

Two soundness/correctness bugs surfaced by a 15-agent CCA2A review fleet on this branch (12 file-scoped Sonnet agents → meta-orchestrator → brutally-honest reviewer, all entries appended to .claude/board/AGENT_LOG.md via tee -a). Confirmed real by the brutally-honest reviewer, which built the workspace and ran the test suite before any change.

Fixes

#	Bug	Severity	Fix
1	`simd_avx512::U8x64::permute_bytes` called `_mm512_permutexvar_epi8` (AVX-512VBMI) as safe `pub fn` with no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI). The doc claimed a fallback existed; none did.	P0 SIGILL	Added `avx512vbmi: bool` field to `SimdCaps`. `permute_bytes` runtime-branches via the singleton: VBMI hosts use the hardware intrinsic (gated `#[target_feature(enable = "avx512vbmi")]` inner unsafe leaf — Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback mirroring `simd_avx2.rs:1435`.
2	`simd_exp_f32(+Inf)` silently returned ~0.5 in release / panicked in debug. `pow2n_from_int` saturated `f32::INFINITY as i32` to `i32::MAX`; `(i32::MAX + 127) as u32` wrapped, producing a garbage IEEE bit pattern via `f32::from_bits`.	P1 silent-wrong-output	Pre-clamp `simd_exp_f32` input to `[-87.336, 88.722]`. Defense-in-depth clamp in `pow2n_from_int` for `ni ∈ [-126, 127]` before bias add. Three regression tests: `+Inf`, `-Inf`, `x=200` — all assert finite output.

Design pattern (for the VBMI fix)

The same conditional-routing shape as future AVX-512 sub-feature work (BF16, VNNI sub-paths). One LazyLock CPU detect (simd_caps()) is consulted at the method boundary; the inner intrinsic leaf carries the #[target_feature] attribute (Rust requires it to use intrinsics from a function not compiled with the feature globally — there is no other legal way). Consumer calls U8x64::permute_bytes(idx) and gets the right path on any AVX-512 CPU.

What this PR does NOT touch

Strictly additive — full breakdown including explicit deferrals (AVX2-tier cfg work, cosmetic-SIMD sweep, AMX detection consolidation) lives in .claude/board/SIMD_REVIEW_FIXES_2026_05_13.md, included in this PR. AMX inline-asm encodings, _mm512_* calls in other methods, existing #[target_feature] annotations: all untouched.

Test plan

cargo test --features rayon --lib → 1786 passed (was 1783; +3 simd_exp_f32 regression tests)
cargo clippy --features rayon -- -D warnings → clean
cargo check --lib (default features, no rayon) → clean

Hardware test matrix

Target	Pre-PR `permute_bytes`	Post-PR `permute_bytes`
Sapphire Rapids (avx512f + avx512vbmi)	works	works (same VBMI hardware path, now via dispatch)
Skylake-X / Cascade Lake / Ice Lake-SP (avx512f, no VBMI)	SIGILL	works (scalar fallback)
Pre-AVX-512 (avx2 only)	type unavailable	type unavailable (unchanged)
ARM aarch64	type unavailable	type unavailable (unchanged)

Review fleet output

15 agents, 13 entries in .claude/board/AGENT_LOG.md (12 file-scoped Sonnet + 1 Opus meta + 1 Opus brutally-honest reviewer). The brutally-honest reviewer corrected the meta on several findings (e.g., framebuffer::project_ortho was a clarity improvement, not a UB fix — Rust 1.45+ saturates float→int casts) and identified the actual P0 SIGILL/correctness items shipped here.

Most other fleet findings (cosmetic-SIMD sweep, AMX detection duplication, SAFETY-comment audit) were classified as real-but-not-Bevy-blocking and deferred per the surgical-and-additive constraint.

Generated by Claude Code

Add integrate_simd_par gated on the existing `rayon` feature. Splits the position/velocity buffers into BLOCK_FLOATS-sized chunks (1024 floats = 4 KB, L1-resident) and runs the existing F32x16::mul_add inner loop on each block in parallel via par_chunks_mut + zip. Composes 16 SIMD lanes × N rayon threads. Block size is chosen so each sub-slice stays a multiple of 16, so the inner as_chunks_mut::<16>() tail is always empty. Tests: integrate_simd_par_matches_sequential — bit-identical output vs sequential integrate_simd (FMA + mul are deterministic). integrate_simd_par_advances_positions_exactly — single-tick contract x[i] += v[i] * dt holds within f32 epsilon. Both gated behind #[cfg(feature = "rayon")]; default build is unchanged.

…rtho f32-to-usize cast is UB in Rust for negative / NaN / out-of-range values (reference §5.5.1). Previously project_ortho cast `(pos*scale + offset) as usize` directly with only a post-cast `.min(screen_w-1)` clamp, which can trigger UB on negative inputs — a real hazard once Bevy's target-cpu=x86-64-v4 enables strict provenance. Fix: clamp in the float domain to [0, screen_dim - 1] BEFORE the cast, so the cast input is always finite, non-negative, and within usize range. Also adds AGENT_LOG.md (CCA2A file blackboard) used by the 12-agent fleet that surfaced this bug + the broader polyfill-violation audit. Reported-by: agent #8 framebuffer (sonnet) in fleet review.

Two soundness/correctness bugs surfaced by the 15-agent CCA2A fleet review on this branch and confirmed real by the brutally-honest reviewer (see .claude/board/AGENT_LOG.md for full fleet output). 1. permute_bytes (P0 SIGILL) — U8x64::permute_bytes called _mm512_permutexvar_epi8 (AVX-512VBMI) as safe pub fn with no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI). Doc claimed a fallback existed; none did. Fix: added avx512vbmi: bool field to SimdCaps. permute_bytes now runtime-branches via the singleton — VBMI hosts use the hardware intrinsic (gated #[target_feature(enable = "avx512vbmi")] inner unsafe leaf, Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback mirroring the AVX2-tier shape at simd_avx2.rs:1435. The #[target_feature] attribute on the inner permute_bytes_vbmi leaf stays — Rust requires it to call VBMI intrinsics from a function not compiled with VBMI globally. The user-facing permute_bytes method is safe and works on any AVX-512 CPU. 2. simd_exp_f32(Inf) (P1 silent-wrong-output) — pow2n_from_int saturated f32::INFINITY as i32 to i32::MAX; (i32::MAX + 127) wrapped, producing garbage IEEE bits via from_bits, polynomial × garbage ≈ 0.5. exp(+Inf) silently returned ~0.5 in release / panicked in debug. Fix: pre-clamp simd_exp_f32 input to [-87.336, 88.722] (the f32-representable domain of exp). Defense-in-depth: pow2n_from_int clamps ni to [-126, 127] before bias add. NaN propagates naturally. 3 regression tests added: +Inf, -Inf, x=200. All assert finite output. Strictly additive — no existing intrinsic touched, no public API broken, no existing test modified, AMX inline-asm encodings untouched. Test surface: 1786 passed, 0 failed (was 1783 — added 3 exp regression tests). cargo clippy --features rayon -- -D warnings: clean. Documentation: see .claude/board/SIMD_REVIEW_FIXES_2026_05_13.md for the full PR breakdown including findings explicitly deferred (AVX2 cfg gate, cosmetic-SIMD sweep, AMX detection consolidation, SAFETY-comment audit).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b3674bfa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T14:34:51Z

+    // Pre-clamp to the safe domain. Outside this band exp() is non-representable
+    // anyway (overflow → +Inf at ~88.7, underflow → +0 at ~-87.3) so the clamp
+    // is observable only at the saturation boundary.
+    let x = x.simd_clamp(F32x16::splat(-87.336_f32), F32x16::splat(88.722_f32));


Preserve NaNs when clamping exp inputs

For lanes where callers pass NaN, this clamp does not preserve the documented NaN behavior: the SIMD simd_clamp implementations are max(lo).min(hi), and both the scalar fallback and x86 intrinsics select one of the bounds for a NaN operand. That makes simd_exp_f32(NaN) return a small finite value near exp(-87.336) instead of propagating NaN, which can hide invalid values in activation/softmax paths that rely on exp semantics. Use comparisons that leave NaN lanes unchanged before clamping finite inputs.

Useful? React with 👍 / 👎.

Per project convention, every AVX-512 build has AVX2 codegen HARD-disabled by design — AVX-512 builds use _mm512_* exclusively (zmm registers only, no mixed-width). The 256-bit AVX2-tier types in simd_avx512.rs are unreachable in any valid build configuration where the soundness concern would apply. The reviewer's AVX2-types-ungated finding was theoretical, not real.

The pre-clamp via simd_clamp silently destroyed NaN inputs. simd_clamp is implemented as max(lo).min(hi); _mm512_max_ps returns the SECOND operand when the first is NaN (per Intel SDM § MAXPS), so NaN got clamped to lo (-87.336) and exp(-87.336) ≈ 1.4e-38 — a tiny finite value pretending to be valid. Fix: capture NaN lanes via x.simd_ne(x) (NaN ≠ itself per IEEE 754) BEFORE the clamp, then mask-select NaN back into those lanes after the polynomial. NaN propagates per-lane; finite lanes are unchanged. Two regression tests: simd_exp_f32_propagates_nan — full-NaN vector returns full-NaN simd_exp_f32_propagates_nan_per_lane — mixed NaN/0.0 input; NaN lanes propagate, finite lanes compute exp(0)=1 unaffected 1788 passed (+2 from 1786). Reported-by: codex review on PR #142.

The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142. That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]` decorating scalar bodies that gave zero speedup over plain scalar. The root cause: there was no `U8x32` type in the polyfill, so consumers couldn't write SIMD byte code at AVX2's natural width (32 bytes = one __m256i ymm register). This PR adds U8x32 with real __m256i storage and 26 polyfill methods mirroring `simd_avx512::U8x64`: Constructors: splat, from_slice, from_array, to_array, copy_to_slice Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64 Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8) Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80), movemask → u32 (matches _mm256_movemask_epi8 width) Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8) Avg: pairwise_avg (_mm256_avg_epu8, round-up) Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16) Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8) permute_bytes (cross-lane, scalar fallback — AVX2 has no native cross-lane byte permute; matches U8x64's behavior on AVX-512F-without-VBMI hosts) unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8) Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask) LUT: nibble_popcount_lut Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping), Debug, Default. All ~26 methods. Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build tiers — U8x32 is the natural AVX2 byte width and is needed regardless of whether AVX-512's U8x64 is the consumer's preferred width. Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that AVX2 is the compile target (x86-64-v3 is project baseline). The codebase uses this pattern already in the AVX2 popcount at simd_avx2.rs:357. Test coverage: - 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max reductions, unsigned cmp masks (incl. high-byte > 127 to verify the XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg round-up, shr_epi16 nibble extraction, permute_bytes reverse, mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes. - All 18 pass. Total test count 1786 → 1804 with no regressions. clippy --features rayon -- -D warnings: clean. Companion: this PR unblocks the round-3 consumer fleet which will rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41 and friends to use `crate::simd::U8x32` instead of `#[target_feature]` wrappers around scalar code. Each consumer rewrite ships as its own PR in the next wave.

Complete the portable-simd backend started in the scaffold commit. 12 Sonnet agents (round-3-portable-simd fleet) populated each of the 12 sub-files in `src/simd_nightly/` via the A2A blackboard pattern at `.claude/board/AGENT_LOG.md`. Total: ~4,022 LOC of wrapper code + 76 parity tests. Per-file (line counts at commit): - f32_types.rs (395) — F32x16, F32x8 - f64_types.rs (307) — F64x8, F64x4 - u8_types.rs (1043) — U8x32, U8x64 + 26 in-file tests - u_word_types.rs (520) — U16x32, U32x16, U32x8, U64x8, U64x4 - i8_types.rs (263) — I8x32, I8x64 - i_word_types.rs (449) — I16x16, I16x32, I32x16, I64x8 - masks.rs (196) — F32Mask16, F32Mask8, F64Mask8, F64Mask4 - bf16_types.rs (248) — BF16x16, BF16x8 (scalar emulation; core::simd has no half-precision) - f16_types.rs (220) — F16x16 (scalar IEEE-754 binary16 emulation) - ops.rs (265) — Add/Sub/Mul/Div/Neg + bitwise + Default macros, applied to all 17 numeric types - exotic_methods.rs (329) — permute_bytes / shuffle_bytes / mask_blend / unpack_lo_epi8 / unpack_hi_epi8 scalar fallbacks for U8x32 + U8x64 (core::simd has no native cross-lane byte ops or bitmask-driven blend) - tests.rs (815) — 76 parity tests vs scalar reference 30 types total (mirrors the AVX-512 / AVX2 polyfill surface 1:1). All re-exported flat from `crate::simd_nightly::*` via the mod.rs aggregator. Verification: rustup run nightly cargo check --features nightly-simd -p ndarray --lib → Finished, 0 errors rustup run nightly cargo test --features nightly-simd -p ndarray --lib simd_nightly → test result: ok. 153 passed; 0 failed cargo check --lib (stable, default features, no nightly-simd) → Finished, 0 errors (the existing intrinsics dispatch is unchanged) Cross-agent findings worth folding into a handover note: - `std::simd::StdFloat` is the trait that provides mul_add/sqrt/round/ floor on core::simd float vectors. `core::simd::num::SimdFloat` provides reduce/min/max/clamp but NOT the transcendentals. - `core::simd::cmp::SimdOrd` is needed for simd_min/simd_max on integer vectors (SimdPartialOrd alone is not sufficient). - `core::simd::Mask::to_bitmask()` always returns u64 regardless of lane count. Wrappers cast `as u8` / `as u16` / `as u32` for narrower bitmask shapes. - `core::simd::Simd::swizzle` is `const N: usize` — cannot take a runtime index vector. permute_bytes / shuffle_bytes need scalar fallback. Same shape as the AVX-512F-without-VBMI fallback path in simd_avx512.rs added in PR #142. What this enables: Miri can execute every method here (intrinsics-based backends are opaque to miri). Consumers who want miri-runnable SIMD tests import from `ndarray::simd_nightly::*` explicitly. The main polyfill via `crate::simd::*` continues to use intrinsics — the nightly-simd feature does NOT replace the production dispatch, it provides a parallel namespace for miri tooling. Fleet output in .claude/board/AGENT_LOG.md (round-3-portable-simd section). 6 of 12 agents hit the same AGENT_LOG-write permission pre-existing block from round-2 — backfilled by the main thread.

claude added 3 commits May 13, 2026 11:02

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

claude added 2 commits May 13, 2026 14:38

AdaWorldAPI merged commit 8259600 into master May 13, 2026
13 of 14 checks passed

AdaWorldAPI mentioned this pull request May 13, 2026

feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone) #144

Merged

5 tasks

AdaWorldAPI mentioned this pull request May 13, 2026

feat(simd_nightly): 30-type portable-simd backend (draft, nightly-simd feature) #146

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32#142

fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32#142
AdaWorldAPI merged 5 commits into
masterfrom
claude/ndarray-simd-review-S0zXK

AdaWorldAPI commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 13, 2026

Summary

Fixes

Design pattern (for the VBMI fix)

What this PR does NOT touch

Test plan

Hardware test matrix

Review fleet output

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants