Skip to content

fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32#142

Merged
AdaWorldAPI merged 5 commits into
masterfrom
claude/ndarray-simd-review-S0zXK
May 13, 2026
Merged

fix(simd): VBMI gate for permute_bytes + Inf clamp for simd_exp_f32#142
AdaWorldAPI merged 5 commits into
masterfrom
claude/ndarray-simd-review-S0zXK

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Two soundness/correctness bugs surfaced by a 15-agent CCA2A review fleet on this branch (12 file-scoped Sonnet agents → meta-orchestrator → brutally-honest reviewer, all entries appended to .claude/board/AGENT_LOG.md via tee -a). Confirmed real by the brutally-honest reviewer, which built the workspace and ran the test suite before any change.

Fixes

# Bug Severity Fix
1 simd_avx512::U8x64::permute_bytes called _mm512_permutexvar_epi8 (AVX-512VBMI) as safe pub fn with no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI). The doc claimed a fallback existed; none did. P0 SIGILL Added avx512vbmi: bool field to SimdCaps. permute_bytes runtime-branches via the singleton: VBMI hosts use the hardware intrinsic (gated #[target_feature(enable = "avx512vbmi")] inner unsafe leaf — Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback mirroring simd_avx2.rs:1435.
2 simd_exp_f32(+Inf) silently returned ~0.5 in release / panicked in debug. pow2n_from_int saturated f32::INFINITY as i32 to i32::MAX; (i32::MAX + 127) as u32 wrapped, producing a garbage IEEE bit pattern via f32::from_bits. P1 silent-wrong-output Pre-clamp simd_exp_f32 input to [-87.336, 88.722]. Defense-in-depth clamp in pow2n_from_int for ni ∈ [-126, 127] before bias add. Three regression tests: +Inf, -Inf, x=200 — all assert finite output.

Design pattern (for the VBMI fix)

The same conditional-routing shape as future AVX-512 sub-feature work (BF16, VNNI sub-paths). One LazyLock CPU detect (simd_caps()) is consulted at the method boundary; the inner intrinsic leaf carries the #[target_feature] attribute (Rust requires it to use intrinsics from a function not compiled with the feature globally — there is no other legal way). Consumer calls U8x64::permute_bytes(idx) and gets the right path on any AVX-512 CPU.

What this PR does NOT touch

Strictly additive — full breakdown including explicit deferrals (AVX2-tier cfg work, cosmetic-SIMD sweep, AMX detection consolidation) lives in .claude/board/SIMD_REVIEW_FIXES_2026_05_13.md, included in this PR. AMX inline-asm encodings, _mm512_* calls in other methods, existing #[target_feature] annotations: all untouched.

Test plan

  • cargo test --features rayon --lib → 1786 passed (was 1783; +3 simd_exp_f32 regression tests)
  • cargo clippy --features rayon -- -D warnings → clean
  • cargo check --lib (default features, no rayon) → clean

Hardware test matrix

Target Pre-PR permute_bytes Post-PR permute_bytes
Sapphire Rapids (avx512f + avx512vbmi) works works (same VBMI hardware path, now via dispatch)
Skylake-X / Cascade Lake / Ice Lake-SP (avx512f, no VBMI) SIGILL works (scalar fallback)
Pre-AVX-512 (avx2 only) type unavailable type unavailable (unchanged)
ARM aarch64 type unavailable type unavailable (unchanged)

Review fleet output

15 agents, 13 entries in .claude/board/AGENT_LOG.md (12 file-scoped Sonnet + 1 Opus meta + 1 Opus brutally-honest reviewer). The brutally-honest reviewer corrected the meta on several findings (e.g., framebuffer::project_ortho was a clarity improvement, not a UB fix — Rust 1.45+ saturates float→int casts) and identified the actual P0 SIGILL/correctness items shipped here.

Most other fleet findings (cosmetic-SIMD sweep, AMX detection duplication, SAFETY-comment audit) were classified as real-but-not-Bevy-blocking and deferred per the surgical-and-additive constraint.


Generated by Claude Code

claude added 3 commits May 13, 2026 11:02
Add integrate_simd_par gated on the existing `rayon` feature. Splits the
position/velocity buffers into BLOCK_FLOATS-sized chunks (1024 floats =
4 KB, L1-resident) and runs the existing F32x16::mul_add inner loop on
each block in parallel via par_chunks_mut + zip.

Composes 16 SIMD lanes × N rayon threads. Block size is chosen so each
sub-slice stays a multiple of 16, so the inner as_chunks_mut::<16>() tail
is always empty.

Tests:
  integrate_simd_par_matches_sequential — bit-identical output vs
    sequential integrate_simd (FMA + mul are deterministic).
  integrate_simd_par_advances_positions_exactly — single-tick contract
    x[i] += v[i] * dt holds within f32 epsilon.

Both gated behind #[cfg(feature = "rayon")]; default build is unchanged.
…rtho

f32-to-usize cast is UB in Rust for negative / NaN / out-of-range values
(reference §5.5.1). Previously project_ortho cast `(pos*scale + offset) as
usize` directly with only a post-cast `.min(screen_w-1)` clamp, which can
trigger UB on negative inputs — a real hazard once Bevy's
target-cpu=x86-64-v4 enables strict provenance.

Fix: clamp in the float domain to [0, screen_dim - 1] BEFORE the cast, so
the cast input is always finite, non-negative, and within usize range.

Also adds AGENT_LOG.md (CCA2A file blackboard) used by the 12-agent fleet
that surfaced this bug + the broader polyfill-violation audit.

Reported-by: agent #8 framebuffer (sonnet) in fleet review.
Two soundness/correctness bugs surfaced by the 15-agent CCA2A fleet
review on this branch and confirmed real by the brutally-honest reviewer
(see .claude/board/AGENT_LOG.md for full fleet output).

1. permute_bytes (P0 SIGILL) — U8x64::permute_bytes called
   _mm512_permutexvar_epi8 (AVX-512VBMI) as safe pub fn with no gate.
   SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (AVX-512F-but-no-VBMI).
   Doc claimed a fallback existed; none did.

   Fix: added avx512vbmi: bool field to SimdCaps. permute_bytes now
   runtime-branches via the singleton — VBMI hosts use the hardware
   intrinsic (gated #[target_feature(enable = "avx512vbmi")] inner unsafe
   leaf, Rust language requirement); non-VBMI AVX-512F hosts use a scalar
   fallback mirroring the AVX2-tier shape at simd_avx2.rs:1435.

   The #[target_feature] attribute on the inner permute_bytes_vbmi leaf
   stays — Rust requires it to call VBMI intrinsics from a function not
   compiled with VBMI globally. The user-facing permute_bytes method is
   safe and works on any AVX-512 CPU.

2. simd_exp_f32(Inf) (P1 silent-wrong-output) — pow2n_from_int saturated
   f32::INFINITY as i32 to i32::MAX; (i32::MAX + 127) wrapped, producing
   garbage IEEE bits via from_bits, polynomial × garbage ≈ 0.5. exp(+Inf)
   silently returned ~0.5 in release / panicked in debug.

   Fix: pre-clamp simd_exp_f32 input to [-87.336, 88.722] (the
   f32-representable domain of exp). Defense-in-depth: pow2n_from_int
   clamps ni to [-126, 127] before bias add. NaN propagates naturally.
   3 regression tests added: +Inf, -Inf, x=200. All assert finite output.

Strictly additive — no existing intrinsic touched, no public API broken,
no existing test modified, AMX inline-asm encodings untouched.

Test surface: 1786 passed, 0 failed (was 1783 — added 3 exp regression
tests). cargo clippy --features rayon -- -D warnings: clean.

Documentation: see .claude/board/SIMD_REVIEW_FIXES_2026_05_13.md for the
full PR breakdown including findings explicitly deferred (AVX2 cfg gate,
cosmetic-SIMD sweep, AMX detection consolidation, SAFETY-comment audit).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b3674bfa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simd.rs
// Pre-clamp to the safe domain. Outside this band exp() is non-representable
// anyway (overflow → +Inf at ~88.7, underflow → +0 at ~-87.3) so the clamp
// is observable only at the saturation boundary.
let x = x.simd_clamp(F32x16::splat(-87.336_f32), F32x16::splat(88.722_f32));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve NaNs when clamping exp inputs

For lanes where callers pass NaN, this clamp does not preserve the documented NaN behavior: the SIMD simd_clamp implementations are max(lo).min(hi), and both the scalar fallback and x86 intrinsics select one of the bounds for a NaN operand. That makes simd_exp_f32(NaN) return a small finite value near exp(-87.336) instead of propagating NaN, which can hide invalid values in activation/softmax paths that rely on exp semantics. Use comparisons that leave NaN lanes unchanged before clamping finite inputs.

Useful? React with 👍 / 👎.

claude added 2 commits May 13, 2026 14:38
Per project convention, every AVX-512 build has AVX2 codegen HARD-disabled
by design — AVX-512 builds use _mm512_* exclusively (zmm registers only,
no mixed-width). The 256-bit AVX2-tier types in simd_avx512.rs are
unreachable in any valid build configuration where the soundness concern
would apply. The reviewer's AVX2-types-ungated finding was theoretical,
not real.
The pre-clamp via simd_clamp silently destroyed NaN inputs. simd_clamp is
implemented as max(lo).min(hi); _mm512_max_ps returns the SECOND operand
when the first is NaN (per Intel SDM § MAXPS), so NaN got clamped to lo
(-87.336) and exp(-87.336) ≈ 1.4e-38 — a tiny finite value pretending to
be valid.

Fix: capture NaN lanes via x.simd_ne(x) (NaN ≠ itself per IEEE 754) BEFORE
the clamp, then mask-select NaN back into those lanes after the polynomial.
NaN propagates per-lane; finite lanes are unchanged.

Two regression tests:
  simd_exp_f32_propagates_nan — full-NaN vector returns full-NaN
  simd_exp_f32_propagates_nan_per_lane — mixed NaN/0.0 input; NaN lanes
    propagate, finite lanes compute exp(0)=1 unaffected

1788 passed (+2 from 1786).

Reported-by: codex review on PR #142.
@AdaWorldAPI AdaWorldAPI merged commit 8259600 into master May 13, 2026
13 of 14 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 13, 2026
The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142.
That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs,
hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]`
decorating scalar bodies that gave zero speedup over plain scalar. The
root cause: there was no `U8x32` type in the polyfill, so consumers
couldn't write SIMD byte code at AVX2's natural width (32 bytes = one
__m256i ymm register).

This PR adds U8x32 with real __m256i storage and 26 polyfill methods
mirroring `simd_avx512::U8x64`:

Constructors:    splat, from_slice, from_array, to_array, copy_to_slice
Reductions:      reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64
Min/max:         simd_min, simd_max  (_mm256_min_epu8, _mm256_max_epu8)
Compare→mask:    cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80),
                 movemask → u32  (matches _mm256_movemask_epi8 width)
Saturating:      saturating_add, saturating_sub  (_mm256_adds/subs_epu8)
Avg:             pairwise_avg  (_mm256_avg_epu8, round-up)
Shifts:          shr_epi16, shl_epi16  (16-bit lane shifts via _mm256_srl/sll_epi16)
Shuffles:        shuffle_bytes  (within-128-bit-lane, _mm256_shuffle_epi8)
                 permute_bytes  (cross-lane, scalar fallback — AVX2 has no
                 native cross-lane byte permute; matches U8x64's behavior
                 on AVX-512F-without-VBMI hosts)
                 unpack_lo_epi8, unpack_hi_epi8  (_mm256_unpacklo/hi_epi8)
Conditional:     mask_blend  (_mm256_blendv_epi8, MSB-driven, NOT bitmask)
LUT:             nibble_popcount_lut

Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping),
Debug, Default. All ~26 methods.

Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build
tiers — U8x32 is the natural AVX2 byte width and is needed regardless
of whether AVX-512's U8x64 is the consumer's preferred width.

Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics
are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that
AVX2 is the compile target (x86-64-v3 is project baseline). The codebase
uses this pattern already in the AVX2 popcount at simd_avx2.rs:357.

Test coverage:
- 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max
  reductions, unsigned cmp masks (incl. high-byte > 127 to verify the
  XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg
  round-up, shr_epi16 nibble extraction, permute_bytes reverse,
  mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes.
- All 18 pass. Total test count 1786 → 1804 with no regressions.

clippy --features rayon -- -D warnings: clean.

Companion: this PR unblocks the round-3 consumer fleet which will
rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41
and friends to use `crate::simd::U8x32` instead of `#[target_feature]`
wrappers around scalar code. Each consumer rewrite ships as its own PR
in the next wave.
AdaWorldAPI pushed a commit that referenced this pull request May 13, 2026
Complete the portable-simd backend started in the scaffold commit.
12 Sonnet agents (round-3-portable-simd fleet) populated each of the
12 sub-files in `src/simd_nightly/` via the A2A blackboard pattern at
`.claude/board/AGENT_LOG.md`.

Total: ~4,022 LOC of wrapper code + 76 parity tests.

Per-file (line counts at commit):
  - f32_types.rs (395)    — F32x16, F32x8
  - f64_types.rs (307)    — F64x8, F64x4
  - u8_types.rs (1043)    — U8x32, U8x64 + 26 in-file tests
  - u_word_types.rs (520) — U16x32, U32x16, U32x8, U64x8, U64x4
  - i8_types.rs (263)     — I8x32, I8x64
  - i_word_types.rs (449) — I16x16, I16x32, I32x16, I64x8
  - masks.rs (196)        — F32Mask16, F32Mask8, F64Mask8, F64Mask4
  - bf16_types.rs (248)   — BF16x16, BF16x8 (scalar emulation;
                            core::simd has no half-precision)
  - f16_types.rs (220)    — F16x16 (scalar IEEE-754 binary16 emulation)
  - ops.rs (265)          — Add/Sub/Mul/Div/Neg + bitwise + Default
                            macros, applied to all 17 numeric types
  - exotic_methods.rs (329) — permute_bytes / shuffle_bytes / mask_blend /
                              unpack_lo_epi8 / unpack_hi_epi8 scalar
                              fallbacks for U8x32 + U8x64 (core::simd
                              has no native cross-lane byte ops or
                              bitmask-driven blend)
  - tests.rs (815)        — 76 parity tests vs scalar reference

30 types total (mirrors the AVX-512 / AVX2 polyfill surface 1:1).
All re-exported flat from `crate::simd_nightly::*` via the mod.rs
aggregator.

Verification:
  rustup run nightly cargo check --features nightly-simd -p ndarray --lib
    → Finished, 0 errors
  rustup run nightly cargo test --features nightly-simd -p ndarray --lib simd_nightly
    → test result: ok. 153 passed; 0 failed
  cargo check --lib (stable, default features, no nightly-simd)
    → Finished, 0 errors (the existing intrinsics dispatch is unchanged)

Cross-agent findings worth folding into a handover note:
  - `std::simd::StdFloat` is the trait that provides mul_add/sqrt/round/
    floor on core::simd float vectors. `core::simd::num::SimdFloat`
    provides reduce/min/max/clamp but NOT the transcendentals.
  - `core::simd::cmp::SimdOrd` is needed for simd_min/simd_max on
    integer vectors (SimdPartialOrd alone is not sufficient).
  - `core::simd::Mask::to_bitmask()` always returns u64 regardless of
    lane count. Wrappers cast `as u8` / `as u16` / `as u32` for narrower
    bitmask shapes.
  - `core::simd::Simd::swizzle` is `const N: usize` — cannot take a
    runtime index vector. permute_bytes / shuffle_bytes need scalar
    fallback. Same shape as the AVX-512F-without-VBMI fallback path in
    simd_avx512.rs added in PR #142.

What this enables:
  Miri can execute every method here (intrinsics-based backends are
  opaque to miri). Consumers who want miri-runnable SIMD tests import
  from `ndarray::simd_nightly::*` explicitly. The main polyfill via
  `crate::simd::*` continues to use intrinsics — the nightly-simd
  feature does NOT replace the production dispatch, it provides a
  parallel namespace for miri tooling.

Fleet output in .claude/board/AGENT_LOG.md (round-3-portable-simd
section). 6 of 12 agents hit the same AGENT_LOG-write permission
pre-existing block from round-2 — backfilled by the main thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants