feat(hpc): SIMD-accelerate activations and vml via F32x16 compat types activations.rs: - Add standalone sigmoid_f32, softmax_f32, log_softmax_f32 functions using F32x16 polynomial exp. 3-pass softmax: SIMD max → SIMD exp+sum → SIMD normalize. - Generic trait impl unchanged. 6 new tests. vml.rs (linter-applied changes preserved): - vsexp, vssqrt, vsabs now use F32x16 SIMD (16-wide main loop + scalar tail) - vsadd, vsmul, vsdiv use F32x16 operator overloads 820 tests pass, zero regressions. https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB#21
Conversation
…able types The wrapper IS the dispatch boundary. F32x16/F64x8 compile to optimal instructions per target (AVX-512 on x86_64, scalar loops elsewhere). - D1: simd_compat.rs — F32x16, F64x8, F32Mask16, F64Mask8 with AVX-512 backend and scalar fallback. Full API: splat, from_slice, reduce_sum, mul_add, abs, sqrt, comparisons, mask select, all operators. - D2: kernels_avx512.rs refactored — BLAS-1 (dot, axpy, scal, nrm2, asum) and element-wise ops now use compat types instead of raw __m512 intrinsics. GEMM microkernels retain raw intrinsics for masked stores/broadcast-FMA. - D5: vml.rs wired through SIMD — vsexp uses simd_exp_f32 polynomial, vssqrt/vsabs/vsadd/vsmul/vsdiv use F32x16 ops. 16-wide SIMD main loop with scalar tail. - 11 new tests for compat types, all 1052+ existing tests pass. https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB
activations.rs: - Add standalone sigmoid_f32, softmax_f32, log_softmax_f32 functions using F32x16 polynomial exp. 3-pass softmax: SIMD max → SIMD exp+sum → SIMD normalize. - Generic trait impl unchanged. 6 new tests. vml.rs (linter-applied changes preserved): - vsexp, vssqrt, vsabs now use F32x16 SIMD (16-wide main loop + scalar tail) - vsadd, vsmul, vsdiv use F32x16 operator overloads 820 tests pass, zero regressions. https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB
There was a problem hiding this comment.
💡 Codex Review
ndarray/src/hpc/activations.rs
Lines 76 to 78 in 3b928a6
sigmoid_f32 sends -x straight into simd_exp_f32, but that helper is only implemented for roughly [-87, 87] and reconstructs 2^n by writing exponent bits directly (src/backend/simd_compat.rs:913-947). On the SIMD path (x.len() >= 16), perfectly valid activations like x = 100.0 or x = -100.0 therefore produce negative or inverted results instead of saturating near 1 or 0, which is a correctness regression versus the scalar implementation.
ndarray/src/hpc/activations.rs
Lines 113 to 114 in 3b928a6
After subtracting max, softmax_f32 and log_softmax_f32 routinely feed very negative deltas into simd_exp_f32, but that helper is only valid on about [-87, 87] (src/backend/simd_compat.rs:913-947). For any vectorized chunk with a logit spread above ~88, the approximation returns huge/negative values instead of underflowing to 0, so softmax_f32 can assign probability ~1 to the wrong class and log_softmax_f32 can end up taking ln of a negative sum. The generic implementations did not have this failure mode.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
No description provided.