Skip to content

feat(hpc): SIMD-accelerate activations and vml via F32x16 compat types activations.rs: - Add standalone sigmoid_f32, softmax_f32, log_softmax_f32 functions using F32x16 polynomial exp. 3-pass softmax: SIMD max → SIMD exp+sum → SIMD normalize. - Generic trait impl unchanged. 6 new tests. vml.rs (linter-applied changes preserved): - vsexp, vssqrt, vsabs now use F32x16 SIMD (16-wide main loop + scalar tail) - vsadd, vsmul, vsdiv use F32x16 operator overloads 820 tests pass, zero regressions. https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB#21

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/continue-session-0mAVa
Mar 22, 2026

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 2 commits March 22, 2026 10:40
…able types

The wrapper IS the dispatch boundary. F32x16/F64x8 compile to optimal
instructions per target (AVX-512 on x86_64, scalar loops elsewhere).

- D1: simd_compat.rs — F32x16, F64x8, F32Mask16, F64Mask8 with AVX-512
  backend and scalar fallback. Full API: splat, from_slice, reduce_sum,
  mul_add, abs, sqrt, comparisons, mask select, all operators.
- D2: kernels_avx512.rs refactored — BLAS-1 (dot, axpy, scal, nrm2, asum)
  and element-wise ops now use compat types instead of raw __m512 intrinsics.
  GEMM microkernels retain raw intrinsics for masked stores/broadcast-FMA.
- D5: vml.rs wired through SIMD — vsexp uses simd_exp_f32 polynomial,
  vssqrt/vsabs/vsadd/vsmul/vsdiv use F32x16 ops. 16-wide SIMD main loop
  with scalar tail.
- 11 new tests for compat types, all 1052+ existing tests pass.

https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB
activations.rs:
- Add standalone sigmoid_f32, softmax_f32, log_softmax_f32 functions
  using F32x16 polynomial exp. 3-pass softmax: SIMD max → SIMD exp+sum → SIMD normalize.
- Generic trait impl unchanged. 6 new tests.

vml.rs (linter-applied changes preserved):
- vsexp, vssqrt, vsabs now use F32x16 SIMD (16-wide main loop + scalar tail)
- vsadd, vsmul, vsdiv use F32x16 operator overloads

820 tests pass, zero regressions.

https://claude.ai/code/session_01CdqyUTUfjKZuk8YGJzv6LB
@AdaWorldAPI AdaWorldAPI merged commit 0a28084 into master Mar 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

let neg_v = -v;
let exp_neg = simd_exp_f32(neg_v);
let sigmoid = one / (one + exp_neg);

P1 Badge Clamp sigmoid's SIMD exp input to its supported range

sigmoid_f32 sends -x straight into simd_exp_f32, but that helper is only implemented for roughly [-87, 87] and reconstructs 2^n by writing exponent bits directly (src/backend/simd_compat.rs:913-947). On the SIMD path (x.len() >= 16), perfectly valid activations like x = 100.0 or x = -100.0 therefore produce negative or inverted results instead of saturating near 1 or 0, which is a correctness regression versus the scalar implementation.


let shifted = F32x16::from_slice(&x[i..]) - max_v;
let exp_v = simd_exp_f32(shifted);

P1 Badge Avoid SIMD exp for softmax deltas below about -87

After subtracting max, softmax_f32 and log_softmax_f32 routinely feed very negative deltas into simd_exp_f32, but that helper is only valid on about [-87, 87] (src/backend/simd_compat.rs:913-947). For any vectorized chunk with a logit spread above ~88, the approximation returns huge/negative values instead of underflowing to 0, so softmax_f32 can assign probability ~1 to the wrong class and log_softmax_f32 can end up taking ln of a negative sum. The generic implementations did not have this failure mode.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants