Skip to content

feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16)#176

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-phase3-neon-tiers
May 20, 2026
Merged

feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16)#176
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-phase3-neon-tiers

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Phase 3 of the integration plan in .claude/knowledge/simd-dispatch-architecture.md. Scaffold only — no dispatch changes, no behavior change. Lays out the structural skeleton + intrinsic maps + cargo configs that the actual NEON tier implementations will fill in.

aarch64 isn't monolithic — splits the dispatch surface into three tier files mirroring the x86 v3/v4/native triplet:

src/simd_neon_baseline.rs — ARMv8.0-A +neon

Floor tier. Pi 3 (A53), Pi 4 (A72), anything without dotprod/fp16/bf16. Documents the silicon and stubs out the future home of the existing simd_neon.rs::aarch64_simd 128-bit wrappers (I8x16/I16x8/U8x16/U16x8/U32x4/U64x2/I32x4/I64x2). Also lists the 8 missing 512-bit composed [neon_native; 4] wrappers (currently routed through scalar::* at simd.rs:1593).

src/simd_neon_dotprod.rs — ARMv8.2-A +dotprod,+fp16

Pi 5 (A76, BCM2712), Cortex-A75+, Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already exist in simd_neon.rs:191-237 and will migrate here. F16 stubs new — documents the full vfmaq_f16 intrinsic map (vaddq_f16, vfmaq_f16, vsqrtq_f16, vaddvq_f16, …) with the stable-Rust asm-byte workaround following the simd_amx.rs precedent (Rust issue #112800 keeps the intrinsics nightly-only).

src/simd_neon_bf16.rs — ARMv8.6-A +bf16

Apple M2/M3/M4, Snapdragon X Elite, Cortex-A510+, Graviton 3/4, NVIDIA Grace, Ampere One. Apple M1 explicitly NOT in this tier (M1 is ARMv8.5-A, no BF16). Stubs BF16x8 (bfloat16x8_t) and BF16x16 ([bfloat16x8_t; 2]). Documents BFMMLA as the prize intrinsic — 2×2 outer product in one instruction, ~32 GFLOP/s/core on M2 in bf16-matmul-bound kernels. Same asm-byte fallback strategy (Rust issue #117222).

.cargo/config-{pi5,apple-m2,graviton}.toml

Cargo configs matching the x86 v3/v4/native triplet shape:

Config target-cpu target-feature Tier
config-pi5.toml cortex-a76 +dotprod,+fp16 dotprod
config-apple-m2.toml apple-m2 +bf16,+dotprod,+fp16,+i8mm bf16
config-graviton.toml neoverse-v2 +bf16,+dotprod,+fp16,+i8mm bf16

Runtime detection (already wired)

simd.rs::detect_tier() already distinguishes Tier::Neon vs Tier::NeonDotProd via is_aarch64_feature_detected!("dotprod") at line 63. The Tier::NeonBf16 variant + check is a TODO for when the bf16 impls land.

What this is NOT

  • No dispatch changes in simd.rs — the new modules aren't wired into crate::simd::* yet. They're declared in lib.rs as pub mod simd_neon_* so they participate in fmt/clippy/check from day one, but nothing reaches into them.
  • No functional NEON code added — only docs + stubs + intrinsic maps. The F16x16Stub / BF16x8Stub / BF16x16Stub placeholder structs are deliberately useless (unimplemented!() with pointers to module docs); they exist so consumers can grep for the name and find the implementation roadmap.
  • No tests added — there's nothing to test until the asm-byte intrinsics are written. The non-aarch64 CI is unaffected (everything is #[cfg(target_arch = "aarch64")] gated).

Why scaffold first

Without aarch64 CI silicon we can't verify byte-encoded asm correctness. Landing the scaffold + docs + intrinsic maps lets:

  1. The next contributor (or future Claude session) start from a precise spec rather than re-deriving the silicon matrix.
  2. Cargo configs already exist for cross-build smoke tests on Pi 5 / Mac / Graviton hardware that becomes available.
  3. F16/BF16 stub names enter the consumer namespace (gated) — any code path that wants to special-case them can already write the #[cfg(target_feature = "bf16")] arm.

Test plan

  • CI green — pure additive on non-aarch64; no x86_64 dispatch arm changed.
  • cargo check --target=aarch64-unknown-linux-gnu from any host — verifies module declarations + stubs compile clean.
  • (Future) cargo --config .cargo/config-pi5.toml check --target=aarch64-unknown-linux-gnu once the dotprod tier implementation lands.

Generated by Claude Code

Splits the aarch64 dispatch surface into three tier files mirroring
the x86 v3/v4/native split. Each file documents the silicon, the
runtime and compile-time detection paths, and stubs out the tier-
specific types with intrinsic maps for future implementation.

src/simd_neon_baseline.rs
-------------------------

Tier floor — ARMv8.0-A `+neon` only. Pi 3 (A53), Pi 4 (A72), anything
that doesn't have dotprod/fp16/bf16. Native 128-bit lanes only;
composed 512-bit wrappers TODO (currently routed through `scalar::*`
fallback in `simd.rs:1593`). Placeholder for the future migration of
`simd_neon.rs::aarch64_simd` (lines 463-1126) into this file.

src/simd_neon_dotprod.rs
------------------------

ARMv8.2-A `+dotprod,+fp16`. Pi 5 (BCM2712, A76), Cortex-A75 and later,
Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already implemented
in `simd_neon.rs:191-237` (will migrate to this file); F16 stubs new.
F16 intrinsic map documents the `vfmaq_f16` family with the stable-
Rust asm-byte workaround (issue #112800) following the AMX precedent
in `src/simd_amx.rs`.

src/simd_neon_bf16.rs
---------------------

ARMv8.6-A `+bf16` (or ARMv8.4-A + optional `+bf16`). Apple M2/M3/M4,
Snapdragon X Elite, Cortex-A510+, Graviton 3/4, Grace, Ampere One.
Apple M1 explicitly NOT in this tier (M1 is v8.5-A). Stubs `BF16x8`
(`bfloat16x8_t`) and `BF16x16` (`[bfloat16x8_t; 2]`). Documents BFMMLA
as the prize intrinsic (2×2 outer product in one instruction, ~32
GFLOP/s/core on M2 in bf16-matmul-bound kernels) and the asm-byte
fallback for stable Rust (issue #117222, mirrors `simd_amx.rs`).

.cargo/config-{pi5,apple-m2,graviton}.toml
------------------------------------------

Three cargo configs matching the x86 v3/v4/native triplet shape:
  - config-pi5.toml → -Ctarget-cpu=cortex-a76 +dotprod,+fp16
  - config-apple-m2.toml → -Ctarget-cpu=apple-m2 +bf16,+dotprod,+fp16,+i8mm
  - config-graviton.toml → -Ctarget-cpu=neoverse-v2 +bf16,+dotprod,+fp16,+i8mm

src/lib.rs
----------

Three new gated module declarations (`#[cfg(all(target_arch =
"aarch64", feature = "std"))]`). No dispatch changes in `simd.rs` —
this PR is scaffold + docs only. The dispatch wiring lands in a
follow-up once the F16/BF16 implementations exist to dispatch to.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac2a9647de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

#
# Also works on:
# - Cortex-X3 / X4 / X925 generic Linux servers
# - Ampere Altra (V1-class — same baseline)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove Altra compatibility claim from Graviton V2 config

This config hard-codes -Ctarget-cpu=neoverse-v2 plus +bf16,+i8mm (line 19), but line 12 says it also works on Ampere Altra; that pairing is unsafe because Altra-class deployments are commonly Neoverse-N1/v8.2 and do not provide the same feature baseline, so binaries built with this profile can execute unsupported instructions and crash with illegal-instruction faults on those hosts.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 96d01ce into master May 20, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants