feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16) by AdaWorldAPI · Pull Request #176 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-20T13:50:20Z

Summary

Phase 3 of the integration plan in .claude/knowledge/simd-dispatch-architecture.md. Scaffold only — no dispatch changes, no behavior change. Lays out the structural skeleton + intrinsic maps + cargo configs that the actual NEON tier implementations will fill in.

aarch64 isn't monolithic — splits the dispatch surface into three tier files mirroring the x86 v3/v4/native triplet:

`src/simd_neon_baseline.rs` — ARMv8.0-A `+neon`

Floor tier. Pi 3 (A53), Pi 4 (A72), anything without dotprod/fp16/bf16. Documents the silicon and stubs out the future home of the existing simd_neon.rs::aarch64_simd 128-bit wrappers (I8x16/I16x8/U8x16/U16x8/U32x4/U64x2/I32x4/I64x2). Also lists the 8 missing 512-bit composed [neon_native; 4] wrappers (currently routed through scalar::* at simd.rs:1593).

`src/simd_neon_dotprod.rs` — ARMv8.2-A `+dotprod,+fp16`

Pi 5 (A76, BCM2712), Cortex-A75+, Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already exist in simd_neon.rs:191-237 and will migrate here. F16 stubs new — documents the full vfmaq_f16 intrinsic map (vaddq_f16, vfmaq_f16, vsqrtq_f16, vaddvq_f16, …) with the stable-Rust asm-byte workaround following the simd_amx.rs precedent (Rust issue #112800 keeps the intrinsics nightly-only).

`src/simd_neon_bf16.rs` — ARMv8.6-A `+bf16`

Apple M2/M3/M4, Snapdragon X Elite, Cortex-A510+, Graviton 3/4, NVIDIA Grace, Ampere One. Apple M1 explicitly NOT in this tier (M1 is ARMv8.5-A, no BF16). Stubs BF16x8 (bfloat16x8_t) and BF16x16 ([bfloat16x8_t; 2]). Documents BFMMLA as the prize intrinsic — 2×2 outer product in one instruction, ~32 GFLOP/s/core on M2 in bf16-matmul-bound kernels. Same asm-byte fallback strategy (Rust issue #117222).

`.cargo/config-{pi5,apple-m2,graviton}.toml`

Cargo configs matching the x86 v3/v4/native triplet shape:

Config	target-cpu	target-feature	Tier
`config-pi5.toml`	`cortex-a76`	`+dotprod,+fp16`	dotprod
`config-apple-m2.toml`	`apple-m2`	`+bf16,+dotprod,+fp16,+i8mm`	bf16
`config-graviton.toml`	`neoverse-v2`	`+bf16,+dotprod,+fp16,+i8mm`	bf16

Runtime detection (already wired)

simd.rs::detect_tier() already distinguishes Tier::Neon vs Tier::NeonDotProd via is_aarch64_feature_detected!("dotprod") at line 63. The Tier::NeonBf16 variant + check is a TODO for when the bf16 impls land.

What this is NOT

No dispatch changes in simd.rs — the new modules aren't wired into crate::simd::* yet. They're declared in lib.rs as pub mod simd_neon_* so they participate in fmt/clippy/check from day one, but nothing reaches into them.
No functional NEON code added — only docs + stubs + intrinsic maps. The F16x16Stub / BF16x8Stub / BF16x16Stub placeholder structs are deliberately useless (unimplemented!() with pointers to module docs); they exist so consumers can grep for the name and find the implementation roadmap.
No tests added — there's nothing to test until the asm-byte intrinsics are written. The non-aarch64 CI is unaffected (everything is #[cfg(target_arch = "aarch64")] gated).

Why scaffold first

Without aarch64 CI silicon we can't verify byte-encoded asm correctness. Landing the scaffold + docs + intrinsic maps lets:

The next contributor (or future Claude session) start from a precise spec rather than re-deriving the silicon matrix.
Cargo configs already exist for cross-build smoke tests on Pi 5 / Mac / Graviton hardware that becomes available.
F16/BF16 stub names enter the consumer namespace (gated) — any code path that wants to special-case them can already write the #[cfg(target_feature = "bf16")] arm.

Test plan

CI green — pure additive on non-aarch64; no x86_64 dispatch arm changed.
cargo check --target=aarch64-unknown-linux-gnu from any host — verifies module declarations + stubs compile clean.
(Future) cargo --config .cargo/config-pi5.toml check --target=aarch64-unknown-linux-gnu once the dotprod tier implementation lands.

Generated by Claude Code

Splits the aarch64 dispatch surface into three tier files mirroring the x86 v3/v4/native split. Each file documents the silicon, the runtime and compile-time detection paths, and stubs out the tier- specific types with intrinsic maps for future implementation. src/simd_neon_baseline.rs ------------------------- Tier floor — ARMv8.0-A `+neon` only. Pi 3 (A53), Pi 4 (A72), anything that doesn't have dotprod/fp16/bf16. Native 128-bit lanes only; composed 512-bit wrappers TODO (currently routed through `scalar::*` fallback in `simd.rs:1593`). Placeholder for the future migration of `simd_neon.rs::aarch64_simd` (lines 463-1126) into this file. src/simd_neon_dotprod.rs ------------------------ ARMv8.2-A `+dotprod,+fp16`. Pi 5 (BCM2712, A76), Cortex-A75 and later, Apple A11+, Snapdragon 8 Gen 1+. dotprod functions already implemented in `simd_neon.rs:191-237` (will migrate to this file); F16 stubs new. F16 intrinsic map documents the `vfmaq_f16` family with the stable- Rust asm-byte workaround (issue #112800) following the AMX precedent in `src/simd_amx.rs`. src/simd_neon_bf16.rs --------------------- ARMv8.6-A `+bf16` (or ARMv8.4-A + optional `+bf16`). Apple M2/M3/M4, Snapdragon X Elite, Cortex-A510+, Graviton 3/4, Grace, Ampere One. Apple M1 explicitly NOT in this tier (M1 is v8.5-A). Stubs `BF16x8` (`bfloat16x8_t`) and `BF16x16` (`[bfloat16x8_t; 2]`). Documents BFMMLA as the prize intrinsic (2×2 outer product in one instruction, ~32 GFLOP/s/core on M2 in bf16-matmul-bound kernels) and the asm-byte fallback for stable Rust (issue #117222, mirrors `simd_amx.rs`). .cargo/config-{pi5,apple-m2,graviton}.toml ------------------------------------------ Three cargo configs matching the x86 v3/v4/native triplet shape: - config-pi5.toml → -Ctarget-cpu=cortex-a76 +dotprod,+fp16 - config-apple-m2.toml → -Ctarget-cpu=apple-m2 +bf16,+dotprod,+fp16,+i8mm - config-graviton.toml → -Ctarget-cpu=neoverse-v2 +bf16,+dotprod,+fp16,+i8mm src/lib.rs ---------- Three new gated module declarations (`#[cfg(all(target_arch = "aarch64", feature = "std"))]`). No dispatch changes in `simd.rs` — this PR is scaffold + docs only. The dispatch wiring lands in a follow-up once the F16/BF16 implementations exist to dispatch to.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac2a9647de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T13:54:57Z

+#
+# Also works on:
+#   - Cortex-X3 / X4 / X925 generic Linux servers
+#   - Ampere Altra (V1-class — same baseline)


Remove Altra compatibility claim from Graviton V2 config

This config hard-codes -Ctarget-cpu=neoverse-v2 plus +bf16,+i8mm (line 19), but line 12 says it also works on Ampere Altra; that pairing is unsafe because Altra-class deployments are commonly Neoverse-N1/v8.2 and do not provide the same feature baseline, so binaries built with this profile can execute unsupported instructions and crash with illegal-instruction faults on those hosts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

AdaWorldAPI merged commit 96d01ce into master May 20, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16)#176

feat(simd): Phase 3 scaffold — NEON tier flavors (baseline / dotprod / bf16)#176
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-phase3-neon-tiers

AdaWorldAPI commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 20, 2026

Summary

src/simd_neon_baseline.rs — ARMv8.0-A +neon

src/simd_neon_dotprod.rs — ARMv8.2-A +dotprod,+fp16

src/simd_neon_bf16.rs — ARMv8.6-A +bf16

.cargo/config-{pi5,apple-m2,graviton}.toml

Runtime detection (already wired)

What this is NOT

Why scaffold first

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/simd_neon_baseline.rs` — ARMv8.0-A `+neon`

`src/simd_neon_dotprod.rs` — ARMv8.2-A `+dotprod,+fp16`

`src/simd_neon_bf16.rs` — ARMv8.6-A `+bf16`

`.cargo/config-{pi5,apple-m2,graviton}.toml`