simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID by AdaWorldAPI · Pull Request #190 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-21T11:11:56Z

Summary

Four orthogonal additions that complement (don't duplicate) the simd_runtime/ + CpuOps work shipped in PRs #185–#187. The parallel session built the runtime-dispatch trampolines and the GCC-codename DTO lookup; this PR adds the fine-grained CPUID detection + silicon-identity enum + compile-time pinning half of the matrix-doc plan.

Layer	Where it lives now
Per-op `LazyLock<fn>` trampolines (`vnni_dot_u8_i8`, `add_mul_f32`, casts, matmul)	`crate::simd_runtime::*` (PR #185)
Codename-to-tier DTO lookup (`cpu_ops_for_cpu("sapphirerapids")`)	`crate::simd_runtime::cpu_ops` (PR #187)
Silicon-identity enum + runtime CPUID detect ladder	*`crate::hpc::simd_profile::` (this PR)**
Compile-time profile pinning via cargo features	*`cpu-` features (this PR)**

SimdProfile::detect() is strictly more granular than CpuOps-tier selection — it discriminates GraniteRapids vs SapphireRapids (amx_fp16 bit), CooperLake vs IceLakeSp (BF16 / VBMI mutex), TigerLakeU vs IceLakeSp (VP2INTERSECT). The two pieces compose cleanly; simd_profile() could feed CpuOps selection in a future commit if you want, but neither subsumes the other today.

Commits

de52a446 fix(simd):  SimdProfile::detect() consults amx_available() — Risk #3 closure
03b30e54 feat(examples): simd_profile_probe — hardware verification binary
b9a6fa0e feat(simd):  cpu-* cargo features for compile-time SimdProfile pinning
db3669e8 feat(simd):  SimdProfile enum + detect() implements dispatch matrix

What each commit adds

`db3669e8` — `SimdProfile` enum + `detect()`

crate::hpc::simd_profile::SimdProfile — 14-variant enum that names silicon generations directly:

pub enum SimdProfile {
    GraniteRapids, SapphireRapids,
    Zen4Avx512, CooperLake,
    TigerLakeU, IceLakeSp, CascadeLake, SkylakeX,
    ArrowLake, HaswellAvx2,
    A76DotProd, A72Fast, A53Baseline,
    Scalar,
}

detect() walks the decision tree from .claude/knowledge/td-simd-cpu-dispatch-matrix.md § "SimdProfile::detect() mapping" lines 271-305 verbatim. Four load-bearing invariants preserved:

GraniteRapids checked before SapphireRapids — GNR is a strict superset; checking SPR first would route GNR silicon as SPR and leave AMX-FP16 unused.
Zen4 vs SPR via amx_tile (both have F+VBMI+BF16+FP16).
CooperLake vs IceLakeSp via mutually-exclusive bits (CPL has BF16 without VBMI; ICX has VBMI without BF16).
TigerLakeU vs IceLakeSp via avx512vp2intersect.

Also adds 3 new SimdCaps fields: avx512fp16, avx512vp2intersect, amx_fp16 — plus the CPUID leaf 7,1 reader (gated on leaf 7,0 EAX max-subleaf ≥ 1). Master's simd_caps.rs currently has no leaf 7,1 read; GNR's AMX-FP16 lives at CPUID.07H.1H:EAX bit 21, so the fix is a prerequisite for GNR dispatch regardless of which detection ladder uses it.

LazyLock-cached simd_profile() accessor.

`b9a6fa0e` — `cpu-*` cargo features (compile-time pinning)

13 mutually-exclusive features that fold simd_profile() to a const at compile time and elide the LazyLock entirely:

cpu-gnr  cpu-spr  cpu-zen4  cpu-cpl  cpu-tigerlake  cpu-icx  cpu-clx  cpu-skx
cpu-arrowlake  cpu-haswell  cpu-a76  cpu-a72  cpu-a53

Mutual exclusion enforced by a const assert (_PIN_COUNT <= 1); enabling two simultaneously produces a build error. Pair with -Ctarget-cpu=<codename> for the codegen side. Use case: per-silicon distribution where --features cpu-spr produces an SPR-specialized binary that bypasses runtime detection.

Complementary to runtime-dispatch (PR #185): runtime-dispatch ships one binary that adapts across silicon; cpu-* ships per-silicon binaries that fold dispatch out entirely.

`03b30e54` — `examples/simd_profile_probe`

Boot-on-silicon diagnostic per the matrix doc's TEST-promotion checklist § step 1: "Boot the binary on the silicon and confirm simd_profile() returns the expected variant." Prints:

Resolved SimdProfile variant + arch/family flags
Compile-time pinning status (with pinned variant if active)
Every CPUID-derived SimdCaps bit
ARM heuristic profile (when running aarch64)
Active compile-time target features
Per-variant matrix-doc cell summary
Re-runs the pinning_consistency invariant so regressions in the cfg cascade are caught post-deploy

cargo run --example simd_profile_probe --release                 # runtime
cargo run --example simd_profile_probe --release --features cpu-spr   # pinned

`de52a446` — Risk #3 closure (AMX OS-state gating)

SimdProfile::detect() now consults simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + arch_prctl(XCOMP_PERM, 18) on Linux 5.19+) and demotes when CPUID reports AMX but the OS/hypervisor hasn't enabled the tile XSAVE state.

Verified on this Sapphire Rapids build host: CPUID reports amx_tile=1, amx_bf16=1, but amx_available() returns false (hypervisor masks XCR0 bits 17/18 or the prctl request fails). Without the fix, detect() would resolve as SapphireRapids and dispatch tables would route to AMX kernels that SIGILL. With the fix, it demotes to Zen4Avx512 (AVX-512 BF16/FP16 path), matching what amx_available() consumers already do inline.

Probe surfaces the CPUID-vs-OS gap directly so the demotion is visible without reading source.

Diff vs current master (`eb6444f`)

 Cargo.toml                     |  30 +++
 examples/simd_profile_probe.rs | 202 ++++++++++++++
 src/hpc/mod.rs                 |   3 +
 src/hpc/simd_caps.rs           |  99 ++++++-
 src/hpc/simd_profile.rs        | 597 +++++++++++++++++++++++++++++++++++++++++
 src/simd.rs                    |  16 ++
 6 files changed, 943 insertions(+), 4 deletions(-)

Test plan

2106 lib tests pass on default config (was 2095 on master; +9 in simd_profile::tests, +2 in simd_caps::tests).
9 new simd_profile::tests:
- detect_returns_a_valid_profile, determinism, arch_partitioning_is_consistent
- x86_target_lands_inside_x86_family, aarch64_target_lands_inside_aarch64_family
- has_avx512_is_subset_of_is_x86 (subset invariants on has_avx512() + has_amx())
- names_are_stable_and_unique (O(n²) sweep)
- pinning_default_is_off, pinning_consistency
2 new simd_caps::tests: fp16_fields_consistent_on_non_x86, has_amx_fp16_requires_amx_tile (defense-in-depth: AMX-FP16 requires AMX-TILE in the convenience method).
cargo clippy --lib -- -D warnings clean under default, --features cpu-spr, --features runtime-dispatch.
cargo fmt --all --check clean.
Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const-assert citation.
Probe runs on this SPR host, resolves to Zen4Avx512 (Risk This PR ports high-performance computing (HPC) features from the rustynum library into ndarray, adding comprehensive linear algebra, statistical operations, hyperdimensional computing (HDC), and signal processing capabilities. The implementation uses a pluggable backend architecture with runtime CPU detection (AVX-512 → AVX2 → scalar) for optimal performance across different hardware. #3 demotion), pinning bit shows OFF; with --features cpu-zen4 shows ACTIVE pinning.

How it composes with PRs #185–#187

                ┌─ crate::simd_runtime::* (PR #185)  ←  runtime LazyLock trampolines
                │
caps = simd_caps()
                │
                ├─ crate::simd_runtime::cpu_ops (PR #187)  ←  named-target lookup
                │
                └─ crate::hpc::simd_profile::detect() (THIS PR)  ←  CPUID-based 14-variant detect
                                                                    + amx_available() OS-gate
                                                                    + cpu-* pinning constants

The matrix doc's TEST-promotion workflow needs the fine-grained variant (you can't promote "SPR" cells if your detector only says "amx_int8 tier"). SimdProfile is that surface; PR #187's CpuOps is the dispatch-table glue.

Out of scope (separate PRs)

Wiring simd_profile() as the seed for CpuOps selection (would replace the current is_x86_feature_detected! inside cpu_ops()).
Promoting matrix-doc cells DOC → TEST using the probe binary on real SPR/GNR/Zen4 silicon.
A simd_profile().coarse() -> &'static str codename mapping for parity with cpu_ops_for_cpu().

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS

Generated by Claude Code

Phase 3 T3.1 of the SIMD integration plan: introduce crate::hpc::simd_profile::SimdProfile, the silicon-grained dispatch identity that replaces the coarse three-Tier collapse called out in audit findings TD-T12/T13/T14. The decision tree in SimdProfile::detect() implements .claude/knowledge/td-simd-cpu-dispatch-matrix.md lines 271-305 verbatim, preserving the four load-bearing invariants from the "Detection invariants" section: GraniteRapids-before-SapphireRapids, Zen4-vs-SPR via amx_tile, CooperLake-vs-IceLakeSp via the mutually exclusive BF16/VBMI bit pattern, and TigerLakeU-vs-IceLakeSp via VP2INTERSECT. Risk #4 of the integration plan (no GNR detection without leaf 7,1 reader) closed in the same change: SimdCaps gains avx512fp16, avx512vp2intersect, and amx_fp16 fields, with the x86 detect() arm adding a __cpuid_count(7, 1) read gated on the leaf 7,0 EAX max subleaf advertising support. has_amx_fp16() requires amx_tile in addition to the FP16 bit, mirroring the defense-in-depth pattern in simd_amx::amx_available(). Surface follows cognitive-shader-foundation.md: SimdProfile + simd_profile() re-exported through crate::simd::* so consumers import a single public path. The existing private Tier / tier() machinery in src/simd.rs is untouched; this lands alongside, with incremental migration deferred to T3.5/T3.6. Tests: 7 new in simd_profile (detection determinism, arch partitioning, AVX-512 subset invariant, x86_64-only Scalar fallback, name uniqueness), 2 new in simd_caps (FP16 fields false on non-x86, has_amx_fp16 requires amx_tile). 2075/2075 lib tests pass, clippy -D warnings clean.

Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.

Supports step 1 of the TEST-promotion checklist from .claude/knowledge/td-simd-cpu-dispatch-matrix.md § "TEST verification checklist": "Boot the binary on the silicon and confirm simd_profile() returns the expected variant." The probe prints: - Resolved SimdProfile variant + arch/family flags - Compile-time pinning status (and pinned variant if active) - Every CPUID-derived SimdCaps bit, ticked/unticked - ARM heuristic profile when running on aarch64 - Active compile-time target features (avx512f / avx2) - Per-variant matrix-doc cell summary (terse — matrix is source of truth) - Runs the same pinning_consistency invariant the unit test checks, so a probe deployed on real silicon flags regressions in the cfg cascade. First-hardware results on the build host (Sapphire Rapids): - simd_profile() resolves to SapphireRapids ✓ - amx_tile + amx_bf16 + avx512fp16 all set ✓ - amx_fp16 unset (correctly NOT promoting to GraniteRapids) ✓ - GNR-before-SPR ordering invariant verified end-to-end This is the first end-to-end pass of the e40f3a3 SimdProfile detect chain plus the 5a3a663 cpu-* pinning machinery on real silicon — confirms the dispatch axis is functional, not just doc-checked. Smoke-tested with --features cpu-zen4: probe correctly reports ACTIVE pinning and Zen4Avx512 as the resolved variant, even on SPR silicon (pinning intentionally overrides hardware detection).

…losure Integration plan risk #3 ("Detection robustness across hypervisors"): CPUID may advertise AMX-TILE while the OS/hypervisor has not enabled the tile XSAVE state. Without the OS-level check, the dispatch table routes to AMX kernels that SIGILL at first use. Fix: SimdProfile::detect() now reads `simd_amx::amx_available()` (the existing 4-step gate: CPUID → OSXSAVE → XCR0[17,18] → arch_prctl XCOMP_PERM on Linux 5.19+) and demotes when CPUID and OS disagree. The GraniteRapids and SapphireRapids arms now require both the CPUID bits AND `amx_usable`; the Zen4Avx512 arm catches SPR-class CPUID with locked-down hypervisor XSAVE so dispatch falls to the AVX-512 BF16/FP16 path instead. Verified on the build host (Sapphire Rapids silicon, kernel 6.18.5): - CPUID reports amx_tile=1, amx_int8=1, amx_bf16=1 (all true) - simd_amx::amx_available() returns false (hypervisor masks XCR0[17,18] or the arch_prctl(XCOMP_PERM) request fails) - SimdProfile::detect() correctly resolves to Zen4Avx512, not SapphireRapids — the AMX kernels are not reachable from dispatch on this OS state. Without this fix, the e40f3a3 detect path would have resolved to SapphireRapids on this exact silicon/OS combination, then SIGILL'd the first time a dispatch table called an AMX kernel. Bug closed before any consumer was wired to the dispatch table. The probe binary (examples/simd_profile_probe.rs) gains a new "AMX gating (CPUID vs OS)" section so the CPUID-vs-OS gap is visible without reading source. Format mirrors how the matrix-doc cell summary appears: terse, two lines plus an optional demotion note when the bits disagree. Pinned mode (cpu-* cargo features) intentionally bypasses this gate since pinning is a build-time assertion that the target OS supports the chosen variant — pinned binaries are non-portable by design. Tests: 2077/2077 lib pass. cargo clippy --lib clean under default and --features cpu-spr. Behaviour on hardware with proper AMX enablement (full prctl path success) is unchanged: SapphireRapids still resolves to SapphireRapids when amx_available() returns true.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de52a446bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T11:17:37Z

+    #[cfg(feature = "cpu-a76")]
+    {
+        return Some(SimdProfile::A76DotProd);


Guard cpu- pinning by target architecture*

The new pinning map accepts ARM pins on x86_64 (and x86 pins on aarch64) with no compile-time guard, so --features cpu-a76 on an x86_64 build will still compile and force simd_profile() to report A76DotProd. That makes is_x86()/is_aarch64() reflect the feature flag rather than the actual target and can silently route callers into the wrong dispatch family if a build matrix or release script passes the wrong cpu-* feature. Please add compile_error!/#[cfg] checks so incompatible pin features fail fast for the current target_arch.

Useful? React with 👍 / 👎.

…ample Two independent bug fixes for codex P2 on de52a44 and the no-default-features CI failure: 1) codex P2 — Reject ARM cpu-* on x86 builds (and vice versa). Without target_arch guards, `--features cpu-a76` on an x86_64 build silently routes simd_profile() to A76DotProd, breaking the is_x86()/is_aarch64() partitioning and routing callers into the wrong dispatch family. Add compile_error! checks that fail fast for the current target_arch — same fail-fast pattern as the existing _PIN_COUNT mutual-exclusion assert. 2) CI fix — examples/simd_profile_probe.rs uses ndarray::hpc::* and ndarray::simd_amx::* which are gated behind the "std" feature. On `cargo test --no-default-features` the example target still tries to compile, producing E0433 "cannot find hpc in ndarray". Add `required-features = ["std"]` to the example's Cargo.toml entry so it is skipped when std is disabled, matching the existing pattern for ocr_benchmark. No behavioral change on default builds. Both fixes are independent of the architectural question about whether cpu-* features should exist at all (which is for the originating session to revisit if they want to unify with cpu_ops_for_cpu); this just makes the existing features less of a footgun.

… AMX OS-gate in cpu_ops Salvages the detection-only subset of closed PR #190 — three real gaps in the substrate runtime dispatch without inheriting any of PR #190's consumer-facing additions (no SimdProfile enum, no public dispatch-identity API, no cpu-* features). What lands here: 1) CPUID leaf 7,1 read for AMX-FP16 (CPUID.07H.1H:EAX bit 21). Lives on a different subleaf than the existing AMX bits; GraniteRapids is the only silicon advertising it today. Guarded by leaf 7,0 EAX >= 1 so older CPUs that don't expose subleaf 1 stay correct. 2) Three new SimdCaps fields (additive, all default false on non-x86): - avx512fp16 — CPUID.07H.0H:EDX bit 23 — `__m512h` math. Discriminates SPR-class from CascadeLake/ IceLakeSp/SkylakeX for any future FP16 kernel. - avx512vp2intersect — CPUID.07H.0H:EDX bit 8 — TigerLake mobile only; absent from Ice Lake-SP and every later server part. Exposed for completeness. - amx_fp16 — CPUID.07H.1H:EAX bit 21 — Granite Rapids. Plus convenience methods has_avx512_fp16() and has_amx_fp16() (the latter defense-in-depths the amx_tile bit). 3) AMX OS-state gate in cpu_ops() selection. The CPU-reports-AMX path now AND-gates on `simd_amx::amx_available()` which runs the full four-step check (CPUID + OSXSAVE + XCR0[17,18] + arch_prctl(XCOMP_PERM, 18) on Linux 5.19+). This closes the SIGILL hole when a hypervisor masks XCR0 or the OS hasn't honoured the prctl: previously cpu_ops() would route to CPU_OPS_AMX_INT8 and AMX instructions would SIGILL despite the CPUID bit. Now it demotes to CPU_OPS_AVX512_VNNI cleanly. What's deliberately NOT here (rejected from PR #190): - No `SimdProfile` enum — would expose dispatch identity to consumer code and invite `match profile { ... }` arms that defeat the polyfill contract. - No `cpu-*` cargo features — build-time silicon pinning that defeats polyfill at an earlier binding time. - No `simd_profile_probe` example — diagnostic-only, rebuilds the SimdProfile surface this PR doesn't bring. - No public dispatch-identity API at any layer. The new bits are internal substrate detection; consumers continue to use `crate::simd::*` polyfilled types and `crate::simd_runtime::*` per-op trampolines. The new fields slot into existing `cpu_ops()` selection by extension (e.g. a future AMX-FP16 tier would AND-gate on `caps.amx_fp16 && simd_amx::amx_available()` between the AMX-INT8 and AVX-512-VNNI arms). No selection logic uses them yet — they're laying the runway, not consuming it. Tests: - 4 new simd_caps tests: cpuid_extended_bits_smoke, has_amx_fp16_requires_amx_tile, x86_extended_bits_are_false_on_non_x86, plus extended determinism coverage. - All 6 existing cpu_ops tests still pass; the AMX OS-gate change passes through transparently on hosts where amx_available() agrees with CPUID (the typical case). - fmt + clippy clean on `--features runtime-dispatch`.

feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)

claude added 4 commits May 21, 2026 11:07

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

AdaWorldAPI mentioned this pull request May 21, 2026

fix+test: cpu_tier_for_cpu cross-arch + Pillar 12/13/14 drift-checks #191

Merged

AdaWorldAPI closed this May 21, 2026

AdaWorldAPI mentioned this pull request May 21, 2026

feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190) #192

Merged

AdaWorldAPI added a commit that referenced this pull request May 21, 2026

Merge pull request #192 from AdaWorldAPI/claude/continue-ndarray-x0Oaw

adeba8a

feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190

simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190
AdaWorldAPI wants to merge 5 commits into
masterfrom
claude/resolve-pr-369-conflicts-ozMXd

AdaWorldAPI commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 21, 2026

Summary

Commits

What each commit adds

db3669e8 — SimdProfile enum + detect()

b9a6fa0e — cpu-* cargo features (compile-time pinning)

03b30e54 — examples/simd_profile_probe

de52a446 — Risk #3 closure (AMX OS-state gating)

Diff vs current master (eb6444f)

Test plan

How it composes with PRs #185–#187

Out of scope (separate PRs)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`db3669e8` — `SimdProfile` enum + `detect()`

`b9a6fa0e` — `cpu-*` cargo features (compile-time pinning)

`03b30e54` — `examples/simd_profile_probe`

`de52a446` — Risk #3 closure (AMX OS-state gating)

Diff vs current master (`eb6444f`)