simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190
simd: SimdProfile fine-grained detection + cpu-* pinning + leaf 7,1 CPUID#190AdaWorldAPI wants to merge 5 commits into
Conversation
Phase 3 T3.1 of the SIMD integration plan: introduce crate::hpc::simd_profile::SimdProfile, the silicon-grained dispatch identity that replaces the coarse three-Tier collapse called out in audit findings TD-T12/T13/T14. The decision tree in SimdProfile::detect() implements .claude/knowledge/td-simd-cpu-dispatch-matrix.md lines 271-305 verbatim, preserving the four load-bearing invariants from the "Detection invariants" section: GraniteRapids-before-SapphireRapids, Zen4-vs-SPR via amx_tile, CooperLake-vs-IceLakeSp via the mutually exclusive BF16/VBMI bit pattern, and TigerLakeU-vs-IceLakeSp via VP2INTERSECT. Risk #4 of the integration plan (no GNR detection without leaf 7,1 reader) closed in the same change: SimdCaps gains avx512fp16, avx512vp2intersect, and amx_fp16 fields, with the x86 detect() arm adding a __cpuid_count(7, 1) read gated on the leaf 7,0 EAX max subleaf advertising support. has_amx_fp16() requires amx_tile in addition to the FP16 bit, mirroring the defense-in-depth pattern in simd_amx::amx_available(). Surface follows cognitive-shader-foundation.md: SimdProfile + simd_profile() re-exported through crate::simd::* so consumers import a single public path. The existing private Tier / tier() machinery in src/simd.rs is untouched; this lands alongside, with incremental migration deferred to T3.5/T3.6. Tests: 7 new in simd_profile (detection determinism, arch partitioning, AVX-512 subset invariant, x86_64-only Scalar fallback, name uniqueness), 2 new in simd_caps (FP16 fields false on non-x86, has_amx_fp16 requires amx_tile). 2075/2075 lib tests pass, clippy -D warnings clean.
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.
Supports step 1 of the TEST-promotion checklist from
.claude/knowledge/td-simd-cpu-dispatch-matrix.md § "TEST verification
checklist": "Boot the binary on the silicon and confirm
simd_profile() returns the expected variant."
The probe prints:
- Resolved SimdProfile variant + arch/family flags
- Compile-time pinning status (and pinned variant if active)
- Every CPUID-derived SimdCaps bit, ticked/unticked
- ARM heuristic profile when running on aarch64
- Active compile-time target features (avx512f / avx2)
- Per-variant matrix-doc cell summary (terse — matrix is source of truth)
- Runs the same pinning_consistency invariant the unit test checks,
so a probe deployed on real silicon flags regressions in the cfg
cascade.
First-hardware results on the build host (Sapphire Rapids):
- simd_profile() resolves to SapphireRapids ✓
- amx_tile + amx_bf16 + avx512fp16 all set ✓
- amx_fp16 unset (correctly NOT promoting to GraniteRapids) ✓
- GNR-before-SPR ordering invariant verified end-to-end
This is the first end-to-end pass of the e40f3a3 SimdProfile detect
chain plus the 5a3a663 cpu-* pinning machinery on real silicon —
confirms the dispatch axis is functional, not just doc-checked.
Smoke-tested with --features cpu-zen4: probe correctly reports
ACTIVE pinning and Zen4Avx512 as the resolved variant, even on SPR
silicon (pinning intentionally overrides hardware detection).
…losure Integration plan risk #3 ("Detection robustness across hypervisors"): CPUID may advertise AMX-TILE while the OS/hypervisor has not enabled the tile XSAVE state. Without the OS-level check, the dispatch table routes to AMX kernels that SIGILL at first use. Fix: SimdProfile::detect() now reads `simd_amx::amx_available()` (the existing 4-step gate: CPUID → OSXSAVE → XCR0[17,18] → arch_prctl XCOMP_PERM on Linux 5.19+) and demotes when CPUID and OS disagree. The GraniteRapids and SapphireRapids arms now require both the CPUID bits AND `amx_usable`; the Zen4Avx512 arm catches SPR-class CPUID with locked-down hypervisor XSAVE so dispatch falls to the AVX-512 BF16/FP16 path instead. Verified on the build host (Sapphire Rapids silicon, kernel 6.18.5): - CPUID reports amx_tile=1, amx_int8=1, amx_bf16=1 (all true) - simd_amx::amx_available() returns false (hypervisor masks XCR0[17,18] or the arch_prctl(XCOMP_PERM) request fails) - SimdProfile::detect() correctly resolves to Zen4Avx512, not SapphireRapids — the AMX kernels are not reachable from dispatch on this OS state. Without this fix, the e40f3a3 detect path would have resolved to SapphireRapids on this exact silicon/OS combination, then SIGILL'd the first time a dispatch table called an AMX kernel. Bug closed before any consumer was wired to the dispatch table. The probe binary (examples/simd_profile_probe.rs) gains a new "AMX gating (CPUID vs OS)" section so the CPUID-vs-OS gap is visible without reading source. Format mirrors how the matrix-doc cell summary appears: terse, two lines plus an optional demotion note when the bits disagree. Pinned mode (cpu-* cargo features) intentionally bypasses this gate since pinning is a build-time assertion that the target OS supports the chosen variant — pinned binaries are non-portable by design. Tests: 2077/2077 lib pass. cargo clippy --lib clean under default and --features cpu-spr. Behaviour on hardware with proper AMX enablement (full prctl path success) is unchanged: SapphireRapids still resolves to SapphireRapids when amx_available() returns true.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de52a446bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[cfg(feature = "cpu-a76")] | ||
| { | ||
| return Some(SimdProfile::A76DotProd); |
There was a problem hiding this comment.
Guard cpu- pinning by target architecture*
The new pinning map accepts ARM pins on x86_64 (and x86 pins on aarch64) with no compile-time guard, so --features cpu-a76 on an x86_64 build will still compile and force simd_profile() to report A76DotProd. That makes is_x86()/is_aarch64() reflect the feature flag rather than the actual target and can silently route callers into the wrong dispatch family if a build matrix or release script passes the wrong cpu-* feature. Please add compile_error!/#[cfg] checks so incompatible pin features fail fast for the current target_arch.
Useful? React with 👍 / 👎.
…ample Two independent bug fixes for codex P2 on de52a44 and the no-default-features CI failure: 1) codex P2 — Reject ARM cpu-* on x86 builds (and vice versa). Without target_arch guards, `--features cpu-a76` on an x86_64 build silently routes simd_profile() to A76DotProd, breaking the is_x86()/is_aarch64() partitioning and routing callers into the wrong dispatch family. Add compile_error! checks that fail fast for the current target_arch — same fail-fast pattern as the existing _PIN_COUNT mutual-exclusion assert. 2) CI fix — examples/simd_profile_probe.rs uses ndarray::hpc::* and ndarray::simd_amx::* which are gated behind the "std" feature. On `cargo test --no-default-features` the example target still tries to compile, producing E0433 "cannot find hpc in ndarray". Add `required-features = ["std"]` to the example's Cargo.toml entry so it is skipped when std is disabled, matching the existing pattern for ocr_benchmark. No behavioral change on default builds. Both fixes are independent of the architectural question about whether cpu-* features should exist at all (which is for the originating session to revisit if they want to unify with cpu_ops_for_cpu); this just makes the existing features less of a footgun.
… AMX OS-gate in cpu_ops Salvages the detection-only subset of closed PR #190 — three real gaps in the substrate runtime dispatch without inheriting any of PR #190's consumer-facing additions (no SimdProfile enum, no public dispatch-identity API, no cpu-* features). What lands here: 1) CPUID leaf 7,1 read for AMX-FP16 (CPUID.07H.1H:EAX bit 21). Lives on a different subleaf than the existing AMX bits; GraniteRapids is the only silicon advertising it today. Guarded by leaf 7,0 EAX >= 1 so older CPUs that don't expose subleaf 1 stay correct. 2) Three new SimdCaps fields (additive, all default false on non-x86): - avx512fp16 — CPUID.07H.0H:EDX bit 23 — `__m512h` math. Discriminates SPR-class from CascadeLake/ IceLakeSp/SkylakeX for any future FP16 kernel. - avx512vp2intersect — CPUID.07H.0H:EDX bit 8 — TigerLake mobile only; absent from Ice Lake-SP and every later server part. Exposed for completeness. - amx_fp16 — CPUID.07H.1H:EAX bit 21 — Granite Rapids. Plus convenience methods has_avx512_fp16() and has_amx_fp16() (the latter defense-in-depths the amx_tile bit). 3) AMX OS-state gate in cpu_ops() selection. The CPU-reports-AMX path now AND-gates on `simd_amx::amx_available()` which runs the full four-step check (CPUID + OSXSAVE + XCR0[17,18] + arch_prctl(XCOMP_PERM, 18) on Linux 5.19+). This closes the SIGILL hole when a hypervisor masks XCR0 or the OS hasn't honoured the prctl: previously cpu_ops() would route to CPU_OPS_AMX_INT8 and AMX instructions would SIGILL despite the CPUID bit. Now it demotes to CPU_OPS_AVX512_VNNI cleanly. What's deliberately NOT here (rejected from PR #190): - No `SimdProfile` enum — would expose dispatch identity to consumer code and invite `match profile { ... }` arms that defeat the polyfill contract. - No `cpu-*` cargo features — build-time silicon pinning that defeats polyfill at an earlier binding time. - No `simd_profile_probe` example — diagnostic-only, rebuilds the SimdProfile surface this PR doesn't bring. - No public dispatch-identity API at any layer. The new bits are internal substrate detection; consumers continue to use `crate::simd::*` polyfilled types and `crate::simd_runtime::*` per-op trampolines. The new fields slot into existing `cpu_ops()` selection by extension (e.g. a future AMX-FP16 tier would AND-gate on `caps.amx_fp16 && simd_amx::amx_available()` between the AMX-INT8 and AVX-512-VNNI arms). No selection logic uses them yet — they're laying the runway, not consuming it. Tests: - 4 new simd_caps tests: cpuid_extended_bits_smoke, has_amx_fp16_requires_amx_tile, x86_extended_bits_are_false_on_non_x86, plus extended determinism coverage. - All 6 existing cpu_ops tests still pass; the AMX OS-gate change passes through transparently on hosts where amx_available() agrees with CPUID (the typical case). - fmt + clippy clean on `--features runtime-dispatch`.
feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)
Summary
Four orthogonal additions that complement (don't duplicate) the
simd_runtime/+CpuOpswork shipped in PRs #185–#187. The parallel session built the runtime-dispatch trampolines and the GCC-codename DTO lookup; this PR adds the fine-grained CPUID detection + silicon-identity enum + compile-time pinning half of the matrix-doc plan.LazyLock<fn>trampolines (vnni_dot_u8_i8,add_mul_f32, casts, matmul)crate::simd_runtime::*(PR #185)cpu_ops_for_cpu("sapphirerapids"))crate::simd_runtime::cpu_ops(PR #187)crate::hpc::simd_profile::*(this PR)cpu-*features (this PR)SimdProfile::detect()is strictly more granular thanCpuOps-tier selection — it discriminates GraniteRapids vs SapphireRapids (amx_fp16bit), CooperLake vs IceLakeSp (BF16 / VBMI mutex), TigerLakeU vs IceLakeSp (VP2INTERSECT). The two pieces compose cleanly;simd_profile()could feedCpuOpsselection in a future commit if you want, but neither subsumes the other today.Commits
What each commit adds
db3669e8—SimdProfileenum +detect()crate::hpc::simd_profile::SimdProfile— 14-variant enum that names silicon generations directly:detect()walks the decision tree from.claude/knowledge/td-simd-cpu-dispatch-matrix.md§ "SimdProfile::detect() mapping" lines 271-305 verbatim. Four load-bearing invariants preserved:amx_tile(both have F+VBMI+BF16+FP16).avx512vp2intersect.Also adds 3 new
SimdCapsfields:avx512fp16,avx512vp2intersect,amx_fp16— plus the CPUID leaf 7,1 reader (gated on leaf 7,0 EAX max-subleaf ≥ 1). Master'ssimd_caps.rscurrently has no leaf 7,1 read; GNR's AMX-FP16 lives at CPUID.07H.1H:EAX bit 21, so the fix is a prerequisite for GNR dispatch regardless of which detection ladder uses it.LazyLock-cached
simd_profile()accessor.b9a6fa0e—cpu-*cargo features (compile-time pinning)13 mutually-exclusive features that fold
simd_profile()to a const at compile time and elide the LazyLock entirely:Mutual exclusion enforced by a const assert (
_PIN_COUNT <= 1); enabling two simultaneously produces a build error. Pair with-Ctarget-cpu=<codename>for the codegen side. Use case: per-silicon distribution where--features cpu-sprproduces an SPR-specialized binary that bypasses runtime detection.Complementary to
runtime-dispatch(PR #185): runtime-dispatch ships one binary that adapts across silicon;cpu-*ships per-silicon binaries that fold dispatch out entirely.03b30e54—examples/simd_profile_probeBoot-on-silicon diagnostic per the matrix doc's TEST-promotion checklist § step 1: "Boot the binary on the silicon and confirm
simd_profile()returns the expected variant." Prints:SimdProfilevariant + arch/family flagsSimdCapsbitpinning_consistencyinvariant so regressions in the cfg cascade are caught post-deployde52a446— Risk #3 closure (AMX OS-state gating)SimdProfile::detect()now consultssimd_amx::amx_available()(CPUID + OSXSAVE + XCR0[17,18] +arch_prctl(XCOMP_PERM, 18)on Linux 5.19+) and demotes when CPUID reports AMX but the OS/hypervisor hasn't enabled the tile XSAVE state.Verified on this Sapphire Rapids build host: CPUID reports
amx_tile=1, amx_bf16=1, butamx_available()returnsfalse(hypervisor masks XCR0 bits 17/18 or the prctl request fails). Without the fix,detect()would resolve as SapphireRapids and dispatch tables would route to AMX kernels that SIGILL. With the fix, it demotes toZen4Avx512(AVX-512 BF16/FP16 path), matching whatamx_available()consumers already do inline.Probe surfaces the CPUID-vs-OS gap directly so the demotion is visible without reading source.
Diff vs current master (eb6444f)
Test plan
simd_profile::tests, +2 insimd_caps::tests).simd_profile::tests:detect_returns_a_valid_profile,determinism,arch_partitioning_is_consistentx86_target_lands_inside_x86_family,aarch64_target_lands_inside_aarch64_familyhas_avx512_is_subset_of_is_x86(subset invariants onhas_avx512()+has_amx())names_are_stable_and_unique(O(n²) sweep)pinning_default_is_off,pinning_consistencysimd_caps::tests:fp16_fields_consistent_on_non_x86,has_amx_fp16_requires_amx_tile(defense-in-depth: AMX-FP16 requires AMX-TILE in the convenience method).cargo clippy --lib -- -D warningsclean underdefault,--features cpu-spr,--features runtime-dispatch.cargo fmt --all --checkclean.--features "cpu-spr,cpu-zen4"— fails as expected with the const-assert citation.Zen4Avx512(Risk This PR ports high-performance computing (HPC) features from the rustynum library into ndarray, adding comprehensive linear algebra, statistical operations, hyperdimensional computing (HDC), and signal processing capabilities. The implementation uses a pluggable backend architecture with runtime CPU detection (AVX-512 → AVX2 → scalar) for optimal performance across different hardware. #3 demotion), pinning bit shows OFF; with--features cpu-zen4shows ACTIVE pinning.How it composes with PRs #185–#187
The matrix doc's TEST-promotion workflow needs the fine-grained variant (you can't promote "SPR" cells if your detector only says "amx_int8 tier").
SimdProfileis that surface; PR #187'sCpuOpsis the dispatch-table glue.Out of scope (separate PRs)
simd_profile()as the seed forCpuOpsselection (would replace the currentis_x86_feature_detected!insidecpu_ops()).DOC→TESTusing the probe binary on real SPR/GNR/Zen4 silicon.simd_profile().coarse() -> &'static strcodename mapping for parity withcpu_ops_for_cpu().https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Generated by Claude Code