feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)#192
Merged
Merged
Conversation
… AMX OS-gate in cpu_ops Salvages the detection-only subset of closed PR #190 — three real gaps in the substrate runtime dispatch without inheriting any of PR #190's consumer-facing additions (no SimdProfile enum, no public dispatch-identity API, no cpu-* features). What lands here: 1) CPUID leaf 7,1 read for AMX-FP16 (CPUID.07H.1H:EAX bit 21). Lives on a different subleaf than the existing AMX bits; GraniteRapids is the only silicon advertising it today. Guarded by leaf 7,0 EAX >= 1 so older CPUs that don't expose subleaf 1 stay correct. 2) Three new SimdCaps fields (additive, all default false on non-x86): - avx512fp16 — CPUID.07H.0H:EDX bit 23 — `__m512h` math. Discriminates SPR-class from CascadeLake/ IceLakeSp/SkylakeX for any future FP16 kernel. - avx512vp2intersect — CPUID.07H.0H:EDX bit 8 — TigerLake mobile only; absent from Ice Lake-SP and every later server part. Exposed for completeness. - amx_fp16 — CPUID.07H.1H:EAX bit 21 — Granite Rapids. Plus convenience methods has_avx512_fp16() and has_amx_fp16() (the latter defense-in-depths the amx_tile bit). 3) AMX OS-state gate in cpu_ops() selection. The CPU-reports-AMX path now AND-gates on `simd_amx::amx_available()` which runs the full four-step check (CPUID + OSXSAVE + XCR0[17,18] + arch_prctl(XCOMP_PERM, 18) on Linux 5.19+). This closes the SIGILL hole when a hypervisor masks XCR0 or the OS hasn't honoured the prctl: previously cpu_ops() would route to CPU_OPS_AMX_INT8 and AMX instructions would SIGILL despite the CPUID bit. Now it demotes to CPU_OPS_AVX512_VNNI cleanly. What's deliberately NOT here (rejected from PR #190): - No `SimdProfile` enum — would expose dispatch identity to consumer code and invite `match profile { ... }` arms that defeat the polyfill contract. - No `cpu-*` cargo features — build-time silicon pinning that defeats polyfill at an earlier binding time. - No `simd_profile_probe` example — diagnostic-only, rebuilds the SimdProfile surface this PR doesn't bring. - No public dispatch-identity API at any layer. The new bits are internal substrate detection; consumers continue to use `crate::simd::*` polyfilled types and `crate::simd_runtime::*` per-op trampolines. The new fields slot into existing `cpu_ops()` selection by extension (e.g. a future AMX-FP16 tier would AND-gate on `caps.amx_fp16 && simd_amx::amx_available()` between the AMX-INT8 and AVX-512-VNNI arms). No selection logic uses them yet — they're laying the runway, not consuming it. Tests: - 4 new simd_caps tests: cpuid_extended_bits_smoke, has_amx_fp16_requires_amx_tile, x86_extended_bits_are_false_on_non_x86, plus extended determinism coverage. - All 6 existing cpu_ops tests still pass; the AMX OS-gate change passes through transparently on hosts where amx_available() agrees with CPUID (the typical case). - fmt + clippy clean on `--features runtime-dispatch`.
Same canonical-fmt collapse as the prior pillar-branch hotfixes. No behavioral change.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
…t re-export First step in the substrate-graduation thread documented in #192's wrap-up: lift the substrate-tier modules out of `hpc/` (which was the rustynum migration staging area) to crate root, where they sit in scope of the W1a polyfill contract and no longer carry the spurious `std`-gate inherited from `hpc/`. `simd_caps` is the smallest and cleanest first move: * No internal `hpc/` dependencies (only `use std::sync::LazyLock`). * 8 internal callers; back-compat re-export keeps them working. * Pure CPU-detection metadata; the most polyfill-adjacent module in the entire `hpc/` set. Changes: 1. `src/hpc/simd_caps.rs` → `src/simd_caps.rs` (file move). 2. `src/lib.rs` adds `#[cfg(feature = "std")] pub mod simd_caps;`. The std-gate is retained for now (uses `std::sync::LazyLock`); lifting it to `core::sync::LazyLock` is a separate follow-up. 3. `src/hpc/mod.rs` replaces `pub mod simd_caps;` with `pub use crate::simd_caps;` — keeps `crate::hpc::simd_caps::*` resolving for cross-repo consumers (lance-graph, WoA, MedCare, q2 may have `use ndarray::hpc::simd_caps::*` imports that this preserves untouched). No public-API breakage; the test suite picks up the new path (test names now `simd_caps::tests::*` rather than `hpc::simd_caps::*`), all 10 tests pass under both default and `runtime-dispatch` configs. The 8 internal callers (`crate::simd_avx512`, `crate::hpc::p64_bridge`, `crate::simd_runtime::{cpu_ops, add_mul, vnni_dot}`) continue using `crate::hpc::simd_caps::*` via the re-export and work unmodified. Next graduation candidates (deferred to follow-up PRs): - `fingerprint` (bitwise substrate; raw `u64` polyfill audit) - `dn_tree` (bitwise substrate; same audit) - `ogit_bridge` (pure logic, no SIMD primitives) - `splat3d` (already uses `crate::simd::*` polyfilled types) Each move follows the same pattern: relocate file, drop std-gate inheritance where unneeded, keep back-compat re-export. Cognitive layer (pillar, plane, seal, merkle_tree, deepnsm, …) stays inside `hpc/` and keeps its legitimate std-gate.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Continues the substrate-graduation thread documented in #192's wrap-up and extended in #193's simd_caps lift. Five more modules move from `crate::hpc::*` (the rustynum migration staging area) to crate root where they sit alongside `simd.rs`, `simd_runtime/`, `simd_caps`, and the W1a polyfill surface they're supposed to compose with. | Module | Reason | |---|---| | `bitwise` | Pure SIMD primitives (popcount, hamming over byte slices); already uses `crate::simd::U64x8` polyfill internally; already re-exported via `simd.rs:512`. | | `heel_f64x8` | All-F64x8 polyfill consumer (dot, cosine, sum-sq, weighted-hamming); already re-exported via `simd.rs:563`. | | `distance` | Spatial 3D + slice-shape L1/L2/L∞ (PR-X10 A6); the linalg/mod.rs hard-boundary comment now points here at root. | | `byte_scan` | Pure SIMD utility (needle search, delimiter find). | | `spatial_hash` | Pure SIMD utility (bucketing, candidate gather). | # Why these five, why now All five satisfied the low-hanging-fruit criteria from #193's wrap-up discussion: 1. No internal `hpc/` dependencies (only `super::simd_caps` which still resolves correctly because `simd_caps` is itself at crate root post-#192). 2. Already polyfill-clean — no raw-intrinsic refactor needed before the move. 3. Already partially exposed via `crate::simd::*` re-exports. The next graduation tier (`fingerprint`, `dn_tree`, `ogit_bridge`, `splat3d`) needs a polyfill audit before it can move, and `fingerprint` in particular is gated on the W1a-#5 POPCOUNT-U64 primitive landing (so its bit ops can route through `U64xN.popcnt()` instead of raw `u64.count_ones()`). # Back-compat preserved end-to-end Every cross-repo consumer using `ndarray::hpc::{bitwise, heel_f64x8, distance, byte_scan, spatial_hash}::*` continues to compile unmodified. The `src/hpc/mod.rs` declarations change from `pub mod X;` to `pub use crate::X;` — Rust re-exports modules just like other items, so `crate::hpc::X::*` resolves through to the same items as `crate::X::*`. Internal `super::simd_caps::simd_caps()` calls inside the moved files continue to work because `super::` at crate root resolves to `crate::*` which has `simd_caps` (graduated in #192). # Changes - `git mv` five files from `src/hpc/` to `src/`. - `src/lib.rs` gains five `#[cfg(feature = "std")] pub mod X;` declarations next to the existing `simd_caps` block, each with a one-liner docstring naming the graduation source and the substrate-tier reason for the move. - `src/hpc/mod.rs` replaces five `pub mod X;` with `pub use crate::X;` (back-compat re-exports). - `src/hpc/linalg/mod.rs` updates the hard-boundary comment from "No distance metrics — those live in `crate::hpc::distance`" to point at `crate::distance` (the new canonical path) with a parenthetical noting the back-compat re-export. - The `bitwise.rs` declaration in `src/hpc/mod.rs` is now a comment instead of being interleaved with `pub mod hdc`/`pub mod projection` to make the graduation status visible at a glance. # Verification - `cargo build -p ndarray --lib` — clean - `cargo build -p ndarray --lib --no-default-features` — clean (the new `#[cfg(feature = "std")]` gates match the existing `simd_caps` pattern; nostd targets see no change) - `cargo test -p ndarray --lib bitwise:: distance:: heel_f64x8:: byte_scan:: spatial_hash::` — all 119 tests on the five graduated modules pass at the new path (test names now `bitwise::tests::*` rather than `hpc::bitwise::tests::*`) - `cargo test -p ndarray --lib --features "pillar,ogit_bridge, runtime-dispatch" hpc::` — 2167 passed, 0 failed, 28 ignored - `cargo fmt --all --check` — clean - `cargo clippy --features "pillar,ogit_bridge,runtime-dispatch" --lib -- -D warnings` — clean # Next graduation candidates (deferred) - `hpc::fingerprint` — needs W1a-#5 POPCOUNT-U64 to land first so bit ops can route through `U64xN.popcnt()` instead of raw `u64.count_ones()`. Cognitive-shader-foundation explicitly names `Fingerprint<N>` as a MUST-be-in-`ndarray::simd::*` type. - `hpc::dn_tree` (bitwise core) — same polyfill-audit dependency. The cognitive DNTree/DNConfig/TraversalHit state stays in `hpc/` after the split. - `hpc::ogit_bridge` — pure logic, no SIMD, can move once the fingerprint + dn_tree audits are out of the way (avoids three partial graduations in flight at once). - `hpc::splat3d` — already mostly polyfill-clean; pure path rewrite. Defer because it's a larger consumer surface than the five in this PR.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 21, 2026
Continues the substrate-graduation thread from #192 (simd_caps), #193 (clippy/doc cleanup), and #194 (bitwise/heel_f64x8/distance/ byte_scan/spatial_hash). Same low-hanging-fruit criteria — no internal hpc/ deps, polyfill-clean, single-line back-compat shim keeps every existing import resolving. | Module | Reason | |---|---| | `aabb` | SIMD AABB intersection/expansion/distance; only deps are | | | `crate::simd::F32x16` + `super::simd_caps` (graduated #192). | | `nibble` | 4-bit packed nibble batch ops; only dep is `crate::simd::U8x64`.| | `palette_codec` | Variable-width palette index codec (1-8 bit packing); zero deps.| | `property_mask` | AVX-512 VPTERNLOGD bitset queries on block state bits; | | | only dep is `crate::simd::U64x8`. | # Why these four, why now All four satisfy the criteria from #194's wrap-up: 1. No internal `hpc/` dependencies — only `crate::simd::*` (polyfill surface) and `super::simd_caps` (which is itself at crate root post-#192). 2. Polyfill-clean — no raw-intrinsic refactor required. 3. Single in-tree downstream caller (`hpc::framebuffer` uses `palette_codec`) → the `pub use crate::palette_codec;` back-compat shim keeps that resolution working zero-touch. # Mechanical changes - `git mv src/hpc/{aabb,nibble,palette_codec,property_mask}.rs src/` - `src/lib.rs`: added four `pub mod` declarations under `#[cfg(feature = "std")]`, each with a `# Example` rustdoc block per CLAUDE.md "all public APIs need doc comments with examples". - `src/hpc/mod.rs`: replaced the four `pub mod` declarations with `pub use crate::{aabb, nibble, palette_codec, property_mask};` back-compat re-exports. `crate::hpc::aabb::*` and friends keep resolving for every existing call site, identical to how `crate::hpc::bitwise::*` works post-#194. # Clippy / lint cleanup 17 clippy errors surfaced under `-D warnings` once the modules left the `hpc/mod.rs` `#![allow(clippy::all, ...)]` umbrella. Fixed each at the canonical Rust idiom (the #194 cleanup pattern, 417131b), no umbrella re-application: - **manual_div_ceil (6 sites)** — `(n + d - 1) / d` → `n.div_ceil(d)` in `nibble.rs` (x2), `palette_codec.rs` (x3), `property_mask.rs`. - **needless_range_loop (10 sites)** — `for i in start..vec.len()` rewrites to `for x in &vec[start..]` (when index unused) or `for (i, &x) in iter().enumerate().skip(start)` (when index used). Sites: `aabb.rs` x4, `nibble.rs` x3, `palette_codec.rs` x1, `property_mask.rs` x2. - **missing_docs (4 sites)** — added field doc comments on `pub struct Aabb { min, max }` and `pub struct Ray { origin, inv_dir }`. Previously masked by the `hpc/mod.rs` umbrella's `#![allow(missing_docs)]`. # Doctest correction Initial `# Example` in `src/lib.rs` for `palette_codec` asserted `bits_for_palette_size(1) == 1` per the module's own docstring table, but the impl returns 0 for `palette_size <= 1` (trivial- palette special case). Changed assertion to use `bits_for_palette_ size(2) == 1` — exercises the same code path with input the impl actually handles per spec. # Verification ``` cargo check --lib green cargo clippy --lib -- -D warnings green cargo clippy --lib --features rayon -- -D warnings green cargo clippy --features approx,serde,rayon -- -D warnings green cargo test --doc (15 graduated-module doctests) pass cargo test --lib (104 unit tests across 4 modules) pass ``` # What's next `hpc/` inventory: ~55 → ~51 modules at the staging path. Next-batch candidates per the same criteria need a deps audit before move: `framebuffer` (uses `palette_codec` shim, otherwise crate-root), `ocr_simd`/`ocr_felt`, `audio`. Filed in AGENT_LOG entry for the follow-up pass. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Salvages the detection-only subset of the closed PR #190 — three real gaps in the substrate runtime dispatch — without inheriting any of #190's consumer-facing additions. No
SimdProfileenum, nocpu-*cargo features, no public dispatch-identity API;crate::simd::*remains the sole consumer entry point.What lands
src/hpc/simd_caps.rsSimdCapsfields:avx512fp16,avx512vp2intersect,amx_fp16falseon non-x86); ready for future kernels that route on FP16 / VP2INTERSECThas_avx512_fp16(),has_amx_fp16()has_amx,has_avx512_bf16etc. patterns.has_amx_fp16()defense-in-depths theamx_tilebit.cpu_ops()selectionsrc/simd_runtime/cpu_ops.rssimd_amx::amx_available()(the existing 4-step CPUID + OSXSAVE + XCR0 + arch_prctl check). Closes the SIGILL hole when a hypervisor masks XCR0 bits 17/18 or the OS hasn't honouredarch_prctl(XCOMP_PERM, 18)on Linux 5.19+.What's deliberately NOT here (rejected from #190)
SimdProfile14-variant enum +simd_profile()accessormatch profile { ... }arms — the polyfill-defeat pattern.cpu-*cargo features (13 mutually-exclusive flags)pub use ... SimdProfilefromsrc/simd.rssimd_profile_probeexample binaryWhy now
The AMX OS-gate fix is load-bearing on Sapphire Rapids hosts where the hypervisor masks the tile XSAVE state — currently
cpu_ops()routes toCPU_OPS_AMX_INT8, calls SIGILL on the first AMX instruction. After this PR it demotes toCPU_OPS_AVX512_VNNIcleanly without the consumer noticing. The bit-exact polyfill contract still holds (both tiers produce identical results — that's the W1a guarantee); the choice between them was just runtime-broken on AMX-but-OS-blocked hosts.The new
SimdCapsfields don't feed any selection yet; they're laying the runway. When an AMX-FP16 kernel lands (probably tied to the BF16/FP16 work hinted at in the matrix doc § J Phase 3b), the new tier would AND-gate oncaps.amx_fp16 && simd_amx::amx_available()between the AMX-INT8 and AVX-512-VNNI arms.Tests
simd_caps::tests:cpuid_extended_bits_smoke,has_amx_fp16_requires_amx_tile,x86_extended_bits_are_false_on_non_x86, plus extended determinism coverage of the new fields.cpu_ops::testsstill pass; the AMX OS-gate change passes through transparently on hosts whereamx_available()agrees with CPUID (the typical case).cargo build --no-default-featuresclean (new fields zero-init on non-x86 / scalar stubs).cargo fmt --all --checkclean.cargo clippy --features runtime-dispatch --lib -- -D warningsclean.Relation to closed #190
The closed PR's
simd_caps.rsadditions were correct and necessary; only the architectural overreach (SimdProfile / cpu-* / public dispatch-identity API) was the problem. This PR is the minimal extraction — ~130 lines, one commit, zero new public types.Generated by Claude Code