docs(simd): per-CPU dispatch matrix + Codex review fixes#181
Conversation
Follow-up to merged PR #180. Two changes: 1. NEW: .claude/knowledge/td-simd-cpu-dispatch-matrix.md Per-CPU feature table, every cell sourced from official spec (Intel ARK, AMD data sheets, WikiChip, Wikipedia AVX-512 article cross-referenced against primary sources). Rows: SkylakeX, CascadeLake, CooperLake, IceLakeSp, TigerLakeU, SapphireRapids, EmeraldRapids, GraniteRapids, Zen4, Zen5, ArrowLake, HaswellAvx2 (x86_64) plus Cortex-A53, A72, A76+ (aarch64). Columns: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, BITALG, GFNI, VAES, VPCLMULQDQ, VP2INTERSECT, AMX-TILE, AMX-INT8, AMX-BF16, AMX-FP16, AVX-VNNI, AVX-VNNI-INT8, AVX-IFMA. Status legend: DOC (from official spec, dispatch safe but not yet hardware-verified by this project) vs TEST (verified on real silicon). Every cell is currently DOC — promotion to TEST happens as we acquire hardware for each profile. Includes: - Per-CPU microarchitecture notes with citations - CPUID leaf/register/bit table for hand-coded detection - OS XSAVE state requirements (XCR0 bits, arch_prctl for AMX) - SimdProfile::detect() pseudocode with the GraniteRapids- before-SapphireRapids ordering invariant - Out-of-scope CPUs listed (Knights Mill, Cannon Lake, Alder Lake firmware-disabled AVX-512, Sierra Forest, etc.) Critical detection invariants: - GraniteRapids checked BEFORE SapphireRapids (GNR has SPR bits + AMX-FP16; if SPR first, AMX-FP16 stays unused). - Zen4 vs SPR distinguished by amx_tile present/absent. - CooperLake vs IceLakeSp: mutually exclusive bit patterns (CPL has BF16 no VBMI; ICX has VBMI no BF16). - TigerLakeU vs IceLakeSp: discriminated by VP2INTERSECT. 2. Fixes to td-simd-integration-plan.md per Codex bot review on PR #180: - AMX MAC count comments corrected: TDPBF16PS = 8192 mul-adds per instruction (16×16 output × K=32), TDPBUSD = 16384 mul-adds per instruction (16×16 output × K=64). Previous "256 mul-adds/instr" understated by 32× and 64× respectively and would have skewed Phase 1 prioritization. Numbers now align with src/hpc/amx_matmul.rs:15 and bf16_tile_gemm.rs:155-157. - A72 vs A53 detection: replaced the unreliable `neon && aes` heuristic (both A53 with crypto and A72 have identical HWCAP flags) with explicit `Armv8Neon` fallback and a doc comment stating that /proc/cpuinfo `CPU part` reading is required to split them. SimdProfile enum, dispatch table, and quick-reference tables collapsed to single Armv8Neon variant. Future improvement: split into A72Fast/A53Baseline when /proc/cpuinfo lookup is wired. No code changes. Documentation only.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9102cddc8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| SimdProfile::A76DotProd => &A76_GEMM, | ||
| SimdProfile::A72Fast => &A72_GEMM, | ||
| SimdProfile::A53Baseline => &A53_GEMM, | ||
| SimdProfile::Armv8Neon => &ARMV8_NEON_GEMM, |
There was a problem hiding this comment.
Define Armv8Neon table before using it in dispatch
The updated match arm now returns &ARMV8_NEON_GEMM, but this symbol is never defined in the example (the snippet still defines A72_GEMM and A53_GEMM). As written, anyone implementing Phase 3 from this “authoritative” plan will hit a compile error and may be unclear whether to alias or merge the old A72/A53 tables. Please add the missing ARMV8_NEON_GEMM definition (or update the surrounding table definitions) so the documented migration is internally consistent.
Useful? React with 👍 / 👎.
Self-correction. The previous commit collapsed A72Fast + A53Baseline into a new `Armv8Neon` variant, claiming the two could not be distinguished by HWCAP. That reinvented something the codebase already solves. `src/hpc/simd_caps.rs:317-336` has had `ArmProfile::arm_profile()` in tree since the SBC support landed. Its decision tree: asimd_dotprod present → A76DotProd (Pi 5 / A76+) aes present (no dotprod) → A72Fast (Pi 4 / Pi 3 / Pi Zero 2W) no aes → A53Baseline (QEMU / minimal aarch64) The line 329 comment explicitly admits the A72Fast branch catches both A72 silicon (Pi 4) and A53-with-crypto silicon (Pi 3, Pi Zero 2W): "we report A72-tier since most deployments target Pi 4." The dispatch tables would be identical at the ISA level (both are ARMv8.0+crypto, no dotprod), so this is intentional. The `A53Baseline` variant catches the rare case of NEON-without- crypto (QEMU, minimal aarch64 builds), which my `Armv8Neon` collapse lost. Changes: - Reverted SimdProfile enum to A76DotProd / A72Fast / A53Baseline. - detect() pseudocode now delegates to existing arm_profile() helper. - GemmDispatch table restored to 3 aarch64 entries. - Quick-reference tables list both A72Fast and A53Baseline rows with a note that they share the same kernel. - Dispatch matrix split into 4 rows: A53+crypto (→A72Fast), A53-no-crypto (→A53Baseline), A72 (→A72Fast), A76+ (→A76DotProd). This is more honest than the Armv8Neon collapse: it preserves the existing in-tree pattern, names it correctly, and documents the A72Fast-as-ARMv8.0+crypto-catch-all semantic that the codebase already chose.
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.
…able Rebased onto master post-#181, #182, #183. Replaces the polyfill-based add_mul_f32/f64 with LazyLock-cached function pointers picking real hardware FMA per silicon, and adds two more LazyLock-cached primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8. WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that is an instruction-family choice, not a lane-width one. LazyLock amortises a one-time simd_caps() read into a frozen fn pointer; every subsequent call is a single indirect jump with zero is_x86_feature_detected! overhead. No SimdProfile exposed at the consumer surface — agnostic contract preserved. add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i] AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail NEON → vfmaq_f32 4-wide + scalar tail scalar → f32::mul_add per lane no_std build → preserves the polyfill F32x16::mul_add path (LazyLock requires std) add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes. is_amx_available() — wraps simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in LazyLock<bool>. The 4-step gate, including the syscall, fires exactly once per process. Always false on non-x86_64. vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices: AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with scalar tail handling (the existing kernel processes only n - (n%64) since its cognitive-shader caller pre-aligns rows; general-purpose callers need the tail) AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly (that one already handles its scalar tail) scalar → simd_amx::vnni_dot_u8_i8_scalar No intrinsic code is duplicated. The dispatcher composes existing simd_amx::* kernels (which #182/#184 also build on) into a safe LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch runs the same selection logic but uses is_x86_feature_detected! per call; this wrapper amortises that to once at startup. PARITY CONTRACT: - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add / f64::mul_add per lane via to_bits() assertion. All vector backends emit single-rounded IEEE-754 FMA. - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply. VPDPBUSD does not saturate the accumulator (intermediate u8*i8 products bounded by 32385, four-element sums by 129540). Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12 problem sizes / tail lengths). cargo clippy --lib clean under default and --features cpu-spr. On Sapphire Rapids host the LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for vnni_dot; AMX is_amx_available returns false (hypervisor masks XCR0[17,18]) — matches the Risk #3 demotion from 61b4563. This commit was rebased atop master after the parallel session shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into + walkback) and be65595 (is_amx_available + vnni_dot duplicating intrinsics) are subsumed by this single clean commit: no public SimdProfile / SimdTier re-export, no duplicated intrinsic code, no mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.
Summary
Follow-up to merged #180. Two changes, both documentation:
NEW
.claude/knowledge/td-simd-cpu-dispatch-matrix.md— authoritative per-CPU feature table, every cell sourced from official spec. This is the fileSimdProfile::detect()(Phase 3 of docs(simd): tier debt audit + 4-phase integration plan #180's integration plan) will be implemented against.FIX
.claude/knowledge/td-simd-integration-plan.md— addresses both Codex bot review comments on docs(simd): tier debt audit + 4-phase integration plan #180.The dispatch matrix
CPUs the user specifically asked about, every cell
DOC-sourced from Intel ARK / AMD data sheets / WikiChip / Wikipedia (cross-referenced against primary specs):Sapphire Rapids HBM (Xeon Max) has identical ISA to standard SPR — one profile covers both. The full matrix in the PR also covers Cascade Lake, Ice Lake-SP, Emerald Rapids, Granite Rapids, Arrow Lake, plus GFNI/VAES/VPCLMULQDQ/VP2INTERSECT/BITALG/VPOPCNTDQ columns.
Zen 5 has identical ISA to Zen 4 — the difference is microarchitectural (full 512-bit datapath vs Zen 4's double-pumped 256-bit). Same
Zen4Avx512profile for both; splitting only matters if tile sizes need to differ, which can wait for benchmarks.Status legend (the user's "safe until we have CPU to test")
Per cell:
The TEST promotion checklist is in the matrix doc — exercise one instruction per feature column, compare against scalar reference, then flip DOC → TEST with the verification commit cited.
Detection invariants
The matrix doc spells out the
SimdProfile::detect()ordering invariants:amx_tile(SPR yes, Zen 4 no).Also includes the full CPUID leaf/register/bit table so the existing
simd_caps.rs::detect()can be extended (currently doesn't read leaf 7,1 — needs that for GNR's AMX-FP16 and Arrow Lake's AVX-VNNI-INT8).Codex bot fixes
Codex review on #180 raised two P2 findings. Both addressed:
AMX MAC counts — my SPR dispatch comments said "256 mul-adds/instr" for both TDPBF16PS and TDPBUSD. Bot correctly pointed out these are tile-level dot products with much higher throughput. Fixed:
src/hpc/amx_matmul.rs:15andbf16_tile_gemm.rs:155-157.A72 vs A53 detection — my proposed
neon && aesheuristic was unreliable (both A53-with-crypto and A72 have identical HWCAP flags; the difference is microarchitectural, not ISA). Replaced with singleArmv8Neonfallback variant + doc comment stating that/proc/cpuinfoCPU partreading is the only reliable discriminator. SimdProfile enum, dispatch table, and quick-reference tables collapsed accordingly. Future split into A72Fast/A53Baseline is opt-in once that lookup is wired.Test plan
SimdProfilevariant exercised via cargocpu-*feature matrix.simd_profile()returns expected variant, exercise one instruction per DOC-marked feature column, promote cells DOC → TEST.Generated by Claude Code