docs(simd): per-CPU dispatch matrix + Codex review fixes by AdaWorldAPI · Pull Request #181 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-20T19:48:32Z

Summary

Follow-up to merged #180. Two changes, both documentation:

NEW .claude/knowledge/td-simd-cpu-dispatch-matrix.md — authoritative per-CPU feature table, every cell sourced from official spec. This is the file SimdProfile::detect() (Phase 3 of docs(simd): tier debt audit + 4-phase integration plan #180's integration plan) will be implemented against.
FIX .claude/knowledge/td-simd-integration-plan.md — addresses both Codex bot review comments on docs(simd): tier debt audit + 4-phase integration plan #180.

The dispatch matrix

CPUs the user specifically asked about, every cell DOC-sourced from Intel ARK / AMD data sheets / WikiChip / Wikipedia (cross-referenced against primary specs):

CPU profile	F	CD	VL	DQ	BW	IFMA	VBMI	VBMI2	VNNI	BF16	FP16	AMX-TILE	AMX-INT8	AMX-BF16	AMX-FP16
SkylakeX	✓	✓	✓	✓	✓	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗
CooperLake	✓	✓	✓	✓	✓	✗	✗	✗	✓	✓	✗	✗	✗	✗	✗
TigerLakeU	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗	✗	✗	✗	✗
SapphireRapids	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗
Zen4	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗	✗	✗
Zen5	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗	✗	✗

Sapphire Rapids HBM (Xeon Max) has identical ISA to standard SPR — one profile covers both. The full matrix in the PR also covers Cascade Lake, Ice Lake-SP, Emerald Rapids, Granite Rapids, Arrow Lake, plus GFNI/VAES/VPCLMULQDQ/VP2INTERSECT/BITALG/VPOPCNTDQ columns.

Zen 5 has identical ISA to Zen 4 — the difference is microarchitectural (full 512-bit datapath vs Zen 4's double-pumped 256-bit). Same Zen4Avx512 profile for both; splitting only matters if tile sizes need to differ, which can wait for benchmarks.

Status legend (the user's "safe until we have CPU to test")

Per cell:

DOC — feature presence stated in official spec. Dispatch is safe (Intel/AMD say it works) but unverified on physical silicon by this project. Every cell is currently DOC.
TEST — verified on real hardware by this project. Promoted from DOC after first hardware run.
✗ — explicitly absent per source.

The TEST promotion checklist is in the matrix doc — exercise one instruction per feature column, compare against scalar reference, then flip DOC → TEST with the verification commit cited.

Detection invariants

The matrix doc spells out the SimdProfile::detect() ordering invariants:

GraniteRapids checked BEFORE SapphireRapids — GNR has all SPR bits plus AMX-FP16. If you check SPR first, AMX-FP16 stays unused on GNR silicon.
Zen4 vs SapphireRapids — both have F+VBMI+BF16+FP16. Discriminator is amx_tile (SPR yes, Zen 4 no).
CooperLake vs IceLakeSp — both have AVX-512+VNNI. CPL has BF16 without VBMI. ICX has VBMI without BF16. Mutually exclusive bit patterns.
TigerLakeU vs IceLakeSp — same feature set EXCEPT VP2INTERSECT (TigerLake mobile only).

Also includes the full CPUID leaf/register/bit table so the existing simd_caps.rs::detect() can be extended (currently doesn't read leaf 7,1 — needs that for GNR's AMX-FP16 and Arrow Lake's AVX-VNNI-INT8).

Codex bot fixes

Codex review on #180 raised two P2 findings. Both addressed:

AMX MAC counts — my SPR dispatch comments said "256 mul-adds/instr" for both TDPBF16PS and TDPBUSD. Bot correctly pointed out these are tile-level dot products with much higher throughput. Fixed:
- TDPBF16PS: 16×16 output × K=32 = 8192 mul-adds/instr
- TDPBUSD: 16×16 output × K=64 = 16384 mul-adds/instr
- Numbers now align with src/hpc/amx_matmul.rs:15 and bf16_tile_gemm.rs:155-157.
A72 vs A53 detection — my proposed neon && aes heuristic was unreliable (both A53-with-crypto and A72 have identical HWCAP flags; the difference is microarchitectural, not ISA). Replaced with single Armv8Neon fallback variant + doc comment stating that /proc/cpuinfo CPU part reading is the only reliable discriminator. SimdProfile enum, dispatch table, and quick-reference tables collapsed accordingly. Future split into A72Fast/A53Baseline is opt-in once that lookup is wired.

Test plan

No code changes — nothing to test in CI for this PR.
Phase 3 implementation (separate PR): each SimdProfile variant exercised via cargo cpu-* feature matrix.
Hardware verification (per-profile, parallel): boot binary on real silicon, confirm simd_profile() returns expected variant, exercise one instruction per DOC-marked feature column, promote cells DOC → TEST.

Generated by Claude Code

Follow-up to merged PR #180. Two changes: 1. NEW: .claude/knowledge/td-simd-cpu-dispatch-matrix.md Per-CPU feature table, every cell sourced from official spec (Intel ARK, AMD data sheets, WikiChip, Wikipedia AVX-512 article cross-referenced against primary sources). Rows: SkylakeX, CascadeLake, CooperLake, IceLakeSp, TigerLakeU, SapphireRapids, EmeraldRapids, GraniteRapids, Zen4, Zen5, ArrowLake, HaswellAvx2 (x86_64) plus Cortex-A53, A72, A76+ (aarch64). Columns: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, BITALG, GFNI, VAES, VPCLMULQDQ, VP2INTERSECT, AMX-TILE, AMX-INT8, AMX-BF16, AMX-FP16, AVX-VNNI, AVX-VNNI-INT8, AVX-IFMA. Status legend: DOC (from official spec, dispatch safe but not yet hardware-verified by this project) vs TEST (verified on real silicon). Every cell is currently DOC — promotion to TEST happens as we acquire hardware for each profile. Includes: - Per-CPU microarchitecture notes with citations - CPUID leaf/register/bit table for hand-coded detection - OS XSAVE state requirements (XCR0 bits, arch_prctl for AMX) - SimdProfile::detect() pseudocode with the GraniteRapids- before-SapphireRapids ordering invariant - Out-of-scope CPUs listed (Knights Mill, Cannon Lake, Alder Lake firmware-disabled AVX-512, Sierra Forest, etc.) Critical detection invariants: - GraniteRapids checked BEFORE SapphireRapids (GNR has SPR bits + AMX-FP16; if SPR first, AMX-FP16 stays unused). - Zen4 vs SPR distinguished by amx_tile present/absent. - CooperLake vs IceLakeSp: mutually exclusive bit patterns (CPL has BF16 no VBMI; ICX has VBMI no BF16). - TigerLakeU vs IceLakeSp: discriminated by VP2INTERSECT. 2. Fixes to td-simd-integration-plan.md per Codex bot review on PR #180: - AMX MAC count comments corrected: TDPBF16PS = 8192 mul-adds per instruction (16×16 output × K=32), TDPBUSD = 16384 mul-adds per instruction (16×16 output × K=64). Previous "256 mul-adds/instr" understated by 32× and 64× respectively and would have skewed Phase 1 prioritization. Numbers now align with src/hpc/amx_matmul.rs:15 and bf16_tile_gemm.rs:155-157. - A72 vs A53 detection: replaced the unreliable `neon && aes` heuristic (both A53 with crypto and A72 have identical HWCAP flags) with explicit `Armv8Neon` fallback and a doc comment stating that /proc/cpuinfo `CPU part` reading is required to split them. SimdProfile enum, dispatch table, and quick-reference tables collapsed to single Armv8Neon variant. Future improvement: split into A72Fast/A53Baseline when /proc/cpuinfo lookup is wired. No code changes. Documentation only.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9102cddc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T19:51:29Z

            SimdProfile::A76DotProd => &A76_GEMM,
-            SimdProfile::A72Fast => &A72_GEMM,
-            SimdProfile::A53Baseline => &A53_GEMM,
+            SimdProfile::Armv8Neon => &ARMV8_NEON_GEMM,


Define Armv8Neon table before using it in dispatch

The updated match arm now returns &ARMV8_NEON_GEMM, but this symbol is never defined in the example (the snippet still defines A72_GEMM and A53_GEMM). As written, anyone implementing Phase 3 from this “authoritative” plan will hit a compile error and may be unclear whether to alias or merge the old A72/A53 tables. Please add the missing ARMV8_NEON_GEMM definition (or update the surrounding table definitions) so the documented migration is internally consistent.

Useful? React with 👍 / 👎.

Self-correction. The previous commit collapsed A72Fast + A53Baseline into a new `Armv8Neon` variant, claiming the two could not be distinguished by HWCAP. That reinvented something the codebase already solves. `src/hpc/simd_caps.rs:317-336` has had `ArmProfile::arm_profile()` in tree since the SBC support landed. Its decision tree: asimd_dotprod present → A76DotProd (Pi 5 / A76+) aes present (no dotprod) → A72Fast (Pi 4 / Pi 3 / Pi Zero 2W) no aes → A53Baseline (QEMU / minimal aarch64) The line 329 comment explicitly admits the A72Fast branch catches both A72 silicon (Pi 4) and A53-with-crypto silicon (Pi 3, Pi Zero 2W): "we report A72-tier since most deployments target Pi 4." The dispatch tables would be identical at the ISA level (both are ARMv8.0+crypto, no dotprod), so this is intentional. The `A53Baseline` variant catches the rare case of NEON-without- crypto (QEMU, minimal aarch64 builds), which my `Armv8Neon` collapse lost. Changes: - Reverted SimdProfile enum to A76DotProd / A72Fast / A53Baseline. - detect() pseudocode now delegates to existing arm_profile() helper. - GemmDispatch table restored to 3 aarch64 entries. - Quick-reference tables list both A72Fast and A53Baseline rows with a note that they share the same kernel. - Dispatch matrix split into 4 rows: A53+crypto (→A72Fast), A53-no-crypto (→A53Baseline), A72 (→A72Fast), A76+ (→A76DotProd). This is more honest than the Armv8Neon collapse: it preserves the existing in-tree pattern, names it correctly, and documents the A72Fast-as-ARMv8.0+crypto-catch-all semantic that the codebase already chose.

Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.

…able Rebased onto master post-#181, #182, #183. Replaces the polyfill-based add_mul_f32/f64 with LazyLock-cached function pointers picking real hardware FMA per silicon, and adds two more LazyLock-cached primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8. WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that is an instruction-family choice, not a lane-width one. LazyLock amortises a one-time simd_caps() read into a frozen fn pointer; every subsequent call is a single indirect jump with zero is_x86_feature_detected! overhead. No SimdProfile exposed at the consumer surface — agnostic contract preserved. add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i] AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail NEON → vfmaq_f32 4-wide + scalar tail scalar → f32::mul_add per lane no_std build → preserves the polyfill F32x16::mul_add path (LazyLock requires std) add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes. is_amx_available() — wraps simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in LazyLock<bool>. The 4-step gate, including the syscall, fires exactly once per process. Always false on non-x86_64. vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices: AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with scalar tail handling (the existing kernel processes only n - (n%64) since its cognitive-shader caller pre-aligns rows; general-purpose callers need the tail) AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly (that one already handles its scalar tail) scalar → simd_amx::vnni_dot_u8_i8_scalar No intrinsic code is duplicated. The dispatcher composes existing simd_amx::* kernels (which #182/#184 also build on) into a safe LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch runs the same selection logic but uses is_x86_feature_detected! per call; this wrapper amortises that to once at startup. PARITY CONTRACT: - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add / f64::mul_add per lane via to_bits() assertion. All vector backends emit single-rounded IEEE-754 FMA. - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply. VPDPBUSD does not saturate the accumulator (intermediate u8*i8 products bounded by 32385, four-element sums by 129540). Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12 problem sizes / tail lengths). cargo clippy --lib clean under default and --features cpu-spr. On Sapphire Rapids host the LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for vnni_dot; AMX is_amx_available returns false (hypervisor masks XCR0[17,18]) — matches the Risk #3 demotion from 61b4563. This commit was rebased atop master after the parallel session shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into + walkback) and be65595 (is_amx_available + vnni_dot duplicating intrinsics) are subsumed by this single clean commit: no public SimdProfile / SimdTier re-export, no duplicated intrinsic code, no mul_add_f32_into (master's add_mul_f32 shape is the right primitive).

Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin simd_profile() to a const at compile time, bypassing the runtime LazyLock detection from T3.1. One feature per non-Scalar variant of the SimdProfile enum. Features (mapping to LLVM target-cpu codenames): cpu-gnr → GraniteRapids (graniterapids) cpu-spr → SapphireRapids (sapphirerapids) cpu-zen4 → Zen4Avx512 (znver4) cpu-cpl → CooperLake (cooperlake) cpu-tigerlake → TigerLakeU (tigerlake) cpu-icx → IceLakeSp (icelake-server) cpu-clx → CascadeLake (cascadelake) cpu-skx → SkylakeX (skylake-avx512) cpu-arrowlake → ArrowLake (arrowlake) cpu-haswell → HaswellAvx2 (haswell) cpu-a76 → A76DotProd (cortex-a76) cpu-a72 → A72Fast (cortex-a72) cpu-a53 → A53Baseline (cortex-a53) Mutual exclusion (per integration-plan risk #1: "if a user sets cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on AMX instructions") is enforced via a const assert: each cpu-* contributes 1 to _PIN_COUNT and the assert fires at compile time if the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4 simultaneously produces a build error citing the mutex. Implementation: a const pinned_profile() -> Option<SimdProfile> walks a cfg cascade and returns the active variant or None. The simd_profile() function exists in two cfg-gated forms — a runtime LazyLock variant compiled when no cpu-* feature is set, and a const-foldable variant compiled when any is set. The LazyLock is not linked into pinned binaries. is_pinned() const helper exposes whether compile-time dispatch is active, useful both for consumer-facing diagnostics and for gating arch-detection tests that no longer apply when pinning overrides hardware. Existing x86_target_lands_inside_x86_family / aarch64_target_lands_inside_aarch64_family tests early-return when pinned; two new tests (pinning_default_is_off, pinning_consistency) verify the const + runtime paths agree. Tests: 2077/2077 lib tests pass under both default and --features cpu-spr configurations. cargo clippy --lib -- -D warnings clean in both modes. Mutex compile_error verified by attempting --features "cpu-spr,cpu-zen4" — fails as expected with the const assert citation. The default (no feature) path is byte-identical to the T3.1 runtime detection — no regression risk to the merged #181 work or to the e40f3a3 SimdProfile commit.

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

AdaWorldAPI merged commit 3568003 into master May 20, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(simd): per-CPU dispatch matrix + Codex review fixes#181

docs(simd): per-CPU dispatch matrix + Codex review fixes#181
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-cpu-dispatch-matrix

AdaWorldAPI commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 20, 2026

Summary

The dispatch matrix

Status legend (the user's "safe until we have CPU to test")

Detection invariants

Codex bot fixes

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants