Skip to content

docs(simd): per-CPU dispatch matrix + Codex review fixes#181

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-cpu-dispatch-matrix
May 20, 2026
Merged

docs(simd): per-CPU dispatch matrix + Codex review fixes#181
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-cpu-dispatch-matrix

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Follow-up to merged #180. Two changes, both documentation:

  1. NEW .claude/knowledge/td-simd-cpu-dispatch-matrix.md — authoritative per-CPU feature table, every cell sourced from official spec. This is the file SimdProfile::detect() (Phase 3 of docs(simd): tier debt audit + 4-phase integration plan #180's integration plan) will be implemented against.

  2. FIX .claude/knowledge/td-simd-integration-plan.md — addresses both Codex bot review comments on docs(simd): tier debt audit + 4-phase integration plan #180.

The dispatch matrix

CPUs the user specifically asked about, every cell DOC-sourced from Intel ARK / AMD data sheets / WikiChip / Wikipedia (cross-referenced against primary specs):

CPU profile F CD VL DQ BW IFMA VBMI VBMI2 VNNI BF16 FP16 AMX-TILE AMX-INT8 AMX-BF16 AMX-FP16
SkylakeX
CooperLake
TigerLakeU
SapphireRapids
Zen4
Zen5

Sapphire Rapids HBM (Xeon Max) has identical ISA to standard SPR — one profile covers both. The full matrix in the PR also covers Cascade Lake, Ice Lake-SP, Emerald Rapids, Granite Rapids, Arrow Lake, plus GFNI/VAES/VPCLMULQDQ/VP2INTERSECT/BITALG/VPOPCNTDQ columns.

Zen 5 has identical ISA to Zen 4 — the difference is microarchitectural (full 512-bit datapath vs Zen 4's double-pumped 256-bit). Same Zen4Avx512 profile for both; splitting only matters if tile sizes need to differ, which can wait for benchmarks.

Status legend (the user's "safe until we have CPU to test")

Per cell:

  • DOC — feature presence stated in official spec. Dispatch is safe (Intel/AMD say it works) but unverified on physical silicon by this project. Every cell is currently DOC.
  • TEST — verified on real hardware by this project. Promoted from DOC after first hardware run.
  • — explicitly absent per source.

The TEST promotion checklist is in the matrix doc — exercise one instruction per feature column, compare against scalar reference, then flip DOC → TEST with the verification commit cited.

Detection invariants

The matrix doc spells out the SimdProfile::detect() ordering invariants:

  1. GraniteRapids checked BEFORE SapphireRapids — GNR has all SPR bits plus AMX-FP16. If you check SPR first, AMX-FP16 stays unused on GNR silicon.
  2. Zen4 vs SapphireRapids — both have F+VBMI+BF16+FP16. Discriminator is amx_tile (SPR yes, Zen 4 no).
  3. CooperLake vs IceLakeSp — both have AVX-512+VNNI. CPL has BF16 without VBMI. ICX has VBMI without BF16. Mutually exclusive bit patterns.
  4. TigerLakeU vs IceLakeSp — same feature set EXCEPT VP2INTERSECT (TigerLake mobile only).

Also includes the full CPUID leaf/register/bit table so the existing simd_caps.rs::detect() can be extended (currently doesn't read leaf 7,1 — needs that for GNR's AMX-FP16 and Arrow Lake's AVX-VNNI-INT8).

Codex bot fixes

Codex review on #180 raised two P2 findings. Both addressed:

  1. AMX MAC counts — my SPR dispatch comments said "256 mul-adds/instr" for both TDPBF16PS and TDPBUSD. Bot correctly pointed out these are tile-level dot products with much higher throughput. Fixed:

    • TDPBF16PS: 16×16 output × K=32 = 8192 mul-adds/instr
    • TDPBUSD: 16×16 output × K=64 = 16384 mul-adds/instr
    • Numbers now align with src/hpc/amx_matmul.rs:15 and bf16_tile_gemm.rs:155-157.
  2. A72 vs A53 detection — my proposed neon && aes heuristic was unreliable (both A53-with-crypto and A72 have identical HWCAP flags; the difference is microarchitectural, not ISA). Replaced with single Armv8Neon fallback variant + doc comment stating that /proc/cpuinfo CPU part reading is the only reliable discriminator. SimdProfile enum, dispatch table, and quick-reference tables collapsed accordingly. Future split into A72Fast/A53Baseline is opt-in once that lookup is wired.

Test plan

  • No code changes — nothing to test in CI for this PR.
  • Phase 3 implementation (separate PR): each SimdProfile variant exercised via cargo cpu-* feature matrix.
  • Hardware verification (per-profile, parallel): boot binary on real silicon, confirm simd_profile() returns expected variant, exercise one instruction per DOC-marked feature column, promote cells DOC → TEST.

Generated by Claude Code

Follow-up to merged PR #180. Two changes:

1. NEW: .claude/knowledge/td-simd-cpu-dispatch-matrix.md

   Per-CPU feature table, every cell sourced from official spec
   (Intel ARK, AMD data sheets, WikiChip, Wikipedia AVX-512
   article cross-referenced against primary sources).

   Rows: SkylakeX, CascadeLake, CooperLake, IceLakeSp,
   TigerLakeU, SapphireRapids, EmeraldRapids, GraniteRapids,
   Zen4, Zen5, ArrowLake, HaswellAvx2 (x86_64) plus
   Cortex-A53, A72, A76+ (aarch64).

   Columns: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VNNI, BF16,
   FP16, VPOPCNTDQ, BITALG, GFNI, VAES, VPCLMULQDQ,
   VP2INTERSECT, AMX-TILE, AMX-INT8, AMX-BF16, AMX-FP16,
   AVX-VNNI, AVX-VNNI-INT8, AVX-IFMA.

   Status legend: DOC (from official spec, dispatch safe but
   not yet hardware-verified by this project) vs TEST (verified
   on real silicon). Every cell is currently DOC — promotion to
   TEST happens as we acquire hardware for each profile.

   Includes:
   - Per-CPU microarchitecture notes with citations
   - CPUID leaf/register/bit table for hand-coded detection
   - OS XSAVE state requirements (XCR0 bits, arch_prctl for AMX)
   - SimdProfile::detect() pseudocode with the GraniteRapids-
     before-SapphireRapids ordering invariant
   - Out-of-scope CPUs listed (Knights Mill, Cannon Lake,
     Alder Lake firmware-disabled AVX-512, Sierra Forest, etc.)

   Critical detection invariants:
   - GraniteRapids checked BEFORE SapphireRapids (GNR has SPR
     bits + AMX-FP16; if SPR first, AMX-FP16 stays unused).
   - Zen4 vs SPR distinguished by amx_tile present/absent.
   - CooperLake vs IceLakeSp: mutually exclusive bit patterns
     (CPL has BF16 no VBMI; ICX has VBMI no BF16).
   - TigerLakeU vs IceLakeSp: discriminated by VP2INTERSECT.

2. Fixes to td-simd-integration-plan.md per Codex bot review on
   PR #180:

   - AMX MAC count comments corrected: TDPBF16PS = 8192 mul-adds
     per instruction (16×16 output × K=32), TDPBUSD = 16384
     mul-adds per instruction (16×16 output × K=64). Previous
     "256 mul-adds/instr" understated by 32× and 64× respectively
     and would have skewed Phase 1 prioritization. Numbers now
     align with src/hpc/amx_matmul.rs:15 and
     bf16_tile_gemm.rs:155-157.

   - A72 vs A53 detection: replaced the unreliable
     `neon && aes` heuristic (both A53 with crypto and A72 have
     identical HWCAP flags) with explicit `Armv8Neon` fallback
     and a doc comment stating that /proc/cpuinfo `CPU part`
     reading is required to split them. SimdProfile enum,
     dispatch table, and quick-reference tables collapsed to
     single Armv8Neon variant. Future improvement: split into
     A72Fast/A53Baseline when /proc/cpuinfo lookup is wired.

No code changes. Documentation only.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9102cddc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

SimdProfile::A76DotProd => &A76_GEMM,
SimdProfile::A72Fast => &A72_GEMM,
SimdProfile::A53Baseline => &A53_GEMM,
SimdProfile::Armv8Neon => &ARMV8_NEON_GEMM,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Define Armv8Neon table before using it in dispatch

The updated match arm now returns &ARMV8_NEON_GEMM, but this symbol is never defined in the example (the snippet still defines A72_GEMM and A53_GEMM). As written, anyone implementing Phase 3 from this “authoritative” plan will hit a compile error and may be unclear whether to alias or merge the old A72/A53 tables. Please add the missing ARMV8_NEON_GEMM definition (or update the surrounding table definitions) so the documented migration is internally consistent.

Useful? React with 👍 / 👎.

Self-correction. The previous commit collapsed A72Fast + A53Baseline
into a new `Armv8Neon` variant, claiming the two could not be
distinguished by HWCAP. That reinvented something the codebase
already solves.

`src/hpc/simd_caps.rs:317-336` has had `ArmProfile::arm_profile()`
in tree since the SBC support landed. Its decision tree:

  asimd_dotprod present  → A76DotProd (Pi 5 / A76+)
  aes present (no dotprod) → A72Fast   (Pi 4 / Pi 3 / Pi Zero 2W)
  no aes                 → A53Baseline (QEMU / minimal aarch64)

The line 329 comment explicitly admits the A72Fast branch catches
both A72 silicon (Pi 4) and A53-with-crypto silicon (Pi 3, Pi Zero
2W): "we report A72-tier since most deployments target Pi 4." The
dispatch tables would be identical at the ISA level (both are
ARMv8.0+crypto, no dotprod), so this is intentional.

The `A53Baseline` variant catches the rare case of NEON-without-
crypto (QEMU, minimal aarch64 builds), which my `Armv8Neon` collapse
lost.

Changes:
- Reverted SimdProfile enum to A76DotProd / A72Fast / A53Baseline.
- detect() pseudocode now delegates to existing arm_profile() helper.
- GemmDispatch table restored to 3 aarch64 entries.
- Quick-reference tables list both A72Fast and A53Baseline rows with
  a note that they share the same kernel.
- Dispatch matrix split into 4 rows: A53+crypto (→A72Fast),
  A53-no-crypto (→A53Baseline), A72 (→A72Fast), A76+ (→A76DotProd).

This is more honest than the Armv8Neon collapse: it preserves the
existing in-tree pattern, names it correctly, and documents the
A72Fast-as-ARMv8.0+crypto-catch-all semantic that the codebase
already chose.
@AdaWorldAPI AdaWorldAPI merged commit 3568003 into master May 20, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 20, 2026
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin
simd_profile() to a const at compile time, bypassing the runtime
LazyLock detection from T3.1. One feature per non-Scalar variant of
the SimdProfile enum.

Features (mapping to LLVM target-cpu codenames):
  cpu-gnr         → GraniteRapids    (graniterapids)
  cpu-spr         → SapphireRapids   (sapphirerapids)
  cpu-zen4        → Zen4Avx512       (znver4)
  cpu-cpl         → CooperLake       (cooperlake)
  cpu-tigerlake   → TigerLakeU       (tigerlake)
  cpu-icx         → IceLakeSp        (icelake-server)
  cpu-clx         → CascadeLake      (cascadelake)
  cpu-skx         → SkylakeX         (skylake-avx512)
  cpu-arrowlake   → ArrowLake        (arrowlake)
  cpu-haswell     → HaswellAvx2      (haswell)
  cpu-a76         → A76DotProd       (cortex-a76)
  cpu-a72         → A72Fast          (cortex-a72)
  cpu-a53         → A53Baseline      (cortex-a53)

Mutual exclusion (per integration-plan risk #1: "if a user sets
cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on
AMX instructions") is enforced via a const assert: each cpu-*
contributes 1 to _PIN_COUNT and the assert fires at compile time if
the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4
simultaneously produces a build error citing the mutex.

Implementation: a const pinned_profile() -> Option<SimdProfile>
walks a cfg cascade and returns the active variant or None. The
simd_profile() function exists in two cfg-gated forms — a runtime
LazyLock variant compiled when no cpu-* feature is set, and a
const-foldable variant compiled when any is set. The LazyLock is
not linked into pinned binaries.

is_pinned() const helper exposes whether compile-time dispatch is
active, useful both for consumer-facing diagnostics and for gating
arch-detection tests that no longer apply when pinning overrides
hardware. Existing x86_target_lands_inside_x86_family /
aarch64_target_lands_inside_aarch64_family tests early-return when
pinned; two new tests (pinning_default_is_off, pinning_consistency)
verify the const + runtime paths agree.

Tests: 2077/2077 lib tests pass under both default and --features
cpu-spr configurations. cargo clippy --lib -- -D warnings clean in
both modes. Mutex compile_error verified by attempting
--features "cpu-spr,cpu-zen4" — fails as expected with the const
assert citation. The default (no feature) path is byte-identical to
the T3.1 runtime detection — no regression risk to the merged #181
work or to the e40f3a3 SimdProfile commit.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin
simd_profile() to a const at compile time, bypassing the runtime
LazyLock detection from T3.1. One feature per non-Scalar variant of
the SimdProfile enum.

Features (mapping to LLVM target-cpu codenames):
  cpu-gnr         → GraniteRapids    (graniterapids)
  cpu-spr         → SapphireRapids   (sapphirerapids)
  cpu-zen4        → Zen4Avx512       (znver4)
  cpu-cpl         → CooperLake       (cooperlake)
  cpu-tigerlake   → TigerLakeU       (tigerlake)
  cpu-icx         → IceLakeSp        (icelake-server)
  cpu-clx         → CascadeLake      (cascadelake)
  cpu-skx         → SkylakeX         (skylake-avx512)
  cpu-arrowlake   → ArrowLake        (arrowlake)
  cpu-haswell     → HaswellAvx2      (haswell)
  cpu-a76         → A76DotProd       (cortex-a76)
  cpu-a72         → A72Fast          (cortex-a72)
  cpu-a53         → A53Baseline      (cortex-a53)

Mutual exclusion (per integration-plan risk #1: "if a user sets
cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on
AMX instructions") is enforced via a const assert: each cpu-*
contributes 1 to _PIN_COUNT and the assert fires at compile time if
the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4
simultaneously produces a build error citing the mutex.

Implementation: a const pinned_profile() -> Option<SimdProfile>
walks a cfg cascade and returns the active variant or None. The
simd_profile() function exists in two cfg-gated forms — a runtime
LazyLock variant compiled when no cpu-* feature is set, and a
const-foldable variant compiled when any is set. The LazyLock is
not linked into pinned binaries.

is_pinned() const helper exposes whether compile-time dispatch is
active, useful both for consumer-facing diagnostics and for gating
arch-detection tests that no longer apply when pinning overrides
hardware. Existing x86_target_lands_inside_x86_family /
aarch64_target_lands_inside_aarch64_family tests early-return when
pinned; two new tests (pinning_default_is_off, pinning_consistency)
verify the const + runtime paths agree.

Tests: 2077/2077 lib tests pass under both default and --features
cpu-spr configurations. cargo clippy --lib -- -D warnings clean in
both modes. Mutex compile_error verified by attempting
--features "cpu-spr,cpu-zen4" — fails as expected with the const
assert citation. The default (no feature) path is byte-identical to
the T3.1 runtime detection — no regression risk to the merged #181
work or to the e40f3a3 SimdProfile commit.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…able

Rebased onto master post-#181, #182, #183. Replaces the polyfill-based
add_mul_f32/f64 with LazyLock-cached function pointers picking real
hardware FMA per silicon, and adds two more LazyLock-cached
primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8.

WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar
f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width
but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that
is an instruction-family choice, not a lane-width one. LazyLock
amortises a one-time simd_caps() read into a frozen fn pointer;
every subsequent call is a single indirect jump with zero
is_x86_feature_detected! overhead. No SimdProfile exposed at the
consumer surface — agnostic contract preserved.

add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i]
  AVX-512F+FMA  → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail
  AVX2+FMA      → _mm256_fmadd_ps 8-wide + scalar tail
  NEON          → vfmaq_f32 4-wide + scalar tail
  scalar        → f32::mul_add per lane
  no_std build  → preserves the polyfill F32x16::mul_add path
                  (LazyLock requires std)

add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes.

is_amx_available() — wraps simd_amx::amx_available() (CPUID +
OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in
LazyLock<bool>. The 4-step gate, including the syscall, fires
exactly once per process. Always false on non-x86_64.

vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices:
  AVX-512 VNNI  → delegates to simd_amx::vnni_dot_u8_i8 wrapped with
                  scalar tail handling (the existing kernel processes
                  only n - (n%64) since its cognitive-shader caller
                  pre-aligns rows; general-purpose callers need the
                  tail)
  AVX-VNNI 256  → delegates to simd_amx::vnni2_dot_u8_i8 directly
                  (that one already handles its scalar tail)
  scalar        → simd_amx::vnni_dot_u8_i8_scalar

No intrinsic code is duplicated. The dispatcher composes existing
simd_amx::* kernels (which #182/#184 also build on) into a safe
LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch
runs the same selection logic but uses is_x86_feature_detected! per
call; this wrapper amortises that to once at startup.

PARITY CONTRACT:
  - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add /
    f64::mul_add per lane via to_bits() assertion. All vector
    backends emit single-rounded IEEE-754 FMA.
  - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply.
    VPDPBUSD does not saturate the accumulator (intermediate u8*i8
    products bounded by 32385, four-element sums by 129540).

Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12
problem sizes / tail lengths). cargo clippy --lib clean under
default and --features cpu-spr. On Sapphire Rapids host the
LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for
vnni_dot; AMX is_amx_available returns false (hypervisor masks
XCR0[17,18]) — matches the Risk #3 demotion from 61b4563.

This commit was rebased atop master after the parallel session
shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and
prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The
earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into +
walkback) and be65595 (is_amx_available + vnni_dot duplicating
intrinsics) are subsumed by this single clean commit: no public
SimdProfile / SimdTier re-export, no duplicated intrinsic code, no
mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Phase 3 T3.2: add 13 mutually-exclusive cargo features that pin
simd_profile() to a const at compile time, bypassing the runtime
LazyLock detection from T3.1. One feature per non-Scalar variant of
the SimdProfile enum.

Features (mapping to LLVM target-cpu codenames):
  cpu-gnr         → GraniteRapids    (graniterapids)
  cpu-spr         → SapphireRapids   (sapphirerapids)
  cpu-zen4        → Zen4Avx512       (znver4)
  cpu-cpl         → CooperLake       (cooperlake)
  cpu-tigerlake   → TigerLakeU       (tigerlake)
  cpu-icx         → IceLakeSp        (icelake-server)
  cpu-clx         → CascadeLake      (cascadelake)
  cpu-skx         → SkylakeX         (skylake-avx512)
  cpu-arrowlake   → ArrowLake        (arrowlake)
  cpu-haswell     → HaswellAvx2      (haswell)
  cpu-a76         → A76DotProd       (cortex-a76)
  cpu-a72         → A72Fast          (cortex-a72)
  cpu-a53         → A53Baseline      (cortex-a53)

Mutual exclusion (per integration-plan risk #1: "if a user sets
cpu-spr AND has runtime detection on Zen 4, the binary SIGILLs on
AMX instructions") is enforced via a const assert: each cpu-*
contributes 1 to _PIN_COUNT and the assert fires at compile time if
the sum exceeds one. Verified: enabling cpu-spr+cpu-zen4
simultaneously produces a build error citing the mutex.

Implementation: a const pinned_profile() -> Option<SimdProfile>
walks a cfg cascade and returns the active variant or None. The
simd_profile() function exists in two cfg-gated forms — a runtime
LazyLock variant compiled when no cpu-* feature is set, and a
const-foldable variant compiled when any is set. The LazyLock is
not linked into pinned binaries.

is_pinned() const helper exposes whether compile-time dispatch is
active, useful both for consumer-facing diagnostics and for gating
arch-detection tests that no longer apply when pinning overrides
hardware. Existing x86_target_lands_inside_x86_family /
aarch64_target_lands_inside_aarch64_family tests early-return when
pinned; two new tests (pinning_default_is_off, pinning_consistency)
verify the const + runtime paths agree.

Tests: 2077/2077 lib tests pass under both default and --features
cpu-spr configurations. cargo clippy --lib -- -D warnings clean in
both modes. Mutex compile_error verified by attempting
--features "cpu-spr,cpu-zen4" — fails as expected with the const
assert citation. The default (no feature) path is byte-identical to
the T3.1 runtime detection — no regression risk to the merged #181
work or to the e40f3a3 SimdProfile commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants