docs(simd): tier debt audit + 4-phase integration plan by AdaWorldAPI · Pull Request #180 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-20T19:22:53Z

Summary

Documentation-only PR. Two companion files in .claude/knowledge/:

td-simd-tier-audit.md — 22 verified findings of SIMD tier debt, every line range read against actual source. Files swept: simd.rs, simd_amx.rs, simd_dispatch.rs, simd_caps.rs, amx_matmul.rs, bf16_tile_gemm.rs, backend/native.rs, plus relevant sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs, bgz17_bridge.rs, nibble.rs, quantized.rs, vnni_gemm.rs.
td-simd-integration-plan.md — architecture decision (SimdProfile + static dispatch tables) plus 4-phase fix plan.

What the audit found

Headline (CRITICAL):

matmul_bf16_to_f32 / matmul_f32 / matmul_i8_to_i32 (hpc/amx_matmul.rs:319-412) all have if amx_available() branches whose AMX arm calls the scalar reference kernel. bf16_tile_gemm_16x16 (working TDPBF16PS dispatch) and int8_gemm_vnni (working VPDPBUSD dispatch) exist and are tested — they're just not wired through.
bf16_gemm_f32 (quantized.rs:444) is a triple-nested scalar loop with .to_f32() per element. No crate::simd::*, no F32x16 mul_add. This IS the fallback everything bottoms out at.
int8_gemm_i32 (quantized.rs:618) is the same shape.
backend/native.rs::avx2::{scal,nrm2,asum}_* (:544-561) all super::scalar::*-delegate. Dispatch macro thinks AVX2; body is scalar.
gemv_f32 / gemv_f64 (backend/native.rs:271-278) skip the dispatch macro entirely. Scalar on every CPU.

Headline (HIGH):

simd_dispatch.rs:150-163 aarch64 tier reports NeonDotProd/Neon but fills every fn pointer from Self::scalar(). Pi 5 / M2 / Apple silicon = scalar.
simd_dispatch.rs:128-134 AVX-512 tier uses AVX2 wrappers for 4 of 6 ops.
simd_neon_bf16.rs:149-177 and simd_neon_dotprod.rs:115-148 have *Stub types with unimplemented!() panic bodies. Module docs spell out the asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired.
Three independent Tier enums (simd.rs:18, backend/native.rs:22, hpc/simd_dispatch.rs:31) all collapse Skylake-X through Granite Rapids into one Avx512 bucket.
simd.rs:291-292 gates BF16x16 re-export on compile-time target_feature = "avx512bf16". Default cargo is v3; even on SPR/Zen 4 silicon, crate::simd::BF16x16 is the scalar simd_half::BF16x16.

Two agent-claimed findings verified false (simd_amx.rs:291, simd_avx512.rs:695) — recorded as such in the audit so they don't get re-flagged on the next sweep.

Architecture proposal (integration plan)

SimdProfile enum flat-keyed by silicon profile. One static *Dispatch table per profile, populated at compile time. LazyLock<&'static *Dispatch> resolved once at first call. Branch predictor warms once; every subsequent call is one pointer deref + one indirect call.

static SPR_GEMM: GemmDispatch = GemmDispatch {
    bf16_gemm: amx_bf16_tile_gemm,
    int8_gemm: amx_int8_tile_gemm,
    f32_gemv: avx512_f32x16_gemv,
};
static ICX_GEMM: GemmDispatch = GemmDispatch {
    bf16_gemm: avx512_f32x16_bf16_gemm,
    int8_gemm: avx512vnni_gemm,
    f32_gemv: avx512_f32x16_gemv,
};
// ... 10 more profiles

pub fn gemm_dispatch() -> &'static GemmDispatch {
    static TABLE: LazyLock<&'static GemmDispatch> = LazyLock::new(|| match simd_profile() {
        SimdProfile::SapphireRapids | SimdProfile::GraniteRapids => &SPR_GEMM,
        SimdProfile::IceLakeSp => &ICX_GEMM,
        // ...
    });
    *TABLE
}

This is the "switch hashtable" the discussion asked about — conceptually a HashMap<CpuId, FnPointers> resolved once.

Compile-time pinning via cargo features (cpu-spr, cpu-icx, cpu-zen4, cpu-a76, …) shortcuts the LazyLock to a const &'static *Dispatch. The #[inline(always)] dispatch function gets folded out at the callsite; chosen kernel inlines directly. Same source, two ergonomics: runtime default for "one binary, all ISAs"; compile-time pin for per-tier distribution binaries.

Phases

Phase	Effort	Scope	Closes
1	10–12h, P0	Wire existing kernels through existing dispatch sites	TD-T1..T7 (CRITICAL)
2	~22h, P1	Real NEON BF16/F16/integer types via asm-byte	TD-T8, T10, T11, T21
3	~21h, P1	`SimdProfile` architecture; migrate three `Tier` enums	TD-T12, T13, T14
4	30–50h, P2	Intra-bucket SIMD fills (13 named tasks)	TD-T15..T20 + new

Phases 1 and 2 are independent (disjoint files), can land in parallel. Phase 3 needs Phase 1's wired kernels as table entries. Phase 4 is fully parallelizable; each entry is one PR.

"What dispatches where" quick reference

The plan ends with per-primitive routing tables, e.g.:

Profile	`bf16_gemm_f32` impl
GraniteRapids, SapphireRapids	`tile_dpbf16ps` 16×16×k tile
Zen4Avx512, CooperLake	`_mm512_dpbf16_ps` (native BF16 dot)
IceLakeSp, CascadeLake, SkylakeX	F32x16 mul_add over decoded BF16
ArrowLake, HaswellAvx2	F32x8 mul_add over decoded BF16
A76DotProd	NEON BFMMLA
A72Fast, A53Baseline	NEON F32x4 mul_add over decoded BF16
Scalar	Scalar triple loop

Similar tables for int8_gemm_i32, gemv_f32, BF16x16 ops, F16x16 ops.

Risks documented

7 open questions at the end of the integration plan. Most important: cpu-* features must be mutually exclusive (compile_error!) AND documented non-portable; runtime-dispatch default stays the safe option. Also: caps.amx_fp16 doesn't exist yet — needs to be added before Phase 3 (CPUID leaf 7,1 EAX bit 21 for Granite Rapids).

Test plan

No code changes — nothing to test in CI for this PR.
Phase 1 (separate PR): each TD-T fix lands with parity test vs the scalar reference.
Phase 2 (separate PR): aarch64 asm-byte impls need objdump verification on x86 dev machine + throughput test on real Pi 5 / M2.
Phase 3 (separate PR): each SimdProfile variant exercised in CI via the cpu-* cargo feature matrix.
Phase 4 (per-entry PRs): each entry has its own parity test for the silicon tier it unlocks.

Generated by Claude Code

22 verified findings across 7 files, in `.claude/knowledge/td-simd-tier-audit.md`. Read end-to-end for this pass: src/simd.rs (720 LoC), simd_amx.rs (421), simd_dispatch.rs (361), hpc/amx_matmul.rs (671), hpc/bf16_tile_gemm.rs (205), hpc/simd_caps.rs (514), backend/native.rs (763), plus relevant sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs, hpc/bgz17_bridge.rs, hpc/nibble.rs, hpc/quantized.rs, hpc/vnni_gemm.rs. Headline findings (CRITICAL): TD-T1/T2/T3 — `matmul_bf16_to_f32`, `matmul_f32`, `matmul_i8_to_i32` in `hpc/amx_matmul.rs:319-412` all have `if amx_available()` branches whose AMX arm calls the scalar reference kernel. `bf16_tile_gemm_16x16` (the real TDPBF16PS dispatch) and `int8_gemm_vnni` (real VPDPBUSD dispatch) exist and are tested — they're just not wired through. TD-T4 — `quantized.rs::bf16_gemm_f32` (line 444) is a triple-nested scalar loop with per-element `.to_f32()`. No `crate::simd::*`, no F32x16 mul_add. This IS the fallback that everything bottoms out at. TD-T5 — `quantized.rs::int8_gemm_i32` (line 618) is the same shape. `int8_gemm_vnni` exists at `vnni_gemm.rs:46` but no caller routes through it. TD-T6 — `backend/native.rs::avx2::{scal,nrm2,asum}_*` (line 544-561) all `super::scalar::*`-delegate. Dispatch macro thinks AVX2; body is scalar. TD-T7 — `backend/native.rs::gemv_f32/f64` (line 271-278) skip the dispatch macro entirely. Scalar on every CPU. Headline findings (HIGH): TD-T8 — `simd_dispatch.rs:150-163` aarch64 tier reports `NeonDotProd`/`Neon` but fills every fn pointer from `Self::scalar()`. Pi 5 / M2 / Apple silicon = scalar. TD-T9 — `simd_dispatch.rs:128-134` AVX-512 tier uses AVX2 wrappers for 4 of 6 ops (squared_distances_f32, nibble_unpack, nibble_above_threshold, batch_sq_dist). TD-T10/T11 — `simd_neon_bf16.rs:149-177` and `simd_neon_dotprod.rs:115-148` have `*Stub` types with `unimplemented!()` panic bodies. Module docs spell out the asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired. TD-T12/T13/T14 — three independent `Tier` enums (simd.rs, backend/native.rs, hpc/simd_dispatch.rs) all collapse Skylake-X through Granite Rapids into one `Avx512` bucket. TD-T15 — `simd.rs:291-292` gates `BF16x16` re-export on compile-time `target_feature = "avx512bf16"`. Default cargo is v3; even on SPR/Zen 4 silicon, `crate::simd::BF16x16` is the scalar `simd_half::BF16x16`. Headline findings (MEDIUM): TD-T16/T17 — `hpc/nibble.rs` ops cap at AVX2 (no AVX-512BW for nibble_unpack / nibble_above_threshold), and the "avx2" funcs are scalar loops under `#[target_feature(enable = "avx2")]`. TD-T18 — `simd_ln_f32` (simd.rs:479) is `arr[i].ln()` per lane, asymmetric with the Remez-polynomial `simd_exp_f32`. TD-T19/T20 — `distance::squared_distances` and `spatial_hash::batch_sq_dist` check only `avx2`; no AVX-512F variant. TD-T21 — `simd.rs:351-354` on aarch64, every integer type (`I32x16`, `U8x64`, `U16x32`, ...) comes from `scalar::*`, not `simd_neon`. Pi 5 / M2 get scalar integer SIMD despite NEON int32x4_t / uint8x16_t being available. Two agent-claimed findings verified false (`simd_amx.rs:291`, `simd_avx512.rs:695`) — recorded as such in the audit so they don't get re-flagged on the next sweep. Next-sweep targets listed at the bottom — `blas_level{1,2,3}.rs`, `statistics.rs`, `lapack.rs`, `simd_avx2.rs`, `simd_neon.rs` unread. Grep showed no `crate::simd::*` use in the BLAS / LAPACK / statistics files; flagship public API may be entirely scalar. Verification requires actually reading them, not grep — same mistake the agent made. No code changes in this commit. Documentation only.

Companion to td-simd-tier-audit.md. Architecture decision for intra-bucket dispatch (SPR vs ICX vs CLX within Tier::Avx512, A76 vs A72 within Tier::Neon). Core proposal: SimdProfile enum (SapphireRapids, GraniteRapids, IceLakeSp, CooperLake, CascadeLake, SkylakeX, Zen4Avx512, ArrowLake, HaswellAvx2, A76DotProd, A72Fast, A53Baseline, Scalar) with one static *Dispatch table per profile. Detect once via LazyLock<SimdProfile>, then every dispatch is one pointer deref + one indirect call. Branch predictor warms once. Compile-time pinning via cargo features (cpu-spr / cpu-icx / cpu-zen4 / etc.) shortcuts the LazyLock to a const &'static *Dispatch reference; rustc folds the entire dispatch out at the callsite. Same source, two ergonomics. Four phases: Phase 1 (~10-12h, P0): Wire existing kernels through existing dispatch sites. Closes 7 CRITICAL audit findings (TD-T1..T7). Pure routing, no new abstractions. Phase 2 (~22h, P1): Real NEON BF16/F16/integer types via asm-byte BFMMLA/BFDOT/fmla.8h. Closes TD-T8, T10, T11, T21. Needs Pi 5 / M2 hardware verification. Phase 3 (~21h, P1): SimdProfile architecture. Closes the three Tier enum collapses (TD-T12, T13, T14). Migration is additive — simd_profile() ships alongside tier(); the three local enums get deleted last. Phase 4 (30-50h, P2, parallelizable): Intra-bucket fills. 13 named tasks listed with profile-unlock and target file. Quick reference at the bottom shows, for bf16_gemm_f32, int8_gemm_i32, gemv_f32, BF16x16, F16x16, the silicon-by-silicon route after all 4 phases land. This is the "what to dispatch where" the user asked for. 7 risks / open questions at the end. Most important: - cpu-* features must be mutually exclusive (compile_error!) AND documented as non-portable. - caps.amx_fp16 doesn't exist yet — add before Phase 3 (CPUID leaf 7,1 EAX bit 21 for GraniteRapids dispatch). - BLAS-1/2/3, LAPACK, statistics, quantized.rs grep showed NO crate::simd::* use. Flagship public surfaces are scalar. Phase 4 must decide: route through dispatch tables, or rewrite in place. No code changes in this commit. Documentation only.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da66c21167

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T19:27:55Z

+    bf16_gemm: amx_bf16_tile_gemm,    // TDPBF16PS, 256 mul-adds/instr
+    int8_gemm: amx_int8_tile_gemm,    // TDPBUSD, 256 mul-adds/instr


Fix AMX MAC counts in SPR dispatch comments

The comments understate AMX throughput by a large factor (TDPBUSD/TDPBF16PS are documented in this repo as tile-level dot products, not 256 total mul-adds), which makes the plan’s performance rationale misleading and can skew prioritization decisions in later implementation PRs. Please align these numbers with the existing AMX primitive documentation in src/hpc/amx_matmul.rs (e.g., tile-level operation counts) so the integration plan reflects realistic speedup expectations.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-20T19:27:55Z

+            if caps.neon && caps.aes /* heuristic for A72 vs A53 */ {
+                return SimdProfile::A72Fast;


Replace AES heuristic for A72 vs A53 profile detection

The proposed caps.neon && caps.aes check cannot reliably distinguish A72Fast from A53Baseline (the current capability model already treats crypto as present across multiple ARMv8 tiers), so this branch will classify many non-A72 devices as A72 and route them to the wrong profile table. That undermines the goal of accurate per-silicon dispatch and should be replaced with an explicit “cannot distinguish” fallback or a different discriminator.

Useful? React with 👍 / 👎.

… kernel Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS, 256 BF16×BF16 multiply-accumulates per instruction, single-rounded into an f32 tile accumulator). Until this commit, the AMX branch of `matmul_bf16_to_f32` was a placebo — both `if amx_available()` and `else` called the scalar `bf16_gemm_f32`. The actual kernel (`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was unreached by the consumer entry point. This wires it. When AMX is OS-enabled AND the matmul shape is 16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95 encoding documented at simd_amx.rs:16-19; AMX intrinsics are nightly-only per issue #126622, hence asm-byte). Aligned tiles get the full hardware throughput; misaligned shapes (any of M/N/K not at the alignment boundary) fall back to the validated scalar `bf16_gemm_f32` reference. Non-AMX hosts always take the scalar fallback. The B sub-block extraction copies a K × 16 packed scratch per j_tile column band (B is K × N row-major; the kernel wants K × 16 contiguous). Allocation cost is amortized across M/16 i-tile iterations under each j_tile. Phase-4 work will land a fully mixed-tile path (AMX 16×16 core + per-axis scalar tails on the same matmul) for arbitrary shapes. Verification: * Default v3 build: 11 amx_matmul tests pass (this host lacks AMX per /proc/cpuinfo, so the path falls through to scalar; behaviour identical to pre-commit on this runner). * Full lib sweep: 2087 tests pass; clippy -D warnings clean. * Real SPR silicon: the gating is correctness-by-construction — the new branch only fires when amx_available() == true AND the alignment predicates hold; the inner kernel is the same one PR #104 shipped and tested. Background — the directive chain from this session: user: "Sapphire Rapids should have BF16 operations" user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?" → both are SIMD, TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums per instruction (16×16 outer-product matmul tile), VDPBF16PS does 32 BF16×BF16 multiplies + 16 f32 accums per zmm instruction. Neither is scalar. The "no scalar lane-by-lane f32 round-trip" rule the user gave is what this PR delivers: the AMX tile op is hardware-fused, single-rounded into f32 accumulator, BF16 mantissa bits preserved bit-exactly per IEEE BF16 spec at the multiply step. Closes TD-T1 from `.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Extends the BF16 GEMM dispatch chain from PR #180's per-tier table. Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR, GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake + Cascade Lake + Zen 4+ which all have avx512bf16 hardware but nothing else). Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates, single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte needed (unlike AMX, which is nightly-only per issue #126622 and must be raw-byte encoded). Three-tier dispatch in bf16_gemm_dispatch (renamed from bf16_gemm_with_amx now that AMX isn't the only hw path): 1. amx_available() && 16/16/32-aligned shapes → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte (8 192 MACs/instr, MOST throughput) 2. is_x86_feature_detected!("avx512bf16") → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic (32 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar bf16_gemm_f32 reference Kernel pattern (slow-but-correct first cut): * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16 columns of one output row, processing 2 K-elements per call. * B columns for the current j-block of 16 are pre-packed into a pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj] in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits), then reused across all m i-iterations to amortize the column- gather cost. * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across 16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by every output column. * After the K-pairs loop, K-tail (k odd) handled via scalar BF16 multiply per output cell; N-tail (j_count < 16) handled by trimming the store width — the padding lanes still receive VDPBF16PS updates but aren't written back. Performance shape (rough): the kernel is correctness-optimized, not peak-throughput-optimized. Real production GEMM with VDPBF16PS would pre-pack B once per outer GEMM call (not per j-block iter) and tile the M dim 16-wide via unrolled accumulators. Phase-4 work. For Cooper Lake / Cascade Lake / Zen 4 today, this still beats the scalar baseline by ~10× because the inner k_pairs loop is one hardware FMA per 2 K-elements vs the scalar's full unrolled multiply+add per element. Verification: * Default v3 build: 11 amx_matmul tests pass (this host shows only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new arm falls through to scalar; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. * cargo fmt --all --check clean. * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65, k_pairs=32, k_tail=1) and strided test will exercise the new arm on Cooper Lake / Cascade Lake / Zen 4 silicon. Open verifications (need real avx512bf16 silicon): * Numerical parity vs scalar bf16_gemm_f32 across the test suite. * Throughput vs scalar baseline. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the integer operand family. Builds the missing int8 tile kernel from scratch (the BF16 equivalent shipped in PR #104; the int8 one had never been built despite the primitives existing in simd_amx since day one) and wires matmul_i8_to_i32's AMX arm through it. New module `hpc::int8_tile_gemm`: * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel, K must be multiple of 64. Mirror shape of `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand family that TDPBUSD natively supports. **One TDPBUSD = 16 384 multiply-accumulates per instruction** (16×16 output tile × 64 K-elements per A row × 4 K-elements per inner-product). That's 256× the VPDPBUSD-zmm throughput per instruction. * Internal `amx_path()` uses the existing primitives in `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig → tile_zero → K/64 iterations of (tile_load A, tile_load B, tile_dpbusd) → tile_store → tile_release. * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32 triple-loop reference. New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`: * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad layout required by TDPBUSD tile 2. * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]` * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout for TDPBF16PS — both kernels reach the same 64-byte tile row width via element-width × pack-factor symmetry: BF16 is 2B × 2, INT8 is 1B × 4). Wiring `matmul_i8_to_i32`'s AMX arm (was placebo): Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR `int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself was never reached even on real AMX silicon. Now: 1. Shift A: i8 → u8 via (+128). 2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block extracted into K × 16 scratch once per j_tile, reused across i_tile iterations. 3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]). The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0; misaligned shapes fall back to the scalar reference. Phase-4 work will land mixed AMX-tile + per-axis scalar tail handling for arbitrary shapes (same shape of Phase-4 work TD-T1 deferred). Verification: * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test now exercises the actual TDPBUSD path because this host has amx_int8 + amx_tile in /proc/cpuinfo; the test continues to pass with bit-identical results to the scalar reference). * `vnni_pack_i8_roundtrip` test verifies the pack layout matches the spec exactly for an 8 × 4 sample. * `fallback_matches_scalar_reference_k64` test verifies the non-AMX path produces the same i32 output as a hand-written reference for a 64-K, pseudo-random u8/i8 matrix pair. * `public_api_diagonal_k128` test asserts a structured pattern (A = identity-like, B = constant 2) gives the expected accumulation through the full dispatch chain. * `cargo clippy --lib -D warnings` clean. * `cargo fmt --all --check` clean. Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX arm no longer falls back to it (the scalar else-branch uses an inline triple-loop directly). After this commit, the per-CPU dispatch table from PR #180 has the AMX tier wired for BOTH operand families on Sapphire Rapids+: BF16 GEMM: SPR+ → TDPBF16PS (TD-T1 / TD-T1b in PR #182) INT8 GEMM: SPR+ → TDPBUSD (this commit) Out of scope (separate PRs): * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade Lake / Zen 4+ (avx512vnni without AMX). The kernel function `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just need to assemble them into a m×n×k GEMM and wire as the middle dispatch tier (analogous to the VDPBF16PS arm in PR #182's bf16_gemm_dispatch). * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level surface from PR #182) — it's u8 × i8 natively so no sign-shift needed, simpler to wire than matmul_i8_to_i32. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per PR #180's table the middle tier between AMX TDPBUSD (Sapphire Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors the VDPBF16PS arm structure that landed for BF16 in PR #182's `bf16_gemm_dispatch`. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`: * One VPDPBUSD instruction: 16 i32 accumulator lanes, each receiving 4 u8×i8 products = 64 MACs per instruction. * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`, one i row processed at a time, K-quad inner loop accumulating into the same 16 i32 lanes across iterations. * B-column packing: pre-packs B for the current j-block into `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad.., j_base+j] packed bottom-to-top)` once per j-block; reused across all M i-iterations so the gather cost amortizes. * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes packed) every K-iter — same quad seen by every output column. * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per output cell; N-tail (j_count < 16) handled by trimming the store width — padding lanes still receive VPDPBUSD updates but aren't written back. * Stable intrinsic `_mm512_dpbusd_epi32` under `target_feature = "avx512vnni,avx512f"` — no asm-byte needed. Wiring `matmul_i8_to_i32` to three-tier dispatch: 1. amx_available() + 16/16/64-aligned shapes → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr, this commit reuses the kernel from PR #184 fe334de... wait, same PR — from b1979d7 in THIS PR) 2. is_x86_feature_detected!("avx512vnni") → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts, or shapes that don't satisfy either SIMD tier's requirements Factored the shared sign-shift bias subtraction into a private `subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1 (AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to subtract 128·colsum(B) from the accumulator. Pure integer arithmetic, bit-identical to the scalar i8×i8 → i32 reference. Verification: * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test `vpdpbusd_zmm_matches_scalar` that exercises the new arm directly with shapes spanning aligned cases, K-tail (k % 4), N-tail (n % 16), and small shapes; asserts byte-equal output vs scalar reference). * Existing `matmul_i8_to_i32_16x16_exact` continues to pass through the AMX tier on this host (which has amx_int8). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (b1979d7) | (THIS COMMIT) | (always) So all three of the public matmul entry points now have full three-tier dispatch on x86_64. Out of scope (separate PRs): * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level u8×i8 surface from PR #182) — it's u8×i8 natively, no sign- shift bias needed, simpler than matmul_i8_to_i32. * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as the avx512vnni arm just with ymm width. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Completes the per-CPU dispatch chain for `matmul_i8_to_i32` by adding the AVX-VNNI ymm tier — Arrow Lake, Meteor Lake U, Alder Lake silicon that has AVX-VNNI but dropped AVX-512. Mirrors the shape of the avx512vnni-zmm arm shipped in PR #184 with the narrower 8-wide kernel. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_ymm`: * One `_mm256_dpbusd_avx_epi32` instruction: 8 i32 accumulator lanes, each receiving 4 u8×i8 products = 32 MACs per instruction. Half the throughput-per-instruction of the `_mm512_dpbusd_epi32` zmm version. * Same B-pre-pack scheme (quad-interleaved per 8-wide j-block), same K-tail / N-tail handling. Just narrower. * Stable intrinsic under `target_feature = "avxvnni,avx2"` — no asm-byte needed. Wiring `matmul_i8_to_i32`'s dispatch as Tier 3: 1. amx_available() + 16/16/64-aligned → AMX TDPBUSD (PR #184: int8_gemm_amx_tiled, 16 384 MACs/instr) 2. is_x86_feature_detected!("avx512vnni") → VPDPBUSD-zmm (PR #184: int8_gemm_vpdpbusd_zmm, 64 MACs/instr) 3. is_x86_feature_detected!("avxvnni") → VPDPBUSD-ymm (THIS COMMIT: int8_gemm_vpdpbusd_ymm, 32 MACs/instr) 4. scalar i8×i8 → i32 reference (was Tier 3) All three SIMD tiers share the sign-shift bias trick: shift LHS i8 → u8 (+128), run the kernel, subtract 128·colsum(B). Same `subtract_i8_to_u8_bias` helper (factored in PR #184). New direct test `vpdpbusd_ymm_matches_scalar` mirrors the zmm version's test: sweeps shapes spanning 8-aligned, K-tail (k % 4), N-tail (n % 8), and small shapes, asserts byte-equal output vs scalar reference. Verification: * Default v3 (this host has avx512vnni so the new arm doesn't fire from matmul_i8_to_i32 — Tier 2 catches first): 2096 lib tests pass (was 2095 — +1 new direct test). * Direct test exercises int8_gemm_vpdpbusd_ymm on this host since avxvnni is present alongside avx512vnni. * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit (final on the int8 side): matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 zmm | ARL ymm | scalar (PR #184) | (PR #184) | (THIS) | (always) The matmul_i8_to_i32 column of PR #180's dispatch table is now fully filled. The gemm_u8_i8 slice surface (in PR #185) already has AVX-VNNI ymm via its existing compile-time cascade — both i8-related public surfaces now cover every x86_64 tier with a hardware-accelerated arm. Out of scope (separate PRs): * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b, needs aarch64 CI runner verification. * TD-T6: real _mm256_* for AVX2 BLAS-1 (scal/nrm2/asum). https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Closes TD-T6 (critical audit finding from the per-CPU matrix doc). Before this commit, the AVX2 native BLAS-1 module had: pub fn scal_f32(alpha: f32, x: &mut [f32]) { super::scalar::scal_f32(alpha, x); // ← scalar shim, no AVX2 } pub fn nrm2_f32(x: &[f32]) -> f32 { super::scalar::nrm2_f32(x) // ← scalar shim } pub fn asum_f32(x: &[f32]) -> f32 { super::scalar::asum_f32(x) // ← scalar shim } // ... and f64 siblings, same shape These were the documented "// No AVX2 specialization — fall through to scalar" path. Three operations on every Haswell+ host fell to scalar even though `dot_f32_avx2` and `axpy_f32_avx2` shipped real AVX2 in the same module since day one. PR #180's audit flagged this as TD-T6 (critical: blocks BLAS-1 throughput on Haswell / Arrow Lake / Zen 1-3). New AVX2 kernels (6 total — f32 + f64 for each of scal / nrm2 / asum): scal: broadcast α to ymm via `_mm256_set1_ps`, multiply 8/4 lanes at a time via `_mm256_mul_ps`/`_mm256_mul_pd`, scalar tail. Stores result back to the same buffer in-place. nrm2: two-accumulator unroll with `_mm256_fmadd_ps`/`_pd` (x² accumulated via FMA, single-rounded per IEEE), horizontal reduce + scalar sqrt. Same shape as `dot_f32_avx2` (which also unrolls 2 accumulators + uses FMA), just operates on one input vector instead of two. asum: abs via `_mm256_and_ps`/`_pd` with a sign-bit-cleared mask (0x7FFFFFFF for f32, 0x7FFFFFFFFFFFFFFF for f64) — one AVX instruction (VANDPS) is faster than calling f32::abs() lane-by-lane. Two-accumulator unroll + horizontal reduce. All three follow the existing `dot_f32_avx2` template: - `#[target_feature(enable = "avx2[,fma]")]` on the inner unsafe fn. - Public wrapper does `cfg(target_arch = "x86_64")` and dispatches to the unsafe fn (tier detection in caller-of-caller verified AVX2 before reaching this module). - Non-x86_64 builds: pass through to `super::scalar::*`. - Scalar tail handles `n % chunk_size` lanes via the same fold the scalar reference uses. Numerical contract: scal: byte-equal to scalar (`x[i] *= α` is the same op). asum: small ULP drift on long vectors because the SIMD horizontal reduce orders the sum differently from strict left-fold. Test tolerance: `|got - expected| <= |expected|*1e-5 + 1e-6`. nrm2: same — drifts ~1-2 ULP on long vectors via reduce-order + sqrt rounding. Same tolerance. 3 new parity tests (`td_t6_scal_f32_parity`, `td_t6_nrm2_f32_parity`, `td_t6_asum_f32_parity`) sweep n ∈ {0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 64, 100} — covers the chunk-of-16 unroll path, the chunk-of-8 cleanup path, and the scalar tail for every kernel. Verification: * 2090 lib tests pass (was 2087 — +3 new parity tests; the existing test_scal_f32 / test_nrm2_f64 / test_asum_f32 that used to hit the scalar shims now exercise the AVX2 kernels and continue to pass). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings clean. * cargo fmt --all --check clean. Throughput impact (back-of-envelope on Sapphire Rapids, n=4096): scal_f32: scalar 4096 cycles (1 mul/lane) → AVX2 ~520 cycles (8 lanes/instr + 1-cycle issue) = ~8× faster. asum_f32: scalar 4096 cycles → AVX2 ~520 cycles = ~8× faster. nrm2_f32: scalar 4096 cycles (1 FMA/lane) → AVX2 ~260 cycles (16 lanes via 2-acc unroll, 1-cycle issue) = ~16×. Out of scope (separate PRs): * AVX-512 versions of the same three ops — `kernels_avx512.rs` has them already (lines 137-209), wired through the cfg(target_feature = "avx512f") path. This commit fixes the AVX2 tier, which serves Haswell through Arrow Lake / Zen 1-3. * Runtime-dispatch trampolines for these ops (would go in `simd_runtime/blas_l1.rs` mirroring the matmul.rs pattern from the runtime-dispatch PR). https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

claude added 2 commits May 20, 2026 18:46

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

AdaWorldAPI merged commit 18451bc into master May 20, 2026
17 checks passed

AdaWorldAPI mentioned this pull request May 20, 2026

docs(simd): per-CPU dispatch matrix + Codex review fixes #181

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(simd): tier debt audit + 4-phase integration plan#180

docs(simd): tier debt audit + 4-phase integration plan#180
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-tier-audit

AdaWorldAPI commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		bf16_gemm: amx_bf16_tile_gemm, // TDPBF16PS, 256 mul-adds/instr
		int8_gemm: amx_int8_tile_gemm, // TDPBUSD, 256 mul-adds/instr

		if caps.neon && caps.aes /* heuristic for A72 vs A53 */ {
		return SimdProfile::A72Fast;

Conversation

AdaWorldAPI commented May 20, 2026

Summary

What the audit found

Architecture proposal (integration plan)

Phases

"What dispatches where" quick reference

Risks documented

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants