Skip to content

docs(simd): tier debt audit + 4-phase integration plan#180

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-tier-audit
May 20, 2026
Merged

docs(simd): tier debt audit + 4-phase integration plan#180
AdaWorldAPI merged 2 commits into
masterfrom
claude/pr-x-tier-audit

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Documentation-only PR. Two companion files in .claude/knowledge/:

  • td-simd-tier-audit.md — 22 verified findings of SIMD tier debt, every line range read against actual source. Files swept: simd.rs, simd_amx.rs, simd_dispatch.rs, simd_caps.rs, amx_matmul.rs, bf16_tile_gemm.rs, backend/native.rs, plus relevant sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs, bgz17_bridge.rs, nibble.rs, quantized.rs, vnni_gemm.rs.

  • td-simd-integration-plan.md — architecture decision (SimdProfile + static dispatch tables) plus 4-phase fix plan.

What the audit found

Headline (CRITICAL):

  • matmul_bf16_to_f32 / matmul_f32 / matmul_i8_to_i32 (hpc/amx_matmul.rs:319-412) all have if amx_available() branches whose AMX arm calls the scalar reference kernel. bf16_tile_gemm_16x16 (working TDPBF16PS dispatch) and int8_gemm_vnni (working VPDPBUSD dispatch) exist and are tested — they're just not wired through.
  • bf16_gemm_f32 (quantized.rs:444) is a triple-nested scalar loop with .to_f32() per element. No crate::simd::*, no F32x16 mul_add. This IS the fallback everything bottoms out at.
  • int8_gemm_i32 (quantized.rs:618) is the same shape.
  • backend/native.rs::avx2::{scal,nrm2,asum}_* (:544-561) all super::scalar::*-delegate. Dispatch macro thinks AVX2; body is scalar.
  • gemv_f32 / gemv_f64 (backend/native.rs:271-278) skip the dispatch macro entirely. Scalar on every CPU.

Headline (HIGH):

  • simd_dispatch.rs:150-163 aarch64 tier reports NeonDotProd/Neon but fills every fn pointer from Self::scalar(). Pi 5 / M2 / Apple silicon = scalar.
  • simd_dispatch.rs:128-134 AVX-512 tier uses AVX2 wrappers for 4 of 6 ops.
  • simd_neon_bf16.rs:149-177 and simd_neon_dotprod.rs:115-148 have *Stub types with unimplemented!() panic bodies. Module docs spell out the asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired.
  • Three independent Tier enums (simd.rs:18, backend/native.rs:22, hpc/simd_dispatch.rs:31) all collapse Skylake-X through Granite Rapids into one Avx512 bucket.
  • simd.rs:291-292 gates BF16x16 re-export on compile-time target_feature = "avx512bf16". Default cargo is v3; even on SPR/Zen 4 silicon, crate::simd::BF16x16 is the scalar simd_half::BF16x16.

Two agent-claimed findings verified false (simd_amx.rs:291, simd_avx512.rs:695) — recorded as such in the audit so they don't get re-flagged on the next sweep.

Architecture proposal (integration plan)

SimdProfile enum flat-keyed by silicon profile. One static *Dispatch table per profile, populated at compile time. LazyLock<&'static *Dispatch> resolved once at first call. Branch predictor warms once; every subsequent call is one pointer deref + one indirect call.

static SPR_GEMM: GemmDispatch = GemmDispatch {
    bf16_gemm: amx_bf16_tile_gemm,
    int8_gemm: amx_int8_tile_gemm,
    f32_gemv: avx512_f32x16_gemv,
};
static ICX_GEMM: GemmDispatch = GemmDispatch {
    bf16_gemm: avx512_f32x16_bf16_gemm,
    int8_gemm: avx512vnni_gemm,
    f32_gemv: avx512_f32x16_gemv,
};
// ... 10 more profiles

pub fn gemm_dispatch() -> &'static GemmDispatch {
    static TABLE: LazyLock<&'static GemmDispatch> = LazyLock::new(|| match simd_profile() {
        SimdProfile::SapphireRapids | SimdProfile::GraniteRapids => &SPR_GEMM,
        SimdProfile::IceLakeSp => &ICX_GEMM,
        // ...
    });
    *TABLE
}

This is the "switch hashtable" the discussion asked about — conceptually a HashMap<CpuId, FnPointers> resolved once.

Compile-time pinning via cargo features (cpu-spr, cpu-icx, cpu-zen4, cpu-a76, …) shortcuts the LazyLock to a const &'static *Dispatch. The #[inline(always)] dispatch function gets folded out at the callsite; chosen kernel inlines directly. Same source, two ergonomics: runtime default for "one binary, all ISAs"; compile-time pin for per-tier distribution binaries.

Phases

Phase Effort Scope Closes
1 10–12h, P0 Wire existing kernels through existing dispatch sites TD-T1..T7 (CRITICAL)
2 ~22h, P1 Real NEON BF16/F16/integer types via asm-byte TD-T8, T10, T11, T21
3 ~21h, P1 SimdProfile architecture; migrate three Tier enums TD-T12, T13, T14
4 30–50h, P2 Intra-bucket SIMD fills (13 named tasks) TD-T15..T20 + new

Phases 1 and 2 are independent (disjoint files), can land in parallel. Phase 3 needs Phase 1's wired kernels as table entries. Phase 4 is fully parallelizable; each entry is one PR.

"What dispatches where" quick reference

The plan ends with per-primitive routing tables, e.g.:

Profile bf16_gemm_f32 impl
GraniteRapids, SapphireRapids tile_dpbf16ps 16×16×k tile
Zen4Avx512, CooperLake _mm512_dpbf16_ps (native BF16 dot)
IceLakeSp, CascadeLake, SkylakeX F32x16 mul_add over decoded BF16
ArrowLake, HaswellAvx2 F32x8 mul_add over decoded BF16
A76DotProd NEON BFMMLA
A72Fast, A53Baseline NEON F32x4 mul_add over decoded BF16
Scalar Scalar triple loop

Similar tables for int8_gemm_i32, gemv_f32, BF16x16 ops, F16x16 ops.

Risks documented

7 open questions at the end of the integration plan. Most important: cpu-* features must be mutually exclusive (compile_error!) AND documented non-portable; runtime-dispatch default stays the safe option. Also: caps.amx_fp16 doesn't exist yet — needs to be added before Phase 3 (CPUID leaf 7,1 EAX bit 21 for Granite Rapids).

Test plan

  • No code changes — nothing to test in CI for this PR.
  • Phase 1 (separate PR): each TD-T fix lands with parity test vs the scalar reference.
  • Phase 2 (separate PR): aarch64 asm-byte impls need objdump verification on x86 dev machine + throughput test on real Pi 5 / M2.
  • Phase 3 (separate PR): each SimdProfile variant exercised in CI via the cpu-* cargo feature matrix.
  • Phase 4 (per-entry PRs): each entry has its own parity test for the silicon tier it unlocks.

Generated by Claude Code

claude added 2 commits May 20, 2026 18:46
22 verified findings across 7 files, in `.claude/knowledge/td-simd-tier-audit.md`.

Read end-to-end for this pass:
  src/simd.rs (720 LoC), simd_amx.rs (421), simd_dispatch.rs (361),
  hpc/amx_matmul.rs (671), hpc/bf16_tile_gemm.rs (205),
  hpc/simd_caps.rs (514), backend/native.rs (763), plus relevant
  sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs,
  hpc/bgz17_bridge.rs, hpc/nibble.rs, hpc/quantized.rs, hpc/vnni_gemm.rs.

Headline findings (CRITICAL):

  TD-T1/T2/T3 — `matmul_bf16_to_f32`, `matmul_f32`, `matmul_i8_to_i32`
    in `hpc/amx_matmul.rs:319-412` all have `if amx_available()`
    branches whose AMX arm calls the scalar reference kernel.
    `bf16_tile_gemm_16x16` (the real TDPBF16PS dispatch) and
    `int8_gemm_vnni` (real VPDPBUSD dispatch) exist and are
    tested — they're just not wired through.

  TD-T4 — `quantized.rs::bf16_gemm_f32` (line 444) is a triple-nested
    scalar loop with per-element `.to_f32()`. No `crate::simd::*`,
    no F32x16 mul_add. This IS the fallback that everything bottoms
    out at.

  TD-T5 — `quantized.rs::int8_gemm_i32` (line 618) is the same shape.
    `int8_gemm_vnni` exists at `vnni_gemm.rs:46` but no caller
    routes through it.

  TD-T6 — `backend/native.rs::avx2::{scal,nrm2,asum}_*` (line 544-561)
    all `super::scalar::*`-delegate. Dispatch macro thinks AVX2;
    body is scalar.

  TD-T7 — `backend/native.rs::gemv_f32/f64` (line 271-278) skip the
    dispatch macro entirely. Scalar on every CPU.

Headline findings (HIGH):

  TD-T8 — `simd_dispatch.rs:150-163` aarch64 tier reports
    `NeonDotProd`/`Neon` but fills every fn pointer from
    `Self::scalar()`. Pi 5 / M2 / Apple silicon = scalar.

  TD-T9 — `simd_dispatch.rs:128-134` AVX-512 tier uses AVX2 wrappers
    for 4 of 6 ops (squared_distances_f32, nibble_unpack,
    nibble_above_threshold, batch_sq_dist).

  TD-T10/T11 — `simd_neon_bf16.rs:149-177` and
    `simd_neon_dotprod.rs:115-148` have `*Stub` types with
    `unimplemented!()` panic bodies. Module docs spell out the
    asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired.

  TD-T12/T13/T14 — three independent `Tier` enums (simd.rs,
    backend/native.rs, hpc/simd_dispatch.rs) all collapse
    Skylake-X through Granite Rapids into one `Avx512` bucket.

  TD-T15 — `simd.rs:291-292` gates `BF16x16` re-export on
    compile-time `target_feature = "avx512bf16"`. Default cargo
    is v3; even on SPR/Zen 4 silicon, `crate::simd::BF16x16` is
    the scalar `simd_half::BF16x16`.

Headline findings (MEDIUM):

  TD-T16/T17 — `hpc/nibble.rs` ops cap at AVX2 (no AVX-512BW for
    nibble_unpack / nibble_above_threshold), and the "avx2" funcs
    are scalar loops under `#[target_feature(enable = "avx2")]`.

  TD-T18 — `simd_ln_f32` (simd.rs:479) is `arr[i].ln()` per lane,
    asymmetric with the Remez-polynomial `simd_exp_f32`.

  TD-T19/T20 — `distance::squared_distances` and
    `spatial_hash::batch_sq_dist` check only `avx2`; no AVX-512F
    variant.

  TD-T21 — `simd.rs:351-354` on aarch64, every integer type
    (`I32x16`, `U8x64`, `U16x32`, ...) comes from `scalar::*`,
    not `simd_neon`. Pi 5 / M2 get scalar integer SIMD despite
    NEON int32x4_t / uint8x16_t being available.

Two agent-claimed findings verified false (`simd_amx.rs:291`,
`simd_avx512.rs:695`) — recorded as such in the audit so they
don't get re-flagged on the next sweep.

Next-sweep targets listed at the bottom — `blas_level{1,2,3}.rs`,
`statistics.rs`, `lapack.rs`, `simd_avx2.rs`, `simd_neon.rs`
unread. Grep showed no `crate::simd::*` use in the BLAS / LAPACK /
statistics files; flagship public API may be entirely scalar.
Verification requires actually reading them, not grep — same
mistake the agent made.

No code changes in this commit. Documentation only.
Companion to td-simd-tier-audit.md. Architecture decision for
intra-bucket dispatch (SPR vs ICX vs CLX within Tier::Avx512,
A76 vs A72 within Tier::Neon).

Core proposal: SimdProfile enum (SapphireRapids, GraniteRapids,
IceLakeSp, CooperLake, CascadeLake, SkylakeX, Zen4Avx512,
ArrowLake, HaswellAvx2, A76DotProd, A72Fast, A53Baseline,
Scalar) with one static *Dispatch table per profile. Detect
once via LazyLock<SimdProfile>, then every dispatch is one
pointer deref + one indirect call. Branch predictor warms
once.

Compile-time pinning via cargo features (cpu-spr / cpu-icx /
cpu-zen4 / etc.) shortcuts the LazyLock to a const &'static
*Dispatch reference; rustc folds the entire dispatch out at
the callsite. Same source, two ergonomics.

Four phases:

  Phase 1 (~10-12h, P0): Wire existing kernels through existing
    dispatch sites. Closes 7 CRITICAL audit findings (TD-T1..T7).
    Pure routing, no new abstractions.

  Phase 2 (~22h, P1): Real NEON BF16/F16/integer types via
    asm-byte BFMMLA/BFDOT/fmla.8h. Closes TD-T8, T10, T11, T21.
    Needs Pi 5 / M2 hardware verification.

  Phase 3 (~21h, P1): SimdProfile architecture. Closes the three
    Tier enum collapses (TD-T12, T13, T14). Migration is additive
    — simd_profile() ships alongside tier(); the three local
    enums get deleted last.

  Phase 4 (30-50h, P2, parallelizable): Intra-bucket fills.
    13 named tasks listed with profile-unlock and target file.

Quick reference at the bottom shows, for bf16_gemm_f32,
int8_gemm_i32, gemv_f32, BF16x16, F16x16, the silicon-by-silicon
route after all 4 phases land. This is the "what to dispatch
where" the user asked for.

7 risks / open questions at the end. Most important:
- cpu-* features must be mutually exclusive (compile_error!)
  AND documented as non-portable.
- caps.amx_fp16 doesn't exist yet — add before Phase 3 (CPUID
  leaf 7,1 EAX bit 21 for GraniteRapids dispatch).
- BLAS-1/2/3, LAPACK, statistics, quantized.rs grep showed NO
  crate::simd::* use. Flagship public surfaces are scalar.
  Phase 4 must decide: route through dispatch tables, or
  rewrite in place.

No code changes in this commit. Documentation only.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da66c21167

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +158 to +159
bf16_gemm: amx_bf16_tile_gemm, // TDPBF16PS, 256 mul-adds/instr
int8_gemm: amx_int8_tile_gemm, // TDPBUSD, 256 mul-adds/instr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix AMX MAC counts in SPR dispatch comments

The comments understate AMX throughput by a large factor (TDPBUSD/TDPBF16PS are documented in this repo as tile-level dot products, not 256 total mul-adds), which makes the plan’s performance rationale misleading and can skew prioritization decisions in later implementation PRs. Please align these numbers with the existing AMX primitive documentation in src/hpc/amx_matmul.rs (e.g., tile-level operation counts) so the integration plan reflects realistic speedup expectations.

Useful? React with 👍 / 👎.

Comment on lines +124 to +125
if caps.neon && caps.aes /* heuristic for A72 vs A53 */ {
return SimdProfile::A72Fast;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replace AES heuristic for A72 vs A53 profile detection

The proposed caps.neon && caps.aes check cannot reliably distinguish A72Fast from A53Baseline (the current capability model already treats crypto as present across multiple ARMv8 tiers), so this branch will classify many non-A72 devices as A72 and route them to the wrong profile table. That undermines the goal of accurate per-silicon dispatch and should be replaced with an explicit “cannot distinguish” fallback or a different discriminator.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 18451bc into master May 20, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
… kernel

Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and
GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS,
256 BF16×BF16 multiply-accumulates per instruction, single-rounded
into an f32 tile accumulator). Until this commit, the AMX branch of
`matmul_bf16_to_f32` was a placebo — both `if amx_available()` and
`else` called the scalar `bf16_gemm_f32`. The actual kernel
(`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was
unreached by the consumer entry point.

This wires it. When AMX is OS-enabled AND the matmul shape is
16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks
through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the
asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95
encoding documented at simd_amx.rs:16-19; AMX intrinsics are
nightly-only per issue #126622, hence asm-byte). Aligned tiles get
the full hardware throughput; misaligned shapes (any of M/N/K not at
the alignment boundary) fall back to the validated scalar
`bf16_gemm_f32` reference. Non-AMX hosts always take the scalar
fallback.

The B sub-block extraction copies a K × 16 packed scratch per
j_tile column band (B is K × N row-major; the kernel wants K × 16
contiguous). Allocation cost is amortized across M/16 i-tile
iterations under each j_tile. Phase-4 work will land a fully
mixed-tile path (AMX 16×16 core + per-axis scalar tails on the
same matmul) for arbitrary shapes.

Verification:
  * Default v3 build: 11 amx_matmul tests pass (this host lacks
    AMX per /proc/cpuinfo, so the path falls through to scalar;
    behaviour identical to pre-commit on this runner).
  * Full lib sweep: 2087 tests pass; clippy -D warnings clean.
  * Real SPR silicon: the gating is correctness-by-construction —
    the new branch only fires when amx_available() == true AND the
    alignment predicates hold; the inner kernel is the same one
    PR #104 shipped and tested.

Background — the directive chain from this session:

  user: "Sapphire Rapids should have BF16 operations"
  user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?"  → both are SIMD,
        TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums
        per instruction (16×16 outer-product matmul tile), VDPBF16PS
        does 32 BF16×BF16 multiplies + 16 f32 accums per zmm
        instruction. Neither is scalar. The "no scalar lane-by-lane
        f32 round-trip" rule the user gave is what this PR delivers:
        the AMX tile op is hardware-fused, single-rounded into f32
        accumulator, BF16 mantissa bits preserved bit-exactly per
        IEEE BF16 spec at the multiply step.

Closes TD-T1 from
`.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Extends the BF16 GEMM dispatch chain from PR #180's per-tier table.
Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR,
GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake
+ Cascade Lake + Zen 4+ which all have avx512bf16 hardware but
nothing else).

Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one
instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates,
single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte
needed (unlike AMX, which is nightly-only per issue #126622 and
must be raw-byte encoded).

Three-tier dispatch in bf16_gemm_dispatch (renamed from
bf16_gemm_with_amx now that AMX isn't the only hw path):

  1. amx_available() && 16/16/32-aligned shapes
     → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte
       (8 192 MACs/instr, MOST throughput)
  2. is_x86_feature_detected!("avx512bf16")
     → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic
       (32 MACs/instr, arbitrary shapes, K-tail handled scalar,
        N-tail handled by per-iteration j_count trim)
  3. scalar bf16_gemm_f32 reference

Kernel pattern (slow-but-correct first cut):
  * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16
    columns of one output row, processing 2 K-elements per call.
  * B columns for the current j-block of 16 are pre-packed into a
    pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj]
    in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits),
    then reused across all m i-iterations to amortize the column-
    gather cost.
  * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across
    16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by
    every output column.
  * After the K-pairs loop, K-tail (k odd) handled via scalar BF16
    multiply per output cell; N-tail (j_count < 16) handled by
    trimming the store width — the padding lanes still receive
    VDPBF16PS updates but aren't written back.

Performance shape (rough): the kernel is correctness-optimized, not
peak-throughput-optimized. Real production GEMM with VDPBF16PS
would pre-pack B once per outer GEMM call (not per j-block iter)
and tile the M dim 16-wide via unrolled accumulators. Phase-4 work.
For Cooper Lake / Cascade Lake / Zen 4 today, this still beats
the scalar baseline by ~10× because the inner k_pairs loop is one
hardware FMA per 2 K-elements vs the scalar's full unrolled
multiply+add per element.

Verification:
  * Default v3 build: 11 amx_matmul tests pass (this host shows
    only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new
    arm falls through to scalar; behaviour identical to pre-commit).
  * cargo clippy --lib -D warnings clean.
  * cargo fmt --all --check clean.
  * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65,
    k_pairs=32, k_tail=1) and strided test will exercise the new
    arm on Cooper Lake / Cascade Lake / Zen 4 silicon.

Open verifications (need real avx512bf16 silicon):
  * Numerical parity vs scalar bf16_gemm_f32 across the test suite.
  * Throughput vs scalar baseline.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the
integer operand family. Builds the missing int8 tile kernel from
scratch (the BF16 equivalent shipped in PR #104; the int8 one had
never been built despite the primitives existing in simd_amx since
day one) and wires matmul_i8_to_i32's AMX arm through it.

New module `hpc::int8_tile_gemm`:

  * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel,
    K must be multiple of 64. Mirror shape of
    `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand
    family that TDPBUSD natively supports. **One TDPBUSD = 16 384
    multiply-accumulates per instruction** (16×16 output tile × 64
    K-elements per A row × 4 K-elements per inner-product). That's
    256× the VPDPBUSD-zmm throughput per instruction.
  * Internal `amx_path()` uses the existing primitives in
    `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig →
    tile_zero → K/64 iterations of (tile_load A, tile_load B,
    tile_dpbusd) → tile_store → tile_release.
  * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32
    triple-loop reference.

New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`:

  * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad
    layout required by TDPBUSD tile 2.
  * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]`
  * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout
    for TDPBF16PS — both kernels reach the same 64-byte tile row
    width via element-width × pack-factor symmetry: BF16 is 2B × 2,
    INT8 is 1B × 4).

Wiring `matmul_i8_to_i32`'s AMX arm (was placebo):

Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR
`int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself
was never reached even on real AMX silicon. Now:

  1. Shift A: i8 → u8 via (+128).
  2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling
     int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block
     extracted into K × 16 scratch once per j_tile, reused across
     i_tile iterations.
  3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]).

The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0;
misaligned shapes fall back to the scalar reference. Phase-4 work
will land mixed AMX-tile + per-axis scalar tail handling for
arbitrary shapes (same shape of Phase-4 work TD-T1 deferred).

Verification:
  * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new
    tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test
    now exercises the actual TDPBUSD path because this host has
    amx_int8 + amx_tile in /proc/cpuinfo; the test continues to
    pass with bit-identical results to the scalar reference).
  * `vnni_pack_i8_roundtrip` test verifies the pack layout matches
    the spec exactly for an 8 × 4 sample.
  * `fallback_matches_scalar_reference_k64` test verifies the
    non-AMX path produces the same i32 output as a hand-written
    reference for a 64-K, pseudo-random u8/i8 matrix pair.
  * `public_api_diagonal_k128` test asserts a structured pattern
    (A = identity-like, B = constant 2) gives the expected
    accumulation through the full dispatch chain.
  * `cargo clippy --lib -D warnings` clean.
  * `cargo fmt --all --check` clean.

Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX
arm no longer falls back to it (the scalar else-branch uses an
inline triple-loop directly).

After this commit, the per-CPU dispatch table from PR #180 has the
AMX tier wired for BOTH operand families on Sapphire Rapids+:

  BF16 GEMM:  SPR+ → TDPBF16PS  (TD-T1 / TD-T1b in PR #182)
  INT8 GEMM:  SPR+ → TDPBUSD    (this commit)

Out of scope (separate PRs):
  * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade
    Lake / Zen 4+ (avx512vnni without AMX). The kernel function
    `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just
    need to assemble them into a m×n×k GEMM and wire as the
    middle dispatch tier (analogous to the VDPBF16PS arm in PR
    #182's bf16_gemm_dispatch).
  * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level
    surface from PR #182) — it's u8 × i8 natively so no sign-shift
    needed, simpler to wire than matmul_i8_to_i32.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per
PR #180's table the middle tier between AMX TDPBUSD (Sapphire
Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm
form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice
Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors
the VDPBF16PS arm structure that landed for BF16 in PR #182's
`bf16_gemm_dispatch`.

New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`:
  * One VPDPBUSD instruction: 16 i32 accumulator lanes, each
    receiving 4 u8×i8 products = 64 MACs per instruction.
  * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`,
    one i row processed at a time, K-quad inner loop accumulating
    into the same 16 i32 lanes across iterations.
  * B-column packing: pre-packs B for the current j-block into
    `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad..,
    j_base+j] packed bottom-to-top)` once per j-block; reused
    across all M i-iterations so the gather cost amortizes.
  * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes
    packed) every K-iter — same quad seen by every output column.
  * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per
    output cell; N-tail (j_count < 16) handled by trimming the
    store width — padding lanes still receive VPDPBUSD updates
    but aren't written back.
  * Stable intrinsic `_mm512_dpbusd_epi32` under
    `target_feature = "avx512vnni,avx512f"` — no asm-byte needed.

Wiring `matmul_i8_to_i32` to three-tier dispatch:
  1. amx_available() + 16/16/64-aligned shapes
     → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr,
       this commit reuses the kernel from PR #184 fe334de... wait,
       same PR — from b1979d7 in THIS PR)
  2. is_x86_feature_detected!("avx512vnni")
     → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable
       intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled
       scalar, N-tail handled by per-iteration j_count trim)
  3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts,
     or shapes that don't satisfy either SIMD tier's requirements

Factored the shared sign-shift bias subtraction into a private
`subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1
(AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to
subtract 128·colsum(B) from the accumulator. Pure integer
arithmetic, bit-identical to the scalar i8×i8 → i32 reference.

Verification:
  * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test
    `vpdpbusd_zmm_matches_scalar` that exercises the new arm
    directly with shapes spanning aligned cases, K-tail (k % 4),
    N-tail (n % 16), and small shapes; asserts byte-equal output
    vs scalar reference).
  * Existing `matmul_i8_to_i32_16x16_exact` continues to pass
    through the AMX tier on this host (which has amx_int8).
  * cargo clippy --lib --tests --features rayon,native -- -D warnings
    clean.
  * cargo fmt --all --check clean.

Per-CPU dispatch state after this commit:

  matmul_bf16_to_f32:  SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_f32:          SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_i8_to_i32:    SPR+ AMX  | CPL/Zen4 VPDPBUSD  | scalar
                       (b1979d7) | (THIS COMMIT)      | (always)

So all three of the public matmul entry points now have full
three-tier dispatch on x86_64.

Out of scope (separate PRs):
  * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level
    u8×i8 surface from PR #182) — it's u8×i8 natively, no sign-
    shift bias needed, simpler than matmul_i8_to_i32.
  * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without
    avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but
    need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as
    the avx512vnni arm just with ymm width.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Completes the per-CPU dispatch chain for `matmul_i8_to_i32` by
adding the AVX-VNNI ymm tier — Arrow Lake, Meteor Lake U, Alder
Lake silicon that has AVX-VNNI but dropped AVX-512. Mirrors the
shape of the avx512vnni-zmm arm shipped in PR #184 with the
narrower 8-wide kernel.

New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_ymm`:
  * One `_mm256_dpbusd_avx_epi32` instruction: 8 i32 accumulator
    lanes, each receiving 4 u8×i8 products = 32 MACs per
    instruction. Half the throughput-per-instruction of the
    `_mm512_dpbusd_epi32` zmm version.
  * Same B-pre-pack scheme (quad-interleaved per 8-wide j-block),
    same K-tail / N-tail handling. Just narrower.
  * Stable intrinsic under `target_feature = "avxvnni,avx2"` — no
    asm-byte needed.

Wiring `matmul_i8_to_i32`'s dispatch as Tier 3:
  1. amx_available() + 16/16/64-aligned → AMX TDPBUSD
     (PR #184: int8_gemm_amx_tiled, 16 384 MACs/instr)
  2. is_x86_feature_detected!("avx512vnni") → VPDPBUSD-zmm
     (PR #184: int8_gemm_vpdpbusd_zmm, 64 MACs/instr)
  3. is_x86_feature_detected!("avxvnni") → VPDPBUSD-ymm
     (THIS COMMIT: int8_gemm_vpdpbusd_ymm, 32 MACs/instr)
  4. scalar i8×i8 → i32 reference (was Tier 3)

All three SIMD tiers share the sign-shift bias trick: shift LHS
i8 → u8 (+128), run the kernel, subtract 128·colsum(B). Same
`subtract_i8_to_u8_bias` helper (factored in PR #184).

New direct test `vpdpbusd_ymm_matches_scalar` mirrors the zmm
version's test: sweeps shapes spanning 8-aligned, K-tail (k % 4),
N-tail (n % 8), and small shapes, asserts byte-equal output vs
scalar reference.

Verification:
  * Default v3 (this host has avx512vnni so the new arm doesn't
    fire from matmul_i8_to_i32 — Tier 2 catches first): 2096 lib
    tests pass (was 2095 — +1 new direct test).
  * Direct test exercises int8_gemm_vpdpbusd_ymm on this host
    since avxvnni is present alongside avx512vnni.
  * cargo clippy --lib --tests --features rayon,native -- -D warnings
    clean.
  * cargo fmt --all --check clean.

Per-CPU dispatch state after this commit (final on the int8 side):

  matmul_i8_to_i32:  SPR+ AMX  | CPL/Zen4 zmm | ARL ymm | scalar
                     (PR #184) | (PR #184)    | (THIS)  | (always)

The matmul_i8_to_i32 column of PR #180's dispatch table is now
fully filled. The gemm_u8_i8 slice surface (in PR #185) already
has AVX-VNNI ymm via its existing compile-time cascade — both
i8-related public surfaces now cover every x86_64 tier with a
hardware-accelerated arm.

Out of scope (separate PRs):
  * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b, needs
    aarch64 CI runner verification.
  * TD-T6: real _mm256_* for AVX2 BLAS-1 (scal/nrm2/asum).

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Closes TD-T6 (critical audit finding from the per-CPU matrix doc).
Before this commit, the AVX2 native BLAS-1 module had:

  pub fn scal_f32(alpha: f32, x: &mut [f32]) {
      super::scalar::scal_f32(alpha, x);  // ← scalar shim, no AVX2
  }
  pub fn nrm2_f32(x: &[f32]) -> f32 {
      super::scalar::nrm2_f32(x)          // ← scalar shim
  }
  pub fn asum_f32(x: &[f32]) -> f32 {
      super::scalar::asum_f32(x)          // ← scalar shim
  }
  // ... and f64 siblings, same shape

These were the documented "// No AVX2 specialization — fall through
to scalar" path. Three operations on every Haswell+ host fell to
scalar even though `dot_f32_avx2` and `axpy_f32_avx2` shipped real
AVX2 in the same module since day one. PR #180's audit flagged this
as TD-T6 (critical: blocks BLAS-1 throughput on Haswell / Arrow
Lake / Zen 1-3).

New AVX2 kernels (6 total — f32 + f64 for each of scal / nrm2 / asum):

  scal: broadcast α to ymm via `_mm256_set1_ps`, multiply 8/4 lanes
        at a time via `_mm256_mul_ps`/`_mm256_mul_pd`, scalar tail.
        Stores result back to the same buffer in-place.

  nrm2: two-accumulator unroll with `_mm256_fmadd_ps`/`_pd` (x²
        accumulated via FMA, single-rounded per IEEE), horizontal
        reduce + scalar sqrt. Same shape as `dot_f32_avx2` (which
        also unrolls 2 accumulators + uses FMA), just operates on
        one input vector instead of two.

  asum: abs via `_mm256_and_ps`/`_pd` with a sign-bit-cleared mask
        (0x7FFFFFFF for f32, 0x7FFFFFFFFFFFFFFF for f64) — one
        AVX instruction (VANDPS) is faster than calling f32::abs()
        lane-by-lane. Two-accumulator unroll + horizontal reduce.

All three follow the existing `dot_f32_avx2` template:
- `#[target_feature(enable = "avx2[,fma]")]` on the inner unsafe fn.
- Public wrapper does `cfg(target_arch = "x86_64")` and dispatches
  to the unsafe fn (tier detection in caller-of-caller verified
  AVX2 before reaching this module).
- Non-x86_64 builds: pass through to `super::scalar::*`.
- Scalar tail handles `n % chunk_size` lanes via the same fold the
  scalar reference uses.

Numerical contract:
  scal: byte-equal to scalar (`x[i] *= α` is the same op).
  asum: small ULP drift on long vectors because the SIMD horizontal
        reduce orders the sum differently from strict left-fold.
        Test tolerance: `|got - expected| <= |expected|*1e-5 + 1e-6`.
  nrm2: same — drifts ~1-2 ULP on long vectors via reduce-order +
        sqrt rounding. Same tolerance.

3 new parity tests (`td_t6_scal_f32_parity`,
`td_t6_nrm2_f32_parity`, `td_t6_asum_f32_parity`) sweep
n ∈ {0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 64, 100} — covers the
chunk-of-16 unroll path, the chunk-of-8 cleanup path, and the
scalar tail for every kernel.

Verification:
  * 2090 lib tests pass (was 2087 — +3 new parity tests; the
    existing test_scal_f32 / test_nrm2_f64 / test_asum_f32 that
    used to hit the scalar shims now exercise the AVX2 kernels
    and continue to pass).
  * cargo clippy --lib --tests --features rayon,native -- -D warnings
    clean.
  * cargo clippy --lib --tests --features rayon,native,runtime-dispatch
    -- -D warnings clean.
  * cargo fmt --all --check clean.

Throughput impact (back-of-envelope on Sapphire Rapids, n=4096):
  scal_f32: scalar 4096 cycles (1 mul/lane) → AVX2 ~520 cycles
            (8 lanes/instr + 1-cycle issue) = ~8× faster.
  asum_f32: scalar 4096 cycles → AVX2 ~520 cycles = ~8× faster.
  nrm2_f32: scalar 4096 cycles (1 FMA/lane) → AVX2 ~260 cycles
            (16 lanes via 2-acc unroll, 1-cycle issue) = ~16×.

Out of scope (separate PRs):
  * AVX-512 versions of the same three ops — `kernels_avx512.rs`
    has them already (lines 137-209), wired through the
    cfg(target_feature = "avx512f") path. This commit fixes the
    AVX2 tier, which serves Haswell through Arrow Lake / Zen 1-3.
  * Runtime-dispatch trampolines for these ops (would go in
    `simd_runtime/blas_l1.rs` mirroring the matmul.rs pattern from
    the runtime-dispatch PR).

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants