Skip to content

simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring#182

Merged
AdaWorldAPI merged 12 commits into
masterfrom
claude/continue-ndarray-x0Oaw
May 21, 2026
Merged

simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring#182
AdaWorldAPI merged 12 commits into
masterfrom
claude/continue-ndarray-x0Oaw

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Nine commits across three areas — recovering JIT-parity-zone slice helpers, building the per-CPU agnostic-surface resolution map, and landing two of its Phase-1 wirings.

Recovery + agnostic surface (commits 1-3)

  • 0a46e7f restores simd_ops::array_windows / array_windows_checked and adds slice-level add_mul_f32 / add_mul_f64 — the foundation primitives the BLAS-graph GEMM path relies on (an earlier sprint had removed them).
  • 86b0f3f ships simd_int_ops::gemm_u8_i8 as an agnostic surface with a compile-time avx512vnni → avxvnni → scalar dispatch chain.
  • caf0471 adds the AVX-VNNI ymm arm (Arrow Lake, Meteor Lake U) and bumps .cargo/config-avx512.toml from bare v4 to sapphirerapids (was missing VNNI).
  • 0134916 adds an #[ignore]'d criterion-style bench harness bench_gemm_u8_i8_vs_scalar so the three arms have an apples-to-apples timing surface.

Per-CPU resolution matrix (commits 4-5)

  • b34d430 introduces .claude/knowledge/agnostic-surface-cpu-matrix.md — a 14-CPU × ~80-symbol matrix mapping every public type / function in crate::simd::* + crate::hpc::* to its actual lowering per profile. Cross-references the W1a consumer contract, the TD-SIMD audit, and the existing dispatch-architecture doc.
  • 058ef61 expands to full integer-lane coverage (I8/U8/I16/U16/I32/U32/I64/U64 at 256/512-bit), adds the cross-cutting infrastructure status table (which configs exist, what features are missing), and grows the integration plan with Phases 0-6 + an explicit out-of-scope list.

Phase 1 wirings (commits 6-9)

  • b5bca4e — MX-T1a: simd_int_ops::add_i8 / sub_i8 / add_i16 lifted from scalar to polyfilled lanes (I8x64 / I16x32 on x86, I8x16 / I16x8 on aarch64). Uses the same min_i8-style cfg-cascade. Existing parity tests cover tail lengths 0/1/32/63/64/65/127/128/129/256.
  • bede3d2 — design rule + matrix update: flips the three MX-T1a cells in the matrix from scalar to per-CPU SIMD, and codifies the asm-byte encoding rule for AMX/F16/FP16 paths (Phases 1b/3b/3c/4d). AMX intrinsics are nightly-only (issue #126622) and avx512fp16/NEON-fp16 have stabilization churn on Rust 1.95 stable — raw .byte/.inst encoding (matching simd_amx.rs:16-19) is the documented stable-toolchain path. Does NOT apply to instructions with stable intrinsics on 1.95 (_mm512_dpbusd_epi32, _mm256_cvtph_ps, _mm512_cvtne2ps2bf16).
  • fe334de — TD-T1: matmul_bf16_to_f32's AMX arm was placebo (both if amx_available() and else called the scalar reference). This wires the 16/16/32-aligned path through bf16_tile_gemm::bf16_tile_gemm_16x16 which emits TDPBF16PS via the asm-byte path in simd_amx.rs::tile_dpbf16ps — 8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction on real SPR silicon. Misaligned shapes fall back to the validated scalar bf16_gemm_f32. Non-AMX hosts always take the scalar fallback.

Test plan

  • Default v3 (x86-64-v3 AVX2) — cargo test --lib: 2087 passed, 0 failed, 29 ignored.
  • cargo clippy --lib -- -D warnings clean.
  • cargo --config .cargo/config-avx512.toml=cascadelake (= v4 + VNNI) — 15 simd_int_ops tests pass on the AVX-512BW direct path (_mm512_add_epi8, _mm512_add_epi16).
  • cargo --config .cargo/config-avx512.toml (sapphirerapids) — runner CPU on this dev host shows only avx512_vnni in /proc/cpuinfo (no AMX/BF16/FP16), so SPR-targeted binaries SIGILL on unrelated tests. Needs real SPR silicon for verification.
  • aarch64 build path unchanged — the integer-lift cfg-cascades follow the same shape as the existing min_i8 / max_i8 cross-arch dispatch.
  • AMX path on real SPR — kernel itself is the same one PR Add BF16 tile GEMM with AMX/AVX-512 dispatch #104 shipped and tested; only the routing is new in this PR.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u


Generated by Claude Code

claude added 12 commits May 20, 2026 22:04
Introduces `simd_int_ops::gemm_u8_i8`, the first consumer-facing
surface that bakes the SIMD dispatch decision in at build time. The
consumer never branches on CPU capability; the active arm of the
`#[cfg(target_feature)]` chain is the only one that compiles in, and
the compiler emits a direct call to the chosen kernel.

Arms wired in this PR:

  target_feature = "avx512vnni"  → int8_gemm_vnni_avx512 kernel
  (default / anything else)      → hpc::quantized::int8_gemm_i32 scalar

Future arms (amx-int8, avxvnniint8, neon+dotprod) land additively in
follow-up PRs without disturbing existing callers.

Mechanical changes:

* `int8_gemm_vnni_avx512` becomes `pub(crate) unsafe fn` so the
  agnostic surface can target it directly under the cfg gate,
  bypassing the per-call `if caps.has_avx512_vnni()` branch in
  `int8_gemm_vnni` (kept as-is for now; cleanup is a follow-up).

* `hpc::quantized::int8_gemm_i32` (scalar) is untouched — it remains
  the universal reference path and the `target_feature`-less fallback.

Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail, extreme
u8/i8 values) pass under both the default v3 scalar arm and the
`-Ctarget-feature=+avx512vnni` AVX-512 arm. Full lib suite green
(2075 passed). `cargo clippy -- -D warnings` clean.

Architecture refs:
  .claude/knowledge/td-simd-integration-plan.md § "SimdProfile +
  static dispatch tables"

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Two complementary changes — agnostic INT8 GEMM no longer falls to
scalar on AVX2-with-VNNI silicon, and the AVX-512 build config picks
up the full modern Intel HPC feature set instead of bare v4.

1. AVX-VNNI ymm kernel

   New `hpc::vnni_gemm::int8_gemm_avxvnni_ymm` — VEX-encoded
   `VPDPBUSD` over 8-wide i32 accumulators. Targets the AVX2 +
   AVX-VNNI silicon tier (Alder Lake, Arrow Lake, Zen 4 in ymm
   mode) which has hardware INT8 dot-product but no AVX-512. Same
   B-pre-pack layout as the AVX-512 kernel; column tail (`n % 8`)
   runs scalar (no masked ymm VPDPBUSD on the VEX encoding).

2. gemm_u8_i8 dispatch chain — avxvnni arm

   `simd_int_ops::gemm_u8_i8` gains a third `#[cfg]` arm between
   `avx512vnni` and the scalar fallback:

      avx512vnni  →  int8_gemm_vnni_avx512  (zmm, 16 lanes)
      avxvnni     →  int8_gemm_avxvnni_ymm  (ymm,  8 lanes)
      (none)      →  hpc::quantized::int8_gemm_i32  (scalar)

   Arm precedence is widest-vector-first via `#[cfg]` ordering
   (Sapphire Rapids has both `avx512vnni` and `avxvnni`; the zmm
   arm wins). All arms are compile-time selected — no runtime
   caps branch on any hot path.

3. config-avx512.toml: x86-64-v4 → sapphirerapids

   The "AVX-512" build config now selects the canonical modern
   Intel HPC target (SPR), enabling VNNI + BF16 + FP16 + VBMI +
   AMX-TILE + AMX-INT8 + AMX-BF16 in addition to the v4 baseline.
   Effect on `gemm_u8_i8`: the avx512vnni arm now lights up under
   this config (pure x86-64-v4 lacks VNNI). Once an `amx-int8`
   arm lands, it will preempt automatically on the same config.

   GitHub CI runs the default `.cargo/config.toml` (still
   `-Ctarget-cpu=x86-64-v3`), which is unaffected — only opt-in
   `--config .cargo/config-avx512.toml` builds see the change.

Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail with the
ymm-tail scalar path, extreme u8/i8 values) pass under three configs:

   default v3                       → scalar arm
   -Ctarget-cpu=alderlake           → avxvnni ymm arm
   --config config-avx512.toml      → avx512vnni zmm arm

Full lib suite green on default v3 (2075 passed).
`cargo clippy -- -D warnings` clean on both default and SPR configs.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
#[ignore]'d sanity-check test that times the agnostic gemm_u8_i8
surface against the scalar reference at 64³/128³/256³/512³. Run with
--ignored --nocapture to compare arms under different target cfgs.

Measured on Sapphire Rapids (4-core Xeon @ 2.1GHz), release build:

  scalar arm (default v3) — noise (~1.0x, both paths are scalar):
    64³   speedup=1.07x
    128³  speedup=1.16x
    256³  speedup=1.06x
    512³  speedup=0.97x

  avxvnni ymm arm (-Ctarget-cpu=alderlake):
    64³   simd= 18.6µs  scalar=109.5µs  speedup=5.88x
    128³  simd=151.7µs  scalar=590.4µs  speedup=3.89x
    256³  simd=  1.1ms  scalar=  2.7ms  speedup=2.40x
    512³  simd=  9.1ms  scalar= 16.2ms  speedup=1.77x

Confirms the AVX-VNNI ymm kernel is genuinely faster than the scalar
reference across all measured sizes — addresses the concern that an
AVX2 path could end up slower than scalar GEMM if the algorithmic
shape (B pre-pack into VNNI layout, tile sizing, MR/NR loop nesting)
were wrong.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Recovers two foundation primitives that prior sessions removed (or
never added in the first place) — both are explicitly cited by an
earlier session as the reason the BLAS-graph GEMM hand-rolled kernels
reach within a few percent of a Cranelift-JIT inner loop. Without
them the JIT-native path becomes the only way to hit that throughput.

1. array_windows::<T, N> — overlapping const-size window iterator

   Stable-Rust equivalent of nightly `std::slice::array_windows::<N>()`.
   Sits next to the existing `array_chunks::<N>` (non-overlapping) in
   `simd_ops.rs`, completing the pair. Together they let consumer
   kernels iterate `B`-rows as overlapping K-windows and `A`-columns as
   non-overlapping M-chunks in a single source, with the polyfilled
   F32x16 / F64x8 types absorbing the per-arch lane count.

   Implementation uses index-based iteration (`(0..count).map(|i|
   &data[i..i+N])`) to avoid `slice::windows(0)`'s panic in the N==0
   edge case — the unchecked variant yields an empty iterator, the
   checked variant returns `Err(())`. Behaviour mirrors the
   already-shipped `array_chunks` / `array_chunks_checked` pair.

2. add_mul_f32 / add_mul_f64 — slice-level FMA into accumulator

   `acc[i] += a[i] * b[i]` via the polyfilled `F32x16::mul_add` /
   `F64x8::mul_add` already on the SIMD types (16-wide AVX-512 / 8-wide
   AVX2-FMA / 4-wide NEON / scalar `f32::mul_add`). Single rounding
   step, semantically identical to BLAS-1 `axpy` with a vector
   multiplier and the dominant inner-loop shape in the bgz17 GEMM
   path. Operates on `min(acc.len(), a.len(), b.len())` lanes.

3. DO NOT REMOVE notice

   `simd_ops.rs` now opens with a "Foundation primitives — do not
   remove" callout that names `array_chunks`, `array_chunks_checked`,
   `array_windows`, `array_windows_checked`, `add_mul_f32`, and
   `add_mul_f64`, explains why they exist (~JIT-parity for BLAS-graph
   GEMM), and warns that prior sessions removed them under the wrong
   impression they were unused cruft. `src/simd.rs`' re-export site
   carries a matching pointer back to that notice.

Both new helpers are re-exported flat from `crate::simd::*` per the W1a
consumer contract — consumers reach for `ndarray::simd::{array_windows,
add_mul_f32}`, never the implementation module directly.

Verification:
  - 29 simd_ops unit tests pass (incl. 7 new array_windows + 4 new
    add_mul tests, covering tail handling, mismatched lengths, N==0,
    short buffers, exact-N, empty buffers).
  - 7 simd_ops doctests pass (the executable examples in the rustdoc).
  - Full lib suite green on default v3: 2087 passed, 29 ignored.
  - `cargo clippy -- -D warnings` clean.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
…plan

Cross-tab of every public agnostic surface (`crate::simd::*`,
`simd_int_ops::*`, `simd_half::*`, `simd_soa::*`) against the 14 CPU
profiles from `td-simd-cpu-dispatch-matrix.md`. For each cell:
which kernel actually runs there, with markers for
 ✅ live, ⏳ planned, 🟡 polyfill-transparent, ⚠️ scalar-debt, — n/a.

Additionally surfaces a separate axis — **shape ingress** — that
tracks the "ArrayView loses array shape on entry point" technical
debt the user flagged: every function tagged 🔪 `&[T] + (m,n,k)`
takes a flat slice and forces `.as_slice().unwrap()` at the call
site, vs the 📐 `ArrayView` shape that `hpc::amx_matmul::matmul_*`
already uses correctly.

Findings the matrix surfaces:

  * `gemm_u8_i8` is the only currently-debt surface — `&[u8] +
    (m,n,k)`. Phase 0 of the integration plan lifts it to
    `(ArrayView2<u8>, ArrayView2<i8>, ArrayViewMut2<i32>) -> Result`.
  * Integer-elementwise ops (`add_i8`, `dot_i8`, `add_i16`, etc.)
    are uniformly scalar on every CPU despite `I8x64` / `I16x32`
    polyfilled lanes existing — predate the lane widening, never
    re-wired.
  * F16x16 has ZERO hardware backing on any profile (TD-SIMD-8);
    BF16x16 is hardware-backed only on `avx512bf16` profiles via
    `__m256bh`. NEON BF16 / FP16 (A76+) entirely scalar.
  * On aarch64, all integer polyfilled lanes
    (I8x32/I16x*/I32x*/I64x*/U*) are scalar — TD-T21 — even though
    NEON has 128-bit `intNx_t` quartets that would back them.
  * AMX exists on SPR/GNR but no agnostic surface routes to it
    (the kernel exists at `bf16_tile_gemm.rs`; only consumers in
    `amx_matmul.rs` reach it). Phase 1b wires it into `gemm_u8_i8`.

Integration plan (J) phases 0-5, each one PR-sized:

  0 - gemm_u8_i8 ArrayView lift (shape-debt fix)
  1 - wire existing hardware paths (NEON SDOT, AMX-INT8, AVX-VNNI-INT8)
  2 - lift integer-elementwise surfaces to polyfilled lanes
  3 - TD-SIMD-8: BF16x16 + F16x16 hardware backing on CPL/SPR/Zn4 + A76
  4 - remaining hardware fills (aarch64 ints, simd_ln_f32 Remez, RNE polyfill, AMX-FP16 detection)
  5 - rolling: every new surface ships with ArrayView ingress from day one

Verification checklist at the bottom defines the promotion gate
(planned -> live): kernel exists, cfg-chain routes, parity-test green,
timing-harness beats scalar, doc-comment updated, this matrix cell
flipped in the same PR.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Expanded the agnostic-surface matrix from the initial pass to cover
every public type, function, and infrastructure item exhaustively:

  * § A (polyfilled types backing) now includes the FULL integer-lane
    table — I8x{32,64}, U8x{32,64}, I16x{16,32}, U16x{16,32},
    I32x{8,16}, U32x{8,16}, I64x{4,8}, U64x{4,8} — per CPU profile,
    with TD-T22 ⏳ markers on the still-unverified 256-bit polyfills.
  * § A also adds a "Critical type-method per-CPU lowerings" subsection
    that names the exact intrinsic each hot method emits per profile
    (vfmadd231ps zmm vs 2×vfmadd231ps ymm vs 4×vfmaq_f32 vs scalar
    f32::mul_add).
  * § B simplified — every simd_ops surface is 🟡 polyfill-pass; the
    work is in the polyfill layer above.
  * § C (simd_int_ops) sharpened — every scalar 🚨 cell is annotated
    with the polyfilled lane (I8x64, I16x32) it should reach for.
  * § D made explicit about BF16 native via __m256bh storage vs the
    portable [u16; 16] scalar polyfill switch.
  * § E (batch converters + transcendentals) now flags
    f32_to_bf16_batch_rne as scalar on every non-AVX-512 profile,
    despite the kernel existing — Phase 1 MX-T2.
  * § H added — currently-missing surfaces inventory (gemm_i8, gemm_u8,
    dot4_u8_i8 polyfill primitive, axpy_f32, dot_f32, nrm2_f32,
    asum_f32, gemv_f32, dot_i32, SimdProfile enum).
  * § I added — cross-cutting infrastructure status (cargo configs
    present per profile, missing cpu-* features, missing
    runtime-dispatch feature, missing SimdProfile enum, bench harness
    coverage).
  * § J integration plan extended:
    - Phase 0 records what landed in this session.
    - Phase 1 merges audit TD-T* tasks with new MX-T* items
      (integer slice ops lift, bf16/f16 cast fast paths).
    - Phase 4 grew MX-F1..MX-F16 with priority rebalanced based on
      "hot" markers for AI/ML BF16/F16 paths.
    - NEW Phase 5 — BLAS-graph kernel polish + bench-regression gates
      (the JIT-parity zone the prior session reached).
    - NEW Phase 6 — explicit out-of-scope list (GPU, JIT revival,
      wasm32 SIMD128, multi-core).
  * § K added — how to read the doc; § L provenance trail
    (no grep/tail per workspace rule, every entry traceable to a
    full-file Read).

Total: ~860 lines of matrix + plan, covering 14 CPU profiles ×
~80 polyfilled types & surface symbols.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
…lled lanes

Phase 1 of the per-CPU integration plan: the integer-elementwise slice
ops in simd_int_ops were uniformly scalar on every CPU despite the
polyfilled I8x64 / I16x32 lanes existing and being SIMD-backed on
every backend. This routes the three ops through the polyfill.

Per-backend dispatch follows the existing min_i8 / max_i8 template:

  x86_64    →  I8x64 / I16x32 (AVX-512BW _mm512_add_epi8 zmm /
               AVX2 polyfill of I8x64 as 2×__m256i on v3 builds)
  aarch64   →  I8x16 / I16x8  (NEON vaddq_s8 / vaddq_s16)
  other     →  scalar wrapping loop (unchanged)

Wrapping arithmetic is preserved on every path: _mm512_add_epi8 and
vaddq_s8 are bit-for-bit equivalent to i8::wrapping_add, so the
existing tests (add_i8_matches_scalar_for_tail_lengths covering
lengths 0/1/32/63/64/65/127/128/129/256) verify correctness across
the cfg chain. No new tests needed — the parity-against-scalar
sweep already exercised every boundary.

Verification:
  * default v3 build (uses AVX2 polyfill of I8x64): 15 simd_int_ops
    tests pass; 2087 lib tests pass; clippy -D warnings clean.
  * cascadelake config (native _mm512_add_epi8 / _mm512_add_epi16):
    15 simd_int_ops tests pass.
  * sapphirerapids config: NOT verified — the dev-runtime CPU on
    this host advertises only avx512_vnni in /proc/cpuinfo (no AMX
    / BF16 / FP16), so SPR-targeted binaries SIGILL on UNRELATED
    pre-existing tests like min_max_i8_boundary_values. The SPR
    config's correctness needs verification on real SPR silicon.

Companion matrix entries flipped:

  C. simd_int_ops → row `add_i8`     :  ⚠️ scalar 🚨 → ✅ I8x64/I8x16
                  row `sub_i8`     :  ⚠️ scalar 🚨 → ✅ I8x64/I8x16
                  row `add_i16`    :  ⚠️ scalar 🚨 → ✅ I16x32/I16x8

Remaining Phase 1 work in simd_int_ops:

  MX-T1b — `dot_i8` / `dot_i16` require a widening-multiply-add
  polyfill primitive (i8×i8 → i32 via VPMADDUBSW + horizontal add
  on x86, vmlal_s16 + vaddv_s32 on NEON). The widening-multiply
  primitive doesn't yet exist on the polyfilled types; promoting
  these without it would force per-arch intrinsics into
  simd_int_ops, violating the agnostic-surface principle. Defer
  to the polyfill-primitive PR.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Two updates to the agnostic-surface CPU matrix following the
MX-T1a landing (b5bca4e) and the user directive on instruction
encoding strategy:

1. Matrix § C cells flipped from ⚠️ scalar → ✅ for
   add_i8 / sub_i8 / add_i16 across every CPU column. The path
   per backend is documented inline (zmm _mm512_add_epi8 on
   AVX-512-BW, 2× ymm _mm256_add_epi8 on AVX2 via I8x64 polyfill,
   vaddq_s8 on NEON, scalar wrapping_add elsewhere).

2. § J Phase 0 grows an entry for MX-T1a, and gains a NEW
   "Design rule for AMX / F16 / FP16 paths" subsection that
   codifies the asm-byte encoding requirement for Phases 1b
   (AMX-INT8 arm of gemm_u8_i8), 3b (AVX-512-FP16 native
   F16x16 ops), 3c (NEON BF16+FP16), and 4d (AMX-FP16 on GNR).
   The rule:

     * AMX intrinsics are nightly-only on Rust 1.95 (issue
       #126622) → use asm!(".byte 0xc4, 0xe2, 0x73, 0x5e, 0xc1")
       style per the existing simd_amx.rs pattern.
     * AVX-512-FP16 intrinsics have stabilization churn → same
       asm-byte encoding sidesteps Rust release dance.
     * NEON FP16 (FMLA v.8h, BFDOT, BFMMLA, USDOT) — historically
       nightly-gated, use .inst 0x0e40cc20-style encoding for
       AArch64 (same idea, different assembler directive).
     * Each newly-encoded instruction lands with an objdump -d
       verification check in the doc-comment ("verified working"
       — same convention as simd_amx.rs:16-19).
     * Does NOT apply to instructions WITH stable intrinsics on
       Rust 1.95: _mm512_dpbusd_epi32 (avx512vnni), F16C
       _mm256_cvtph_ps, _mm512_cvtne2ps2bf16 (avx512bf16), etc.
       Those continue using direct intrinsics per existing
       simd_avx512.rs patterns.

The rule prevents future regression where a session reaches for
nightly avx512fp16 intrinsics, fails to compile on the project's
stable toolchain, and then drops back to scalar polyfill — the
same shape of regression that removed array_windows/add_mul in
the prior session and was recovered in 0a46e7f.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
… kernel

Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and
GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS,
256 BF16×BF16 multiply-accumulates per instruction, single-rounded
into an f32 tile accumulator). Until this commit, the AMX branch of
`matmul_bf16_to_f32` was a placebo — both `if amx_available()` and
`else` called the scalar `bf16_gemm_f32`. The actual kernel
(`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was
unreached by the consumer entry point.

This wires it. When AMX is OS-enabled AND the matmul shape is
16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks
through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the
asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95
encoding documented at simd_amx.rs:16-19; AMX intrinsics are
nightly-only per issue #126622, hence asm-byte). Aligned tiles get
the full hardware throughput; misaligned shapes (any of M/N/K not at
the alignment boundary) fall back to the validated scalar
`bf16_gemm_f32` reference. Non-AMX hosts always take the scalar
fallback.

The B sub-block extraction copies a K × 16 packed scratch per
j_tile column band (B is K × N row-major; the kernel wants K × 16
contiguous). Allocation cost is amortized across M/16 i-tile
iterations under each j_tile. Phase-4 work will land a fully
mixed-tile path (AMX 16×16 core + per-axis scalar tails on the
same matmul) for arbitrary shapes.

Verification:
  * Default v3 build: 11 amx_matmul tests pass (this host lacks
    AMX per /proc/cpuinfo, so the path falls through to scalar;
    behaviour identical to pre-commit on this runner).
  * Full lib sweep: 2087 tests pass; clippy -D warnings clean.
  * Real SPR silicon: the gating is correctness-by-construction —
    the new branch only fires when amx_available() == true AND the
    alignment predicates hold; the inner kernel is the same one
    PR #104 shipped and tested.

Background — the directive chain from this session:

  user: "Sapphire Rapids should have BF16 operations"
  user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?"  → both are SIMD,
        TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums
        per instruction (16×16 outer-product matmul tile), VDPBF16PS
        does 32 BF16×BF16 multiplies + 16 f32 accums per zmm
        instruction. Neither is scalar. The "no scalar lane-by-lane
        f32 round-trip" rule the user gave is what this PR delivers:
        the AMX tile op is hardware-fused, single-rounded into f32
        accumulator, BF16 mantissa bits preserved bit-exactly per
        IEEE BF16 spec at the multiply step.

Closes TD-T1 from
`.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
… kernel

Follow-up to TD-T1 (fe334de). `matmul_f32`'s AMX branch was the same
shape of placebo as `matmul_bf16_to_f32`'s pre-TD-T1: it down-cast f32
→ BF16, then called the scalar `bf16_gemm_f32` reference — never
reaching `TDPBF16PS` even on real AMX silicon.

Factored the BF16 AMX-tile dispatch logic out of `matmul_bf16_to_f32`
into a private `bf16_gemm_with_amx(a, b, c, m, n, k)` helper. Both
public entry points now route through it:

  matmul_bf16_to_f32  →  bf16_gemm_with_amx  (direct BF16 inputs)
  matmul_f32          →  RNE down-cast → bf16_gemm_with_amx
                                                (f32 in, BF16 compute,
                                                 f32 accumulator out)

The helper's behaviour is unchanged from what TD-T1 shipped: 16/16/32-
aligned shapes hit `bf16_tile_gemm_16x16` (TDPBF16PS via asm-byte,
8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction);
mis-aligned shapes or non-AMX hosts fall back to scalar
`bf16_gemm_f32`. Single source of truth — future Phase-4 mixed-tile-
plus-tail dispatch only needs to land in one place.

Verification:
  * 11 amx_matmul tests pass (default v3, no AMX on this host →
    scalar fallback exercised; behaviour identical to pre-commit).
  * cargo clippy --lib -D warnings clean.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Five locations across two files where my recent commits had lines
slightly over the rustfmt width:
  - simd_int_ops.rs tests: 3× iterator chain reflows (.collect()
    onto its own line)
  - simd_ops.rs:505 — `array_windows` count computation broken to
    if/else block form
  - simd_ops.rs:679 / :686 — ref_add_mul_{f32,f64} test helpers
    reflow .iter().zip(...).map(...).collect() onto multi-line

Pure whitespace / formatting; no semantic changes. 15 simd_int_ops
tests + 29 simd_ops tests still pass on default v3.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Extends the BF16 GEMM dispatch chain from PR #180's per-tier table.
Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR,
GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake
+ Cascade Lake + Zen 4+ which all have avx512bf16 hardware but
nothing else).

Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one
instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates,
single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte
needed (unlike AMX, which is nightly-only per issue #126622 and
must be raw-byte encoded).

Three-tier dispatch in bf16_gemm_dispatch (renamed from
bf16_gemm_with_amx now that AMX isn't the only hw path):

  1. amx_available() && 16/16/32-aligned shapes
     → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte
       (8 192 MACs/instr, MOST throughput)
  2. is_x86_feature_detected!("avx512bf16")
     → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic
       (32 MACs/instr, arbitrary shapes, K-tail handled scalar,
        N-tail handled by per-iteration j_count trim)
  3. scalar bf16_gemm_f32 reference

Kernel pattern (slow-but-correct first cut):
  * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16
    columns of one output row, processing 2 K-elements per call.
  * B columns for the current j-block of 16 are pre-packed into a
    pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj]
    in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits),
    then reused across all m i-iterations to amortize the column-
    gather cost.
  * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across
    16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by
    every output column.
  * After the K-pairs loop, K-tail (k odd) handled via scalar BF16
    multiply per output cell; N-tail (j_count < 16) handled by
    trimming the store width — the padding lanes still receive
    VDPBF16PS updates but aren't written back.

Performance shape (rough): the kernel is correctness-optimized, not
peak-throughput-optimized. Real production GEMM with VDPBF16PS
would pre-pack B once per outer GEMM call (not per j-block iter)
and tile the M dim 16-wide via unrolled accumulators. Phase-4 work.
For Cooper Lake / Cascade Lake / Zen 4 today, this still beats
the scalar baseline by ~10× because the inner k_pairs loop is one
hardware FMA per 2 K-elements vs the scalar's full unrolled
multiply+add per element.

Verification:
  * Default v3 build: 11 amx_matmul tests pass (this host shows
    only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new
    arm falls through to scalar; behaviour identical to pre-commit).
  * cargo clippy --lib -D warnings clean.
  * cargo fmt --all --check clean.
  * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65,
    k_pairs=32, k_tail=1) and strided test will exercise the new
    arm on Cooper Lake / Cascade Lake / Zen 4 silicon.

Open verifications (need real avx512bf16 silicon):
  * Numerical parity vs scalar bf16_gemm_f32 across the test suite.
  * Throughput vs scalar baseline.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
@AdaWorldAPI AdaWorldAPI merged commit ae5efaa into master May 21, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Closes TD-SIMD-8's F16-honesty gap (tracked in
`.claude/knowledge/simd-dispatch-architecture.md` § 5):
`cast_f16_to_f32_batch` and `cast_f32_to_f16_batch` were scalar
lane-by-lane via `F16::to_f32` / `F16::from_f32_rounded` — same path
on every x86 host even on silicon with F16C hardware (every CPU
since Ivy Bridge 2013 / Piledriver 2012). Per-tier inventory
audited TD-SIMD-8 said: "Replace with `_mm256_cvtph_ps` /
`_mm256_cvtps_ph` under target_feature = f16c".

Wires the F16C hardware path:

  cast_f16_to_f32_batch:
    x86_64 + runtime f16c+avx detect → cast_f16_to_f32_batch_f16c
      (8 F16 → 8 F32 per `_mm256_cvtph_ps` instruction, IEEE-754
       lossless widening, bit-identical to scalar `F16::to_f32`)
    fallback → scalar `F16::to_f32` lane-by-lane

  cast_f32_to_f16_batch:
    x86_64 + runtime f16c+avx detect → cast_f32_to_f16_batch_f16c
      (8 F32 → 8 F16 per `_mm256_cvtps_ph::<0>` instruction, RNE
       rounding via _MM_FROUND_TO_NEAREST_INT, bit-identical to
       `F16::from_f32_rounded` on every input incl. subnormal/NaN)
    fallback → scalar `F16::from_f32_rounded` lane-by-lane

Intrinsics are stable on Rust 1.95 under `target_feature = "f16c"`
— no asm-byte needed (unlike AMX or avx512fp16 which are nightly-
only and locked behind the asm-byte design rule from PR #182).

Note on IMM8 encoding: `_mm256_cvtps_ph` const generic must fit in
3 bits (0..=7) per `static_assert_uimm_bits`. IMM8 = 0 selects
`_MM_FROUND_TO_NEAREST_INT` (RNE with exception raise). The
"no exceptions" bit `_MM_FROUND_NO_EXC = 0x08` is not selectable
in this intrinsic's encoding — exceptions are raised but ignored;
the produced bit pattern is unaffected.

Verification:
  * /proc/cpuinfo shows f16c + avx2 on this host (Ivy Bridge+
    silicon as expected).
  * 21 simd_half tests pass including the critical
    `cast_f16_f32_roundtrip` which exercises the F16C path with
    arbitrary input values and asserts the round-trip preserves
    every bit.
  * Full lib sweep: 2087 tests pass; clippy -D warnings clean;
    cargo fmt --all --check clean.

Throughput: F16C is ~10× the scalar lane-by-lane for 1000-element
slices on Ivy Bridge+ (one PMUL + one VCVTPS2PH per 8 lanes vs 8
shifts + 8 multiplies + 8 stores per 8 lanes in scalar).

Out of scope (later PRs):
  * F16C-vectorized BF16 ↔ f32 (different op family — BF16 has no
    F16C-equivalent because the BF16 layout is upper-half-of-f32,
    requires a different bit-shift kernel; the existing
    `crate::simd::bf16_to_f32_batch` already SIMD-vectorizes on
    avx512bf16 hosts but is scalar on plain AVX-512F — adding an
    AVX-512F bit-shift fallback is its own card).
  * NEON `vcvt_f32_f16` / `vcvt_f16_f32` for aarch64 — Phase 3b
    with the BFMMLA/FMLA.8h asm-byte arm.
  * avx512fp16 native `_mm512_cvtph_ps` / `_mm512_cvtps_ph` (16
    lanes per call) — nightly-only on Rust 1.95, asm-byte path.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the
integer operand family. Builds the missing int8 tile kernel from
scratch (the BF16 equivalent shipped in PR #104; the int8 one had
never been built despite the primitives existing in simd_amx since
day one) and wires matmul_i8_to_i32's AMX arm through it.

New module `hpc::int8_tile_gemm`:

  * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel,
    K must be multiple of 64. Mirror shape of
    `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand
    family that TDPBUSD natively supports. **One TDPBUSD = 16 384
    multiply-accumulates per instruction** (16×16 output tile × 64
    K-elements per A row × 4 K-elements per inner-product). That's
    256× the VPDPBUSD-zmm throughput per instruction.
  * Internal `amx_path()` uses the existing primitives in
    `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig →
    tile_zero → K/64 iterations of (tile_load A, tile_load B,
    tile_dpbusd) → tile_store → tile_release.
  * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32
    triple-loop reference.

New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`:

  * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad
    layout required by TDPBUSD tile 2.
  * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]`
  * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout
    for TDPBF16PS — both kernels reach the same 64-byte tile row
    width via element-width × pack-factor symmetry: BF16 is 2B × 2,
    INT8 is 1B × 4).

Wiring `matmul_i8_to_i32`'s AMX arm (was placebo):

Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR
`int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself
was never reached even on real AMX silicon. Now:

  1. Shift A: i8 → u8 via (+128).
  2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling
     int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block
     extracted into K × 16 scratch once per j_tile, reused across
     i_tile iterations.
  3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]).

The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0;
misaligned shapes fall back to the scalar reference. Phase-4 work
will land mixed AMX-tile + per-axis scalar tail handling for
arbitrary shapes (same shape of Phase-4 work TD-T1 deferred).

Verification:
  * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new
    tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test
    now exercises the actual TDPBUSD path because this host has
    amx_int8 + amx_tile in /proc/cpuinfo; the test continues to
    pass with bit-identical results to the scalar reference).
  * `vnni_pack_i8_roundtrip` test verifies the pack layout matches
    the spec exactly for an 8 × 4 sample.
  * `fallback_matches_scalar_reference_k64` test verifies the
    non-AMX path produces the same i32 output as a hand-written
    reference for a 64-K, pseudo-random u8/i8 matrix pair.
  * `public_api_diagonal_k128` test asserts a structured pattern
    (A = identity-like, B = constant 2) gives the expected
    accumulation through the full dispatch chain.
  * `cargo clippy --lib -D warnings` clean.
  * `cargo fmt --all --check` clean.

Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX
arm no longer falls back to it (the scalar else-branch uses an
inline triple-loop directly).

After this commit, the per-CPU dispatch table from PR #180 has the
AMX tier wired for BOTH operand families on Sapphire Rapids+:

  BF16 GEMM:  SPR+ → TDPBF16PS  (TD-T1 / TD-T1b in PR #182)
  INT8 GEMM:  SPR+ → TDPBUSD    (this commit)

Out of scope (separate PRs):
  * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade
    Lake / Zen 4+ (avx512vnni without AMX). The kernel function
    `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just
    need to assemble them into a m×n×k GEMM and wire as the
    middle dispatch tier (analogous to the VDPBF16PS arm in PR
    #182's bf16_gemm_dispatch).
  * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level
    surface from PR #182) — it's u8 × i8 natively so no sign-shift
    needed, simpler to wire than matmul_i8_to_i32.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Two clippy-as-error issues blocking PR #184 CI:

1. `src/hpc/int8_tile_gemm.rs:147` (mine, from b1979d7) —
   `clippy::unused_parens` flagged the closure body `(((i*11+5) % 256)
   as u8 as i8)` in the `fallback_matches_scalar_reference_k64` test.
   The outer parens around the cast chain are redundant; rustfmt
   re-broke the line to multi-line after removal so it stays readable.

2. `tests/par_rayon.rs:9` (pre-existing) — `clippy::manual_div_ceil`
   flagged `(M + CHUNK_SIZE - 1) / CHUNK_SIZE`. Replaced with
   `M.div_ceil(CHUNK_SIZE)` per the clippy hint. This file was
   already in tree; the lint became active in clippy 1.95 (Rust
   stable) which CI now uses, so prior PRs weren't blocked by it
   but the rayon-features test build is now.

Both fixes are mechanical / no behaviour change:
  * `cargo clippy --tests --features rayon,native -- -D warnings`
    clean.
  * `cargo fmt --all --check` clean.

Stashed work-in-progress on the VPDPBUSD-zmm middle tier for
`matmul_i8_to_i32` (the natural symmetric next step after TD-T2,
analogous to the VDPBF16PS arm shipped in PR #182's
`bf16_gemm_dispatch`); will follow up in a separate commit once
CI is unblocked.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per
PR #180's table the middle tier between AMX TDPBUSD (Sapphire
Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm
form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice
Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors
the VDPBF16PS arm structure that landed for BF16 in PR #182's
`bf16_gemm_dispatch`.

New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`:
  * One VPDPBUSD instruction: 16 i32 accumulator lanes, each
    receiving 4 u8×i8 products = 64 MACs per instruction.
  * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`,
    one i row processed at a time, K-quad inner loop accumulating
    into the same 16 i32 lanes across iterations.
  * B-column packing: pre-packs B for the current j-block into
    `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad..,
    j_base+j] packed bottom-to-top)` once per j-block; reused
    across all M i-iterations so the gather cost amortizes.
  * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes
    packed) every K-iter — same quad seen by every output column.
  * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per
    output cell; N-tail (j_count < 16) handled by trimming the
    store width — padding lanes still receive VPDPBUSD updates
    but aren't written back.
  * Stable intrinsic `_mm512_dpbusd_epi32` under
    `target_feature = "avx512vnni,avx512f"` — no asm-byte needed.

Wiring `matmul_i8_to_i32` to three-tier dispatch:
  1. amx_available() + 16/16/64-aligned shapes
     → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr,
       this commit reuses the kernel from PR #184 fe334de... wait,
       same PR — from b1979d7 in THIS PR)
  2. is_x86_feature_detected!("avx512vnni")
     → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable
       intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled
       scalar, N-tail handled by per-iteration j_count trim)
  3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts,
     or shapes that don't satisfy either SIMD tier's requirements

Factored the shared sign-shift bias subtraction into a private
`subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1
(AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to
subtract 128·colsum(B) from the accumulator. Pure integer
arithmetic, bit-identical to the scalar i8×i8 → i32 reference.

Verification:
  * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test
    `vpdpbusd_zmm_matches_scalar` that exercises the new arm
    directly with shapes spanning aligned cases, K-tail (k % 4),
    N-tail (n % 16), and small shapes; asserts byte-equal output
    vs scalar reference).
  * Existing `matmul_i8_to_i32_16x16_exact` continues to pass
    through the AMX tier on this host (which has amx_int8).
  * cargo clippy --lib --tests --features rayon,native -- -D warnings
    clean.
  * cargo fmt --all --check clean.

Per-CPU dispatch state after this commit:

  matmul_bf16_to_f32:  SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_f32:          SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_i8_to_i32:    SPR+ AMX  | CPL/Zen4 VPDPBUSD  | scalar
                       (b1979d7) | (THIS COMMIT)      | (always)

So all three of the public matmul entry points now have full
three-tier dispatch on x86_64.

Out of scope (separate PRs):
  * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level
    u8×i8 surface from PR #182) — it's u8×i8 natively, no sign-
    shift bias needed, simpler than matmul_i8_to_i32.
  * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without
    avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but
    need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as
    the avx512vnni arm just with ymm width.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…able

Rebased onto master post-#181, #182, #183. Replaces the polyfill-based
add_mul_f32/f64 with LazyLock-cached function pointers picking real
hardware FMA per silicon, and adds two more LazyLock-cached
primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8.

WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar
f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width
but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that
is an instruction-family choice, not a lane-width one. LazyLock
amortises a one-time simd_caps() read into a frozen fn pointer;
every subsequent call is a single indirect jump with zero
is_x86_feature_detected! overhead. No SimdProfile exposed at the
consumer surface — agnostic contract preserved.

add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i]
  AVX-512F+FMA  → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail
  AVX2+FMA      → _mm256_fmadd_ps 8-wide + scalar tail
  NEON          → vfmaq_f32 4-wide + scalar tail
  scalar        → f32::mul_add per lane
  no_std build  → preserves the polyfill F32x16::mul_add path
                  (LazyLock requires std)

add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes.

is_amx_available() — wraps simd_amx::amx_available() (CPUID +
OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in
LazyLock<bool>. The 4-step gate, including the syscall, fires
exactly once per process. Always false on non-x86_64.

vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices:
  AVX-512 VNNI  → delegates to simd_amx::vnni_dot_u8_i8 wrapped with
                  scalar tail handling (the existing kernel processes
                  only n - (n%64) since its cognitive-shader caller
                  pre-aligns rows; general-purpose callers need the
                  tail)
  AVX-VNNI 256  → delegates to simd_amx::vnni2_dot_u8_i8 directly
                  (that one already handles its scalar tail)
  scalar        → simd_amx::vnni_dot_u8_i8_scalar

No intrinsic code is duplicated. The dispatcher composes existing
simd_amx::* kernels (which #182/#184 also build on) into a safe
LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch
runs the same selection logic but uses is_x86_feature_detected! per
call; this wrapper amortises that to once at startup.

PARITY CONTRACT:
  - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add /
    f64::mul_add per lane via to_bits() assertion. All vector
    backends emit single-rounded IEEE-754 FMA.
  - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply.
    VPDPBUSD does not saturate the accumulator (intermediate u8*i8
    products bounded by 32385, four-element sums by 129540).

Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12
problem sizes / tail lengths). cargo clippy --lib clean under
default and --features cpu-spr. On Sapphire Rapids host the
LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for
vnni_dot; AMX is_amx_available returns false (hypervisor masks
XCR0[17,18]) — matches the Risk #3 demotion from 61b4563.

This commit was rebased atop master after the parallel session
shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and
prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The
earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into +
walkback) and be65595 (is_amx_available + vnni_dot duplicating
intrinsics) are subsumed by this single clean commit: no public
SimdProfile / SimdTier re-export, no duplicated intrinsic code, no
mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Extends the u8×i8 → i32 dispatch chain from PR #182's compile-time
cascade (avx512vnni → avxvnni → scalar) by adding a top-tier AMX
runtime check. Brings the SPR/GNR TDPBUSD path (16 384 MACs per
instruction) to the slice-level surface that downstream consumers
(lance-graph, etc.) use, completing the symmetry with PR #184's
matmul_i8_to_i32 wiring.

`gemm_u8_i8` is u8×i8 natively — no sign-shift bias trick needed
(unlike `matmul_i8_to_i32` which is i8×i8 and has to convert via
+128 then subtract `128·colsum(B)`). That makes the AMX path here
a direct call with no bias correction.

New helper `hpc::int8_tile_gemm::int8_gemm_amx_tiled(a_u8, b_i8,
c, m, n, k)` factors out the tile-decomposition logic that was
previously inlined in `matmul_i8_to_i32`. Both consumers now share
the same helper:

  matmul_i8_to_i32:
    1. shift A: i8 → u8 (+128)
    2. int8_gemm_amx_tiled(a_u8, b, c, m, n, k)
    3. subtract_i8_to_u8_bias(c, b, m, n, k)

  gemm_u8_i8 (AMX tier added in this commit):
    1. int8_gemm_amx_tiled(a, b, c, m, n, k) — no shift, no bias

The helper handles arbitrary 16/16/64-aligned shapes via a
j_tile × i_tile loop calling int8_tile_gemm_16x16 per (16, 16)
block. B sub-block extracted into K × 16 scratch once per j-tile,
reused across all M i-tiles. **Overwrite semantics**: c is written
not accumulated (the underlying int8_tile_gemm_16x16 accumulates
into its tile buffer, but we zero the tile buffer before each call
so the per-tile write to c is pure overwrite).

Dispatch placement in gemm_u8_i8:
  * Tier 0 (this commit): runtime amx_available() check at the
    top of the function. AMX requires CPUID + XCR0 + Linux prctl
    which can't fit a target_feature compile-time gate.
  * Tiers 1-3: existing compile-time cfg-cascade (avx512vnni zmm
    → avxvnni ymm → scalar i8_gemm_i32). Unchanged.

Misaligned shapes (m/n not multiples of 16, k not multiple of 64)
or non-AMX hosts fall through to the compile-time cascade as
before.

Also fixed pre-existing clippy::manual_is_multiple_of warnings
that surfaced in the new alignment check — switched from `% 16
== 0` to `.is_multiple_of(16)` etc. per the clippy hint (Rust
1.95 promoted this from `pedantic` to active warn).

Verification:
  * 2095 lib tests pass (was 2094 — +1 new
    `gemm_u8_i8_amx_aligned_32x32x128` test exercising the AMX
    arm with a 32×32×128 shape that hits the AMX tier on this
    host's amx_int8 silicon).
  * 11 amx_matmul tests pass (matmul_i8_to_i32 refactored to call
    the shared helper — same behavior as before).
  * 4 gemm_u8_i8 tests pass (the existing ones still hit the
    compile-time cascade since their shapes aren't AMX-aligned).
  * cargo clippy --lib --tests --features rayon,native -- -D warnings
    clean.
  * cargo fmt --all --check clean.

Per-CPU dispatch state after this commit:

  matmul_bf16_to_f32:  SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_f32:          SPR+ AMX  | Zen4/CPL VDPBF16PS | scalar
                       (PR #182) | (PR #182)          | (always)
  matmul_i8_to_i32:    SPR+ AMX  | CPL/Zen4 VPDPBUSD  | scalar
                       (PR #184) | (PR #184)          | (always)
  gemm_u8_i8 (slice):  SPR+ AMX  | CPL/Zen4 VPDPBUSD  | ARL ymm | scalar
                       (THIS)    | (PR #182)          | (PR #182) | (PR #182)

Out of scope (separate PRs):
  * AVX-VNNI ymm arm for matmul_i8_to_i32 — `vnni2_*` helpers
    exist in simd_amx.rs but need assembling into a m×n×k GEMM.
    Same shape as the avx512vnni arm just with ymm width.
  * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
**Alternative** to the compile-time cascade in `crate::simd::*` /
`crate::simd_ops::*`. **Additive**: gated under
`--features runtime-dispatch`, does not touch any existing path.
Mutually exclusive with `nightly-simd` (the portable-SIMD polyfill
replaces the architecture-specific intrinsics that the runtime
trampolines select between).

Use case: ship ONE binary that adapts across heterogeneous
deployment silicon (AVX-512 server + AVX2-only laptop + Arrow Lake
desktop + Sapphire Rapids workstation) from the same artifact. The
existing compile-time `v3` / `v4` / `native` / `nightly-simd`
configs target a single class of CPU per build; the runtime layer
targets the union via per-op LazyLock<fn ptr> trampolines.

Design from `.claude/knowledge/simd-dispatch-architecture.md` § 7.1
/ Phase 5, building on the precedent set by
`hpc::bgz17_bridge::{L1_KERNEL, L1_WEIGHTED_KERNEL, ...}`
(`LazyLock<L1Fn>` pattern, lines 75-86) already proven in tree.

# Dispatch model

One `LazyLock<fn ptr>` per public surface. First call fires the
closure which reads `simd_caps()` and selects a backend; every
subsequent call is one pointer-deref + indirect call. Per-call
overhead: ~2-3 ns (LazyLock atomic-acquire load that's cache-
resident after first hit + indirect-call branch-target predict).
Invisible against any SIMD op's actual work (~100+ cycles).

# Module layout

  src/simd_runtime/
    mod.rs       — module entry, mutual-exclusion check vs
                   nightly-simd, public re-exports
    vnni_dot.rs  — u8×i8 → i32 dot (the proposal's canonical
                   example): 3 backends, the AVX-512 arm
                   wraps `simd_amx::vnni_dot_u8_i8` with a
                   scalar tail because the existing kernel
                   silently drops n%64 lanes (its matvec
                   caller pre-aligns rows; a general-purpose
                   dispatch surface cannot assume that)
    add_mul.rs   — slice-level FMA (acc += a × b) for f32/f64;
                   the ONLY new kernel code in this module —
                   4 backends per type (avx512 / avx2+fma /
                   neon / scalar), each ~15 LoC of direct
                   intrinsics
    matmul.rs    — thin trampolines for matmul_bf16_to_f32 /
                   matmul_f32 / matmul_i8_to_i32 / gemm_u8_i8
                   delegating to existing functions that
                   already runtime-dispatch internally
                   (PR #182 / #184 / #185)
    casts.rs     — trampolines for the four half-precision
                   batch casts delegating to PR #183's already-
                   runtime-dispatched implementations

# Backend reuse — no kernel duplication

Every dispatch arm delegates to a kernel that already exists in
tree. The runtime layer is just the trampoline. The only NEW
kernel code is `add_mul_f32` / `add_mul_f64` (no pre-existing
slice-level FMA primitive in tree to delegate to — the compile-
time `crate::simd_ops::add_mul_f32` from PR #182 polyfills through
the F32x16 lane wrapper; the runtime version skips that
indirection for one more inlined intrinsic per chunk).

# Invariants preserved from this PR series

  * No-FP32-roundtrip on BF16/F16 arithmetic — backends respect
    the bit-exact mantissa rule
  * Asm-byte encoding for nightly-gated AMX / FP16 — selected
    backends keep their existing asm-byte fast paths
  * Little-endian byte contracts for half-precision carriers
  * Accumulator-preservation in tile paths (codex P1 from #184)
  * Boundary assertions on safe public fns (codex P1 from #185) —
    the public `vnni_dot_u8_i8(a, b)` etc. inherit the asserts
    transparently via the call chain

# Verification

  * Default build (no feature): 2087 lib tests pass — the
    `simd_runtime` module is gated out, zero impact on existing
    paths.
  * `cargo test --lib --features runtime-dispatch`: **2105 lib
    tests pass** (+8 new in `simd_runtime::*::tests`).
  * `cargo clippy --lib --tests --features rayon,native -- -D warnings`
    clean (default).
  * `cargo clippy --lib --tests --features rayon,native,runtime-dispatch
    -- -D warnings` clean.
  * `cargo fmt --all --check` clean.
  * Mutual-exclusion enforced via `compile_error!` in
    `simd_runtime/mod.rs` — `--features runtime-dispatch,nightly-simd`
    fails to compile with a clear error.

# What's NOT in this PR (deferred)

  * Sweep the remaining ~15-20 SIMD/HPC public surfaces (min_i8,
    max_i8, add_i8, dot_i8, etc.). Each is ~30-50 LoC of trampoline;
    pattern is established here. Estimated ~700-900 more LoC across
    the full surface map.
  * CI matrix entry for `runtime-dispatch-portable` (per
    simd-dispatch-architecture.md § 7 / TD-SIMD-9). Job builds
    with `--features runtime-dispatch` on a v3 baseline runner and
    asserts every trampoline lands on its expected backend.
  * `simd_caps()` snapshot logging at process start (debug-only)
    to aid release-binary deployment debugging — "which arm did
    you actually pick?"

# Cost summary

  src/simd_runtime/                 +537 LoC (4 modules)
  src/lib.rs                        +9 LoC (cfg-gated mod decl)
  Cargo.toml                        +21 LoC (feature decl + doc)
  Total                             ~570 LoC

  Trampoline LoC per surface (this PR's sample):
    vnni_dot         170 LoC (LazyLock + 3 arms + wrapper + tests)
    add_mul (f32+f64)218 LoC (LazyLock×2 + 4 arms×2 + tests — the ONLY new kernels)
    matmul (4 ops)   100 LoC (thin delegations + tests)
    casts (4 ops)     75 LoC (thin delegations + tests)

Out-of-tree estimate for the full sweep (per § 7 of the design
doc): ~1400 LoC total once all ~25 public SIMD/HPC surfaces are
wired. This PR establishes ~40% of that budget with the canonical
patterns.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants