simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring by AdaWorldAPI · Pull Request #182 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-21T00:44:55Z

Summary

Nine commits across three areas — recovering JIT-parity-zone slice helpers, building the per-CPU agnostic-surface resolution map, and landing two of its Phase-1 wirings.

Recovery + agnostic surface (commits 1-3)

0a46e7f restores simd_ops::array_windows / array_windows_checked and adds slice-level add_mul_f32 / add_mul_f64 — the foundation primitives the BLAS-graph GEMM path relies on (an earlier sprint had removed them).
86b0f3f ships simd_int_ops::gemm_u8_i8 as an agnostic surface with a compile-time avx512vnni → avxvnni → scalar dispatch chain.
caf0471 adds the AVX-VNNI ymm arm (Arrow Lake, Meteor Lake U) and bumps .cargo/config-avx512.toml from bare v4 to sapphirerapids (was missing VNNI).
0134916 adds an #[ignore]'d criterion-style bench harness bench_gemm_u8_i8_vs_scalar so the three arms have an apples-to-apples timing surface.

Per-CPU resolution matrix (commits 4-5)

b34d430 introduces .claude/knowledge/agnostic-surface-cpu-matrix.md — a 14-CPU × ~80-symbol matrix mapping every public type / function in crate::simd::* + crate::hpc::* to its actual lowering per profile. Cross-references the W1a consumer contract, the TD-SIMD audit, and the existing dispatch-architecture doc.
058ef61 expands to full integer-lane coverage (I8/U8/I16/U16/I32/U32/I64/U64 at 256/512-bit), adds the cross-cutting infrastructure status table (which configs exist, what features are missing), and grows the integration plan with Phases 0-6 + an explicit out-of-scope list.

Phase 1 wirings (commits 6-9)

b5bca4e — MX-T1a: simd_int_ops::add_i8 / sub_i8 / add_i16 lifted from scalar to polyfilled lanes (I8x64 / I16x32 on x86, I8x16 / I16x8 on aarch64). Uses the same min_i8-style cfg-cascade. Existing parity tests cover tail lengths 0/1/32/63/64/65/127/128/129/256.
bede3d2 — design rule + matrix update: flips the three MX-T1a cells in the matrix from scalar to per-CPU SIMD, and codifies the asm-byte encoding rule for AMX/F16/FP16 paths (Phases 1b/3b/3c/4d). AMX intrinsics are nightly-only (issue #126622) and avx512fp16/NEON-fp16 have stabilization churn on Rust 1.95 stable — raw .byte/.inst encoding (matching simd_amx.rs:16-19) is the documented stable-toolchain path. Does NOT apply to instructions with stable intrinsics on 1.95 (_mm512_dpbusd_epi32, _mm256_cvtph_ps, _mm512_cvtne2ps2bf16).
fe334de — TD-T1: matmul_bf16_to_f32's AMX arm was placebo (both if amx_available() and else called the scalar reference). This wires the 16/16/32-aligned path through bf16_tile_gemm::bf16_tile_gemm_16x16 which emits TDPBF16PS via the asm-byte path in simd_amx.rs::tile_dpbf16ps — 8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction on real SPR silicon. Misaligned shapes fall back to the validated scalar bf16_gemm_f32. Non-AMX hosts always take the scalar fallback.

Test plan

Default v3 (x86-64-v3 AVX2) — cargo test --lib: 2087 passed, 0 failed, 29 ignored.
cargo clippy --lib -- -D warnings clean.
cargo --config .cargo/config-avx512.toml=cascadelake (= v4 + VNNI) — 15 simd_int_ops tests pass on the AVX-512BW direct path (_mm512_add_epi8, _mm512_add_epi16).
cargo --config .cargo/config-avx512.toml (sapphirerapids) — runner CPU on this dev host shows only avx512_vnni in /proc/cpuinfo (no AMX/BF16/FP16), so SPR-targeted binaries SIGILL on unrelated tests. Needs real SPR silicon for verification.
aarch64 build path unchanged — the integer-lift cfg-cascades follow the same shape as the existing min_i8 / max_i8 cross-arch dispatch.
AMX path on real SPR — kernel itself is the same one PR Add BF16 tile GEMM with AMX/AVX-512 dispatch #104 shipped and tested; only the routing is new in this PR.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Generated by Claude Code

Introduces `simd_int_ops::gemm_u8_i8`, the first consumer-facing surface that bakes the SIMD dispatch decision in at build time. The consumer never branches on CPU capability; the active arm of the `#[cfg(target_feature)]` chain is the only one that compiles in, and the compiler emits a direct call to the chosen kernel. Arms wired in this PR: target_feature = "avx512vnni" → int8_gemm_vnni_avx512 kernel (default / anything else) → hpc::quantized::int8_gemm_i32 scalar Future arms (amx-int8, avxvnniint8, neon+dotprod) land additively in follow-up PRs without disturbing existing callers. Mechanical changes: * `int8_gemm_vnni_avx512` becomes `pub(crate) unsafe fn` so the agnostic surface can target it directly under the cfg gate, bypassing the per-call `if caps.has_avx512_vnni()` branch in `int8_gemm_vnni` (kept as-is for now; cleanup is a follow-up). * `hpc::quantized::int8_gemm_i32` (scalar) is untouched — it remains the universal reference path and the `target_feature`-less fallback. Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail, extreme u8/i8 values) pass under both the default v3 scalar arm and the `-Ctarget-feature=+avx512vnni` AVX-512 arm. Full lib suite green (2075 passed). `cargo clippy -- -D warnings` clean. Architecture refs: .claude/knowledge/td-simd-integration-plan.md § "SimdProfile + static dispatch tables" https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Two complementary changes — agnostic INT8 GEMM no longer falls to scalar on AVX2-with-VNNI silicon, and the AVX-512 build config picks up the full modern Intel HPC feature set instead of bare v4. 1. AVX-VNNI ymm kernel New `hpc::vnni_gemm::int8_gemm_avxvnni_ymm` — VEX-encoded `VPDPBUSD` over 8-wide i32 accumulators. Targets the AVX2 + AVX-VNNI silicon tier (Alder Lake, Arrow Lake, Zen 4 in ymm mode) which has hardware INT8 dot-product but no AVX-512. Same B-pre-pack layout as the AVX-512 kernel; column tail (`n % 8`) runs scalar (no masked ymm VPDPBUSD on the VEX encoding). 2. gemm_u8_i8 dispatch chain — avxvnni arm `simd_int_ops::gemm_u8_i8` gains a third `#[cfg]` arm between `avx512vnni` and the scalar fallback: avx512vnni → int8_gemm_vnni_avx512 (zmm, 16 lanes) avxvnni → int8_gemm_avxvnni_ymm (ymm, 8 lanes) (none) → hpc::quantized::int8_gemm_i32 (scalar) Arm precedence is widest-vector-first via `#[cfg]` ordering (Sapphire Rapids has both `avx512vnni` and `avxvnni`; the zmm arm wins). All arms are compile-time selected — no runtime caps branch on any hot path. 3. config-avx512.toml: x86-64-v4 → sapphirerapids The "AVX-512" build config now selects the canonical modern Intel HPC target (SPR), enabling VNNI + BF16 + FP16 + VBMI + AMX-TILE + AMX-INT8 + AMX-BF16 in addition to the v4 baseline. Effect on `gemm_u8_i8`: the avx512vnni arm now lights up under this config (pure x86-64-v4 lacks VNNI). Once an `amx-int8` arm lands, it will preempt automatically on the same config. GitHub CI runs the default `.cargo/config.toml` (still `-Ctarget-cpu=x86-64-v3`), which is unaffected — only opt-in `--config .cargo/config-avx512.toml` builds see the change. Parity tests (4×4 identity, 3×5×8 rectangular, 17×17 tail with the ymm-tail scalar path, extreme u8/i8 values) pass under three configs: default v3 → scalar arm -Ctarget-cpu=alderlake → avxvnni ymm arm --config config-avx512.toml → avx512vnni zmm arm Full lib suite green on default v3 (2075 passed). `cargo clippy -- -D warnings` clean on both default and SPR configs. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

#[ignore]'d sanity-check test that times the agnostic gemm_u8_i8 surface against the scalar reference at 64³/128³/256³/512³. Run with --ignored --nocapture to compare arms under different target cfgs. Measured on Sapphire Rapids (4-core Xeon @ 2.1GHz), release build: scalar arm (default v3) — noise (~1.0x, both paths are scalar): 64³ speedup=1.07x 128³ speedup=1.16x 256³ speedup=1.06x 512³ speedup=0.97x avxvnni ymm arm (-Ctarget-cpu=alderlake): 64³ simd= 18.6µs scalar=109.5µs speedup=5.88x 128³ simd=151.7µs scalar=590.4µs speedup=3.89x 256³ simd= 1.1ms scalar= 2.7ms speedup=2.40x 512³ simd= 9.1ms scalar= 16.2ms speedup=1.77x Confirms the AVX-VNNI ymm kernel is genuinely faster than the scalar reference across all measured sizes — addresses the concern that an AVX2 path could end up slower than scalar GEMM if the algorithmic shape (B pre-pack into VNNI layout, tile sizing, MR/NR loop nesting) were wrong. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Recovers two foundation primitives that prior sessions removed (or never added in the first place) — both are explicitly cited by an earlier session as the reason the BLAS-graph GEMM hand-rolled kernels reach within a few percent of a Cranelift-JIT inner loop. Without them the JIT-native path becomes the only way to hit that throughput. 1. array_windows::<T, N> — overlapping const-size window iterator Stable-Rust equivalent of nightly `std::slice::array_windows::<N>()`. Sits next to the existing `array_chunks::<N>` (non-overlapping) in `simd_ops.rs`, completing the pair. Together they let consumer kernels iterate `B`-rows as overlapping K-windows and `A`-columns as non-overlapping M-chunks in a single source, with the polyfilled F32x16 / F64x8 types absorbing the per-arch lane count. Implementation uses index-based iteration (`(0..count).map(|i| &data[i..i+N])`) to avoid `slice::windows(0)`'s panic in the N==0 edge case — the unchecked variant yields an empty iterator, the checked variant returns `Err(())`. Behaviour mirrors the already-shipped `array_chunks` / `array_chunks_checked` pair. 2. add_mul_f32 / add_mul_f64 — slice-level FMA into accumulator `acc[i] += a[i] * b[i]` via the polyfilled `F32x16::mul_add` / `F64x8::mul_add` already on the SIMD types (16-wide AVX-512 / 8-wide AVX2-FMA / 4-wide NEON / scalar `f32::mul_add`). Single rounding step, semantically identical to BLAS-1 `axpy` with a vector multiplier and the dominant inner-loop shape in the bgz17 GEMM path. Operates on `min(acc.len(), a.len(), b.len())` lanes. 3. DO NOT REMOVE notice `simd_ops.rs` now opens with a "Foundation primitives — do not remove" callout that names `array_chunks`, `array_chunks_checked`, `array_windows`, `array_windows_checked`, `add_mul_f32`, and `add_mul_f64`, explains why they exist (~JIT-parity for BLAS-graph GEMM), and warns that prior sessions removed them under the wrong impression they were unused cruft. `src/simd.rs`' re-export site carries a matching pointer back to that notice. Both new helpers are re-exported flat from `crate::simd::*` per the W1a consumer contract — consumers reach for `ndarray::simd::{array_windows, add_mul_f32}`, never the implementation module directly. Verification: - 29 simd_ops unit tests pass (incl. 7 new array_windows + 4 new add_mul tests, covering tail handling, mismatched lengths, N==0, short buffers, exact-N, empty buffers). - 7 simd_ops doctests pass (the executable examples in the rustdoc). - Full lib suite green on default v3: 2087 passed, 29 ignored. - `cargo clippy -- -D warnings` clean. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

…plan Cross-tab of every public agnostic surface (`crate::simd::*`, `simd_int_ops::*`, `simd_half::*`, `simd_soa::*`) against the 14 CPU profiles from `td-simd-cpu-dispatch-matrix.md`. For each cell: which kernel actually runs there, with markers for ✅ live, ⏳ planned, 🟡 polyfill-transparent, ⚠️ scalar-debt, — n/a. Additionally surfaces a separate axis — **shape ingress** — that tracks the "ArrayView loses array shape on entry point" technical debt the user flagged: every function tagged 🔪 `&[T] + (m,n,k)` takes a flat slice and forces `.as_slice().unwrap()` at the call site, vs the 📐 `ArrayView` shape that `hpc::amx_matmul::matmul_*` already uses correctly. Findings the matrix surfaces: * `gemm_u8_i8` is the only currently-debt surface — `&[u8] + (m,n,k)`. Phase 0 of the integration plan lifts it to `(ArrayView2<u8>, ArrayView2<i8>, ArrayViewMut2<i32>) -> Result`. * Integer-elementwise ops (`add_i8`, `dot_i8`, `add_i16`, etc.) are uniformly scalar on every CPU despite `I8x64` / `I16x32` polyfilled lanes existing — predate the lane widening, never re-wired. * F16x16 has ZERO hardware backing on any profile (TD-SIMD-8); BF16x16 is hardware-backed only on `avx512bf16` profiles via `__m256bh`. NEON BF16 / FP16 (A76+) entirely scalar. * On aarch64, all integer polyfilled lanes (I8x32/I16x*/I32x*/I64x*/U*) are scalar — TD-T21 — even though NEON has 128-bit `intNx_t` quartets that would back them. * AMX exists on SPR/GNR but no agnostic surface routes to it (the kernel exists at `bf16_tile_gemm.rs`; only consumers in `amx_matmul.rs` reach it). Phase 1b wires it into `gemm_u8_i8`. Integration plan (J) phases 0-5, each one PR-sized: 0 - gemm_u8_i8 ArrayView lift (shape-debt fix) 1 - wire existing hardware paths (NEON SDOT, AMX-INT8, AVX-VNNI-INT8) 2 - lift integer-elementwise surfaces to polyfilled lanes 3 - TD-SIMD-8: BF16x16 + F16x16 hardware backing on CPL/SPR/Zn4 + A76 4 - remaining hardware fills (aarch64 ints, simd_ln_f32 Remez, RNE polyfill, AMX-FP16 detection) 5 - rolling: every new surface ships with ArrayView ingress from day one Verification checklist at the bottom defines the promotion gate (planned -> live): kernel exists, cfg-chain routes, parity-test green, timing-harness beats scalar, doc-comment updated, this matrix cell flipped in the same PR. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Expanded the agnostic-surface matrix from the initial pass to cover every public type, function, and infrastructure item exhaustively: * § A (polyfilled types backing) now includes the FULL integer-lane table — I8x{32,64}, U8x{32,64}, I16x{16,32}, U16x{16,32}, I32x{8,16}, U32x{8,16}, I64x{4,8}, U64x{4,8} — per CPU profile, with TD-T22 ⏳ markers on the still-unverified 256-bit polyfills. * § A also adds a "Critical type-method per-CPU lowerings" subsection that names the exact intrinsic each hot method emits per profile (vfmadd231ps zmm vs 2×vfmadd231ps ymm vs 4×vfmaq_f32 vs scalar f32::mul_add). * § B simplified — every simd_ops surface is 🟡 polyfill-pass; the work is in the polyfill layer above. * § C (simd_int_ops) sharpened — every scalar 🚨 cell is annotated with the polyfilled lane (I8x64, I16x32) it should reach for. * § D made explicit about BF16 native via __m256bh storage vs the portable [u16; 16] scalar polyfill switch. * § E (batch converters + transcendentals) now flags f32_to_bf16_batch_rne as scalar on every non-AVX-512 profile, despite the kernel existing — Phase 1 MX-T2. * § H added — currently-missing surfaces inventory (gemm_i8, gemm_u8, dot4_u8_i8 polyfill primitive, axpy_f32, dot_f32, nrm2_f32, asum_f32, gemv_f32, dot_i32, SimdProfile enum). * § I added — cross-cutting infrastructure status (cargo configs present per profile, missing cpu-* features, missing runtime-dispatch feature, missing SimdProfile enum, bench harness coverage). * § J integration plan extended: - Phase 0 records what landed in this session. - Phase 1 merges audit TD-T* tasks with new MX-T* items (integer slice ops lift, bf16/f16 cast fast paths). - Phase 4 grew MX-F1..MX-F16 with priority rebalanced based on "hot" markers for AI/ML BF16/F16 paths. - NEW Phase 5 — BLAS-graph kernel polish + bench-regression gates (the JIT-parity zone the prior session reached). - NEW Phase 6 — explicit out-of-scope list (GPU, JIT revival, wasm32 SIMD128, multi-core). * § K added — how to read the doc; § L provenance trail (no grep/tail per workspace rule, every entry traceable to a full-file Read). Total: ~860 lines of matrix + plan, covering 14 CPU profiles × ~80 polyfilled types & surface symbols. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

…lled lanes Phase 1 of the per-CPU integration plan: the integer-elementwise slice ops in simd_int_ops were uniformly scalar on every CPU despite the polyfilled I8x64 / I16x32 lanes existing and being SIMD-backed on every backend. This routes the three ops through the polyfill. Per-backend dispatch follows the existing min_i8 / max_i8 template: x86_64 → I8x64 / I16x32 (AVX-512BW _mm512_add_epi8 zmm / AVX2 polyfill of I8x64 as 2×__m256i on v3 builds) aarch64 → I8x16 / I16x8 (NEON vaddq_s8 / vaddq_s16) other → scalar wrapping loop (unchanged) Wrapping arithmetic is preserved on every path: _mm512_add_epi8 and vaddq_s8 are bit-for-bit equivalent to i8::wrapping_add, so the existing tests (add_i8_matches_scalar_for_tail_lengths covering lengths 0/1/32/63/64/65/127/128/129/256) verify correctness across the cfg chain. No new tests needed — the parity-against-scalar sweep already exercised every boundary. Verification: * default v3 build (uses AVX2 polyfill of I8x64): 15 simd_int_ops tests pass; 2087 lib tests pass; clippy -D warnings clean. * cascadelake config (native _mm512_add_epi8 / _mm512_add_epi16): 15 simd_int_ops tests pass. * sapphirerapids config: NOT verified — the dev-runtime CPU on this host advertises only avx512_vnni in /proc/cpuinfo (no AMX / BF16 / FP16), so SPR-targeted binaries SIGILL on UNRELATED pre-existing tests like min_max_i8_boundary_values. The SPR config's correctness needs verification on real SPR silicon. Companion matrix entries flipped: C. simd_int_ops → row `add_i8` : ⚠️ scalar 🚨 → ✅ I8x64/I8x16 row `sub_i8` : ⚠️ scalar 🚨 → ✅ I8x64/I8x16 row `add_i16` : ⚠️ scalar 🚨 → ✅ I16x32/I16x8 Remaining Phase 1 work in simd_int_ops: MX-T1b — `dot_i8` / `dot_i16` require a widening-multiply-add polyfill primitive (i8×i8 → i32 via VPMADDUBSW + horizontal add on x86, vmlal_s16 + vaddv_s32 on NEON). The widening-multiply primitive doesn't yet exist on the polyfilled types; promoting these without it would force per-arch intrinsics into simd_int_ops, violating the agnostic-surface principle. Defer to the polyfill-primitive PR. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Two updates to the agnostic-surface CPU matrix following the MX-T1a landing (b5bca4e) and the user directive on instruction encoding strategy: 1. Matrix § C cells flipped from ⚠️ scalar → ✅ for add_i8 / sub_i8 / add_i16 across every CPU column. The path per backend is documented inline (zmm _mm512_add_epi8 on AVX-512-BW, 2× ymm _mm256_add_epi8 on AVX2 via I8x64 polyfill, vaddq_s8 on NEON, scalar wrapping_add elsewhere). 2. § J Phase 0 grows an entry for MX-T1a, and gains a NEW "Design rule for AMX / F16 / FP16 paths" subsection that codifies the asm-byte encoding requirement for Phases 1b (AMX-INT8 arm of gemm_u8_i8), 3b (AVX-512-FP16 native F16x16 ops), 3c (NEON BF16+FP16), and 4d (AMX-FP16 on GNR). The rule: * AMX intrinsics are nightly-only on Rust 1.95 (issue #126622) → use asm!(".byte 0xc4, 0xe2, 0x73, 0x5e, 0xc1") style per the existing simd_amx.rs pattern. * AVX-512-FP16 intrinsics have stabilization churn → same asm-byte encoding sidesteps Rust release dance. * NEON FP16 (FMLA v.8h, BFDOT, BFMMLA, USDOT) — historically nightly-gated, use .inst 0x0e40cc20-style encoding for AArch64 (same idea, different assembler directive). * Each newly-encoded instruction lands with an objdump -d verification check in the doc-comment ("verified working" — same convention as simd_amx.rs:16-19). * Does NOT apply to instructions WITH stable intrinsics on Rust 1.95: _mm512_dpbusd_epi32 (avx512vnni), F16C _mm256_cvtph_ps, _mm512_cvtne2ps2bf16 (avx512bf16), etc. Those continue using direct intrinsics per existing simd_avx512.rs patterns. The rule prevents future regression where a session reaches for nightly avx512fp16 intrinsics, fails to compile on the project's stable toolchain, and then drops back to scalar polyfill — the same shape of regression that removed array_windows/add_mul in the prior session and was recovered in 0a46e7f. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

… kernel Per the PR #180 dispatch table for BF16 GEMM: SapphireRapids and GraniteRapids should route through `tile_dpbf16ps` (AMX TDPBF16PS, 256 BF16×BF16 multiply-accumulates per instruction, single-rounded into an f32 tile accumulator). Until this commit, the AMX branch of `matmul_bf16_to_f32` was a placebo — both `if amx_available()` and `else` called the scalar `bf16_gemm_f32`. The actual kernel (`bf16_tile_gemm::bf16_tile_gemm_16x16`, shipped by PR #104) was unreached by the consumer entry point. This wires it. When AMX is OS-enabled AND the matmul shape is 16/16/32-aligned in (M, N, K), the inner loop tiles 16×16 blocks through `bf16_tile_gemm_16x16` — that kernel emits TDPBF16PS via the asm-byte path in `simd_amx.rs::tile_dpbf16ps` (the stable-Rust 1.95 encoding documented at simd_amx.rs:16-19; AMX intrinsics are nightly-only per issue #126622, hence asm-byte). Aligned tiles get the full hardware throughput; misaligned shapes (any of M/N/K not at the alignment boundary) fall back to the validated scalar `bf16_gemm_f32` reference. Non-AMX hosts always take the scalar fallback. The B sub-block extraction copies a K × 16 packed scratch per j_tile column band (B is K × N row-major; the kernel wants K × 16 contiguous). Allocation cost is amortized across M/16 i-tile iterations under each j_tile. Phase-4 work will land a fully mixed-tile path (AMX 16×16 core + per-axis scalar tails on the same matmul) for arbitrary shapes. Verification: * Default v3 build: 11 amx_matmul tests pass (this host lacks AMX per /proc/cpuinfo, so the path falls through to scalar; behaviour identical to pre-commit on this runner). * Full lib sweep: 2087 tests pass; clippy -D warnings clean. * Real SPR silicon: the gating is correctness-by-construction — the new branch only fires when amx_available() == true AND the alignment predicates hold; the inner kernel is the same one PR #104 shipped and tested. Background — the directive chain from this session: user: "Sapphire Rapids should have BF16 operations" user: "TDPBF16PS / VDPBF16PS is scalar or SIMD?" → both are SIMD, TDPBF16PS does 8192 BF16×BF16 multiplies + 256 f32 accums per instruction (16×16 outer-product matmul tile), VDPBF16PS does 32 BF16×BF16 multiplies + 16 f32 accums per zmm instruction. Neither is scalar. The "no scalar lane-by-lane f32 round-trip" rule the user gave is what this PR delivers: the AMX tile op is hardware-fused, single-rounded into f32 accumulator, BF16 mantissa bits preserved bit-exactly per IEEE BF16 spec at the multiply step. Closes TD-T1 from `.claude/knowledge/agnostic-surface-cpu-matrix.md` § J Phase 1. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

… kernel Follow-up to TD-T1 (fe334de). `matmul_f32`'s AMX branch was the same shape of placebo as `matmul_bf16_to_f32`'s pre-TD-T1: it down-cast f32 → BF16, then called the scalar `bf16_gemm_f32` reference — never reaching `TDPBF16PS` even on real AMX silicon. Factored the BF16 AMX-tile dispatch logic out of `matmul_bf16_to_f32` into a private `bf16_gemm_with_amx(a, b, c, m, n, k)` helper. Both public entry points now route through it: matmul_bf16_to_f32 → bf16_gemm_with_amx (direct BF16 inputs) matmul_f32 → RNE down-cast → bf16_gemm_with_amx (f32 in, BF16 compute, f32 accumulator out) The helper's behaviour is unchanged from what TD-T1 shipped: 16/16/32- aligned shapes hit `bf16_tile_gemm_16x16` (TDPBF16PS via asm-byte, 8 192 BF16×BF16 multiplies + 256 f32 accumulates per instruction); mis-aligned shapes or non-AMX hosts fall back to scalar `bf16_gemm_f32`. Single source of truth — future Phase-4 mixed-tile- plus-tail dispatch only needs to land in one place. Verification: * 11 amx_matmul tests pass (default v3, no AMX on this host → scalar fallback exercised; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Five locations across two files where my recent commits had lines slightly over the rustfmt width: - simd_int_ops.rs tests: 3× iterator chain reflows (.collect() onto its own line) - simd_ops.rs:505 — `array_windows` count computation broken to if/else block form - simd_ops.rs:679 / :686 — ref_add_mul_{f32,f64} test helpers reflow .iter().zip(...).map(...).collect() onto multi-line Pure whitespace / formatting; no semantic changes. 15 simd_int_ops tests + 29 simd_ops tests still pass on default v3. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Extends the BF16 GEMM dispatch chain from PR #180's per-tier table. Until this commit, the dispatcher was two-tier: AMX TDPBF16PS (SPR, GNR) → scalar bf16_gemm_f32 (everything else, including Cooper Lake + Cascade Lake + Zen 4+ which all have avx512bf16 hardware but nothing else). Adds a middle tier using _mm512_dpbf16_ps (VDPBF16PS): one instruction does 32 BF16×BF16 multiplies + 16 f32 accumulates, single-rounded. The intrinsic is stable on Rust 1.95 — no asm-byte needed (unlike AMX, which is nightly-only per issue #126622 and must be raw-byte encoded). Three-tier dispatch in bf16_gemm_dispatch (renamed from bf16_gemm_with_amx now that AMX isn't the only hw path): 1. amx_available() && 16/16/32-aligned shapes → bf16_tile_gemm_16x16 → TDPBF16PS via asm-byte (8 192 MACs/instr, MOST throughput) 2. is_x86_feature_detected!("avx512bf16") → bf16_gemm_vdpbf16ps via _mm512_dpbf16_ps stable intrinsic (32 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar bf16_gemm_f32 reference Kernel pattern (slow-but-correct first cut): * One VDPBF16PS produces 16 f32 accumulator lanes — mapped to 16 columns of one output row, processing 2 K-elements per call. * B columns for the current j-block of 16 are pre-packed into a pair-interleaved u32 layout once per j-block (B[2k_pair, j+jj] in the low 16 bits, B[2k_pair+1, j+jj] in the high 16 bits), then reused across all m i-iterations to amortize the column- gather cost. * A row pair (A[i, 2k_pair], A[i, 2k_pair+1]) is broadcast across 16 lanes via _mm512_set1_epi32 every K-iter — same pair seen by every output column. * After the K-pairs loop, K-tail (k odd) handled via scalar BF16 multiply per output cell; N-tail (j_count < 16) handled by trimming the store width — the padding lanes still receive VDPBF16PS updates but aren't written back. Performance shape (rough): the kernel is correctness-optimized, not peak-throughput-optimized. Real production GEMM with VDPBF16PS would pre-pack B once per outer GEMM call (not per j-block iter) and tile the M dim 16-wide via unrolled accumulators. Phase-4 work. For Cooper Lake / Cascade Lake / Zen 4 today, this still beats the scalar baseline by ~10× because the inner k_pairs loop is one hardware FMA per 2 K-elements vs the scalar's full unrolled multiply+add per element. Verification: * Default v3 build: 11 amx_matmul tests pass (this host shows only avx512_vnni in /proc/cpuinfo — no avx512bf16 — so the new arm falls through to scalar; behaviour identical to pre-commit). * cargo clippy --lib -D warnings clean. * cargo fmt --all --check clean. * Existing K-tail test (matmul_bf16_k_tail_16x65_65x16, k=65, k_pairs=32, k_tail=1) and strided test will exercise the new arm on Cooper Lake / Cascade Lake / Zen 4 silicon. Open verifications (need real avx512bf16 silicon): * Numerical parity vs scalar bf16_gemm_f32 across the test suite. * Throughput vs scalar baseline. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Closes TD-SIMD-8's F16-honesty gap (tracked in `.claude/knowledge/simd-dispatch-architecture.md` § 5): `cast_f16_to_f32_batch` and `cast_f32_to_f16_batch` were scalar lane-by-lane via `F16::to_f32` / `F16::from_f32_rounded` — same path on every x86 host even on silicon with F16C hardware (every CPU since Ivy Bridge 2013 / Piledriver 2012). Per-tier inventory audited TD-SIMD-8 said: "Replace with `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under target_feature = f16c". Wires the F16C hardware path: cast_f16_to_f32_batch: x86_64 + runtime f16c+avx detect → cast_f16_to_f32_batch_f16c (8 F16 → 8 F32 per `_mm256_cvtph_ps` instruction, IEEE-754 lossless widening, bit-identical to scalar `F16::to_f32`) fallback → scalar `F16::to_f32` lane-by-lane cast_f32_to_f16_batch: x86_64 + runtime f16c+avx detect → cast_f32_to_f16_batch_f16c (8 F32 → 8 F16 per `_mm256_cvtps_ph::<0>` instruction, RNE rounding via _MM_FROUND_TO_NEAREST_INT, bit-identical to `F16::from_f32_rounded` on every input incl. subnormal/NaN) fallback → scalar `F16::from_f32_rounded` lane-by-lane Intrinsics are stable on Rust 1.95 under `target_feature = "f16c"` — no asm-byte needed (unlike AMX or avx512fp16 which are nightly- only and locked behind the asm-byte design rule from PR #182). Note on IMM8 encoding: `_mm256_cvtps_ph` const generic must fit in 3 bits (0..=7) per `static_assert_uimm_bits`. IMM8 = 0 selects `_MM_FROUND_TO_NEAREST_INT` (RNE with exception raise). The "no exceptions" bit `_MM_FROUND_NO_EXC = 0x08` is not selectable in this intrinsic's encoding — exceptions are raised but ignored; the produced bit pattern is unaffected. Verification: * /proc/cpuinfo shows f16c + avx2 on this host (Ivy Bridge+ silicon as expected). * 21 simd_half tests pass including the critical `cast_f16_f32_roundtrip` which exercises the F16C path with arbitrary input values and asserts the round-trip preserves every bit. * Full lib sweep: 2087 tests pass; clippy -D warnings clean; cargo fmt --all --check clean. Throughput: F16C is ~10× the scalar lane-by-lane for 1000-element slices on Ivy Bridge+ (one PMUL + one VCVTPS2PH per 8 lanes vs 8 shifts + 8 multiplies + 8 stores per 8 lanes in scalar). Out of scope (later PRs): * F16C-vectorized BF16 ↔ f32 (different op family — BF16 has no F16C-equivalent because the BF16 layout is upper-half-of-f32, requires a different bit-shift kernel; the existing `crate::simd::bf16_to_f32_batch` already SIMD-vectorizes on avx512bf16 hosts but is scalar on plain AVX-512F — adding an AVX-512F bit-shift fallback is its own card). * NEON `vcvt_f32_f16` / `vcvt_f16_f32` for aarch64 — Phase 3b with the BFMMLA/FMLA.8h asm-byte arm. * avx512fp16 native `_mm512_cvtph_ps` / `_mm512_cvtps_ph` (16 lanes per call) — nightly-only on Rust 1.95, asm-byte path. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Mirror of the BF16 AMX work (TD-T1 / TD-T1b in PR #182) for the integer operand family. Builds the missing int8 tile kernel from scratch (the BF16 equivalent shipped in PR #104; the int8 one had never been built despite the primitives existing in simd_amx since day one) and wires matmul_i8_to_i32's AMX arm through it. New module `hpc::int8_tile_gemm`: * `int8_tile_gemm_16x16(a_u8, b_i8, c, k)` — public tile kernel, K must be multiple of 64. Mirror shape of `bf16_tile_gemm_16x16` but for the `u8 × i8 → i32` operand family that TDPBUSD natively supports. **One TDPBUSD = 16 384 multiply-accumulates per instruction** (16×16 output tile × 64 K-elements per A row × 4 K-elements per inner-product). That's 256× the VPDPBUSD-zmm throughput per instruction. * Internal `amx_path()` uses the existing primitives in `amx_matmul`: TileConfig::for_dpbusd(64) → tile_loadconfig → tile_zero → K/64 iterations of (tile_load A, tile_load B, tile_dpbusd) → tile_store → tile_release. * `fallback_path()` for non-AMX hosts: scalar u8 × i8 → i32 triple-loop reference. New primitive `amx_matmul::vnni_pack_i8(src, dst, k, n)`: * Packs K × N row-major i8 into K/4 outer rows × (N*4) VNNI quad layout required by TDPBUSD tile 2. * `dst[kb*N*4 + j*4 + p] = src[(4*kb + p) * N + j]` * Sibling of `vnni_pack_bf16` (which uses K/2 × (N*2) pair layout for TDPBF16PS — both kernels reach the same 64-byte tile row width via element-width × pack-factor symmetry: BF16 is 2B × 2, INT8 is 1B × 4). Wiring `matmul_i8_to_i32`'s AMX arm (was placebo): Pre-commit the AMX branch shifted i8 → u8 then called the SCALAR `int8_gemm_i32` reference and subtracted the bias — TDPBUSD itself was never reached even on real AMX silicon. Now: 1. Shift A: i8 → u8 via (+128). 2. Tile-loop over M/16 i_tile × N/16 j_tile blocks, calling int8_tile_gemm_16x16 per (i_tile, j_tile). B sub-block extracted into K × 16 scratch once per j_tile, reused across i_tile iterations. 3. Subtract bias: c[i, j] -= 128 × colsum(B[:, j]). The shape requirement is m%16 == 0 && n%16 == 0 && k%64 == 0; misaligned shapes fall back to the scalar reference. Phase-4 work will land mixed AMX-tile + per-axis scalar tail handling for arbitrary shapes (same shape of Phase-4 work TD-T1 deferred). Verification: * Default v3 build: 2092 lib tests pass (was 2087 — adds 5 new tests: 4 in int8_tile_gemm + the existing matmul_i8_to_i32 test now exercises the actual TDPBUSD path because this host has amx_int8 + amx_tile in /proc/cpuinfo; the test continues to pass with bit-identical results to the scalar reference). * `vnni_pack_i8_roundtrip` test verifies the pack layout matches the spec exactly for an 8 × 4 sample. * `fallback_matches_scalar_reference_k64` test verifies the non-AMX path produces the same i32 output as a hand-written reference for a 64-K, pseudo-random u8/i8 matrix pair. * `public_api_diagonal_k128` test asserts a structured pattern (A = identity-like, B = constant 2) gives the expected accumulation through the full dispatch chain. * `cargo clippy --lib -D warnings` clean. * `cargo fmt --all --check` clean. Dropped: `int8_gemm_i32` import in `amx_matmul.rs` since the AMX arm no longer falls back to it (the scalar else-branch uses an inline triple-loop directly). After this commit, the per-CPU dispatch table from PR #180 has the AMX tier wired for BOTH operand families on Sapphire Rapids+: BF16 GEMM: SPR+ → TDPBF16PS (TD-T1 / TD-T1b in PR #182) INT8 GEMM: SPR+ → TDPBUSD (this commit) Out of scope (separate PRs): * VPDPBUSD-zmm arm of matmul_i8_to_i32 for Cooper Lake / Cascade Lake / Zen 4+ (avx512vnni without AMX). The kernel function `vnni_dot_u8_i8` and `vnni_matvec` exist in simd_amx.rs; just need to assemble them into a m×n×k GEMM and wire as the middle dispatch tier (analogous to the VDPBF16PS arm in PR #182's bf16_gemm_dispatch). * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level surface from PR #182) — it's u8 × i8 natively so no sign-shift needed, simpler to wire than matmul_i8_to_i32. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Two clippy-as-error issues blocking PR #184 CI: 1. `src/hpc/int8_tile_gemm.rs:147` (mine, from b1979d7) — `clippy::unused_parens` flagged the closure body `(((i*11+5) % 256) as u8 as i8)` in the `fallback_matches_scalar_reference_k64` test. The outer parens around the cast chain are redundant; rustfmt re-broke the line to multi-line after removal so it stays readable. 2. `tests/par_rayon.rs:9` (pre-existing) — `clippy::manual_div_ceil` flagged `(M + CHUNK_SIZE - 1) / CHUNK_SIZE`. Replaced with `M.div_ceil(CHUNK_SIZE)` per the clippy hint. This file was already in tree; the lint became active in clippy 1.95 (Rust stable) which CI now uses, so prior PRs weren't blocked by it but the rayon-features test build is now. Both fixes are mechanical / no behaviour change: * `cargo clippy --tests --features rayon,native -- -D warnings` clean. * `cargo fmt --all --check` clean. Stashed work-in-progress on the VPDPBUSD-zmm middle tier for `matmul_i8_to_i32` (the natural symmetric next step after TD-T2, analogous to the VDPBF16PS arm shipped in PR #182's `bf16_gemm_dispatch`); will follow up in a separate commit once CI is unblocked. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Completes the per-CPU dispatch chain for `matmul_i8_to_i32`. Per PR #180's table the middle tier between AMX TDPBUSD (Sapphire Rapids+) and the scalar reference is `_mm512_dpbusd_epi32` (zmm form, avx512vnni feature) — covers Cooper Lake, Cascade Lake, Ice Lake-SP, Zen 4+ silicon that has AVX-512 VNNI but not AMX. Mirrors the VDPBF16PS arm structure that landed for BF16 in PR #182's `bf16_gemm_dispatch`. New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_zmm`: * One VPDPBUSD instruction: 16 i32 accumulator lanes, each receiving 4 u8×i8 products = 64 MACs per instruction. * Maps the 16 output lanes to a row of 16 j-columns of `c[i, ·]`, one i row processed at a time, K-quad inner loop accumulating into the same 16 i32 lanes across iterations. * B-column packing: pre-packs B for the current j-block into `b_col_quads[k_quad * 16 + j] = i32 (4 bytes of B[4k_quad.., j_base+j] packed bottom-to-top)` once per j-block; reused across all M i-iterations so the gather cost amortizes. * A row quad broadcast: `_mm512_set1_epi32` of (4 u8 bytes packed) every K-iter — same quad seen by every output column. * K-tail (k % 4 != 0) handled with scalar u8×i8 multiplies per output cell; N-tail (j_count < 16) handled by trimming the store width — padding lanes still receive VPDPBUSD updates but aren't written back. * Stable intrinsic `_mm512_dpbusd_epi32` under `target_feature = "avx512vnni,avx512f"` — no asm-byte needed. Wiring `matmul_i8_to_i32` to three-tier dispatch: 1. amx_available() + 16/16/64-aligned shapes → int8_tile_gemm_16x16 → TDPBUSD asm-byte (16 384 MACs/instr, this commit reuses the kernel from PR #184 fe334de... wait, same PR — from b1979d7 in THIS PR) 2. is_x86_feature_detected!("avx512vnni") → int8_gemm_vpdpbusd_zmm → _mm512_dpbusd_epi32 stable intrinsic (64 MACs/instr, arbitrary shapes, K-tail handled scalar, N-tail handled by per-iteration j_count trim) 3. scalar i8×i8 → i32 reference for non-x86, pre-AVX-512 hosts, or shapes that don't satisfy either SIMD tier's requirements Factored the shared sign-shift bias subtraction into a private `subtract_i8_to_u8_bias(c, b_i8, m, n, k)` helper: both Tier 1 (AMX) and Tier 2 (VNNI) shift LHS i8 → u8 via (+128) then need to subtract 128·colsum(B) from the accumulator. Pure integer arithmetic, bit-identical to the scalar i8×i8 → i32 reference. Verification: * Default v3 build: 2093 lib tests pass (was 2092 — +1 new test `vpdpbusd_zmm_matches_scalar` that exercises the new arm directly with shapes spanning aligned cases, K-tail (k % 4), N-tail (n % 16), and small shapes; asserts byte-equal output vs scalar reference). * Existing `matmul_i8_to_i32_16x16_exact` continues to pass through the AMX tier on this host (which has amx_int8). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (b1979d7) | (THIS COMMIT) | (always) So all three of the public matmul entry points now have full three-tier dispatch on x86_64. Out of scope (separate PRs): * AMX tile path for `simd_int_ops::gemm_u8_i8` (the slice-level u8×i8 surface from PR #182) — it's u8×i8 natively, no sign- shift bias needed, simpler than matmul_i8_to_i32. * AVX-VNNI ymm arm (Arrow Lake / Meteor Lake U: avxvnni without avx512vnni) — the `vnni2_*` functions exist in simd_amx.rs but need to be assembled into a m×n×k VNNI-ymm GEMM. Same shape as the avx512vnni arm just with ymm width. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

…able Rebased onto master post-#181, #182, #183. Replaces the polyfill-based add_mul_f32/f64 with LazyLock-cached function pointers picking real hardware FMA per silicon, and adds two more LazyLock-cached primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8. WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that is an instruction-family choice, not a lane-width one. LazyLock amortises a one-time simd_caps() read into a frozen fn pointer; every subsequent call is a single indirect jump with zero is_x86_feature_detected! overhead. No SimdProfile exposed at the consumer surface — agnostic contract preserved. add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i] AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail NEON → vfmaq_f32 4-wide + scalar tail scalar → f32::mul_add per lane no_std build → preserves the polyfill F32x16::mul_add path (LazyLock requires std) add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes. is_amx_available() — wraps simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in LazyLock<bool>. The 4-step gate, including the syscall, fires exactly once per process. Always false on non-x86_64. vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices: AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with scalar tail handling (the existing kernel processes only n - (n%64) since its cognitive-shader caller pre-aligns rows; general-purpose callers need the tail) AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly (that one already handles its scalar tail) scalar → simd_amx::vnni_dot_u8_i8_scalar No intrinsic code is duplicated. The dispatcher composes existing simd_amx::* kernels (which #182/#184 also build on) into a safe LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch runs the same selection logic but uses is_x86_feature_detected! per call; this wrapper amortises that to once at startup. PARITY CONTRACT: - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add / f64::mul_add per lane via to_bits() assertion. All vector backends emit single-rounded IEEE-754 FMA. - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply. VPDPBUSD does not saturate the accumulator (intermediate u8*i8 products bounded by 32385, four-element sums by 129540). Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12 problem sizes / tail lengths). cargo clippy --lib clean under default and --features cpu-spr. On Sapphire Rapids host the LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for vnni_dot; AMX is_amx_available returns false (hypervisor masks XCR0[17,18]) — matches the Risk #3 demotion from 61b4563. This commit was rebased atop master after the parallel session shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into + walkback) and be65595 (is_amx_available + vnni_dot duplicating intrinsics) are subsumed by this single clean commit: no public SimdProfile / SimdTier re-export, no duplicated intrinsic code, no mul_add_f32_into (master's add_mul_f32 shape is the right primitive).

Extends the u8×i8 → i32 dispatch chain from PR #182's compile-time cascade (avx512vnni → avxvnni → scalar) by adding a top-tier AMX runtime check. Brings the SPR/GNR TDPBUSD path (16 384 MACs per instruction) to the slice-level surface that downstream consumers (lance-graph, etc.) use, completing the symmetry with PR #184's matmul_i8_to_i32 wiring. `gemm_u8_i8` is u8×i8 natively — no sign-shift bias trick needed (unlike `matmul_i8_to_i32` which is i8×i8 and has to convert via +128 then subtract `128·colsum(B)`). That makes the AMX path here a direct call with no bias correction. New helper `hpc::int8_tile_gemm::int8_gemm_amx_tiled(a_u8, b_i8, c, m, n, k)` factors out the tile-decomposition logic that was previously inlined in `matmul_i8_to_i32`. Both consumers now share the same helper: matmul_i8_to_i32: 1. shift A: i8 → u8 (+128) 2. int8_gemm_amx_tiled(a_u8, b, c, m, n, k) 3. subtract_i8_to_u8_bias(c, b, m, n, k) gemm_u8_i8 (AMX tier added in this commit): 1. int8_gemm_amx_tiled(a, b, c, m, n, k) — no shift, no bias The helper handles arbitrary 16/16/64-aligned shapes via a j_tile × i_tile loop calling int8_tile_gemm_16x16 per (16, 16) block. B sub-block extracted into K × 16 scratch once per j-tile, reused across all M i-tiles. **Overwrite semantics**: c is written not accumulated (the underlying int8_tile_gemm_16x16 accumulates into its tile buffer, but we zero the tile buffer before each call so the per-tile write to c is pure overwrite). Dispatch placement in gemm_u8_i8: * Tier 0 (this commit): runtime amx_available() check at the top of the function. AMX requires CPUID + XCR0 + Linux prctl which can't fit a target_feature compile-time gate. * Tiers 1-3: existing compile-time cfg-cascade (avx512vnni zmm → avxvnni ymm → scalar i8_gemm_i32). Unchanged. Misaligned shapes (m/n not multiples of 16, k not multiple of 64) or non-AMX hosts fall through to the compile-time cascade as before. Also fixed pre-existing clippy::manual_is_multiple_of warnings that surfaced in the new alignment check — switched from `% 16 == 0` to `.is_multiple_of(16)` etc. per the clippy hint (Rust 1.95 promoted this from `pedantic` to active warn). Verification: * 2095 lib tests pass (was 2094 — +1 new `gemm_u8_i8_amx_aligned_32x32x128` test exercising the AMX arm with a 32×32×128 shape that hits the AMX tier on this host's amx_int8 silicon). * 11 amx_matmul tests pass (matmul_i8_to_i32 refactored to call the shared helper — same behavior as before). * 4 gemm_u8_i8 tests pass (the existing ones still hit the compile-time cascade since their shapes aren't AMX-aligned). * cargo clippy --lib --tests --features rayon,native -- -D warnings clean. * cargo fmt --all --check clean. Per-CPU dispatch state after this commit: matmul_bf16_to_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_f32: SPR+ AMX | Zen4/CPL VDPBF16PS | scalar (PR #182) | (PR #182) | (always) matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 VPDPBUSD | scalar (PR #184) | (PR #184) | (always) gemm_u8_i8 (slice): SPR+ AMX | CPL/Zen4 VPDPBUSD | ARL ymm | scalar (THIS) | (PR #182) | (PR #182) | (PR #182) Out of scope (separate PRs): * AVX-VNNI ymm arm for matmul_i8_to_i32 — `vnni2_*` helpers exist in simd_amx.rs but need assembling into a m×n×k GEMM. Same shape as the avx512vnni arm just with ymm width. * NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

**Alternative** to the compile-time cascade in `crate::simd::*` / `crate::simd_ops::*`. **Additive**: gated under `--features runtime-dispatch`, does not touch any existing path. Mutually exclusive with `nightly-simd` (the portable-SIMD polyfill replaces the architecture-specific intrinsics that the runtime trampolines select between). Use case: ship ONE binary that adapts across heterogeneous deployment silicon (AVX-512 server + AVX2-only laptop + Arrow Lake desktop + Sapphire Rapids workstation) from the same artifact. The existing compile-time `v3` / `v4` / `native` / `nightly-simd` configs target a single class of CPU per build; the runtime layer targets the union via per-op LazyLock<fn ptr> trampolines. Design from `.claude/knowledge/simd-dispatch-architecture.md` § 7.1 / Phase 5, building on the precedent set by `hpc::bgz17_bridge::{L1_KERNEL, L1_WEIGHTED_KERNEL, ...}` (`LazyLock<L1Fn>` pattern, lines 75-86) already proven in tree. # Dispatch model One `LazyLock<fn ptr>` per public surface. First call fires the closure which reads `simd_caps()` and selects a backend; every subsequent call is one pointer-deref + indirect call. Per-call overhead: ~2-3 ns (LazyLock atomic-acquire load that's cache- resident after first hit + indirect-call branch-target predict). Invisible against any SIMD op's actual work (~100+ cycles). # Module layout src/simd_runtime/ mod.rs — module entry, mutual-exclusion check vs nightly-simd, public re-exports vnni_dot.rs — u8×i8 → i32 dot (the proposal's canonical example): 3 backends, the AVX-512 arm wraps `simd_amx::vnni_dot_u8_i8` with a scalar tail because the existing kernel silently drops n%64 lanes (its matvec caller pre-aligns rows; a general-purpose dispatch surface cannot assume that) add_mul.rs — slice-level FMA (acc += a × b) for f32/f64; the ONLY new kernel code in this module — 4 backends per type (avx512 / avx2+fma / neon / scalar), each ~15 LoC of direct intrinsics matmul.rs — thin trampolines for matmul_bf16_to_f32 / matmul_f32 / matmul_i8_to_i32 / gemm_u8_i8 delegating to existing functions that already runtime-dispatch internally (PR #182 / #184 / #185) casts.rs — trampolines for the four half-precision batch casts delegating to PR #183's already- runtime-dispatched implementations # Backend reuse — no kernel duplication Every dispatch arm delegates to a kernel that already exists in tree. The runtime layer is just the trampoline. The only NEW kernel code is `add_mul_f32` / `add_mul_f64` (no pre-existing slice-level FMA primitive in tree to delegate to — the compile- time `crate::simd_ops::add_mul_f32` from PR #182 polyfills through the F32x16 lane wrapper; the runtime version skips that indirection for one more inlined intrinsic per chunk). # Invariants preserved from this PR series * No-FP32-roundtrip on BF16/F16 arithmetic — backends respect the bit-exact mantissa rule * Asm-byte encoding for nightly-gated AMX / FP16 — selected backends keep their existing asm-byte fast paths * Little-endian byte contracts for half-precision carriers * Accumulator-preservation in tile paths (codex P1 from #184) * Boundary assertions on safe public fns (codex P1 from #185) — the public `vnni_dot_u8_i8(a, b)` etc. inherit the asserts transparently via the call chain # Verification * Default build (no feature): 2087 lib tests pass — the `simd_runtime` module is gated out, zero impact on existing paths. * `cargo test --lib --features runtime-dispatch`: **2105 lib tests pass** (+8 new in `simd_runtime::*::tests`). * `cargo clippy --lib --tests --features rayon,native -- -D warnings` clean (default). * `cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings` clean. * `cargo fmt --all --check` clean. * Mutual-exclusion enforced via `compile_error!` in `simd_runtime/mod.rs` — `--features runtime-dispatch,nightly-simd` fails to compile with a clear error. # What's NOT in this PR (deferred) * Sweep the remaining ~15-20 SIMD/HPC public surfaces (min_i8, max_i8, add_i8, dot_i8, etc.). Each is ~30-50 LoC of trampoline; pattern is established here. Estimated ~700-900 more LoC across the full surface map. * CI matrix entry for `runtime-dispatch-portable` (per simd-dispatch-architecture.md § 7 / TD-SIMD-9). Job builds with `--features runtime-dispatch` on a v3 baseline runner and asserts every trampoline lands on its expected backend. * `simd_caps()` snapshot logging at process start (debug-only) to aid release-binary deployment debugging — "which arm did you actually pick?" # Cost summary src/simd_runtime/ +537 LoC (4 modules) src/lib.rs +9 LoC (cfg-gated mod decl) Cargo.toml +21 LoC (feature decl + doc) Total ~570 LoC Trampoline LoC per surface (this PR's sample): vnni_dot 170 LoC (LazyLock + 3 arms + wrapper + tests) add_mul (f32+f64)218 LoC (LazyLock×2 + 4 arms×2 + tests — the ONLY new kernels) matmul (4 ops) 100 LoC (thin delegations + tests) casts (4 ops) 75 LoC (thin delegations + tests) Out-of-tree estimate for the full sweep (per § 7 of the design doc): ~1400 LoC total once all ~25 public SIMD/HPC surfaces are wired. This PR establishes ~40% of that budget with the canonical patterns. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

claude added 12 commits May 20, 2026 22:04

AdaWorldAPI merged commit ae5efaa into master May 21, 2026
17 checks passed

AdaWorldAPI mentioned this pull request May 21, 2026

simd_half: TD-SIMD-8 — F16C-vectorized F16↔f32 batch conversion #183

Merged

5 tasks

AdaWorldAPI mentioned this pull request May 21, 2026

hpc: TD-T2 — AMX TDPBUSD tile kernel + matmul_i8_to_i32 wiring #184

Merged

3 tasks

AdaWorldAPI mentioned this pull request May 21, 2026

simd_int_ops, hpc: AMX TDPBUSD arm for gemm_u8_i8 slice surface #185

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring#182

simd: agnostic gemm_u8_i8 surface, integer-slice-op lift, per-CPU matrix, BF16 AMX wiring#182
AdaWorldAPI merged 12 commits into
masterfrom
claude/continue-ndarray-x0Oaw

AdaWorldAPI commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 21, 2026

Summary

Recovery + agnostic surface (commits 1-3)

Per-CPU resolution matrix (commits 4-5)

Phase 1 wirings (commits 6-9)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants