Skip to content

simd_half: TD-SIMD-8 — F16C-vectorized F16↔f32 batch conversion#183

Merged
AdaWorldAPI merged 3 commits into
masterfrom
claude/continue-ndarray-x0Oaw
May 21, 2026
Merged

simd_half: TD-SIMD-8 — F16C-vectorized F16↔f32 batch conversion#183
AdaWorldAPI merged 3 commits into
masterfrom
claude/continue-ndarray-x0Oaw

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Closes TD-SIMD-8 (the F16-honesty gap tracked in .claude/knowledge/simd-dispatch-architecture.md § 5): cast_f16_to_f32_batch and cast_f32_to_f16_batch were scalar lane-by-lane via F16::to_f32 / F16::from_f32_rounded on every host — including silicon with F16C hardware (every x86 since Ivy Bridge 2013 / Piledriver 2012).

This wires the F16C hardware paths:

  • cast_f16_to_f32_batch_mm256_cvtph_ps (8 F16 → 8 F32 per instruction, IEEE-754 lossless widening, bit-identical to F16::to_f32)
  • cast_f32_to_f16_batch_mm256_cvtps_ph::<0> (8 F32 → 8 F16 per instruction, IEEE-754 RNE via _MM_FROUND_TO_NEAREST_INT, bit-identical to F16::from_f32_rounded on every input — subnormal / NaN / Inf included)

Both intrinsics are stable on Rust 1.95 under target_feature = "f16c" — no asm-byte needed (unlike AMX, avx512fp16, NEON FP16, which are nightly-only and locked behind the asm-byte design rule from PR #182).

Runtime dispatch via std::is_x86_feature_detected!("f16c") && std::is_x86_feature_detected!("avx"); falls back to the scalar lane-by-lane reference on non-x86_64 hosts and on pre-F16C silicon (none in practice).

Notes

  • IMM8 encoding for _mm256_cvtps_ph: the const generic must fit in 3 bits (static_assert_uimm_bits enforces 0..=7). IMM8 = 0 = _MM_FROUND_TO_NEAREST_INT = RNE with exception raise; the "no exceptions" bit (_MM_FROUND_NO_EXC = 0x08) is not selectable in this intrinsic's encoding — exceptions are raised but ignored, the produced bit pattern is unaffected.
  • Throughput: F16C is ~10× the scalar lane-by-lane for typical 1000-element slices on Ivy Bridge+ (one PMUL + one VCVTPS2PH per 8 lanes vs 8 shifts + 8 multiplies + 8 stores per 8 lanes in scalar).

Out of scope (separate PRs)

  • F16C-vectorized BF16 ↔ f32 — different op family (BF16 is upper-half-of-f32 bit-shift, not the IEEE 754 conversion F16C implements). The existing crate::simd::bf16_to_f32_batch already SIMD-vectorizes on avx512bf16 hosts; adding an AVX-512F bit-shift fallback for non-BF16 silicon is its own card.
  • NEON vcvt_f32_f16 / vcvt_f16_f32 for aarch64 — Phase 3b with the BFMMLA / FMLA.8h asm-byte arm.
  • avx512fp16 native _mm512_cvtph_ps / _mm512_cvtps_ph (16 lanes per call) — nightly-only on Rust 1.95, asm-byte path.

Test plan

  • /proc/cpuinfo shows f16c + avx2 on this runner (Ivy Bridge+ silicon).
  • 21 simd_half tests pass including the critical cast_f16_f32_roundtrip which exercises the new F16C path with arbitrary input values and asserts the round-trip preserves every bit.
  • Full cargo test --lib: 2087 passed, 0 failed, 29 ignored.
  • cargo clippy --lib -- -D warnings clean.
  • cargo fmt --all --check clean.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u


Generated by Claude Code

Closes TD-SIMD-8's F16-honesty gap (tracked in
`.claude/knowledge/simd-dispatch-architecture.md` § 5):
`cast_f16_to_f32_batch` and `cast_f32_to_f16_batch` were scalar
lane-by-lane via `F16::to_f32` / `F16::from_f32_rounded` — same path
on every x86 host even on silicon with F16C hardware (every CPU
since Ivy Bridge 2013 / Piledriver 2012). Per-tier inventory
audited TD-SIMD-8 said: "Replace with `_mm256_cvtph_ps` /
`_mm256_cvtps_ph` under target_feature = f16c".

Wires the F16C hardware path:

  cast_f16_to_f32_batch:
    x86_64 + runtime f16c+avx detect → cast_f16_to_f32_batch_f16c
      (8 F16 → 8 F32 per `_mm256_cvtph_ps` instruction, IEEE-754
       lossless widening, bit-identical to scalar `F16::to_f32`)
    fallback → scalar `F16::to_f32` lane-by-lane

  cast_f32_to_f16_batch:
    x86_64 + runtime f16c+avx detect → cast_f32_to_f16_batch_f16c
      (8 F32 → 8 F16 per `_mm256_cvtps_ph::<0>` instruction, RNE
       rounding via _MM_FROUND_TO_NEAREST_INT, bit-identical to
       `F16::from_f32_rounded` on every input incl. subnormal/NaN)
    fallback → scalar `F16::from_f32_rounded` lane-by-lane

Intrinsics are stable on Rust 1.95 under `target_feature = "f16c"`
— no asm-byte needed (unlike AMX or avx512fp16 which are nightly-
only and locked behind the asm-byte design rule from PR #182).

Note on IMM8 encoding: `_mm256_cvtps_ph` const generic must fit in
3 bits (0..=7) per `static_assert_uimm_bits`. IMM8 = 0 selects
`_MM_FROUND_TO_NEAREST_INT` (RNE with exception raise). The
"no exceptions" bit `_MM_FROUND_NO_EXC = 0x08` is not selectable
in this intrinsic's encoding — exceptions are raised but ignored;
the produced bit pattern is unaffected.

Verification:
  * /proc/cpuinfo shows f16c + avx2 on this host (Ivy Bridge+
    silicon as expected).
  * 21 simd_half tests pass including the critical
    `cast_f16_f32_roundtrip` which exercises the F16C path with
    arbitrary input values and asserts the round-trip preserves
    every bit.
  * Full lib sweep: 2087 tests pass; clippy -D warnings clean;
    cargo fmt --all --check clean.

Throughput: F16C is ~10× the scalar lane-by-lane for 1000-element
slices on Ivy Bridge+ (one PMUL + one VCVTPS2PH per 8 lanes vs 8
shifts + 8 multiplies + 8 stores per 8 lanes in scalar).

Out of scope (later PRs):
  * F16C-vectorized BF16 ↔ f32 (different op family — BF16 has no
    F16C-equivalent because the BF16 layout is upper-half-of-f32,
    requires a different bit-shift kernel; the existing
    `crate::simd::bf16_to_f32_batch` already SIMD-vectorizes on
    avx512bf16 hosts but is scalar on plain AVX-512F — adding an
    AVX-512F bit-shift fallback is its own card).
  * NEON `vcvt_f32_f16` / `vcvt_f16_f32` for aarch64 — Phase 3b
    with the BFMMLA/FMLA.8h asm-byte arm.
  * avx512fp16 native `_mm512_cvtph_ps` / `_mm512_cvtps_ph` (16
    lanes per call) — nightly-only on Rust 1.95, asm-byte path.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cce37e1835

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simd_half.rs
for c in 0..chunks {
let off = c * 8;
let f = _mm256_loadu_ps(src.as_ptr().add(off));
let h = _mm256_cvtps_ph::<0>(f);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use no-exception rounding mode for F16C downcast

cast_f32_to_f16_batch_f16c currently uses _mm256_cvtps_ph::<0>, which performs round-to-nearest-even but does not request exception suppression. This means conversions of NaN/Inf/overflow/underflow inputs can set MXCSR exception flags (and can trap if FP exceptions are unmasked), which is a behavior change from the previous pure bit-manipulation scalar path and contradicts the function-level contract that says “no exceptions.” Use the _MM_FROUND_NO_EXC variant (imm8 with bit 3 set) to preserve non-trapping behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — the underlying concern (the F16C path can set MXCSR flags that the scalar bit-fiddle path doesn't) is valid. The proposed fix isn't quite right though: _mm256_cvtps_ph's IMM8 is constrained to 3 bits in Rust stdarch (static_assert_uimm_bits!(IMM8, 3) — fails to compile for IMM8 = 8), and in the underlying VCVTPS2PH spec bit 3 selects MXCSR.RM not _MM_FROUND_NO_EXC (NO_EXC is an AVX-512 convention; F16C predates the SAE family). Only valid IMM8 values here are 0..=3 (the four rounding modes).

The right fix is MXCSR save/restore via inline asm!(stmxcsr/ldmxcsr) — landed in 1a73c37. STMXCSR before the SIMD region, LDMXCSR after, preserves every bit of the saved control/status word including the exception flags the SIMD path may have set. Net effect: callers observe zero MXCSR change vs. the scalar path. Inline asm rather than _mm_getcsr/_mm_setcsr because those wrappers are deprecated on Rust 1.95 stable (unsoundness across thread MXCSR visibility; the deprecation notice explicitly recommends inline asm).

Same fix applied to cast_f16_to_f32_batch_f16c since _mm256_cvtph_ps can also raise #I/#D on SNaN/denormal F16 inputs. New test f16c_cast_preserves_mxcsr exercises both directions with inputs that trigger every relevant exception (overflow/underflow/precision/invalid/denormal); snapshots MXCSR before and after via stmxcsr, asserts byte-equal. Test passes.

This fix preserves the MXCSR FLAG state. It does not prevent traps when the caller has unmasked FP exceptions before invoking us — those would fire from the SIMD ops themselves and bypass our restore. That's the same trap behaviour as any plain a + b on overflow-prone f32, and the default OS-set MXCSR has all exception masks set so it's a non-issue for the common case.


Generated by Claude Code

claude added 2 commits May 21, 2026 01:34
Closes the dispatch-table gap for BF16 decode on AVX-512F silicon
without the BF16 extension (Skylake-X, Cascade Lake, Ice Lake-SP).
Before this commit, `bf16_to_f32_batch` was two-tier: avx512bf16
SIMD path (Cooper Lake, SPR+, Zen 4+) or scalar lane-by-lane
fallback. The middle tier — every Intel AVX-512 CPU from 2017 to
2021 plus AMD Zen 1-3 with avx512f — was forced through scalar even
though the BF16 → f32 conversion is just a 16-bit left shift and
AVX-512F has had `_mm512_cvtepu16_epi32` + `_mm512_slli_epi32` since
day one.

The new `convert_bf16_to_f32_avx512f` uses three AVX-512F
instructions per 16-lane chunk:

  _mm256_loadu_si256                  // 16 u16 → __m256i
  _mm512_cvtepu16_epi32               // zero-extend to 16 u32 → __m512i
  _mm512_slli_epi32::<16>             // shift left by 16 (BF16 → f32 bits)
  _mm512_castsi512_ps                 // bit-cast i32 → f32
  _mm512_storeu_ps                    // store 16 f32

Plus a scalar tail for the last n % 16 lanes (handled via the
existing `bf16_to_f32_scalar` reference).

BF16 → f32 is mathematically exact (BF16 IS the upper 16 bits of
f32), so the AVX-512F path is byte-equal to the scalar reference on
every input, including subnormal, NaN, ±Inf, ±0 — verified in the
new direct test against a corpus that sweeps every (sign × exponent
× representative-mantissa) triple plus a 5-element tail to exercise
both the 16-aligned loop and the scalar tail.

Dispatch order after this commit:
  1. avx512bf16 + avx512vl → `_mm512_cvtpbh_ps` path (best — 1 op)
  2. avx512f → bit-shift path (this commit — 4 ops, no rounding)
  3. scalar lane-by-lane fallback

Verification:
  * Direct test `batch_bf16_to_f32_avx512f_matches_scalar` runs on
    the `cascadelake` config (avx512f + bw + vl, no bf16) and
    passes — asserts byte-equal output against scalar reference
    across the full corpus.
  * Existing `batch_conversion_matches_scalar` test on this host
    (avx512_bf16 present) still hits the avx512bf16 path; the new
    arm is dead code there, which is correct — the dispatch order
    prefers the better intrinsic when available.
  * Default v3 build (no AVX-512): 2087 lib tests pass; the new arm
    isn't compiled because the surrounding test module is gated on
    `target_feature = "avx512f"`.
  * cargo clippy -- -D warnings clean.
  * cargo fmt --all --check clean.

The symmetric f32 → BF16 direction already had its AVX-512F-only
RNE path (`f32_to_bf16_batch_rne` shipped in PR #126, byte-exact
vs `_mm512_cvtneps_pbh`). This commit closes the asymmetry so both
directions have AVX-512F-only paths on top of the avx512bf16 fast
path.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Per codex review on PR #183: `cast_f32_to_f16_batch_f16c` and
`cast_f16_to_f32_batch_f16c` use F16C intrinsics that can raise
FP exceptions (#O / #U / #P / #I / #D) on edge inputs — setting
bits in the MXCSR status word. The scalar reference paths
(`F16::to_f32`, `F16::from_f32_rounded`) are pure bit
manipulation and never touch MXCSR, so the F16C fast path was
introducing observable FP control-state side effects.

Codex's proposed fix (`_mm256_cvtps_ph::<8>` with bit 3 set for
`_MM_FROUND_NO_EXC`) does not apply here: the Rust stdarch
intrinsic enforces `static_assert_uimm_bits!(IMM8, 3)` so IMM8
is constrained to `0..=7`, and the underlying VCVTPS2PH IMM8
encoding has no SAE bit — bit 3 selects MXCSR.RM (not NO_EXC,
which is an AVX-512 convention). The only valid IMM8 values for
F16C `_mm256_cvtps_ph` are 0..=3 (the four rounding modes).

The actual fix: save MXCSR via STMXCSR before the SIMD region,
restore via LDMXCSR after. Preserves every bit of the original
control/status word (rounding mode, exception masks, flush-to-
zero, and importantly the exception flag bits that the SIMD path
may have set). Net effect: callers observe no MXCSR change vs.
the scalar path.

Implementation uses inline `asm!(stmxcsr/ldmxcsr)` rather than
`_mm_getcsr` / `_mm_setcsr` because those wrappers are deprecated
on stable Rust 1.95 (rustc deemed them unsound for cross-thread
visibility reasons; the official guidance is exactly this — use
inline asm). Two ops per batch call: one STMXCSR save at entry,
one LDMXCSR restore at exit. Cost: ~5 cycles total, dwarfed by
even a single 8-lane cvtps_ph chunk.

New test `f16c_cast_preserves_mxcsr` exercises the fix:
constructs input arrays containing 1e30 / -1e30 (overflow #O),
1e-30 (underflow / denormal #U / #D / #P), 1.0/3.0 (precision
#P), NaN, Inf, ±0, 1.0 — values designed to trigger every
relevant F16C exception. Snapshots MXCSR before, runs the cast,
snapshots after, asserts byte-equal. Same check for the upcast
direction with SNaN-encoded F16 inputs that trigger #I/#D in
`_mm256_cvtph_ps`. Both pass on this host (F16C + avx2 silicon).

Note: this fix does NOT prevent traps from firing on hosts where
the caller has unmasked FP exceptions before calling us. Trap
behaviour is the same as for any plain `a + b` of f32 that
overflows — fires from the SIMD ops themselves, not under our
control. Default MXCSR has all exception masks set (the
process-startup state on Linux/macOS/Windows), so this is the
common case and traps don't fire there.

Verification:
  * 22 simd_half tests pass (was 21 before, +1 new MXCSR-
    preservation test).
  * Full lib sweep: 2087 tests pass.
  * cargo clippy -- -D warnings clean (no deprecation warning
    from _mm_getcsr / _mm_setcsr — we use inline asm instead).
  * cargo fmt --all --check clean.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
@AdaWorldAPI AdaWorldAPI merged commit 098c5aa into master May 21, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…able

Rebased onto master post-#181, #182, #183. Replaces the polyfill-based
add_mul_f32/f64 with LazyLock-cached function pointers picking real
hardware FMA per silicon, and adds two more LazyLock-cached
primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8.

WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar
f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width
but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that
is an instruction-family choice, not a lane-width one. LazyLock
amortises a one-time simd_caps() read into a frozen fn pointer;
every subsequent call is a single indirect jump with zero
is_x86_feature_detected! overhead. No SimdProfile exposed at the
consumer surface — agnostic contract preserved.

add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i]
  AVX-512F+FMA  → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail
  AVX2+FMA      → _mm256_fmadd_ps 8-wide + scalar tail
  NEON          → vfmaq_f32 4-wide + scalar tail
  scalar        → f32::mul_add per lane
  no_std build  → preserves the polyfill F32x16::mul_add path
                  (LazyLock requires std)

add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes.

is_amx_available() — wraps simd_amx::amx_available() (CPUID +
OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in
LazyLock<bool>. The 4-step gate, including the syscall, fires
exactly once per process. Always false on non-x86_64.

vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices:
  AVX-512 VNNI  → delegates to simd_amx::vnni_dot_u8_i8 wrapped with
                  scalar tail handling (the existing kernel processes
                  only n - (n%64) since its cognitive-shader caller
                  pre-aligns rows; general-purpose callers need the
                  tail)
  AVX-VNNI 256  → delegates to simd_amx::vnni2_dot_u8_i8 directly
                  (that one already handles its scalar tail)
  scalar        → simd_amx::vnni_dot_u8_i8_scalar

No intrinsic code is duplicated. The dispatcher composes existing
simd_amx::* kernels (which #182/#184 also build on) into a safe
LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch
runs the same selection logic but uses is_x86_feature_detected! per
call; this wrapper amortises that to once at startup.

PARITY CONTRACT:
  - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add /
    f64::mul_add per lane via to_bits() assertion. All vector
    backends emit single-rounded IEEE-754 FMA.
  - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply.
    VPDPBUSD does not saturate the accumulator (intermediate u8*i8
    products bounded by 32385, four-element sums by 129540).

Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12
problem sizes / tail lengths). cargo clippy --lib clean under
default and --features cpu-spr. On Sapphire Rapids host the
LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for
vnni_dot; AMX is_amx_available returns false (hypervisor masks
XCR0[17,18]) — matches the Risk #3 demotion from 61b4563.

This commit was rebased atop master after the parallel session
shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and
prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The
earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into +
walkback) and be65595 (is_amx_available + vnni_dot duplicating
intrinsics) are subsumed by this single clean commit: no public
SimdProfile / SimdTier re-export, no duplicated intrinsic code, no
mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
**Alternative** to the compile-time cascade in `crate::simd::*` /
`crate::simd_ops::*`. **Additive**: gated under
`--features runtime-dispatch`, does not touch any existing path.
Mutually exclusive with `nightly-simd` (the portable-SIMD polyfill
replaces the architecture-specific intrinsics that the runtime
trampolines select between).

Use case: ship ONE binary that adapts across heterogeneous
deployment silicon (AVX-512 server + AVX2-only laptop + Arrow Lake
desktop + Sapphire Rapids workstation) from the same artifact. The
existing compile-time `v3` / `v4` / `native` / `nightly-simd`
configs target a single class of CPU per build; the runtime layer
targets the union via per-op LazyLock<fn ptr> trampolines.

Design from `.claude/knowledge/simd-dispatch-architecture.md` § 7.1
/ Phase 5, building on the precedent set by
`hpc::bgz17_bridge::{L1_KERNEL, L1_WEIGHTED_KERNEL, ...}`
(`LazyLock<L1Fn>` pattern, lines 75-86) already proven in tree.

# Dispatch model

One `LazyLock<fn ptr>` per public surface. First call fires the
closure which reads `simd_caps()` and selects a backend; every
subsequent call is one pointer-deref + indirect call. Per-call
overhead: ~2-3 ns (LazyLock atomic-acquire load that's cache-
resident after first hit + indirect-call branch-target predict).
Invisible against any SIMD op's actual work (~100+ cycles).

# Module layout

  src/simd_runtime/
    mod.rs       — module entry, mutual-exclusion check vs
                   nightly-simd, public re-exports
    vnni_dot.rs  — u8×i8 → i32 dot (the proposal's canonical
                   example): 3 backends, the AVX-512 arm
                   wraps `simd_amx::vnni_dot_u8_i8` with a
                   scalar tail because the existing kernel
                   silently drops n%64 lanes (its matvec
                   caller pre-aligns rows; a general-purpose
                   dispatch surface cannot assume that)
    add_mul.rs   — slice-level FMA (acc += a × b) for f32/f64;
                   the ONLY new kernel code in this module —
                   4 backends per type (avx512 / avx2+fma /
                   neon / scalar), each ~15 LoC of direct
                   intrinsics
    matmul.rs    — thin trampolines for matmul_bf16_to_f32 /
                   matmul_f32 / matmul_i8_to_i32 / gemm_u8_i8
                   delegating to existing functions that
                   already runtime-dispatch internally
                   (PR #182 / #184 / #185)
    casts.rs     — trampolines for the four half-precision
                   batch casts delegating to PR #183's already-
                   runtime-dispatched implementations

# Backend reuse — no kernel duplication

Every dispatch arm delegates to a kernel that already exists in
tree. The runtime layer is just the trampoline. The only NEW
kernel code is `add_mul_f32` / `add_mul_f64` (no pre-existing
slice-level FMA primitive in tree to delegate to — the compile-
time `crate::simd_ops::add_mul_f32` from PR #182 polyfills through
the F32x16 lane wrapper; the runtime version skips that
indirection for one more inlined intrinsic per chunk).

# Invariants preserved from this PR series

  * No-FP32-roundtrip on BF16/F16 arithmetic — backends respect
    the bit-exact mantissa rule
  * Asm-byte encoding for nightly-gated AMX / FP16 — selected
    backends keep their existing asm-byte fast paths
  * Little-endian byte contracts for half-precision carriers
  * Accumulator-preservation in tile paths (codex P1 from #184)
  * Boundary assertions on safe public fns (codex P1 from #185) —
    the public `vnni_dot_u8_i8(a, b)` etc. inherit the asserts
    transparently via the call chain

# Verification

  * Default build (no feature): 2087 lib tests pass — the
    `simd_runtime` module is gated out, zero impact on existing
    paths.
  * `cargo test --lib --features runtime-dispatch`: **2105 lib
    tests pass** (+8 new in `simd_runtime::*::tests`).
  * `cargo clippy --lib --tests --features rayon,native -- -D warnings`
    clean (default).
  * `cargo clippy --lib --tests --features rayon,native,runtime-dispatch
    -- -D warnings` clean.
  * `cargo fmt --all --check` clean.
  * Mutual-exclusion enforced via `compile_error!` in
    `simd_runtime/mod.rs` — `--features runtime-dispatch,nightly-simd`
    fails to compile with a clear error.

# What's NOT in this PR (deferred)

  * Sweep the remaining ~15-20 SIMD/HPC public surfaces (min_i8,
    max_i8, add_i8, dot_i8, etc.). Each is ~30-50 LoC of trampoline;
    pattern is established here. Estimated ~700-900 more LoC across
    the full surface map.
  * CI matrix entry for `runtime-dispatch-portable` (per
    simd-dispatch-architecture.md § 7 / TD-SIMD-9). Job builds
    with `--features runtime-dispatch` on a v3 baseline runner and
    asserts every trampoline lands on its expected backend.
  * `simd_caps()` snapshot logging at process start (debug-only)
    to aid release-binary deployment debugging — "which arm did
    you actually pick?"

# Cost summary

  src/simd_runtime/                 +537 LoC (4 modules)
  src/lib.rs                        +9 LoC (cfg-gated mod decl)
  Cargo.toml                        +21 LoC (feature decl + doc)
  Total                             ~570 LoC

  Trampoline LoC per surface (this PR's sample):
    vnni_dot         170 LoC (LazyLock + 3 arms + wrapper + tests)
    add_mul (f32+f64)218 LoC (LazyLock×2 + 4 arms×2 + tests — the ONLY new kernels)
    matmul (4 ops)   100 LoC (thin delegations + tests)
    casts (4 ops)     75 LoC (thin delegations + tests)

Out-of-tree estimate for the full sweep (per § 7 of the design
doc): ~1400 LoC total once all ~25 public SIMD/HPC surfaces are
wired. This PR establishes ~40% of that budget with the canonical
patterns.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants