diff --git a/.claude/knowledge/simd-dispatch-architecture.md b/.claude/knowledge/simd-dispatch-architecture.md new file mode 100644 index 00000000..82fdd482 --- /dev/null +++ b/.claude/knowledge/simd-dispatch-architecture.md @@ -0,0 +1,305 @@ +# SIMD Dispatch Architecture — design, parity, tech debt, integration plan + +> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion). +> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract), +> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`. + +## 1. Why this exists + +`ndarray::simd::*` is the single public surface every cognitive-shader, +splat, codec, BLAS, and FFI consumer reaches for. The current dispatch +in `src/simd.rs` is **compile-time-only** with arms keyed off +`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar +fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking +AVX-512 into every compiled artifact. + +The consequence surfaced on PR #170 (`tests/1.95.0` CI run +[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)): +**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on +a GitHub runner without AVX-512 silicon, all timing out uniformly +~19 s — the symptom of "binary cannot execute" rather than assertion +failure. The same configuration also leaves `simd_nightly/*` (the +portable-SIMD polyfill backend) unreachable because no dispatch arm in +`simd.rs` re-exports from it. + +This document pins the target architecture, captures the parity gaps, +ranks the technical debt, and sequences the integration. + +## 2. Dispatch model — three build configs, one runtime mode + +Each build mode is a **conscious cargo invocation** via a distinct +`.cargo/config*.toml`. No silent fallbacks, no surprise hardware +mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd` +chose it deliberately. + +| Config file | `target-cpu` | Dispatch strategy | Default? | Use case | +|---|---|---|---|---| +| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 | +| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment | +| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds | +| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments | + +The aarch64 path is automatic: any `target_arch = "aarch64"` build +selects `simd_neon` regardless of the config above. + +**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used +when shipping a single release binary that must adapt at process +start across heterogeneous deployment silicon (one binary running on +AVX-512 + AVX2-only machines from the same artifact). It compiles all +backends in and uses `LazyLock` trampolines. Reserved for the +release-binary distribution path; never the dev / CI default. + +### Dispatch precedence in `simd.rs` + +Compile-time arms read like a cascade, **not** like priority overrides +— each cargo config sets exactly one `target_feature` / `feature` such +that exactly one arm matches. The order below is the source-of-truth +ranking the compiler walks: + +```rust +// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature). +// No `target_arch` constraint — `core::simd` is portable, so this +// arm is the one true backend on wasm32 / riscv / any other target +// as soon as `nightly-simd` is on. Keeping it unconditional on +// `feature = "nightly-simd"` is what makes the `not(feature = +// "nightly-simd")` exclusion on every other arm sound. +#[cfg(feature = "nightly-simd")] +pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8}; + +// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts) +#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))] +pub use crate::simd_avx512::{...}; + +// 3. AVX2 baseline (the v3 / GitHub-CI default) +#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))] +pub use crate::simd_avx2::{...}; + +// 4. NEON (aarch64) +#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))] +pub use crate::simd_neon::aarch64_simd::{...}; + +// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without +// AVX2, etc.). The predicate is the negation of arms 1-4 so that +// *exactly one* arm matches on every (target, feature) pair. +#[cfg(not(any( + feature = "nightly-simd", + all(target_arch = "x86_64", target_feature = "avx2"), + target_arch = "aarch64", +)))] +pub use scalar::{...}; +``` + +Runtime dispatch via `LazyLock` lives in a separate +`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch` +flag, mutually exclusive with the compile-time arms above. + +## 3. Module roles + +``` +crate::simd::* ← user-facing key registry (re-exports only) + │ + ├── simd.rs = dispatch arms; no implementation, only `pub use` + │ + ├── simd_ops.rs = slice-level ops over crate::simd::* primitives + │ (add_f32, scale_f64, array_chunks, …) + │ + ├── simd_avx512.rs = __m512* values, native 512-bit registers + ├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half + │ wrappers (struct F32x16(pub f32x8, pub f32x8)) + ├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes + │ composed as [float32x4_t; 4] etc. + ├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable + │ ├── f32_types.rs F32x16, F32x8 + │ ├── f64_types.rs F64x8, F64x4 + │ ├── u8_types.rs U8x64, U8x32 + │ ├── u_word_types.rs U16x32, U32x16, U64x8 + │ ├── i8_types.rs I8x64, I8x32 + │ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8 + │ ├── bf16_types.rs BF16x16, BF16x8 + │ ├── f16_types.rs F16x16 + │ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8 + │ └── ops.rs op impls + └── scalar (inline `mod scalar` in simd.rs) + = pure-Rust fallback for unknown arch +``` + +Every `simd_.rs` is just a SOURCE of typed primitives. `simd.rs` +chooses the source; the cargo config chooses how `simd.rs` chooses. + +## 4. Parity matrix — typed lane primitives per backend + +Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵 +scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch. + +| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` | +|---|---|---|---|---|---| +| `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` | +| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ | +| `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ | +| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ | +| `U8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ | +| `U16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `U32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `U64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `I8x32` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ | +| `I8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `I16x16` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ | +| `I16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `I32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `I64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | +| `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ | +| `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ | +| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ | +| `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ | +| `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ | + +**Aarch64-native narrower types** (only useful directly when the +consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`, +`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch +parity surface — consumers requesting 256-bit / 512-bit shapes go +through the composed wrappers. + +### Read of the matrix + +- **F32x16 + F64x8 are universal** — all four backends ship them. Hot + paths can rely on these without branching. +- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`, + `F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other + cross-arch lane is missing — making the v3 default config crash any + consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc. +- **NEON is even sparser** at the 256/512-bit level. +- **`simd_nightly` is the most complete** but is unreachable today + because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its + re-exports. +- **`scalar`** has comprehensive cover and is the safest fallback for + any arch the others miss, but lives inline in `simd.rs` rather than + in a dedicated `simd_scalar.rs`. Symmetry would help. + +## 5. Technical debt matrix + +Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have). + +| ID | Severity | Description | Detection | Fix scope | +|---|---|---|---|---| +| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. | +| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. | +| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. | +| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. | +| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. | +| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. | +| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. | +| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. | +| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. | +| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. | + +## 6. Integration plan — sequenced sprints + +Each phase is a single-PR worker (sized for one Sonnet impl-sprint per +the `.claude/EN/agents/worker-template.md` shape). Phases sequence so +each lands a green CI; the next phase depends only on shipped state. + +### Phase 1 — Unblock CI (P0 fixes) + +**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml` +build runs end-to-end on AVX2-only silicon. + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 1.1 | flip baseline | Change `target-cpu` from `v4` → `v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs | +| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only | +| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` | + +After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships +green CI by default. AVX-512 testing becomes an explicit job. + +### Phase 2 — Unblock the polyfill (P1: `nightly-simd`) + +**Goal:** `cargo +nightly check --features nightly-simd` reaches +`simd_nightly/*` via `crate::simd::*`. miri can execute the portable +path. + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature | +| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` | +| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature | + +### Phase 3 — NEON parity (P1) + +**Goal:** aarch64 build reaches the same cross-arch lane set as the v3 +config. + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture | +| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean | + +### Phase 4 — Symmetry + ergonomics (P1/P2) + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs | +| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation | +| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers | + +### Phase 5 — Runtime dispatch (P2, opt-in) + +**Goal:** ship-once binaries that adapt across heterogeneous deployment +silicon. + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact | +| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline | +| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green | + +### Phase 6 — CI matrix for explicit AVX-512 (P3) + +| # | Worker | Scope | Files | Acceptance | +|---|---|---|---|---| +| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner | + +## 7. Open questions + +1. **Runtime trampoline cost class.** Phase 5's per-op indirection + adds one indirect call per `F32x16::add(...)`. Acceptable for the + typical 100+ cycle SIMD-op cost, but consumer benchmarks should + sanity-check before declaring the path production-ready. +2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top + of the cascade; alternative reading is "polyfill is for miri only, + so put it BELOW the arch-specific arms and only fire on non-x86_64, + non-aarch64 targets." The current proposal matches the user's + "explicit opt-in wins" framing; revisit if there's a use case for + `--features nightly-simd` on an AVX-512 host wanting the AVX-512 + path. +3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is + x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface. + Out of scope for this document; tracked under PR-X10 A6 + (`linalg::distance`) follow-ups. + +## 8. Cross-references + +- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer + contract every new SIMD primitive follows (struct methods on typed + wrappers, three-backend parity test, saturating/overflow semantics + documented). +- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend + integration consumer of `crate::simd::*`. +- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse + + Tantivy injection plan; depends on Phase 1 + 2 landing. +- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!` + reporting (NOT dispatch) — repurpose for Phase 5 trampoline. +- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set + ready to be wired into `simd.rs` dispatch (Phase 2). + +## 9. TL;DR + +Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by +default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path. +`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1). +`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*` +becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase +3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are +P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing +in order keeps every step green.