Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions .claude/knowledge/simd-dispatch-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# SIMD Dispatch Architecture — design, parity, tech debt, integration plan

> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion).
> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract),
> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`.

## 1. Why this exists

`ndarray::simd::*` is the single public surface every cognitive-shader,
splat, codec, BLAS, and FFI consumer reaches for. The current dispatch
in `src/simd.rs` is **compile-time-only** with arms keyed off
`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar
fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking
AVX-512 into every compiled artifact.

The consequence surfaced on PR #170 (`tests/1.95.0` CI run
[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)):
**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on
a GitHub runner without AVX-512 silicon, all timing out uniformly
~19 s — the symptom of "binary cannot execute" rather than assertion
failure. The same configuration also leaves `simd_nightly/*` (the
portable-SIMD polyfill backend) unreachable because no dispatch arm in
`simd.rs` re-exports from it.

This document pins the target architecture, captures the parity gaps,
ranks the technical debt, and sequences the integration.

## 2. Dispatch model — three build configs, one runtime mode

Each build mode is a **conscious cargo invocation** via a distinct
`.cargo/config*.toml`. No silent fallbacks, no surprise hardware
mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd`
chose it deliberately.

| Config file | `target-cpu` | Dispatch strategy | Default? | Use case |
|---|---|---|---|---|
| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 |
| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment |
| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds |
| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments |

The aarch64 path is automatic: any `target_arch = "aarch64"` build
selects `simd_neon` regardless of the config above.

**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used
when shipping a single release binary that must adapt at process
start across heterogeneous deployment silicon (one binary running on
AVX-512 + AVX2-only machines from the same artifact). It compiles all
backends in and uses `LazyLock<CpuCaps>` trampolines. Reserved for the
release-binary distribution path; never the dev / CI default.

### Dispatch precedence in `simd.rs`

Compile-time arms read like a cascade, **not** like priority overrides
— each cargo config sets exactly one `target_feature` / `feature` such
that exactly one arm matches. The order below is the source-of-truth
ranking the compiler walks:

```rust
// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature).
// No `target_arch` constraint — `core::simd` is portable, so this
// arm is the one true backend on wasm32 / riscv / any other target
// as soon as `nightly-simd` is on. Keeping it unconditional on
// `feature = "nightly-simd"` is what makes the `not(feature =
// "nightly-simd")` exclusion on every other arm sound.
#[cfg(feature = "nightly-simd")]
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};

// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
pub use crate::simd_avx512::{...};

// 3. AVX2 baseline (the v3 / GitHub-CI default)
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
pub use crate::simd_avx2::{...};

// 4. NEON (aarch64)
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
pub use crate::simd_neon::aarch64_simd::{...};

// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without
// AVX2, etc.). The predicate is the negation of arms 1-4 so that
// *exactly one* arm matches on every (target, feature) pair.
#[cfg(not(any(
feature = "nightly-simd",
all(target_arch = "x86_64", target_feature = "avx2"),
target_arch = "aarch64",
)))]
pub use scalar::{...};
```

Runtime dispatch via `LazyLock<CpuCaps>` lives in a separate
`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch`
flag, mutually exclusive with the compile-time arms above.

## 3. Module roles

```
crate::simd::* ← user-facing key registry (re-exports only)
├── simd.rs = dispatch arms; no implementation, only `pub use`
├── simd_ops.rs = slice-level ops over crate::simd::* primitives
│ (add_f32, scale_f64, array_chunks, …)
├── simd_avx512.rs = __m512* values, native 512-bit registers
├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half
│ wrappers (struct F32x16(pub f32x8, pub f32x8))
├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes
│ composed as [float32x4_t; 4] etc.
├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable
│ ├── f32_types.rs F32x16, F32x8
│ ├── f64_types.rs F64x8, F64x4
│ ├── u8_types.rs U8x64, U8x32
│ ├── u_word_types.rs U16x32, U32x16, U64x8
│ ├── i8_types.rs I8x64, I8x32
│ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
│ ├── bf16_types.rs BF16x16, BF16x8
│ ├── f16_types.rs F16x16
│ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8
│ └── ops.rs op impls
└── scalar (inline `mod scalar` in simd.rs)
= pure-Rust fallback for unknown arch
```

Every `simd_<arch>.rs` is just a SOURCE of typed primitives. `simd.rs`
chooses the source; the cargo config chooses how `simd.rs` chooses.

## 4. Parity matrix — typed lane primitives per backend

Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch.

| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
|---|---|---|---|---|---|
| `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` |
| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ |
| `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ |
| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ |
| `U8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ |
| `U16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `U32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `U64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `I8x32` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ |
| `I8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `I16x16` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ |
| `I16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `I32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `I64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
| `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ |
| `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ |
| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ |
| `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ |
| `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ |

**Aarch64-native narrower types** (only useful directly when the
consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
parity surface — consumers requesting 256-bit / 512-bit shapes go
through the composed wrappers.

### Read of the matrix

- **F32x16 + F64x8 are universal** — all four backends ship them. Hot
paths can rely on these without branching.
- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`,
`F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other
cross-arch lane is missing — making the v3 default config crash any
consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc.
- **NEON is even sparser** at the 256/512-bit level.
- **`simd_nightly` is the most complete** but is unreachable today
because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its
re-exports.
- **`scalar`** has comprehensive cover and is the safest fallback for
any arch the others miss, but lives inline in `simd.rs` rather than
in a dedicated `simd_scalar.rs`. Symmetry would help.

## 5. Technical debt matrix

Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).

| ID | Severity | Description | Detection | Fix scope |
|---|---|---|---|---|
| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. |
| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. |
| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. |
| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. |
| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |

## 6. Integration plan — sequenced sprints

Each phase is a single-PR worker (sized for one Sonnet impl-sprint per
the `.claude/EN/agents/worker-template.md` shape). Phases sequence so
each lands a green CI; the next phase depends only on shipped state.

### Phase 1 — Unblock CI (P0 fixes)

**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml`
build runs end-to-end on AVX2-only silicon.

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 1.1 | flip baseline | Change `target-cpu` from `v4` → `v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs |
| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only |
| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` |

After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships
green CI by default. AVX-512 testing becomes an explicit job.

### Phase 2 — Unblock the polyfill (P1: `nightly-simd`)

**Goal:** `cargo +nightly check --features nightly-simd` reaches
`simd_nightly/*` via `crate::simd::*`. miri can execute the portable
path.

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature |
| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` |
| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature |

### Phase 3 — NEON parity (P1)

**Goal:** aarch64 build reaches the same cross-arch lane set as the v3
config.

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture |
| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean |

### Phase 4 — Symmetry + ergonomics (P1/P2)

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs |
| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation |
| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers |

### Phase 5 — Runtime dispatch (P2, opt-in)

**Goal:** ship-once binaries that adapt across heterogeneous deployment
silicon.

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact |
| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline |
| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green |

### Phase 6 — CI matrix for explicit AVX-512 (P3)

| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner |

## 7. Open questions

1. **Runtime trampoline cost class.** Phase 5's per-op indirection
adds one indirect call per `F32x16::add(...)`. Acceptable for the
typical 100+ cycle SIMD-op cost, but consumer benchmarks should
sanity-check before declaring the path production-ready.
2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top
of the cascade; alternative reading is "polyfill is for miri only,
so put it BELOW the arch-specific arms and only fire on non-x86_64,
non-aarch64 targets." The current proposal matches the user's
"explicit opt-in wins" framing; revisit if there's a use case for
`--features nightly-simd` on an AVX-512 host wanting the AVX-512
path.
3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is
x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface.
Out of scope for this document; tracked under PR-X10 A6
(`linalg::distance`) follow-ups.

## 8. Cross-references

- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer
contract every new SIMD primitive follows (struct methods on typed
wrappers, three-backend parity test, saturating/overflow semantics
documented).
- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend
integration consumer of `crate::simd::*`.
- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse +
Tantivy injection plan; depends on Phase 1 + 2 landing.
- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!`
reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set
ready to be wired into `simd.rs` dispatch (Phase 2).

## 9. TL;DR

Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by
default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path.
`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1).
`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*`
becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase
3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are
P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing
in order keeps every step green.
Loading