Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions .claude/knowledge/vertical-simd-consumer-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# KNOWLEDGE: Vertical SIMD — W1a Consumer Contract

## READ BY:
- `savant-architect` agent — before designing any new public `pub fn` in `src/simd_*.rs`
- `sentinel-qa` agent — when auditing the saturating / bounds-aware / scalar-fallback discipline on a SIMD addition
- Any contributor opening a PR that adds an `impl` block on `F32x16` / `I8x16` / `U8x32` / `U64x8` etc.
- Any contributor adding a new public function under `src/simd_ops.rs` or `src/simd_int_ops.rs`

## P0 TRIGGERS:
- About to file a PR adding `pub fn` to `src/simd_*.rs` → read this first
- About to claim "X SIMD instruction saturates by ISA" → read §"VPABSB correction" first
- Five `TD-NDARRAY-SIMD-*` issues are about to be filed against this repo from the `AdaWorldAPI/lance-graph` consumer contract → those are the W1a queue described below

---

## Why this doc exists

`AdaWorldAPI/lance-graph` (the obligatory spine for the Ada architecture) carries a hard architectural invariant: **all SIMD must come from `ndarray::simd` via the polyfill — `simd.rs` + `simd_ops.rs` > `simd_{type}.rs` per-arch. Raw intrinsics outside `ndarray/src/simd_*.rs` are a violation**, enforced by the `simd-savant` agent at `lance-graph:.claude/agents/simd-savant.md`.

A PRE-MERGE audit of `lance-graph` main on 2026-05-16 surfaced **158 raw-intrinsic violations across 5 consumer crates** plus **3 missing primitives** in `ndarray::simd` that block clean remediation. The lance-graph side is staged to migrate (in 5 sequential consumer PRs); the missing primitives must land in ndarray FIRST. This doc is the contract for what those primitives must do, with implementation details called out where consumer-side correctness depends on getting the semantics right.

The architectural shape this doc serves is captured in detail at:
- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — the canonical reference, "alien magic" framing
- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — the consumer-side enforcement card
- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` (2026-05-16) — the 158-violation finding

---

## The pattern (one paragraph)

ndarray's SIMD surface is shaped to fit exactly what the Ada stack vertically needs — not as a generic library that consumers wrap, but as **struct methods on typed wrappers** (`I8x16`, `U8x32`, `F32x16`, `U64x8`, …) plus **closure-parameterized batch primitives** that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zero `cfg(target_arch)`, zero runtime feature-detect — they call `I8x16::from_i4_packed_u64(...)`, `I8x16::saturating_abs(...)`, `batch_packed_i4_16(..., |lanes, aux| { ... })`. The polyfill owns the runtime feature dispatch, lane chunking, tail handling, and scalar fallback. Per-arch code lives in `simd_avx512.rs` / `simd_neon.rs` / `simd_wasm.rs`; nothing arch-specific leaks above the `src/simd*.rs` namespace.

---

## VPABSB correction (P0 — read before implementing saturating_abs)

**`_mm512_abs_epi8` (VPABSB) does NOT saturate `i8::MIN`.** The Intel intrinsic returns the same bit pattern for `0x80` — i.e., `abs(i8::MIN) = i8::MIN` because `+128` does not fit in `i8`. An earlier draft of the consumer contract (2026-05-16 morning) claimed the instruction saturated `i8::MIN → 127` by ISA. Codex caught this on `lance-graph` PR #400; the correction is binding.

**Correct AVX-512 implementation of `I8x16::saturating_abs`:**

```rust
// AVX-512 path
let raw_abs = unsafe { _mm512_abs_epi8(self.0) };
let clamped = unsafe {
_mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f))
};
I8x16(clamped)
```

The mechanic:
1. **VPABSB** computes the bit-pattern absolute value lane-wise. For `0x80` it returns `0x80` (the bit pattern of `+128` interpreted as unsigned). For everything else, `abs(x) < 0x80`, so the result fits in `i8` correctly.
2. **VPMINUB** (unsigned-byte min) then clamps `0x80` (=128 unsigned) down to `0x7f` (=127). All lanes with `abs(x) < 0x80` are unaffected because `min_epu8(x, 0x7f) = x` for `x ≤ 0x7f` and `min_epu8(0x80, 0x7f) = 0x7f`.

Equivalent NEON:
```rust
// vqabsq_s8 is hardware-saturating (the `q` suffix means saturating)
I8x16(unsafe { vqabsq_s8(self.0) })
// Returns 127 for i8::MIN, identical to the AVX-512 + clamp result
```

Scalar fused-loop:
```rust
for lane in 0..16 {
out[lane] = input[lane].saturating_abs(); // stdlib, well-defined
}
```

**Mandatory test** (binding for the PR):
```rust
#[test]
fn saturating_abs_i8_min_matches_across_backends() {
let input = I8x16::splat(i8::MIN);
let result = input.saturating_abs();
assert_eq!(result.lane_i8::<0>(), i8::MAX);
// ... and assert all 16 lanes equal i8::MAX
}
```

Any saturating-abs primitive in ndarray that does NOT produce `i8::MAX` for `i8::MIN` input is broken. The widen-then-negate trick (i8 → i64, then negate, then compare against threshold) used in `lance-graph` PR #398's mul.rs is a different mechanism and **not a substitute** — the new `I8x16::saturating_abs` must produce the saturating result in the same byte-wide register without widening, because downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops.

---

## W1a queue — 5 primitives ndarray must ship

Each is a tight-scope PR. Recommended: one branch per primitive, parallel review.

### W1a-#1 — `TD-NDARRAY-SIMD-UNPACK-I4-16D`

**Purpose:** unpack a `u64` of 16 packed signed nibbles (i4) into an `I8x16` with sign extension. Plus the closure-batch entry that the consumer's `mul::i4_eval::batch` dispatch calls.

**API surface:**
```rust
impl I8x16 {
/// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes
/// (sign-extended). Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf, i8).
pub fn from_i4_packed_u64(packed: u64) -> Self;

/// Const-folded lane extract.
pub fn lane_i8<const N: usize>(self) -> i8;
}

/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
/// Bounds-aware tail handling; scalar fallback on unsupported arch.
pub fn batch_packed_i4_16<E, F>(
packed: &[u64],
aux: &[i8],
out: &mut [E],
f: F,
)
where
F: Fn(I8x16, i8) -> E + Sync + Send,
E: Copy;
```

**Per-arch implementation hints:**
- **AVX-512:** load 16 × i8 from u64 via `_mm_cvtsi64_si128` + extend with `_mm512_cvtepi8_epi16` + nibble shuffle (PEXTRB or VPSHUFB with a mask LUT), then sign-extend by `_mm_cvtepi8_epi16`. Bench against alternative: PDEP (`_pdep_u64` × 2) into two u64 halves, then load + `vpmovsxbw` for sign-extend. Pick whichever benches faster on Zen4 + Sapphire Rapids.
- **NEON:** `vld1_u8` 8 bytes into `uint8x8_t`, then nibble-split via `vshl_n_s8(v, 4)` and `vshr_n_s8(v, 4)`. Sign-extension is automatic from `vshr_n_s8`.
- **Scalar:** fused loop reading 16 nibbles via `((packed >> (4*i)) & 0xf) as i8` with manual sign-extend (`if x > 7 { x - 16 } else { x }`).

**Consumer call site:** `lance-graph:crates/lance-graph-contract/src/mul.rs::i4_eval::batch` (5 batch fns over `QualiaI4_16D(u64)`). The closure-batch absorbs the 5 fns into closures + classifier names.

**PR #398 codex P1 (NEON OOB at `len==2`) is closed by this primitive** because the batch entry owns tail handling; consumers no longer reach for raw `vld1q_u64(&qualia[i+1].0 as *const u64)`.

---

### W1a-#2 — `TD-NDARRAY-SIMD-SATURATING-ABS-I8`

**Purpose:** byte-wide saturating absolute value. Closes codex P2 i8::MIN divergence on `lance-graph` PR #398 by giving consumers a single source-of-truth.

**API surface:**
```rust
impl I8x16 {
/// Lane-wise saturating absolute value. saturating_abs(i8::MIN) == i8::MAX.
/// All lanes are independently saturated.
pub fn saturating_abs(self) -> Self;
}

impl I8x32 {
pub fn saturating_abs(self) -> Self; // parity
}
```

**Per-arch implementation:** see § "VPABSB correction" above. The AVX-512 path is `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`; NEON is `vqabsq_s8`; scalar is `i8::saturating_abs`.

**Consumer:** `lance-graph:crates/lance-graph-contract/src/mul.rs` (Direction-B fix from PP-16 preflight-drift-auditor 2026-05-16). Spec line 233 of `lance-graph:.claude/specs/pr-sprint-13-simd-i4.md`: `|signed_mantissa| ≤ 1 → ValleyOfDespair` represents weak rule signal, NOT sign-extreme; `i8::MIN` must classify as `Slope/Plateau`, not `ValleyOfDespair`. Scalar in PR #398 is buggy (uses `unsigned_abs() as i8` which wraps `i8::MIN → -128`); the new primitive lets the fix be a one-liner: `lanes.saturating_abs().lane_i8::<0>()` ≤ 1.

---

### W1a-#3 — `TD-NDARRAY-SIMD-GATHER`

**Purpose:** SIMD gather for palette / lookup-table consumers. Currently `bgz17/src/simd.rs:88` inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation).

**API surface:**
```rust
impl U16x8 {
/// Gather 8 u16 values from `table` at the given indices.
/// indices[i] >= table.len() => panic in debug, scalar-fallback safe in release.
pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
}

/// Convenience: lookup 8 bytes from a u8 LUT by u16 indices.
pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
```

**Per-arch implementation:**
- **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid reading past u16 tables with 32-bit gathers

For gather_u16, using _mm256_i32gather_epi32 against a &[u16] can read four bytes starting at table[index]; when an allowed index is table.len() - 1, that crosses the end of the slice even though the later bounds contract only requires max(indices) < table.len(). If a future implementation follows this hint, release builds can perform an out-of-bounds SIMD load for valid last-element lookups instead of safely returning the final u16.

Useful? React with 👍 / 👎.

- **NEON:** no native gather instruction. Scalar loop is fine for 8 lanes — `(0..8).map(|i| table[indices.lane(i) as usize])`.
- **Scalar:** identical to the NEON fallback.

**Bounds:** `gather_u16` MUST validate `max(indices) < table.len()` before the SIMD gather call (debug panic; in release, fall through to scalar with `.get()` for safety).

---

### W1a-#4 — `TD-NDARRAY-SIMD-PREFETCH`

**Purpose:** cross-arch prefetch hint. Currently `bgz17/src/prefetch.rs:96,100` inlines `_mm_prefetch` and `_prefetch` directly.

**API surface:**
```rust
/// Hint that `ptr` will be read soon; load into L1 (T0) cache.
pub fn prefetch_read_t0(ptr: *const u8);

/// Hint to load into L2 (T1) cache.
pub fn prefetch_read_t1(ptr: *const u8);

/// Hint to load into L3 (T2) cache.
pub fn prefetch_read_t2(ptr: *const u8);
```

**Per-arch implementation:**
- **x86_64:** `_mm_prefetch(ptr as *const i8, _MM_HINT_T0)` / `_T1` / `_T2`.
- **aarch64:** `__pld(ptr)` via inline asm `prfm pldl1keep, [ptr]` (T0), `pldl2keep` (T1), `pldl3keep` (T2). Or wrap `core::intrinsics::prefetch_read_data` if/when stable.
- **Other arches:** no-op (the prefetch contract is a hint, not a guarantee — silent no-op is correct).

**Safety:** `ptr` is allowed to be invalid (prefetch on an unmapped page is a hint that the CPU silently drops on x86). No `assert!` needed.

---

### W1a-#5 — `TD-NDARRAY-SIMD-POPCOUNT-U64`

**Purpose:** lane-wise popcount of u64 vectors. Currently `holograph/hamming.rs` and `lance-graph:crates/lance-graph/src/graph/blasgraph/types.rs` use `_mm512_popcnt_epi64` directly for Hamming-distance reduction.

**API surface:**
```rust
impl U64x8 {
/// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
pub fn popcnt(self) -> Self;

/// XOR + lane-wise popcount + horizontal sum across 8 lanes.
/// Optimized for Hamming-distance reductions.
pub fn xor_popcount(self, other: Self) -> u64;
}

impl U64x4 {
pub fn popcnt(self) -> Self; // AVX2 parity
}
```

**Per-arch implementation:**
- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`.
- **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB).
- **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve u64 lanes in the NEON popcount recipe

For lane-wise U64x8::popcnt, vaddvq_u8 reduces all bytes in a NEON vector to one scalar, so using it as described here would merge the counts of multiple u64 lanes instead of returning one count per lane. This only shows up on the NEON backend, where a future implementation following this contract would disagree with the scalar and AVX results for inputs such as [1, 0, ...] versus [0, 1, ...]; the hint should require a per-u64 widening/pairwise reduction instead.

Useful? React with 👍 / 👎.

- **Scalar:** `u64::count_ones` fused loop.

**Note:** the existing `ndarray::hpc::bitwise::popcount_raw` and `hamming_distance_raw` cover the slice case but DO NOT expose a lane-wise method. The new `U64x8::popcnt` fills that gap so consumers can compose Hamming-distance pipelines without dropping back to slice ops.

---

## W1.5 — DEFERRED primitives (gated on `lance-graph:crates/sigker` certification)

Three more primitives are queued behind a certification gate. `crates/sigker` is `lance-graph`'s path-signature codec — it's pure-scalar Rust today (zero raw intrinsics, zero ndarray dep), and is positioned as the **Index-regime third encoding lane** alongside palette-distance (bgz17) and NSM tiling (deepnsm). It explicitly bypasses the `I-NOISE-FLOOR-JIRAK` iron rule (Jirak 2016 Berry-Esseen for weak-dependence data) via Hambly-Lyons 2010 path-signature uniqueness.

When `jc Pillar 11` (Hambly-Lyons signature uniqueness on lance-graph paths) activates and sigker is benchmarked at production carrier widths, the W1.5 queue lights up:

### W1.5-#6 — `TD-NDARRAY-SIMD-SIGNATURE-PDE-SWEEP`

**Purpose:** signature kernel `〈S(X), S(Y)〉` via Goursat PDE — depth-∞ in O(T₁·T₂) flops, no signature materialization.

**API surface (sketch):**
```rust
pub fn signature_pde_sweep<F>(
x: &[F32x16],
y: &[F32x16],
kernel_fn: F,
) -> f32
where
F: Fn(F32x16, F32x16) -> F32x16;
```

2D banded grid sweep; closure-parameterized kernel evaluator per step.

### W1.5-#7 — `TD-NDARRAY-SIMD-RANDOMIZED-PROJECTION`

Cuchiero-Schmocker-Teichmann (2021) randomized signatures: Gaussian random-matrix-vector update with `F32x16` state. Same closure-batch shape as W1a-#1, different lane type.

### W1.5-#8 — `TD-NDARRAY-SIMD-LYNDON-PACK`

Log-signature compression in the Lyndon basis of the free Lie algebra (7-13× compression, lossless). Pack/unpack primitives on `I16x16` state with combinatorial-index awareness.

**No code needed today for W1.5.** Mentioned here so W1a additions are designed broad enough to compose with these later (in particular: the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7).

---

## Acceptance criteria for each W1a PR

Every PR adding a primitive from this queue MUST:

1. **Implement all three backends** (AVX-512/AVX2/SSE, NEON, scalar). Missing scalar fallback is a P0 reject — the scalar path is the correctness anchor.
2. **Document the saturating / overflow / signedness semantics** in the doc-comment. State explicitly what happens at edge cases (`i8::MIN`, `u8::MAX`, empty slices, indices out-of-range).
3. **Mandatory parity test** asserting all three backends produce identical output on a fixed-seed randomized corpus that includes edge cases (`i8::MIN`, `0`, `i8::MAX`, mantissa = -128, etc.). Use `proptest` or `quickcheck` if available; otherwise hand-roll 50+ test inputs.
4. **Bench against scalar** — record AVX-512 / NEON speedup ratios in the PR body. No SHIP/LAND gate required for the primitive PR itself (the consumer-side migration PRs will benchmark end-to-end), but a 0.5× anti-speedup ratio is a reject.
5. **`// SAFETY:` comments on every `unsafe` block** per ndarray's existing discipline (`CLAUDE.md` § Hard Rules).
6. **No new `is_*_feature_detected!` calls outside `src/hpc/simd_caps.rs`** — dispatch through the existing `simd_caps()` singleton.
7. **PR description must include the consumer site** (`lance-graph:crates/lance-graph-contract/src/mul.rs:NNN`, etc.) so the post-merge consumer-PR has a known target.

The `simd-savant` agent on the `lance-graph` side runs PRE-MERGE against every W1a PR to verify compliance.

---

## Cross-references

**ndarray-side (this repo):**
- `src/simd.rs` — the public re-export hub. New primitives surface here.
- `src/simd_avx512.rs` — AVX-512 typed wrappers (`I64x8`, `U64x8`, `I8x32`, `F32x16`, `F64x8`, …).
- `src/simd_avx2.rs` — AVX2 typed wrappers (`U8x32`).
- `src/simd_neon.rs` — NEON typed wrappers.
- `src/simd_ops.rs` — high-level vector→vector ops (`add_f32`, `mul_f32`, …).
- `src/simd_int_ops.rs` — integer batch ops (`add_i8`, `dot_i8`, `min_i8`, …).
- `src/hpc/simd_caps.rs` — runtime feature-detect singleton.
- `src/hpc/bitwise.rs` — already-exposed `hamming_distance_raw` + `popcount_raw` (slice case).

**lance-graph-side (the consumer driving this contract):**
- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — full architectural doc + per-workload table
- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — PRE-MERGE audit gate
- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` — the 158-violation finding
- `AdaWorldAPI/lance-graph:.claude/board/TECH_DEBT.md` § `TD-NDARRAY-SIMD-*` and § `TD-SIMD-SWEEP-W*` — full debt ledger
- `AdaWorldAPI/lance-graph:.claude/specs/pr-sprint-13-simd-i4.md` — D-CSV-13b spec (the consumer workload spec)
- PR #398 (sprint-13 W-I1 retry) — the codex P1 (NEON OOB) + P2 (i8::MIN divergence) origin
- PR #399 (`simd-savant` card + autoattended-pattern doc) — invariant declaration
- PR #400 (architectural capture commit) — the canonical reference + tech-debt entries

**External references:**
- Intel Intrinsics Guide — `_mm512_abs_epi8` (VPABSB; does NOT saturate `i8::MIN`)
- Intel Intrinsics Guide — `_mm512_min_epu8` (VPMINUB; unsigned-byte minimum, used to clamp the VPABSB result)
- Intel Intrinsics Guide — `_mm512_popcnt_epi64` (VPOPCNTDQ; AVX-512 feature `avx512vpopcntdq`)
- Intel Intrinsics Guide — `_mm256_i32gather_epi32` (VPGATHERDD AVX2)
- ARM Architecture Reference — VQABS (`vqabsq_s8`, hardware-saturating)
- ARM Architecture Reference — VCNT (`vcntq_u8`, byte-wise popcount)
- Hambly & Lyons (2010), "Uniqueness for the signature of a path of bounded variation and the reduced path group"
- Cuchiero, Schmocker & Teichmann (2021), "Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality"
- Jirak (2016), "Berry-Esseen theorems under weak dependence" — the iron rule sigker bypasses

## Litmus tests (for any contributor proposing an addition to this queue)

> **Does the new primitive go on a typed-wrapper struct, or as a free function?**
> Free function = reject; the surface fragments. Struct method = accept.

> **Does the doc-comment state the edge-case behavior (saturating? wrapping? UB? scalar-fallback?)?**
> Missing = reject. The consumer needs to know without reading the code.

> **Are all three backends implemented (AVX*, NEON, scalar)?**
> Missing scalar = reject. Scalar is the correctness anchor.

> **Is there a parity test asserting all three backends produce identical output on a fixed-seed randomized corpus including edge cases?**
> Missing = reject. The codex P2 i8::MIN divergence on `lance-graph` PR #398 happened because no such test existed.

> **Is the consumer site cited in the PR description?**
> Missing = reject. We're shipping primitives for known workloads, not speculative ones.
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ This project uses specialized agents in `.claude/agents/`. Follow these rules:
- Every `unsafe` block needs a `// SAFETY:` comment.
- All public APIs need `///` doc comments with examples.
- `cargo clippy -- -D warnings` must pass.
- **All new public `pub fn` in `src/simd_*.rs` follows the W1a consumer contract** at `.claude/knowledge/vertical-simd-consumer-contract.md` — struct methods on typed wrappers, closure-parameterized batch primitives, all three backends (AVX*/NEON/scalar) implemented, parity test mandatory, saturating/overflow semantics documented. The Ada stack (lance-graph + downstream) enforces "all SIMD from `ndarray::simd`" via its `simd-savant` agent; missing primitives in ndarray force consumer-side raw-intrinsic violations, so additions here are gating the consumer-side sweep. **VPABSB does NOT saturate `i8::MIN`** — see § "VPABSB correction" in the contract doc before implementing `saturating_abs` or any abs primitive.

## Compaction Preservation
When summarizing this conversation, preserve:
Expand Down
Loading