Skip to content

docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction#149

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/vertical-simd-consumer-contract-w1a-spec
May 16, 2026
Merged

docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction#149
AdaWorldAPI merged 1 commit into
masterfrom
claude/vertical-simd-consumer-contract-w1a-spec

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Captures the consumer-side architectural contract that the AdaWorldAPI/lance-graph spine is staging against this ndarray fork. A PRE-MERGE audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic violations across 5 consumer crates plus 3 missing primitives here that block clean remediation. This doc is the spec for what the W1a primitive queue must do; consumer-side migrations cannot proceed until these primitives ship.

Files

  1. .claude/knowledge/vertical-simd-consumer-contract.md (NEW, 328 LOC) — the canonical W1a spec.
  2. CLAUDE.md § Hard Rules (+1 line) — pointer + VPABSB callout + gating-relationship statement.

The pattern

ndarray's SIMD surface is designed AS-IF for our exact workloads: struct methods on typed wrappers (I8x16, U8x32, F32x16, U64x8, …) plus closure-parameterized batch primitives that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zero cfg(target_arch), zero runtime feature-detect — they call I8x16::from_i4_packed_u64(...), I8x16::saturating_abs(...), batch_packed_i4_16(..., |lanes, aux| { ... }). The polyfill (this repo) owns dispatch, chunking, tail handling, and scalar fallback.

P0: VPABSB correction

The original lance-graph PR #400 capture claimed _mm512_abs_epi8 saturates i8::MIN → 127 by ISA. This is wrong — VPABSB returns the same bit pattern for 0x80 (i.e., abs(i8::MIN) = i8::MIN, since +128 doesn't fit in i8). Codex caught this on PR #400; the binding correction is:

// AVX-512 saturating_abs
let raw_abs = _mm512_abs_epi8(self.0);
let clamped = _mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f));
I8x16(clamped)

The VPMINUB (unsigned-byte min) clamp remaps 0x80 (= 128 unsigned > 127) down to 0x7f. All other lanes are unchanged since abs(x) < 0x80 for x ≠ i8::MIN. NEON vqabsq_s8 is already hardware-saturating (the q suffix); scalar i8::saturating_abs is correct.

Mandatory parity test:

let input = I8x16::splat(i8::MIN);
assert_eq!(input.saturating_abs().lane_i8::<0>(), i8::MAX);

The widen-then-negate trick used in lance-graph PR #398's mul.rs is NOT a substitute — the new primitive must produce saturating semantics in the byte-wide register without widening, since downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops.

W1a queue — 5 primitives this PR specs

Each will be a separate PR (parallel review, tight scope):

TD API Consumer driving the spec
TD-NDARRAY-SIMD-UNPACK-I4-16D I8x16::from_i4_packed_u64 + batch_packed_i4_16<E, F> closure-batch lance-graph::mul::i4_eval::batch (5 fns)
TD-NDARRAY-SIMD-SATURATING-ABS-I8 I8x16::saturating_abs (VPABSB + VPMINUB clamp) lance-graph PR #398 Direction-B fix
TD-NDARRAY-SIMD-GATHER U16x8::gather_u16 + palette_lookup_u8x8 bgz17/src/simd.rs:88
TD-NDARRAY-SIMD-PREFETCH prefetch_read_t0/t1/t2 cross-arch bgz17/src/prefetch.rs:96,100
TD-NDARRAY-SIMD-POPCOUNT-U64 U64x8::popcnt + xor_popcount holograph/hamming.rs, blasgraph/types.rs

W1.5 — deferred primitives (gated on lance-graph::sigker certification)

Three more queued behind jc Pillar 11 activation: signature-PDE-sweep, randomized-projection, lyndon-pack. The W1a additions must be designed broad enough to compose with these — in particular, the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7 (randomized signatures).

Acceptance criteria for each W1a PR

  1. All three backends (AVX*, NEON, scalar) — scalar is the correctness anchor
  2. Doc-comment states saturating/overflow/signedness semantics explicitly
  3. Mandatory parity test on randomized + edge-case corpus
  4. No new is_*_feature_detected! outside src/hpc/simd_caps.rs
  5. // SAFETY: comments on all unsafe blocks
  6. Consumer site cited in PR description

The simd-savant agent on the lance-graph side runs PRE-MERGE against every W1a PR to verify compliance.

Cross-references

  • AdaWorldAPI/lance-graph PR #399 — introduced the simd-savant agent + autoattended-multiagent pattern doc
  • AdaWorldAPI/lance-graph PR #400 — architectural capture (alien-magic + EPIPHANIES + TECH_DEBT)
  • AdaWorldAPI/lance-graph PR #398 — codex P1 (NEON OOB) + P2 (i8::MIN divergence) — the trigger
  • AdaWorldAPI/lance-graph PR (open) — corrects the VPABSB claim in lance-graph's knowledge doc
  • Intel Intrinsics Guide: _mm512_abs_epi8, _mm512_min_epu8, _mm512_popcnt_epi64, _mm256_i32gather_epi32
  • ARM Architecture Reference: VQABS (vqabsq_s8), VCNT (vcntq_u8)

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS


Generated by Claude Code

…tion

Captures the consumer-side architectural contract that the AdaWorldAPI/
lance-graph spine is staging against this ndarray fork. A PRE-MERGE
audit of lance-graph main on 2026-05-16 surfaced 158 raw-intrinsic
violations across 5 consumer crates plus 3 missing primitives here
that block clean remediation. This doc is the spec for what the
W1a primitive queue must do; consumer-side migrations cannot proceed
until these primitives ship.

Files (2):

1. .claude/knowledge/vertical-simd-consumer-contract.md (NEW, 328L)
   The pattern: struct methods on typed wrappers + closure-
   parameterized batch primitives. Consumers see zero raw intrinsics
   and zero arch-specific cfg; the polyfill (this repo) owns runtime
   feature dispatch, lane chunking, tail handling, scalar fallback.

   **VPABSB correction (P0):** _mm512_abs_epi8 does NOT saturate
   i8::MIN — it returns the same bit pattern (0x80 → 0x80, which is
   still -128 when interpreted as i8). The lance-graph PR #400 codex
   P1 review caught the original claim that VPABSB saturates by ISA;
   the correct AVX-512 saturating_abs is
     _mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))
   The VPMINUB clamp remaps 0x80 (unsigned 128 > 127) down to 0x7f.
   NEON vqabsq_s8 is already hardware-saturating (q-suffix); scalar
   i8::saturating_abs is correct.

   Five W1a primitive specs with per-arch implementation hints,
   API surfaces, mandatory parity-test requirements, and consumer
   call sites:
   - TD-NDARRAY-SIMD-UNPACK-I4-16D: I8x16::from_i4_packed_u64 +
     batch_packed_i4_16<E, F> closure-batch (consumer:
     lance-graph mul::i4_eval::batch, 5 fns)
   - TD-NDARRAY-SIMD-SATURATING-ABS-I8: I8x16::saturating_abs (the
     VPABSB+VPMINUB fix above)
   - TD-NDARRAY-SIMD-GATHER: U16x8::gather_u16 + palette_lookup_u8x8
     (consumer: bgz17/src/simd.rs:88)
   - TD-NDARRAY-SIMD-PREFETCH: prefetch_read_t0/t1/t2 cross-arch
     (consumer: bgz17/src/prefetch.rs:96,100)
   - TD-NDARRAY-SIMD-POPCOUNT-U64: U64x8::popcnt + U64x8::xor_popcount
     (consumer: holograph/hamming.rs, blasgraph/types.rs)

   Three W1.5 deferred primitives (gated on lance-graph:crates/sigker
   benchmarking + jc Pillar 11 activation): signature-PDE-sweep,
   randomized-projection, lyndon-pack. Mentioned so W1a additions
   are designed broad enough to compose with these later.

   Acceptance criteria: all three backends mandatory, parity test
   mandatory, saturating/overflow semantics documented, no new
   is_*_feature_detected! outside simd_caps, // SAFETY: on all
   unsafe blocks, consumer site cited in PR description.

   Cross-links to lance-graph knowledge doc, simd-savant card,
   E-SIMD-SWEEP-1 epiphany, TD-NDARRAY-SIMD-* / TD-SIMD-SWEEP-W*
   debt ledger, PR #398/#399/#400 history. Plus Intel + ARM
   intrinsic references, Hambly-Lyons 2010, Cuchiero-Schmocker-
   Teichmann 2021, Jirak 2016.

2. CLAUDE.md § Hard Rules (+1 row)
   "All new public pub fn in src/simd_*.rs follows the W1a consumer
   contract" — pointer to the doc, callout that VPABSB does NOT
   saturate i8::MIN, and statement of the gating relationship
   (missing primitives here force consumer-side raw-intrinsic
   violations, so additions here are blocking the consumer sweep).

Once this lands, W1a workers can spawn against this branch in
parallel: 5 small PRs against ndarray master, each implementing
one primitive from the queue with the parity tests mandated by
this doc.

https://claude.ai/code/session_01UwJuKqP828qyX1VkLgGJFS
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c2161b4a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

```

**Per-arch implementation:**
- **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid reading past u16 tables with 32-bit gathers

For gather_u16, using _mm256_i32gather_epi32 against a &[u16] can read four bytes starting at table[index]; when an allowed index is table.len() - 1, that crosses the end of the slice even though the later bounds contract only requires max(indices) < table.len(). If a future implementation follows this hint, release builds can perform an out-of-bounds SIMD load for valid last-element lookups instead of safely returning the final u16.

Useful? React with 👍 / 👎.

**Per-arch implementation:**
- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`.
- **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB).
- **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve u64 lanes in the NEON popcount recipe

For lane-wise U64x8::popcnt, vaddvq_u8 reduces all bytes in a NEON vector to one scalar, so using it as described here would merge the counts of multiple u64 lanes instead of returning one count per lane. This only shows up on the NEON backend, where a future implementation following this contract would disagree with the scalar and AVX results for inputs such as [1, 0, ...] versus [0, 1, ...]; the hint should require a per-u64 widening/pairwise reduction instead.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit d0627b8 into master May 16, 2026
14 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 16, 2026
Resolves add/add conflict on .claude/knowledge/vertical-simd-consumer-contract.md
by taking master's version (PR #149 — the polished READ BY / P0 TRIGGERS form
with agent routing). CLAUDE.md gains the W1a contract hard rule pointer
from master.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants