From ff234171c6f5ed1363ee4e807c2a18dbd50171f7 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 20:12:38 +0000
Subject: [PATCH 1/7] Add detailed hpc/ integration plan: bridge raw-slice
 layer to ndarray types
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

6-tier refactoring shopping list transforming hpc/ from a bolted-on
raw-slice library into a first-class ndarray citizen:

- Tier 1: Type bridges (Fingerprint/VsaVector ↔ ArrayView, BF16 → BlasFloat)
- Tier 2: Extension traits (HdcOps, Quantize, Prefilter, SimdMath)
- Tier 3: Backend wiring (core sum/mean → SIMD, unified detection)
- Tier 4: View factories (Arrow → ArrayView)
- Tier 5: Zip modernization (VML tails, parallel hamming)
- Tier 6: N-D axis reductions with SIMD dispatch

No deletions — raw-slice functions stay for FFI/embedded. Extension traits
add ndarray-native overloads that delegate to existing SIMD kernels.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 REFACTOR_HPC_INTEGRATION.md | 1006 +++++++++++++++++++++++++++++++++++
 1 file changed, 1006 insertions(+)
 create mode 100644 REFACTOR_HPC_INTEGRATION.md
diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md
new file mode 100644
index 00000000..1c11c56d
--- /dev/null
+++ b/REFACTOR_HPC_INTEGRATION.md
@@ -0,0 +1,1006 @@
+# ndarray hpc/ Module Integration & Transformation Plan
+
+> **Goal**: Transform hpc/ from a bolted-on raw-slice library into a first-class
+> citizen of ndarray's type system — without breaking any existing API.
+>
+> **Principle**: Every refactoring adds an ndarray-native overload or trait impl.
+> Raw-slice functions stay (FFI, embedded, zero-overhead). No deletion.
+
+---
+
+## Architectural Diagnosis
+
+The hpc/ module (80K lines, 90+ files) currently operates in two disconnected worlds:
+
+```
+┌─────────────────────────────────────────────────────┐
+│  ndarray core                                       │
+│  Array<f32, Ix2>, ArrayView, Zip, Broadcasting,     │
+│  .sum(), .dot(), mapv(), LinalgScalar, BlasFloat    │
+└───────────────┬─────────────────────────────────────┘
+                │ .as_slice() ← the only bridge
+                ▼
+┌─────────────────────────────────────────────────────┐
+│  hpc/ module                                        │
+│  &[u8], &[u64], &[f32], *const u8, Vec<f32>        │
+│  Fingerprint<N>, VsaVector, Cascade, EnergyConflict │
+│  Custom SIMD dispatch, manual row/col params        │
+└─────────────────────────────────────────────────────┘
+```
+
+The refactoring creates **bidirectional bridges** without removing the raw layer:
+
+```
+┌─────────────────────────────────────────────────────┐
+│  ndarray core (unchanged)                           │
+│  Array<f32, Ix2>, ArrayView, Zip, Broadcasting      │
+└───────────────┬──────────────────────▲──────────────┘
+                │                      │
+          .as_slice()          backend dispatch
+                │                      │
+                ▼                      │
+┌─────────────────────────────────────────────────────┐
+│  hpc/ bridge layer (NEW)                            │
+│  Extension traits on ArrayBase<S, D>                │
+│  From/Into impls for domain types                   │
+│  Core reductions route to SIMD backends             │
+└───────────────┬──────────────────────▲──────────────┘
+                │                      │
+          delegates to          implements
+                │                      │
+                ▼                      │
+┌─────────────────────────────────────────────────────┐
+│  hpc/ raw compute (unchanged)                       │
+│  &[u8], &[u64], Fingerprint<N>, SIMD dispatch       │
+│  K0/K1/K2, BF16 GEMM, VNNI, VML                    │
+└─────────────────────────────────────────────────────┘
+```
+
+---
+
+## Refactoring Shopping List
+
+### Tier 1: Type Bridge — Domain Types ↔ Array (Low effort, High impact)
+
+These add conversion traits so domain types can participate in ndarray's ecosystem.
+
+---
+
+#### 1.1 Fingerprint<N> ↔ ArrayView1<u64>
+
+**File**: `src/hpc/fingerprint.rs`
+
+**Current state**: `Fingerprint<N>` wraps `[u64; N]`. Provides `.as_bytes()` via
+unsafe pointer cast. No connection to ndarray.
+
+**Transform**:
+
+```rust
+// --- Zero-copy views (no allocation) ---
+
+impl<const N: usize> Fingerprint<N> {
+    /// View this fingerprint as an ndarray 1-D array of u64 words.
+    /// Zero-copy: borrows the internal buffer.
+    pub fn as_array_view(&self) -> ArrayView1<'_, u64> {
+        ArrayView1::from_shape(N, &self.words).unwrap()
+    }
+
+    /// Mutable view for in-place operations.
+    pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> {
+        ArrayViewMut1::from_shape(N, &mut self.words).unwrap()
+    }
+
+    /// View as 2-D matrix (e.g., 256 = 32×8 for tiled operations).
+    pub fn as_matrix_view(&self, rows: usize, cols: usize) -> ArrayView2<'_, u64> {
+        assert_eq!(rows * cols, N);
+        ArrayView2::from_shape((rows, cols), &self.words).unwrap()
+    }
+}
+
+// --- Owned conversions (allocation on From, zero-copy on Into) ---
+
+impl<const N: usize> From<Array1<u64>> for Fingerprint<N> {
+    fn from(arr: Array1<u64>) -> Self {
+        assert_eq!(arr.len(), N);
+        let mut words = [0u64; N];
+        words.copy_from_slice(arr.as_slice().unwrap());
+        Self { words }
+    }
+}
+
+impl<const N: usize> From<Fingerprint<N>> for Array1<u64> {
+    fn from(fp: Fingerprint<N>) -> Self {
+        Array1::from_vec(fp.words.to_vec())
+    }
+}
+```
+
+**Why**: Callers can now pass fingerprints to any function expecting `ArrayView1<u64>`.
+Enables `.sum()`, `.mapv()`, Zip, broadcasting on fingerprint data.
+
+**Example — before vs. after**:
+
+```rust
+// BEFORE: raw slice, manual loop
+let dist = k2_exact(&fp_a.words, &fp_b.words, 256);
+
+// AFTER: can also use ndarray ops directly
+let view_a = fp_a.as_array_view();
+let view_b = fp_b.as_array_view();
+let xor_popcount: u64 = (&view_a ^ &view_b).mapv(|w| w.count_ones() as u64).sum();
+```
+
+---
+
+#### 1.2 VsaVector ↔ ArrayView1<u64>
+
+**File**: `src/hpc/vsa.rs`
+
+**Same pattern as 1.1** — VsaVector has `words: [u64; 256]`. Add:
+
+```rust
+impl VsaVector {
+    pub fn as_array_view(&self) -> ArrayView1<'_, u64> {
+        ArrayView1::from_shape(VSA_WORDS, &self.words).unwrap()
+    }
+    pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> {
+        ArrayViewMut1::from_shape(VSA_WORDS, &mut self.words).unwrap()
+    }
+}
+
+impl From<Array1<u64>> for VsaVector { ... }
+impl From<VsaVector> for Array1<u64> { ... }
+```
+
+---
+
+#### 1.3 BF16 / F16 as ndarray Element Types
+
+**Files**: `src/hpc/quantized.rs`, `src/backend/mod.rs`
+
+**Current state**: BF16 and F16 implement `ScalarOperand` (can appear in arrays)
+but NOT `BlasFloat` (cannot use `.dot()`, BLAS traits). They're second-class citizens.
+
+**Transform** — Register BF16/F16 for BLAS dispatch:
+
+```rust
+// In src/backend/mod.rs or src/hpc/quantized.rs:
+
+impl BlasFloat for BF16 {
+    fn backend_dot(xs: &[Self], ys: &[Self]) -> Self {
+        // Accumulate in f32, return as BF16
+        let mut acc = 0.0f32;
+        for (&x, &y) in xs.iter().zip(ys.iter()) {
+            acc += x.to_f32() * y.to_f32();
+        }
+        BF16::from_f32(acc)
+    }
+
+    fn backend_gemm(m: usize, n: usize, k: usize, alpha: Self,
+                    a: &[Self], lda: usize, b: &[Self], ldb: usize,
+                    beta: Self, c: &mut [Self], ldc: usize) {
+        // Delegate to existing bf16_gemm_f32, convert output back
+        crate::hpc::quantized::bf16_gemm_f32(
+            a, b,
+            /* c_f32 temp */ ...,
+            m, n, k, alpha.to_f32(), beta.to_f32()
+        );
+        // Convert f32 results back to BF16 in c
+    }
+
+    fn backend_axpy(alpha: Self, x: &[Self], y: &mut [Self]) { ... }
+    fn backend_scal(alpha: Self, x: &mut [Self]) { ... }
+    fn backend_nrm2(x: &[Self]) -> Self { ... }
+    fn backend_asum(x: &[Self]) -> Self { ... }
+}
+```
+
+**Impact**: Unlocks this:
+
+```rust
+let a: Array2<BF16> = ...;
+let b: Array2<BF16> = ...;
+let c = a.blas_gemm(BF16::one(), &b, BF16::zero());  // Just works
+```
+
+---
+
+#### 1.4 CogRecord ↔ Structured ArrayView
+
+**File**: `src/hpc/cogrecord.rs`
+
+**Current state**: CogRecord has four `Array<u8, Ix1>` fields (btree, cam, embed, meta)
+of 4096 bytes each. No combined view.
+
+**Transform**:
+
+```rust
+impl CogRecord {
+    /// View all 4 channels as a single [4, 4096] matrix.
+    pub fn as_channels_view(&self) -> ArrayView2<'_, u8> {
+        // Stack the 4 contiguous buffers into a 2-D view
+        let total = self.btree.len() + self.cam.len() + self.embed.len() + self.meta.len();
+        debug_assert_eq!(total, 4 * CONTAINER_BYTES);
+        // NOTE: only works if fields are contiguous in memory.
+        // If not, provide a copy-based `to_channels_array()`.
+        todo!("requires contiguous layout guarantee")
+    }
+
+    /// Get a single channel as ArrayView1<u64> (for Hamming/XOR ops).
+    pub fn channel_as_words(&self, channel: usize) -> ArrayView1<'_, u64> {
+        let bytes = match channel {
+            0 => self.btree.as_slice().unwrap(),
+            1 => self.cam.as_slice().unwrap(),
+            2 => self.embed.as_slice().unwrap(),
+            3 => self.meta.as_slice().unwrap(),
+            _ => panic!("channel must be 0..4"),
+        };
+        // Safe: u8 slice reinterpreted as u64 (alignment checked)
+        let words = unsafe {
+            std::slice::from_raw_parts(
+                bytes.as_ptr() as *const u64,
+                bytes.len() / 8,
+            )
+        };
+        ArrayView1::from_shape(words.len(), words).unwrap()
+    }
+}
+```
+
+---
+
+### Tier 2: Trait Extensions — hpc Operations on Array Types (Medium effort, High impact)
+
+Add extension traits so hpc operations feel native on ndarray arrays.
+
+---
+
+#### 2.1 HDC Extension Trait (Hamming, Cascade, K0/K1/K2)
+
+**New file**: `src/hpc/ndarray_ext.rs` (or extend `bitwise.rs`)
+
+**Reasoning**: The K0/K1/K2 pipeline currently takes `&[u64]`. Users with
+`Array1<u64>` must manually call `.as_slice().unwrap()`. This trait bridges that.
+
+```rust
+use crate::imp_prelude::*;
+use super::kernels::{k0_probe, k1_stats, k2_exact, SliceGate, EnergyConflict};
+
+/// HDC (Hyperdimensional Computing) operations on u64 word arrays.
+///
+/// Provides the K0/K1/K2 rejection pipeline directly on ndarray types.
+pub trait HdcOps {
+    /// K0 probe: fast 64-bit reject.
+    /// Returns true if the first word passes the gate threshold.
+    fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool;
+
+    /// K1 stats: 512-bit medium reject.
+    /// Returns true if the first 8 words pass the gate threshold.
+    fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool;
+
+    /// K2 exact: Full-width Hamming + energy decomposition.
+    fn k2_exact(&self, candidate: &Self) -> EnergyConflict;
+
+    /// Full cascade: K0 → K1 → K2 with early rejection.
+    fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option<EnergyConflict>;
+}
+
+impl<S> HdcOps for ArrayBase<S, Ix1>
+where
+    S: Data<Elem = u64>,
+{
+    fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool {
+        k0_probe(self[0], candidate[0], gate)
+    }
+
+    fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool {
+        let q = self.as_slice().expect("query must be contiguous");
+        let c = candidate.as_slice().expect("candidate must be contiguous");
+        k1_stats(q, c, gate)
+    }
+
+    fn k2_exact(&self, candidate: &Self) -> EnergyConflict {
+        let q = self.as_slice().expect("query must be contiguous");
+        let c = candidate.as_slice().expect("candidate must be contiguous");
+        k2_exact(q, c, q.len())
+    }
+
+    fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option<EnergyConflict> {
+        if !self.k0_probe(candidate, gate) { return None; }
+        if !self.k1_stats(candidate, gate) { return None; }
+        Some(self.k2_exact(candidate))
+    }
+}
+```
+
+**Example usage after**:
+
+```rust
+use ndarray::hpc::ndarray_ext::HdcOps;
+
+let query: Array1<u64> = fp_query.into();
+let candidate: Array1<u64> = fp_candidate.into();
+let gate = SliceGate::for_sku16k();
+
+// Clean ndarray-native API
+if let Some(result) = query.cascade_distance(&candidate, &gate) {
+    println!("Hamming: {}, Agreement: {}", result.conflict, result.agreement);
+}
+```
+
+---
+
+#### 2.2 Quantization Extension Trait
+
+**File**: extend `src/hpc/quantized.rs`
+
+**Reasoning**: Quantize/dequantize currently returns `Vec<T>`. Should return `Array1<T>`.
+
+```rust
+/// Quantization operations on f32 arrays.
+pub trait Quantize {
+    /// Quantize to BF16 (element-wise truncation).
+    fn to_bf16(&self) -> Array<BF16, Ix1>;
+
+    /// Quantize to symmetric INT8 with scale factor.
+    fn to_i8_symmetric(&self) -> (Array<i8, Ix1>, QuantParams);
+
+    /// Quantize per-channel (rows) to INT8.
+    fn to_i8_per_channel(&self) -> (Array2<i8>, PerChannelQuantParams)
+    where
+        Self: AsArray<'_, f32, Ix2>;  // Only for 2-D
+}
+
+/// Dequantization operations.
+pub trait Dequantize<A> {
+    /// Dequantize INT8 codes back to f32.
+    fn dequantize_f32(&self, params: &QuantParams) -> Array<f32, Ix1>;
+}
+
+impl<S> Quantize for ArrayBase<S, Ix1>
+where
+    S: Data<Elem = f32>,
+{
+    fn to_bf16(&self) -> Array<BF16, Ix1> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let mut out = vec![BF16::ZERO; slice.len()];
+        f32_to_bf16_slice(slice, &mut out);  // existing raw function
+        Array1::from_vec(out)
+    }
+
+    fn to_i8_symmetric(&self) -> (Array<i8, Ix1>, QuantParams) {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (vec, params) = quantize_f32_to_i8(slice);  // existing raw function
+        (Array1::from_vec(vec), params)
+    }
+}
+```
+
+---
+
+#### 2.3 Prefilter — Accept ArrayView2 Instead of (slice, rows, cols)
+
+**File**: `src/hpc/prefilter.rs`
+
+**Current state** (lines 48-295):
+```rust
+pub fn approx_mean_std_f32(data: &[f32]) -> (f32, f32)
+pub fn approx_row_norms_f32(data: &[f32], rows: usize, cols: usize) -> Vec<f32>
+pub fn top_k_rows_by_norm(data: &[f32], rows: usize, cols: usize, k: usize) -> Vec<usize>
+pub fn approx_column_std(data: &[f32], rows: usize, cols: usize) -> Vec<f32>
+```
+
+**Transform** — add ndarray overloads that delegate to existing implementations:
+
+```rust
+/// Array-native overloads for prefilter operations.
+pub trait Prefilter {
+    /// Approximate mean and standard deviation (SIMD).
+    fn approx_mean_std(&self) -> (f32, f32);
+
+    /// L2 norm per row (SIMD).
+    fn row_norms(&self) -> Array1<f32>;
+
+    /// Indices of top-K rows by L2 norm.
+    fn top_k_by_norm(&self, k: usize) -> Array1<usize>;
+
+    /// Standard deviation per column.
+    fn column_std(&self) -> Array1<f32>;
+}
+
+impl<S> Prefilter for ArrayBase<S, Ix2>
+where
+    S: Data<Elem = f32>,
+{
+    fn approx_mean_std(&self) -> (f32, f32) {
+        let slice = self.as_slice().expect("must be contiguous");
+        approx_mean_std_f32(slice)
+    }
+
+    fn row_norms(&self) -> Array1<f32> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(approx_row_norms_f32(slice, rows, cols))
+    }
+
+    fn top_k_by_norm(&self, k: usize) -> Array1<usize> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(top_k_rows_by_norm(slice, rows, cols, k))
+    }
+
+    fn column_std(&self) -> Array1<f32> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(approx_column_std(slice, rows, cols))
+    }
+}
+```
+
+---
+
+#### 2.4 Activations — Extend to N-D with Axis
+
+**File**: `src/hpc/activations.rs`
+
+**Current state**: Trait only works on `ArrayBase<S, Ix1>`. No per-row softmax.
+
+**Transform**:
+
+```rust
+/// Multi-dimensional activations (applied along an axis).
+pub trait ActivationsNd<A> {
+    /// Softmax along the given axis.
+    /// For a [batch, features] matrix with axis=1, computes softmax per row.
+    fn softmax_axis(&self, axis: Axis) -> Array<A, Self::Dim>;
+
+    /// Log-softmax along the given axis.
+    fn log_softmax_axis(&self, axis: Axis) -> Array<A, Self::Dim>;
+}
+
+impl<A, S> ActivationsNd<A> for ArrayBase<S, Ix2>
+where
+    A: Float + 'static,
+    S: Data<Elem = A>,
+{
+    fn softmax_axis(&self, axis: Axis) -> Array<A, Ix2> {
+        let mut result = self.to_owned();
+        for mut row in result.axis_iter_mut(axis) {
+            let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b));
+            row.mapv_inplace(|v| (v - max_val).exp());
+            let sum: A = row.iter().fold(A::zero(), |acc, &v| acc + v);
+            row.mapv_inplace(|v| v / sum);
+        }
+        result
+    }
+
+    fn log_softmax_axis(&self, axis: Axis) -> Array<A, Ix2> {
+        let mut result = self.to_owned();
+        for mut row in result.axis_iter_mut(axis) {
+            let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b));
+            row.mapv_inplace(|v| v - max_val);
+            let log_sum_exp = row.mapv(|v| v.exp())
+                .iter().fold(A::zero(), |acc, &v| acc + v).ln();
+            row.mapv_inplace(|v| v - log_sum_exp);
+        }
+        result
+    }
+}
+```
+
+**Impact**: Enables `batch_logits.softmax_axis(Axis(1))` — critical for inference.
+
+---
+
+#### 2.5 VML — SIMD Transcendentals as mapv Acceleration
+
+**File**: `src/hpc/vml.rs` (2543 lines)
+
+**Current state**: 12+ standalone functions like `vsexp(x: &[f32], out: &mut [f32])`.
+
+**Transform** — add `ArrayBase` methods that auto-dispatch to SIMD:
+
+```rust
+/// SIMD-accelerated element-wise math on contiguous f32 arrays.
+///
+/// These call the VML SIMD kernels when the array is contiguous,
+/// falling back to scalar mapv() otherwise.
+pub trait SimdMath {
+    fn simd_exp(&self) -> Array<f32, Self::Dim>;
+    fn simd_ln(&self) -> Array<f32, Self::Dim>;
+    fn simd_sqrt(&self) -> Array<f32, Self::Dim>;
+    fn simd_sin(&self) -> Array<f32, Self::Dim>;
+    fn simd_cos(&self) -> Array<f32, Self::Dim>;
+    fn simd_tanh(&self) -> Array<f32, Self::Dim>;
+    fn simd_abs(&self) -> Array<f32, Self::Dim>;
+}
+
+impl<S, D> SimdMath for ArrayBase<S, D>
+where
+    S: Data<Elem = f32>,
+    D: Dimension,
+{
+    fn simd_exp(&self) -> Array<f32, D> {
+        if let Some(slice) = self.as_slice() {
+            let mut out = vec![0.0f32; slice.len()];
+            vsexp(slice, &mut out);  // existing SIMD kernel
+            Array::from_shape_vec(self.raw_dim(), out).unwrap()
+        } else {
+            self.mapv(|x| x.exp())
+        }
+    }
+
+    fn simd_ln(&self) -> Array<f32, D> {
+        if let Some(slice) = self.as_slice() {
+            let mut out = vec![0.0f32; slice.len()];
+            vsln(slice, &mut out);
+            Array::from_shape_vec(self.raw_dim(), out).unwrap()
+        } else {
+            self.mapv(|x| x.ln())
+        }
+    }
+    // ... same pattern for sqrt, sin, cos, tanh, abs
+}
+```
+
+**Benchmark expectation**: 10–50x for transcendentals (polynomial SIMD vs scalar libm).
+
+---
+
+### Tier 3: Backend Integration — Core Dispatch to SIMD (Medium effort, Very High impact)
+
+Make ndarray's **built-in** operations (`.sum()`, `.mean()`, `.dot()`) automatically
+use hpc SIMD kernels for contiguous arrays.
+
+---
+
+#### 3.1 Wire Reductions into Core
+
+**Files**: `src/numeric/impl_numeric.rs` (core sum/mean) + `src/hpc/reductions.rs`
+
+**Current state**: `Array.sum()` uses `numeric_util::unrolled_fold()` — a scalar
+4x unrolled loop. `hpc::reductions::sum_f32()` uses F32x16 (16-lane AVX-512)
+with 4-accumulator ILP. They never talk to each other.
+
+**Transform** — add backend dispatch in core's sum():
+
+```rust
+// In src/numeric/impl_numeric.rs or src/backend/native.rs:
+
+impl<A, S, D> ArrayBase<S, D>
+where
+    A: Clone + num_traits::Zero + std::ops::Add<Output = A>,
+    S: Data<Elem = A>,
+    D: Dimension,
+{
+    pub fn sum(&self) -> A {
+        // Fast path: contiguous f32 → SIMD reduction
+        if std::any::TypeId::of::<A>() == std::any::TypeId::of::<f32>() {
+            if let Some(slice) = self.as_slice_memory_order() {
+                // SAFETY: TypeId check guarantees A == f32
+                let f32_slice = unsafe {
+                    std::slice::from_raw_parts(slice.as_ptr() as *const f32, slice.len())
+                };
+                let result = crate::hpc::reductions::sum_f32(f32_slice);
+                // SAFETY: result is f32, A is f32
+                return unsafe { std::mem::transmute_copy(&result) };
+            }
+        }
+        if std::any::TypeId::of::<A>() == std::any::TypeId::of::<f64>() {
+            if let Some(slice) = self.as_slice_memory_order() {
+                let f64_slice = unsafe {
+                    std::slice::from_raw_parts(slice.as_ptr() as *const f64, slice.len())
+                };
+                let result = crate::hpc::reductions::sum_f64(f64_slice);
+                return unsafe { std::mem::transmute_copy(&result) };
+            }
+        }
+        // Fallback: generic fold
+        self.fold(A::zero(), |acc, &x| acc + x)
+    }
+}
+```
+
+**Impact**: Every `.sum()` on a contiguous f32/f64 array silently becomes 16x
+faster. Zero API change for users.
+
+**Same pattern for**: `.mean()`, internal `max()`/`min()` in softmax paths.
+
+---
+
+#### 3.2 Unify SIMD Detection (Delete Duplicates)
+
+**Files to merge**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
+**Into**: `src/simd.rs` (the existing core SIMD module)
+
+**Current state**: Three overlapping detection systems:
+1. `src/simd.rs` → `LazyLock<Tier>` — detects AVX-512/AVX2/NEON
+2. `src/hpc/simd_caps.rs` → `CpuCaps` struct — detects same capabilities
+3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` — another LazyLock with function pointers
+
+**Transform**:
+
+```rust
+// 1. In src/simd.rs, export the unified detection:
+pub static CPU_CAPS: LazyLock<CpuCaps> = LazyLock::new(|| detect_cpu_caps());
+
+pub struct CpuCaps {
+    pub tier: Tier,  // existing
+    pub avx512f: bool,
+    pub avx512bw: bool,
+    pub avx512vpopcntdq: bool,
+    pub avx512vnni: bool,
+    pub avx2: bool,
+    pub fma: bool,
+    pub popcnt: bool,
+    // ARM
+    pub neon: bool,
+    pub sve: bool,
+}
+
+// 2. In src/hpc/simd_dispatch.rs, replace with:
+pub use crate::simd::CPU_CAPS;
+// Keep function-pointer dispatch but source caps from unified singleton
+
+// 3. Delete src/hpc/simd_caps.rs (or make it a thin re-export)
+```
+
+**Result**: One CPUID check, one atomic, one struct — shared by core and all hpc modules.
+
+---
+
+#### 3.3 VNNI GEMM via BlasFloat Dispatch
+
+**Files**: `src/hpc/vnni_gemm.rs`, `src/hpc/quantized.rs:618`, `src/backend/mod.rs`
+
+**Current state**: `int8_gemm_vnni()` takes raw `&[u8], &[i8], &mut [i32]` with manual
+m/n/k parameters. Not accessible via `Array.dot()` or `blas_gemm()`.
+
+**Transform** — wire INT8 GEMM into the backend trait system:
+
+```rust
+// New trait in src/backend/mod.rs:
+pub trait BlasInt8 {
+    /// INT8 GEMM: C_i32 = A_u8 × B_i8
+    fn backend_gemm_i8(
+        m: usize, n: usize, k: usize,
+        a: &[u8], lda: usize,
+        b: &[i8], ldb: usize,
+        c: &mut [i32], ldc: usize,
+    );
+}
+
+// Extension trait on Array:
+pub trait Int8Matmul {
+    /// Quantized matrix multiply: self (u8) × rhs (i8) → Array2<i32>
+    fn matmul_i8(&self, rhs: &ArrayView2<i8>) -> Array2<i32>;
+}
+
+impl<S> Int8Matmul for ArrayBase<S, Ix2>
+where
+    S: Data<Elem = u8>,
+{
+    fn matmul_i8(&self, rhs: &ArrayView2<i8>) -> Array2<i32> {
+        let (m, k) = self.dim();
+        let (k2, n) = rhs.dim();
+        assert_eq!(k, k2, "inner dimensions must match");
+
+        let a_slice = self.as_slice().expect("A must be contiguous");
+        let b_slice = rhs.as_slice().expect("B must be contiguous");
+        let mut c = vec![0i32; m * n];
+
+        crate::hpc::vnni_gemm::int8_gemm_vnni(a_slice, b_slice, &mut c, m, n, k);
+        Array2::from_shape_vec((m, n), c).unwrap()
+    }
+}
+```
+
+**Example**:
+
+```rust
+let weights: Array2<i8> = load_quantized_weights();
+let activations: Array2<u8> = quantize_input(&input);
+let output: Array2<i32> = activations.matmul_i8(&weights.view());
+// 4× throughput vs f32 GEMM
+```
+
+---
+
+### Tier 4: View Factories — Arrow & External Data (Low effort, Medium impact)
+
+Make hpc's data ingestion return ndarray views instead of raw slices.
+
+---
+
+#### 4.1 Arrow Bridge → ArrayView
+
+**File**: `src/hpc/arrow_bridge.rs`
+
+**Current state**: `binary_entry()` returns `&[u8]`. `SoakingBuffer` provides
+`as_columnar_slice()` as `&[i8]`.
+
+**Transform**:
+
+```rust
+impl ArrowBridge {
+    /// Get one binary entry as a 1-D array view.
+    pub fn binary_entry_view(&self, idx: usize) -> ArrayView1<'_, u8> {
+        let bytes = self.binary_entry(idx);
+        ArrayView1::from_shape(bytes.len(), bytes).unwrap()
+    }
+
+    /// Get the full binary column as a 2-D view [n_rows, entry_bytes].
+    pub fn binary_column_view(&self) -> ArrayView2<'_, u8> {
+        let data = self.as_binary_slice();
+        let entry_len = self.entry_len();
+        let n_rows = data.len() / entry_len;
+        ArrayView2::from_shape((n_rows, entry_len), data).unwrap()
+    }
+}
+
+impl SoakingBuffer {
+    /// View the columnar buffer as a 2-D matrix [n_entries, n_dims].
+    pub fn as_array_view_2d(&self) -> ArrayView2<'_, i8> {
+        let data = self.as_columnar_slice();
+        ArrayView2::from_shape((self.n_entries, self.n_dims), data).unwrap()
+    }
+}
+```
+
+**Impact**: Downstream code can use ndarray broadcasting, slicing, and Zip
+on Arrow data without copying.
+
+---
+
+#### 4.2 Distance Functions → Return Array1
+
+**File**: `src/hpc/distance.rs`
+
+**Current state**: Returns `Vec<f32>`.
+
+**Transform**:
+
+```rust
+// Keep existing:
+pub fn squared_distances_f32(query: [f32; 3], points: &[[f32; 3]]) -> Vec<f32> { ... }
+
+// Add ndarray overload:
+pub fn squared_distances_array(query: &ArrayView1<f32>, points: &ArrayView2<f32>) -> Array1<f32> {
+    assert_eq!(query.len(), points.ncols());
+    let n = points.nrows();
+    let mut result = Array1::zeros(n);
+    for (i, row) in points.axis_iter(Axis(0)).enumerate() {
+        let diff = &row - query;
+        result[i] = diff.mapv(|x| x * x).sum();
+    }
+    result
+}
+```
+
+---
+
+### Tier 5: Zip & Parallel — Modernize Inner Loops (Medium effort, Medium impact)
+
+Replace manual `iter_mut().zip()` with ndarray's `Zip` for vectorization hints
+and future `par_for_each()` support.
+
+---
+
+#### 5.1 VML Tail Loops → Zip
+
+**File**: `src/hpc/vml.rs` — 12+ sites
+
+**Current pattern** (repeated throughout):
+```rust
+for (o, &v) in out[tail_start..].iter_mut().zip(x[tail_start..].iter()) {
+    *o = v.exp();
+}
+```
+
+**Transform**:
+```rust
+use crate::Zip;
+
+let out_view = ArrayViewMut1::from_shape(n - tail_start, &mut out[tail_start..]).unwrap();
+let x_view = ArrayView1::from_shape(n - tail_start, &x[tail_start..]).unwrap();
+Zip::from(&mut out_view).and(&x_view).for_each(|o, &v| {
+    *o = v.exp();
+});
+```
+
+**Benefit**: Zip provides SIMD-friendlier iteration order and enables
+future `par_for_each()` without restructuring.
+
+---
+
+#### 5.2 Hamming Batch → Parallel Zip
+
+**File**: `src/hpc/bitwise.rs` lines 320-333
+
+**Current state**: Sequential loop over candidates.
+
+**Transform** (feature-gated on `rayon`):
+
+```rust
+#[cfg(feature = "rayon")]
+pub fn hamming_batch_parallel(
+    query: &ArrayView1<u8>,
+    database: &ArrayView2<u8>,  // [n_candidates, vec_len]
+) -> Array1<u64> {
+    use rayon::prelude::*;
+    let q = query.as_slice().unwrap();
+    let results: Vec<u64> = database
+        .axis_iter(Axis(0))
+        .into_par_iter()
+        .map(|row| {
+            let r = row.as_slice().unwrap();
+            dispatch_hamming(q, r)
+        })
+        .collect();
+    Array1::from_vec(results)
+}
+```
+
+---
+
+### Tier 6: Multi-Dimensional Reductions (High effort, High impact for ML)
+
+Lift 1-D slice operations to full N-D axis reductions.
+
+---
+
+#### 6.1 SIMD Axis Reductions
+
+**File**: `src/hpc/reductions.rs`
+
+**Current state**: Only 1-D: `sum_f32(&[f32])`, `mean_f32(&[f32])`, etc.
+
+**Transform**:
+
+```rust
+/// Sum along an axis, using SIMD for contiguous lanes.
+pub fn sum_axis_f32<S, D>(arr: &ArrayBase<S, D>, axis: Axis) -> Array<f32, D::Smaller>
+where
+    S: Data<Elem = f32>,
+    D: Dimension + RemoveAxis,
+{
+    let mut result = Array::zeros(arr.raw_dim().remove_axis(axis));
+    for (i, lane) in arr.lanes(axis).into_iter().enumerate() {
+        if let Some(slice) = lane.as_slice() {
+            result.as_slice_mut().unwrap()[i] = sum_f32(slice);
+        } else {
+            result.as_slice_mut().unwrap()[i] = lane.iter().fold(0.0, |a, &b| a + b);
+        }
+    }
+    result
+}
+```
+
+---
+
+#### 6.2 Statistics Trait — Generalize to N-D
+
+**File**: `src/hpc/statistics.rs`
+
+**Transform**: Make `variance`, `std_dev`, `percentile` work along axes:
+
+```rust
+pub trait StatisticsNd<A> {
+    type Reduced;
+
+    fn variance_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced;
+    fn std_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced;
+    fn percentile_axis(&self, axis: Axis, q: f64) -> Self::Reduced;
+}
+```
+
+---
+
+## Execution Order & Dependencies
+
+```
+Phase A (2 weeks) — Type Bridges:
+  1.1 Fingerprint ↔ ArrayView1      (standalone, no deps)
+  1.2 VsaVector ↔ ArrayView1        (standalone)
+  1.3 BF16/F16 → BlasFloat          (requires backend/mod.rs edit)
+  1.4 CogRecord channel views       (standalone)
+
+Phase B (2 weeks) — Extension Traits:
+  2.1 HdcOps trait (K0/K1/K2)       (depends on 1.1)
+  2.2 Quantize trait                 (depends on 1.3)
+  2.3 Prefilter trait                (standalone)
+  2.4 ActivationsNd (axis softmax)   (standalone)
+  2.5 SimdMath trait (VML)           (standalone)
+
+Phase C (2 weeks) — Backend Wiring:
+  3.1 Core sum/mean → SIMD dispatch  (depends on 3.2)
+  3.2 Unified SIMD detection         (standalone, delete duplicates)
+  3.3 INT8 Matmul via trait          (depends on 1.3 pattern)
+
+Phase D (1 week) — View Factories:
+  4.1 Arrow → ArrayView              (standalone)
+  4.2 Distance → Array1 returns      (standalone)
+
+Phase E (1 week) — Modernize Loops:
+  5.1 VML tails → Zip                (standalone)
+  5.2 Hamming batch → parallel       (standalone)
+
+Phase F (2 weeks) — N-D Reductions:
+  6.1 SIMD axis reductions           (depends on 3.1)
+  6.2 Statistics N-D                  (depends on 6.1)
+```
+
+---
+
+## Rules of Engagement
+
+1. **No deletion** — raw-slice functions stay. They serve FFI, embedded, and hot-path callers
+   who've already pre-validated contiguity.
+
+2. **Extension traits, not modifications** — new traits in new files or at the bottom of
+   existing files. Existing public APIs unchanged.
+
+3. **Contiguous fast-path, generic fallback** — every new trait method checks
+   `as_slice()` first. If contiguous → SIMD dispatch. If not → `mapv()`/`iter()`.
+
+4. **Feature-gate expensive additions** — rayon parallel (`rayon` feature),
+   BlasFloat for BF16 (`bf16-blas` feature if controversial).
+
+5. **Tests mirror existing** — each new trait method gets the same test coverage
+   as the raw-slice version it wraps.
+
+6. **Naming convention** — ndarray-native methods use `snake_case` matching ndarray
+   style (`.softmax_axis()`, `.simd_exp()`). Existing raw functions keep their
+   current names (`softmax_f32()`, `vsexp()`).
+
+---
+
+## Impact Matrix
+
+| Item | Lines Changed | New Lines | Performance Impact | Ergonomics Impact |
+|------|--------------|-----------|-------------------|-------------------|
+| 1.1 Fingerprint bridge | 0 | ~40 | Zero (zero-copy) | High — unblocks Zip/broadcast |
+| 1.3 BF16 BlasFloat | ~10 | ~80 | Enables BF16 GEMM via .dot() | Very High — removes type barrier |
+| 2.1 HdcOps trait | 0 | ~60 | Zero (delegates) | High — clean pipeline API |
+| 2.5 SimdMath (VML) | 0 | ~100 | 10-50x for exp/ln/sin | Very High — transparent SIMD |
+| 3.1 Core sum → SIMD | ~20 | ~40 | 16x for f32 sum/mean | Zero (invisible) |
+| 3.2 Unified detection | -877 | ~50 | Zero (dedup) | N/A (internal) |
+| 3.3 INT8 matmul trait | 0 | ~60 | 4x over f32 GEMM | High — quantized inference |
+| 5.2 Parallel hamming | 0 | ~30 | N×cores speedup | Medium |
+| 6.1 SIMD axis reductions | 0 | ~80 | 16x per lane | High — ML training |
+
+---
+
+## What This Achieves
+
+After this refactoring, hpc/ is no longer a "second library living inside ndarray."
+It becomes the **performance substrate** that ndarray's own operations dispatch to:
+
+```rust
+// User code — completely ndarray-native, no hpc imports needed:
+let a: Array2<f32> = load_embeddings();
+let b: Array2<f32> = load_queries();
+
+let sims = a.dot(&b.t());           // → routes to SIMD GEMM internally
+let topk = sims.sum_axis(Axis(1));   // → routes to SIMD sum internally
+let probs = sims.softmax_axis(Axis(1));  // → routes to SIMD exp internally
+
+// User code — HDC domain, imports hpc trait:
+use ndarray::hpc::ndarray_ext::HdcOps;
+let fp: Array1<u64> = Fingerprint::<256>::random(42).into();
+let candidates: Array2<u64> = load_database();
+
+for row in candidates.axis_iter(Axis(0)) {
+    if let Some(ec) = fp.cascade_distance(&row, &gate) {
+        hits.push(ec);
+    }
+}
+```
+
+The raw-slice layer (`hpc::kernels::k2_exact(&[u64], ...)`) remains for:
+- FFI exports (Python, C)
+- Embedded/no-std contexts
+- Callers who've already validated layout and want zero overhead
+- Benchmarks that isolate kernel performance from array metadata
+
+Both worlds coexist. The bridge is the extension traits.

From 7fadc061dec4918686607a080d8694d916f2510f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 20:17:32 +0000
Subject: [PATCH 2/7] Add SoA kernel architecture: transposed cascade with
 columnar SIMD filtering

The meta-level insight: don't iterate candidates and probe words,
iterate words and filter candidates. ndarray's F-order strides make
the same Array2<u64> serve both column scanning (K0/K1, contiguous)
and per-record access (K2, strided on survivors).

Key ideas:
- Column-major database: K0 becomes sequential 8-byte scan (8x cache util)
- Bitmask survivor propagation (no branching, pure SIMD mask narrowing)
- BF16 field-separated SoA (sign/exp/mantissa pre-decomposed at ingest)
- QualiaColumns (16 parallel dimension arrays for per-dim scan)
- TieredDatabase (K0 in L2, K1 in L3, K2 in DRAM)
- Arrow's columnar format aligns natively with this layout

Expected: 4-8x cascade throughput, 18x qualia dimension scans.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 SOA_KERNEL_ARCHITECTURE.md | 478 +++++++++++++++++++++++++++++++++++++
 1 file changed, 478 insertions(+)
 create mode 100644 SOA_KERNEL_ARCHITECTURE.md

diff --git a/SOA_KERNEL_ARCHITECTURE.md b/SOA_KERNEL_ARCHITECTURE.md
new file mode 100644
index 00000000..384023b3
--- /dev/null
+++ b/SOA_KERNEL_ARCHITECTURE.md
@@ -0,0 +1,478 @@
+# SoA Kernel Architecture: The Transposed Cascade
+
+> **The insight**: Don't iterate candidates and probe words.
+> Iterate words and filter candidates.
+>
+> The database layout IS the algorithm.
+
+---
+
+## The Current Architecture (AoS — Row-Major)
+
+```
+Memory layout: [candidate₀ word₀..word₂₅₅] [candidate₁ word₀..word₂₅₅] ...
+
+Algorithm (kernel_pipeline, line 725):
+    for each candidate:          ← sequential over N
+        K0: XOR word[0]          ← random access into candidate's word 0
+        K1: XOR word[0..8]       ← sequential within one candidate
+        K2: XOR word[0..256]     ← sequential within one candidate
+
+Cache behavior:
+    K0 touches 8 bytes per candidate but strides by 2048 bytes.
+    For 1M candidates: 8 bytes useful / 2048 bytes per cache line = 0.4% utilization.
+```
+
+Each K0 probe fetches one u64 from a different 2KB region. The CPU prefetcher
+can't help because the access pattern is "8 bytes at stride 2048" — every fetch
+pulls a full cache line but uses 1/256th of it.
+
+---
+
+## The Transposed Architecture (SoA — Column-Major)
+
+```
+Memory layout:
+    Column 0:   [cand₀_word₀, cand₁_word₀, cand₂_word₀, ... candₙ_word₀]
+    Column 1:   [cand₀_word₁, cand₁_word₁, cand₂_word₁, ... candₙ_word₁]
+    ...
+    Column 255: [cand₀_word₂₅₅, cand₁_word₂₅₅, ... candₙ_word₂₅₅]
+
+Algorithm (transposed cascade):
+    Phase K0: scan Column 0 — XOR all N words with query[0], threshold filter
+        → produces survivor_mask (bitmask of which candidates survived)
+
+    Phase K1: scan Columns 0..8 — XOR only survivors' words, accumulate
+        → narrows survivor_mask further
+
+    Phase K2: scan Columns 0..255 — XOR only final survivors
+        → exact EnergyConflict for each
+
+Cache behavior:
+    K0 is a contiguous sequential scan of N × 8 bytes.
+    For 1M candidates: 8MB, perfectly prefetchable, 100% cache line utilization.
+    AVX-512: 8 candidates per instruction (512 bits / 64 bits = 8 u64s).
+```
+
+---
+
+## Why This Is Genius at the Meta Level
+
+### 1. The Cascade Becomes SIMD-Native Filtering
+
+Current K0 (per-candidate, scalar):
+```rust
+for i in 0..n_candidates {
+    let word0 = database[i * 256];           // stride 2048 bytes — cache-hostile
+    if (word0 ^ query_word0).count_ones() > threshold {
+        continue;
+    }
+    // ... K1, K2
+}
+```
+
+Transposed K0 (columnar, SIMD):
+```rust
+let col0: &[u64] = &columns[0];              // contiguous N×8 bytes
+let q0 = U64x8::splat(query_words[0]);       // broadcast query word 0
+let thresh = U32x8::splat(gate.k0_reject);
+
+let mut survivors = vec![0u64; (n_candidates + 63) / 64];  // bitmask
+
+for chunk in 0..(n_candidates / 8) {
+    let c = U64x8::from_slice(&col0[chunk * 8..]);
+    let xor = c ^ q0;
+    let popcnt = xor.popcnt_per_lane();      // 8 popcounts in parallel
+    let mask = popcnt.le(thresh);             // 8 comparisons → 8-bit mask
+    survivors[chunk / 8] |= mask.to_bitmask() << (chunk % 8 * 8);
+}
+```
+
+**8 candidates per clock cycle instead of 1.** And the memory access is a
+sequential stream — the hardware prefetcher handles it perfectly.
+
+---
+
+### 2. The Survivor Mask IS the State Machine
+
+The cascade becomes mask narrowing — no branch prediction, no `continue`:
+
+```rust
+// Start: all candidates alive
+let mut mask: Vec<u64> = vec![u64::MAX; (n + 63) / 64];
+
+// K0: narrow mask via column 0
+k0_filter_columnar(&columns[0], query[0], gate.k0_reject, &mut mask);
+
+// K1: narrow mask via columns 0..8 (only touch survivors)
+k1_filter_columnar(&columns[0..8], &query[0..8], gate.k1_reject, &mut mask);
+
+// K2: exact only on final survivors
+let results = k2_exact_masked(&columns, query, &mask);
+```
+
+Each phase reads ONLY the columns it needs. K0 touches 1 column (N×8 bytes).
+K1 touches 8 columns (N×64 bytes). K2 touches 256 columns but only for
+survivors (~3-5% of N). Total memory touched:
+
+```
+AoS:  N × 2048 bytes (reads all words for all candidates even if K0 rejects)
+      Actually reads: N × 8 (K0) + 0.05N × 64 (K1) + 0.003N × 2048 (K2)
+      But cache lines pulled: N × 64 (one cache line per K0 probe) = N×64 bytes
+
+SoA:  N × 8 (K0, contiguous) + N × 64 (K1, contiguous) + 0.003N × 2048 (K2)
+      Effective: N × 72 + 6N = ~78N bytes vs ~64N bytes (AoS)
+      BUT: 100% utilization vs 12.5% utilization per cache line
+      Net bandwidth saving: ~8× due to no wasted cache line fetch
+```
+
+---
+
+### 3. ndarray Makes This Free (Strides, Not Copies)
+
+Here's the meta-level genius: **you don't transpose the data. You transpose the strides.**
+
+```rust
+// Build database in Fortran (column-major) order:
+let database: Array2<u64> = Array2::from_shape_vec(
+    (n_candidates, 256).f(),  // .f() = Fortran order
+    flat_word_data,
+).unwrap();
+
+// Column access is now CONTIGUOUS (no copy, no transpose):
+let col0: ArrayView1<u64> = database.column(0);
+assert!(col0.is_standard_layout());  // TRUE — physically contiguous
+
+// Row access still works (strided, for K2 on survivors):
+let survivor_fp: ArrayView1<u64> = database.row(42);
+// Not contiguous, but only used for the 3-5% of survivors — doesn't matter
+```
+
+The `(n_candidates, 256).f()` tells ndarray to lay out memory column-by-column.
+Accessing `.column(i)` returns a contiguous view. Accessing `.row(i)` returns a
+strided view. **Same data, both access patterns, zero transposition cost.**
+
+This is ndarray's stride system doing something no raw-slice library can:
+**the same allocation serves both columnar scanning AND per-candidate access.**
+
+---
+
+### 4. Arrow Is Already Column-Major
+
+Arrow stores columns separately. The arrow_bridge already provides columnar views.
+If each "word position" is stored as its own Arrow column (or a single Arrow
+FixedSizeList with the right chunking), the SoA layout is the native format:
+
+```rust
+// Arrow RecordBatch with 256 columns of u64:
+// col_0: [cand₀_w0, cand₁_w0, ...]  ← this IS the K0 scan buffer
+// col_1: [cand₀_w1, cand₁_w1, ...]
+// ...
+
+fn kernel_pipeline_arrow(batch: &RecordBatch, query: &[u64; 256], gate: &SliceGate) {
+    // K0: read column 0 directly from Arrow buffer (zero-copy)
+    let col0 = batch.column(0).as_primitive::<UInt64Type>().values();
+    let mask = k0_filter_columnar(col0, query[0], gate.k0_reject);
+
+    // K1: read columns 0..8
+    for i in 0..8 {
+        let col = batch.column(i).as_primitive::<UInt64Type>().values();
+        k1_accumulate_columnar(col, query[i], &mut mask, &mut accum);
+    }
+    // ...
+}
+```
+
+**No conversion. No copy. Arrow's native layout IS the SoA kernel's input format.**
+
+---
+
+### 5. Multi-Resolution SoA: The Three Tiers as Physical Arrays
+
+```rust
+/// Tiered columnar database. Each tier is a separate Array2 because they have
+/// different access patterns and lifetimes.
+pub struct TieredDatabase {
+    /// Tier K0: shape [N, 1] — just word 0 per candidate.
+    /// Fits in L2 for 1M candidates (8MB).
+    pub k0: Array2<u64>,  // Column-major, 1 column
+
+    /// Tier K1: shape [N, 8] — first 8 words per candidate.
+    /// Fits in L3 for 1M candidates (64MB).
+    pub k1: Array2<u64>,  // Column-major, 8 columns
+
+    /// Tier K2: shape [N, 256] — full containers.
+    /// Lives in main memory. Only accessed for 3-5% survivors.
+    pub k2: Array2<u64>,  // Column-major, 256 columns
+}
+```
+
+**Why separate arrays?** Because K0 is HOT (scanned every query). K1 is WARM
+(scanned for ~5% survivors). K2 is COLD (scanned for ~0.3% survivors). Separating
+them lets you:
+- Pin K0 in L2 cache (8MB for 1M candidates)
+- Let K1 live in L3
+- Let K2 stay in DRAM — only a few thousand rows are ever touched
+
+This is **cache-aware SoA** — not just columnar layout, but columnar layout with
+physical separation matching the thermal hierarchy of the CPU cache.
+
+---
+
+### 6. BF16 Field-Separated SoA: Awareness Becomes Memory Layout
+
+Current BF16 storage: N vectors × 1024 BF16 values = N × 2048 bytes (interleaved).
+
+Field-separated SoA:
+```rust
+/// BF16 database stored as separated bitfields.
+/// The decomposition that bf16_hamming.rs does AT RUNTIME is instead
+/// baked INTO THE STORAGE FORMAT.
+pub struct BF16FieldDatabase {
+    /// Sign bits: 1 bit per dimension × 1024 dims = 128 bytes per vector.
+    /// Shape [N, 128] as u8 — each byte = 8 sign bits.
+    pub signs: Array2<u8>,
+
+    /// Exponent fields: 8 bits per dimension × 1024 dims = 1024 bytes per vector.
+    /// Shape [N, 1024] as u8.
+    pub exponents: Array2<u8>,
+
+    /// Mantissa fields: 7 bits per dimension × 1024 dims ≈ 896 bytes per vector.
+    /// Shape [N, 896] as u8 (packed 7-bit).
+    pub mantissas: Array2<u8>,
+}
+```
+
+**The awareness algebra becomes trivially parallel**:
+
+```rust
+impl BF16FieldDatabase {
+    /// Sign-only distance (strongest signal): just Hamming on sign bytes.
+    /// 128 bytes per comparison — fits in one AVX-512 pass for 4 vectors.
+    fn sign_distance(&self, query_signs: &[u8], candidate_idx: usize) -> u32 {
+        let row = self.signs.row(candidate_idx);
+        hamming_dispatch(query_signs, row.as_slice().unwrap())
+    }
+
+    /// Exponent distance (medium signal): Hamming on exp bytes.
+    fn exp_distance(&self, query_exp: &[u8], candidate_idx: usize) -> u32 {
+        let row = self.exponents.row(candidate_idx);
+        hamming_dispatch(query_exp, row.as_slice().unwrap())
+    }
+
+    /// Awareness classification without per-element decomposition.
+    /// Signs are already separated — just compare columns directly.
+    fn batch_awareness(&self, query_signs: &Array1<u8>) -> Array1<AwarenessState> {
+        // For ALL candidates simultaneously:
+        // sign_agreement = popcount(NOT (query_signs XOR candidate_signs))
+        // Already column-accessible, already SIMD-friendly
+        let n = self.signs.nrows();
+        let mut states = Array1::zeros(n);
+        for i in 0..n {
+            let row = self.signs.row(i);
+            let agreement = hamming_agreement(query_signs.as_slice().unwrap(),
+                                             row.as_slice().unwrap());
+            states[i] = classify_awareness(agreement, ...);
+        }
+        states
+    }
+}
+```
+
+**The insight**: bf16_hamming.rs currently decomposes each BF16 value into
+sign/exp/mantissa at QUERY TIME (runtime unpacking). With field-separated SoA,
+the decomposition happens at INGEST TIME (once). Every subsequent query operates
+on pre-decomposed fields. The hot path never sees a BF16 value — only its fields.
+
+This trades ingest bandwidth for query speed:
+- Ingest: O(N) decomposition (amortized over all future queries)
+- Query: zero decomposition cost (fields already separated)
+- For a database queried 10,000× per second, this is a massive win.
+
+---
+
+### 7. PackedQualia SoA: 16 Parallel Dimension Columns
+
+Current: `Vec<PackedQualia>` where each PackedQualia = 16 × i8 + 1 BF16 (18 bytes).
+
+SoA:
+```rust
+/// Qualia database stored as separated dimension columns.
+/// Each column is ONE phenomenological dimension across ALL agents.
+pub struct QualiaColumns {
+    /// 16 dimension columns, each Array1<i8> with N_agents elements.
+    /// dims[0] = warmth, dims[1] = texture, dims[4] = social, etc.
+    pub dims: [Array1<i8>; 16],
+
+    /// Magnitude column: Array1<u16> (BF16 scalars).
+    pub magnitude: Array1<u16>,
+}
+```
+
+**What this unlocks**:
+
+```rust
+impl QualiaColumns {
+    /// "Which agents are tensioned on warmth (dim 4)?"
+    /// → Single SIMD scan of one column.
+    fn tensioned_on(&self, dim: usize, threshold: i8) -> Array1<bool> {
+        self.dims[dim].mapv(|v| v.abs() > threshold && v < 0)
+    }
+
+    /// "Resonance between agent A and all others on dimension 5"
+    /// → One column × scalar = Array1<i8>
+    fn resonance_on_dim(&self, agent_idx: usize, dim: usize) -> Array1<i16> {
+        let agent_val = self.dims[dim][agent_idx] as i16;
+        self.dims[dim].mapv(|v| v as i16 * agent_val)
+    }
+
+    /// "Total resonance between agent A and all others (dot product)"
+    /// → 16 column scans, accumulate
+    fn total_resonance(&self, agent_idx: usize) -> Array1<i32> {
+        let mut acc = Array1::<i32>::zeros(self.dims[0].len());
+        for d in 0..16 {
+            let agent_val = self.dims[d][agent_idx] as i32;
+            acc += &self.dims[d].mapv(|v| v as i32 * agent_val);
+        }
+        acc
+    }
+}
+```
+
+**AoS version of the same query** (current):
+```rust
+// "Find agents tensioned on warmth" — must scan ALL 18 bytes per agent,
+// extract byte [4], compare. Cache line waste: 4/18 ≈ 22%.
+for agent in &qualia_db {
+    if agent.resonance[4].abs() > threshold && agent.resonance[4] < 0 {
+        results.push(agent);
+    }
+}
+```
+
+**SoA version**: scan 1 byte per agent, sequential, 100% cache utilization.
+For 1M agents: 1MB scan vs ~18MB with random byte extraction.
+
+---
+
+### 8. The Meta-Meta Level: ndarray IS the SoA Abstraction
+
+The deepest insight is that **ndarray's type system already encodes this**.
+`Array2<u64>` with Fortran order IS a SoA database. The columns are the "fields."
+The rows are the "records." And ndarray's stride machinery gives you BOTH access
+patterns from the same allocation:
+
+```rust
+// THE SAME Array2, TWO access patterns:
+
+// SoA access (column scan for K0 — contiguous):
+let k0_column: ArrayView1<u64> = database.column(0);
+// → strides = [1], physically contiguous
+
+// AoS access (full record for K2 — strided):  
+let full_record: ArrayView1<u64> = database.row(survivor_idx);
+// → strides = [n_candidates], non-contiguous but only 3% of data
+```
+
+You don't need two copies. You don't need explicit transposition. ndarray's
+`Order::F` gives you SoA as the NATIVE layout while `Order::C` gives you AoS.
+The same API works for both — only the performance characteristics change.
+
+**This is what makes ndarray + hpc/ potentially genius**: the array type IS
+the database schema. Shape IS the data model. Strides ARE the access pattern.
+And the cascade kernel can be expressed as:
+
+```rust
+pub fn cascade_soa(
+    database: &Array2<u64>,      // F-order: [N, 256], columns contiguous
+    query: &Array1<u64>,         // [256]
+    gate: &SliceGate,
+) -> Vec<KernelResult> {
+    let n = database.nrows();
+
+    // K0: column 0 scan (contiguous — full SIMD bandwidth)
+    let col0 = database.column(0);
+    let mut mask = k0_columnar_simd(col0.as_slice().unwrap(), query[0], gate);
+
+    // K1: columns 0..8 scan (contiguous per column)
+    for w in 0..8 {
+        let col = database.column(w);
+        k1_accumulate_simd(col.as_slice().unwrap(), query[w], &mut mask);
+    }
+    k1_threshold(&mut mask, gate);
+
+    // K2: only survivors (row access — strided but few)
+    let mut results = Vec::new();
+    for idx in mask.iter_ones() {
+        let row = database.row(idx);
+        // row is strided — copy to stack for SIMD (256 × 8 = 2KB, fits in L1)
+        let mut buf = [0u64; 256];
+        row.assign_to(ArrayViewMut1::from(&mut buf));
+        let ec = k2_exact(&query.as_slice().unwrap(), &buf, 256);
+        results.push(/* ... */);
+    }
+    results
+}
+```
+
+---
+
+## The Shopping List Addition
+
+Add to `REFACTOR_HPC_INTEGRATION.md` Phase C:
+
+```
+Phase C-SoA (2 weeks) — Columnar Kernel Architecture:
+    C.1  TieredDatabase struct (K0/K1/K2 as separate F-order Array2)
+    C.2  k0_columnar_simd: scan N u64s with U64x8 broadcast+XOR+popcount
+    C.3  Bitmask survivor propagation (no per-candidate branching)
+    C.4  k1_accumulate_columnar: 8-column scan with running mask
+    C.5  k2_exact_masked: row extraction for final survivors
+    C.6  Arrow native path: RecordBatch columns → direct columnar scan
+    C.7  BF16FieldDatabase: sign/exp/mantissa pre-separated at ingest
+    C.8  QualiaColumns: 16 × Array1<i8> for per-dimension scan
+    C.9  Benchmark: AoS vs SoA cascade on 1M SKU-16K containers
+```
+
+---
+
+## Expected Performance Impact
+
+| Operation | AoS (current) | SoA (proposed) | Speedup | Why |
+|-----------|---------------|----------------|---------|-----|
+| K0 scan (1M) | ~2ms (cache-miss bound) | ~0.25ms (streaming) | **8×** | Contiguous vs stride-2048 |
+| K1 scan (50K survivors) | ~0.8ms | ~0.1ms | **8×** | Same reason, 8 columns |
+| K2 exact (1500 survivors) | ~0.3ms | ~0.35ms | 0.85× | Row copy overhead |
+| **Total pipeline** | **~3.1ms** | **~0.7ms** | **4.4×** | K0+K1 dominate |
+| BF16 awareness (1M) | ~5ms (decompose each) | ~0.6ms (pre-separated) | **8×** | No runtime decomposition |
+| Qualia dim scan (1M agents) | ~4.5ms (18 byte stride) | ~0.25ms (1 byte stride) | **18×** | Perfect cache line use |
+
+The K2 step is slightly SLOWER in SoA (strided row access + copy to stack).
+But K2 only runs on 0.3% of candidates — it's irrelevant to total time.
+The cascade is dominated by K0 (touches all candidates) and K1 (touches 5%).
+Both are massively faster with columnar layout.
+
+---
+
+## The Philosophical Payoff
+
+> **AoS says**: "Here is a complete entity. Ask it anything."
+> **SoA says**: "Here is one question. Ask it of everyone."
+
+The cascade pipeline IS the SoA access pattern:
+- K0 asks ONE question (word 0) of EVERYONE
+- K1 asks EIGHT questions (words 0-7) of SURVIVORS
+- K2 asks ALL questions (words 0-255) of VERY FEW
+
+This is a funnel. Each stage narrows the population and widens the per-entity cost.
+SoA is optimal for funnels because the wide-population stages (K0, K1) get
+contiguous memory access, and the narrow-population stage (K2) pays the strided
+penalty on so few entities that it doesn't matter.
+
+ndarray encodes this directly in its type system:
+- `database.column(0)` = the K0 question to everyone (contiguous)
+- `database.row(survivor)` = all answers from one entity (strided, rare)
+
+**The database layout IS the algorithm.
+The strides ARE the access pattern.
+ndarray IS the SoA abstraction.**

From c003c193f6423f01149defeb205698b161cea3b9 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 20:19:28 +0000
Subject: [PATCH 3/7] Add unified refactor sequence: 8 waves integrating all
 analysis passes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Merges three inputs into one executable plan:
- REFACTOR_HPC_INTEGRATION.md (type bridges, extension traits)
- SOA_KERNEL_ARCHITECTURE.md (columnar cascade, field separation)
- Transformer session feedback (API conventions, namespace, codegen)

8 waves: conventions → macros → bridges → traits → backend → SoA → namespace → release
Critical path: 15 days serial, 10 days with parallelism.

The SoA meta-insight applies at 6 levels:
  1. Database layout (column-per-word)
  2. API surface (_into forms = caller-controlled layout)
  3. Module structure (hpc/cog/ext/io = columns of concern)
  4. ndarray types (F-order strides encode the duality)
  5. Cascade pipeline (widen reads as population shrinks)
  6. Arrow/Lance (storage IS the compute layout, zero ETL)

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 UNIFIED_REFACTOR_SEQUENCE.md | 248 +++++++++++++++++++++++++++++++++++
 1 file changed, 248 insertions(+)
 create mode 100644 UNIFIED_REFACTOR_SEQUENCE.md

diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md
new file mode 100644
index 00000000..afa4c9d5
--- /dev/null
+++ b/UNIFIED_REFACTOR_SEQUENCE.md
@@ -0,0 +1,248 @@
+# Unified Refactor Sequence: ndarray Fork
+
+> Integrates three analysis passes into one executable sequence:
+> 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits
+> 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage
+> 3. **Transformer session feedback** — API conventions, namespace, codegen macros
+>
+> Each wave is self-contained. Later waves depend on earlier ones.
+> Within a wave, items are independent and can run in parallel.
+
+---
+
+## Wave 0 — Conventions & Foundations (3 days)
+
+Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
+
+| ID | Item | Source | Effort | Why First |
+|----|------|--------|--------|-----------|
+| W0.1 | **Dual-form signature convention** (`_into` + Vec wrapper + `_ptr`) | R1.1 | 1d | Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape. |
+| W0.2 | **Unified HpcError enum** (`src/hpc/error.rs`) | R1.2 | 4h | Replaces 4 error conventions with 1. |
+| W0.3 | **`#![deny(warnings)]` → targeted denies** | R1.3 | 1h | Unblocks doctest authoring for all subsequent waves. |
+| W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. |
+| W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. |
+
+**Gate**: All downstream crates (burn, candle, tract, ort) still compile unchanged.
+The dual-form is additive; existing call sites continue working.
+
+---
+
+## Wave 1 — Codegen Macros (2 days)
+
+Eliminates copy-paste before adding new code. Every subsequent wave benefits.
+
+| ID | Item | Source | Effort | What It Produces |
+|----|------|--------|--------|------------------|
+| W1.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
+| W1.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. |
+| W1.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
+| W1.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
+
+**Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions.
+Run benches to verify no regression.
+
+---
+
+## Wave 2 — Type Bridges (2 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W2.1 | **Fingerprint<N> ↔ ArrayView1<u64>** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints |
+| W2.2 | **VsaVector ↔ ArrayView1<u64>** | Tier 1.2 | 2h | Same pattern |
+| W2.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array<BF16>.dot()`, unlocks BLAS dispatch |
+| W2.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1<u64> per channel |
+| W2.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 |
+
+**Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge.
+
+---
+
+## Wave 3 — Extension Traits (3 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W3.1 | **HdcOps trait** (K0/K1/K2 on Array1<u64>) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` |
+| W3.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` |
+| W3.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern |
+| W3.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference |
+| W3.5 | **SimdMath trait** (VML on any Array<f32>) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals |
+| W3.6 | **Int8Matmul trait** (Array2<u8> × Array2<i8> → Array2<i32>) | Tier 3.3 | 4h | Quantized inference via ndarray types |
+
+**Gate**: Each trait has tests matching the raw-slice function's test suite.
+Benchmark: trait overhead vs. direct raw-slice call ≤ 1%.
+
+---
+
+## Wave 4 — Backend Wiring (2 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W4.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication |
+| W4.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
+| W4.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
+
+**Gate**: `cargo bench` shows measurable improvement on contiguous arrays.
+Non-contiguous arrays unchanged (fallback to generic fold).
+
+---
+
+## Wave 5 — SoA Kernel Architecture (1 week)
+
+From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W5.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy |
+| W5.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | 8 candidates per cycle, sequential memory |
+| W5.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops |
+| W5.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns |
+| W5.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor |
+| W5.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition |
+| W5.7 | **QualiaColumns** (16 × Array1<i8>) | SoA §7 | 4h | 18x dimension scan throughput |
+| W5.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow |
+| W5.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim |
+
+**Gate**: SoA cascade produces identical results to AoS cascade on full test suite.
+Benchmark confirms ≥3x throughput improvement on 1M containers.
+
+---
+
+## Wave 6 — Namespace Restructure (3 days)
+
+From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W6.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ |
+| W6.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code |
+| W6.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat |
+
+**Gate**: `pub use` deprecation shims for all moved modules.
+All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning).
+
+---
+
+## Wave 7 — Test & Bench Infrastructure (2 days)
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W7.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count |
+| W7.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims |
+| W7.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 |
+
+---
+
+## Wave 8 — Version Bump & Downstream Pin (1 day)
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W8.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut |
+| W8.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed |
+| W8.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort) | R4.4 | 2h | tag = "v0.18.0" |
+| W8.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup |
+
+---
+
+## The SoA Integration with Conventions
+
+The SoA architecture (Wave 5) benefits from every preceding wave:
+
+| Wave 5 needs | Provided by |
+|--------------|-------------|
+| F-order Array2<u64> database | Wave 2 (type bridges — Fingerprint converts to Array) |
+| SIMD column scanning | Wave 4 (unified SIMD detection, core dispatch) |
+| Bitmask as Array1<u64> | Wave 2 (ArrayView factory pattern) |
+| `_into` form for columnar kernels | Wave 0 (signature convention) |
+| Dispatch macro for k0_columnar_simd | Wave 1 (codegen macros) |
+| Extension trait: `database.cascade_soa(&query, &gate)` | Wave 3 (HdcOps trait extended) |
+| BF16FieldDatabase uses Quantize trait | Wave 3 (Quantize extension) |
+| Arrow columns → direct scan | Wave 2 (Arrow view factories) |
+| Benchmark harness to prove 4-8x | Wave 7 (bench infrastructure) |
+
+This is why the waves are ordered this way. The SoA kernel is the most impactful
+change (~4-8x cascade throughput), but it needs the convention/macro/bridge
+foundation to land cleanly. Without Wave 0-4, the SoA code would be yet another
+raw-slice module that doesn't integrate with ndarray — repeating the original sin.
+
+---
+
+## The Meta-Level SoA Observation
+
+The SoA insight goes beyond just "transpose the database." It's a design principle
+that applies recursively across the entire hpc/ surface:
+
+### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)
+- Fingerprint database: column-per-word → cascade becomes column scan
+- BF16 fields: separated sign/exp/mantissa → awareness is native layout
+- Qualia: column-per-dimension → per-dim queries are sequential scans
+
+### Level 2: API surface (this document)
+- The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures:
+  - AoS API: `fn foo(input) -> output` (allocates inside, hides layout)
+  - SoA API: `fn foo_into(input, output)` (caller controls memory layout)
+  - The `_into` form lets the caller decide whether output is contiguous, chunked, or interleaved
+
+### Level 3: Module structure (W6.1 namespace split)
+- AoS: one module (`hpc/`) contains everything → grep to find anything
+- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → each namespace is a contiguous "column" of related functionality that you scan sequentially
+
+### Level 4: ndarray's type system
+- `Array2<u64>` in F-order IS the SoA database. No wrapper type needed.
+- `.column(i)` IS the field-access pattern. No custom iterator needed.
+- `.row(i)` IS the record-access pattern. No conversion needed.
+- Strides encode the duality. The same allocation serves both patterns.
+
+### Level 5: The cascade pipeline itself
+- K0: ONE field (word 0) across ALL records → column scan
+- K1: EIGHT fields (words 0-7) across SURVIVORS → multi-column scan
+- K2: ALL fields across FEW records → row reconstruction
+
+This is a **widening read pattern**: start narrow (1 column), widen as population shrinks.
+SoA makes the narrow-start cheap (contiguous) and tolerates the wide-end being strided
+(because so few records reach it).
+
+### Level 6: Arrow + Lance alignment
+- Arrow IS columnar storage (SoA by definition)
+- Lance stores Arrow columns on disk
+- The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy
+- The database layout on disk IS the query execution layout in memory
+- This is the zero-copy nirvana: mmap → column pointer → SIMD scan → result
+
+**The architecture collapses all intermediate transformations.**
+Storage format = memory layout = compute pattern = cache access order.
+One decision (F-order Array2) propagates from disk through cache to SIMD lanes.
+
+---
+
+## Effort Summary
+
+| Wave | Days | Parallel? | Key Outcome |
+|------|------|-----------|-------------|
+| 0 | 3 | Yes | Conventions locked |
+| 1 | 2 | Yes | Codegen macros ready |
+| 2 | 2 | Yes (after 0) | Domain types ↔ Array |
+| 3 | 3 | Yes (after 2) | Extension traits live |
+| 4 | 2 | After 1+2 | Core ops accelerated |
+| 5 | 5 | After 2+4 | SoA cascade (the big win) |
+| 6 | 3 | After 0 | Namespace clean |
+| 7 | 2 | After 5 | Proof via benchmarks |
+| 8 | 1 | After all | Release |
+
+**Critical path**: 0 → 2 → 4 → 5 → 7 → 8 = **15 days** serial.
+With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**.
+
+---
+
+## What NOT to Do
+
+1. **Don't delete hpc/ modules** — raw-slice functions stay (FFI, embedded, zero-overhead)
+2. **Don't premature-v1.0** — surface is too young; candle/tract wiring will shake it again
+3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional
+4. **Don't couple SoA with module restructure** — they're independent; merge separately
+5. **Don't break downstream in one shot** — deprecation shims for one release minimum

From d1882b6fad58900c7603902f6ef80deb6eba6da8 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 20:32:32 +0000
Subject: [PATCH 4/7] Expand unified refactor sequence: integrate lance-graph
 W1a consumer contract
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fourth analysis pass adds Wave 1 (SIMD Consumer Primitives) — 5 P0
primitives blocking 158 raw-intrinsic violations in lance-graph:

- I8x16::from_i4_packed_u64 (closure-batch pattern)
- I8x16::saturating_abs (VPABSB correction for i8::MIN)
- U16x8::gather_u16 (quantized distance tables)
- prefetch_read_t0/t1/t2 (hint wrappers)
- U64x8::popcnt + xor_popcount (HDC hamming)

Sequence now 9 waves (was 8), dependency graph updated,
lance-graph unblocked at day 6 of 12-day plan.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 UNIFIED_REFACTOR_SEQUENCE.md | 391 ++++++++++++++++++++++++++---------
 1 file changed, 292 insertions(+), 99 deletions(-)

diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md
index afa4c9d5..6ad54321 100644
--- a/UNIFIED_REFACTOR_SEQUENCE.md
+++ b/UNIFIED_REFACTOR_SEQUENCE.md
@@ -1,9 +1,10 @@
 # Unified Refactor Sequence: ndarray Fork
 
-> Integrates three analysis passes into one executable sequence:
+> Integrates four analysis passes into one executable sequence:
 > 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits
 > 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage
 > 3. **Transformer session feedback** — API conventions, namespace, codegen macros
+> 4. **lance-graph W1a consumer contract** — 5 SIMD primitives blocking 158-violation remediation
 >
 > Each wave is self-contained. Later waves depend on earlier ones.
 > Within a wave, items are independent and can run in parallel.
@@ -22,201 +23,389 @@ Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
 | W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. |
 | W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. |
 
-**Gate**: All downstream crates (burn, candle, tract, ort) still compile unchanged.
+**Gate**: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged.
 The dual-form is additive; existing call sites continue working.
 
 ---
 
-## Wave 1 — Codegen Macros (2 days)
+## Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0
+
+From `lance-graph` W1a consumer contract. **Five primitives blocking 158 raw-intrinsic
+violations across 5 lance-graph crates.** Each is a tight-scope PR. Parallel review.
+
+The lance-graph `simd-savant` agent runs PRE-MERGE against each PR.
+
+| ID | Item | Source | Effort | Consumer Site |
+|----|------|--------|--------|---------------|
+| W1.1 | **`I8x16::from_i4_packed_u64()`** + `batch_packed_i4_16()` | W1a-#1 | 1d | `lance-graph-contract/src/mul.rs::i4_eval::batch` — 5 batch fns over `QualiaI4_16D(u64)`. Closes PR #398 codex P1 (NEON OOB at `len==2`). |
+| W1.2 | **`I8x16::saturating_abs()`** + `I8x32::saturating_abs()` | W1a-#2 | 4h | `lance-graph-contract/src/mul.rs` — Direction-B fix. `abs(i8::MIN)` must return `i8::MAX`, not wrap. See §VPABSB correction. |
+| W1.3 | **`U16x8::gather_u16()`** + `palette_lookup_u8x8()` | W1a-#3 | 4h | `bgz17/src/simd.rs:88` — currently inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation). |
+| W1.4 | **`prefetch_read_t0/t1/t2()`** | W1a-#4 | 2h | `bgz17/src/prefetch.rs:96,100` — currently inlines `_mm_prefetch` directly. |
+| W1.5 | **`U64x8::popcnt()`** + `U64x8::xor_popcount()` | W1a-#5 | 4h | `holograph/hamming.rs` + `lance-graph/graph/blasgraph/types.rs` — inline `_mm512_popcnt_epi64`. |
+
+### Per-primitive implementation contract
+
+Every PR MUST:
+1. **Three backends**: AVX-512, NEON, scalar. Missing scalar = P0 reject.
+2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices).
+3. **Parity test**: all backends produce identical output on randomized corpus including edge cases.
+4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body.
+5. **`// SAFETY:` on every `unsafe` block**.
+6. **No new `is_x86_feature_detected!`** outside `simd_caps.rs`.
+7. **Consumer site cited** in PR description.
+
+### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)
+
+```rust
+impl I8x16 {
+    /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended).
+    /// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf).
+    pub fn from_i4_packed_u64(packed: u64) -> Self;
+
+    /// Const-folded lane extract.
+    pub fn lane_i8<const N: usize>(self) -> i8;
+}
+
+/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
+/// Bounds-aware tail handling; scalar fallback on unsupported arch.
+pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
+where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy;
+```
+
+**AVX-512**: `_mm_cvtsi64_si128` + nibble shuffle via VPSHUFB mask LUT, then sign-extend.
+Alt: PDEP (`_pdep_u64` × 2) into two u64 halves → load → `vpmovsxbw`. Bench both on Zen4 + SPR.
+**NEON**: `vld1_u8` 8 bytes → `vshl_n_s8(v, 4)` + `vshr_n_s8(v, 4)` for nibble split + sign-extend.
+**Scalar**: `((packed >> (4*i)) & 0xf) as i8`, with `if x > 7 { x - 16 } else { x }` for sign-extend.
+
+The `batch_packed_i4_16` closure-batch is the foundation for W1.5-#7 (randomized signatures).
+
+### W1.2 — I8x16::saturating_abs (VPABSB correction)
+
+**Critical**: `_mm512_abs_epi8` does NOT saturate `i8::MIN`. Returns `0x80` (= -128).
+
+```rust
+impl I8x16 { pub fn saturating_abs(self) -> Self; }
+impl I8x32 { pub fn saturating_abs(self) -> Self; }
+```
+
+**AVX-512**: `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`.
+**NEON**: `vqabsq_s8(x)` — hardware-saturating (the `q` suffix).
+**Scalar**: `i8::saturating_abs()` (stdlib, well-defined).
+
+**Mandatory test**:
+```rust
+let input = I8x16::splat(i8::MIN);
+let result = input.saturating_abs();
+assert_eq!(result.lane_i8::<0>(), i8::MAX);  // 127, NOT -128
+```
+
+### W1.3 — U16x8::gather_u16 (palette lookup)
+
+```rust
+impl U16x8 {
+    /// Gather 8 u16 values from `table` at given indices.
+    /// Debug: panics if max(indices) >= table.len(). Release: scalar fallback.
+    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
+}
+pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
+```
+
+**AVX2**: `_mm256_i32gather_epi32` with index widening + downcast.
+**NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`.
+
+### W1.4 — Prefetch hints (cross-arch)
+
+```rust
+pub fn prefetch_read_t0(ptr: *const u8);  // L1
+pub fn prefetch_read_t1(ptr: *const u8);  // L2
+pub fn prefetch_read_t2(ptr: *const u8);  // L3
+```
+
+**x86**: `_mm_prefetch(ptr, _MM_HINT_T0/T1/T2)`.
+**aarch64**: `prfm pldl1keep/pldl2keep/pldl3keep`.
+**Other**: no-op (prefetch is a hint; silent no-op is correct per ISA).
+
+### W1.5 — U64x8::popcnt (lane-wise popcount)
+
+```rust
+impl U64x8 {
+    /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
+    pub fn popcnt(self) -> Self;
+    /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
+    pub fn xor_popcount(self, other: Self) -> u64;
+}
+impl U64x4 { pub fn popcnt(self) -> Self; }  // AVX2 parity
+```
+
+**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`).
+**AVX-512 without VPOPCNTDQ**: Mula's algorithm via `VPSHUFB` + `VPSADBW` per-byte LUT.
+**NEON**: `vcntq_u8` → `vpaddlq_u8` cascade to sum within each u64.
+**Scalar**: `u64::count_ones()` fused loop.
+
+### W1.5+ — Deferred primitives (gated on sigker certification)
+
+Three more primitives queued behind `lance-graph:crates/sigker` certification
+(Hambly-Lyons 2010 path-signature uniqueness). **No code needed today.** Listed
+so W1a additions are designed broad enough to compose with these later.
+
+| ID | Primitive | Gate |
+|----|-----------|------|
+| W1.5-#6 | `signature_pde_sweep()` — Goursat PDE for `〈S(X), S(Y)〉` | `jc Pillar 11` activation |
+| W1.5-#7 | Randomized projection (Cuchiero-Schmocker-Teichmann 2021) | sigker bench at production widths |
+| W1.5-#8 | Lyndon-pack log-signature (7-13× compression) on `I16x16` | sigker bench at production widths |
+
+The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7.
+
+**Gate**: All 5 primitives pass parity tests across AVX-512/NEON/scalar.
+`lance-graph` `simd-savant` agent verifies compliance PRE-MERGE.
+Consumer-side migration PRs (#398-#402) unblocked.
+
+---
+
+## Wave 2 — Codegen Macros (2 days)
 
 Eliminates copy-paste before adding new code. Every subsequent wave benefits.
+Wave 1 primitives become the first consumers of these macros (retrofit optional).
 
 | ID | Item | Source | Effort | What It Produces |
 |----|------|--------|--------|------------------|
-| W1.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
-| W1.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. |
-| W1.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
-| W1.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
+| W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
+| W2.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. |
+| W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
+| W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
 
 **Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions.
 Run benches to verify no regression.
 
 ---
 
-## Wave 2 — Type Bridges (2 days)
+## Wave 3 — Type Bridges (2 days)
 
 From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native.
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W2.1 | **Fingerprint<N> ↔ ArrayView1<u64>** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints |
-| W2.2 | **VsaVector ↔ ArrayView1<u64>** | Tier 1.2 | 2h | Same pattern |
-| W2.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array<BF16>.dot()`, unlocks BLAS dispatch |
-| W2.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1<u64> per channel |
-| W2.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 |
+| W3.1 | **Fingerprint<N> ↔ ArrayView1<u64>** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints |
+| W3.2 | **VsaVector ↔ ArrayView1<u64>** | Tier 1.2 | 2h | Same pattern |
+| W3.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array<BF16>.dot()`, unlocks BLAS dispatch |
+| W3.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1<u64> per channel |
+| W3.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 |
 
 **Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge.
 
 ---
 
-## Wave 3 — Extension Traits (3 days)
+## Wave 4 — Extension Traits (3 days)
 
 From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types.
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W3.1 | **HdcOps trait** (K0/K1/K2 on Array1<u64>) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` |
-| W3.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` |
-| W3.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern |
-| W3.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference |
-| W3.5 | **SimdMath trait** (VML on any Array<f32>) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals |
-| W3.6 | **Int8Matmul trait** (Array2<u8> × Array2<i8> → Array2<i32>) | Tier 3.3 | 4h | Quantized inference via ndarray types |
+| W4.1 | **HdcOps trait** (K0/K1/K2 on Array1<u64>) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` |
+| W4.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` |
+| W4.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern |
+| W4.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference |
+| W4.5 | **SimdMath trait** (VML on any Array<f32>) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals |
+| W4.6 | **Int8Matmul trait** (Array2<u8> × Array2<i8> → Array2<i32>) | Tier 3.3 | 4h | Quantized inference via ndarray types |
 
 **Gate**: Each trait has tests matching the raw-slice function's test suite.
 Benchmark: trait overhead vs. direct raw-slice call ≤ 1%.
 
 ---
 
-## Wave 4 — Backend Wiring (2 days)
+## Wave 5 — Backend Wiring (2 days)
 
 From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate.
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W4.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication |
-| W4.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
-| W4.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
+| W5.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication |
+| W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
+| W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
 
 **Gate**: `cargo bench` shows measurable improvement on contiguous arrays.
 Non-contiguous arrays unchanged (fallback to generic fold).
 
 ---
 
-## Wave 5 — SoA Kernel Architecture (1 week)
+## Wave 6 — SoA Kernel Architecture (1 week)
 
 From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change.
+Wave 1's `U64x8::popcnt()` and `U64x8::xor_popcount()` are direct prerequisites
+for the columnar XOR+popcount scan in W6.2.
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W5.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy |
-| W5.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | 8 candidates per cycle, sequential memory |
-| W5.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops |
-| W5.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns |
-| W5.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor |
-| W5.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition |
-| W5.7 | **QualiaColumns** (16 × Array1<i8>) | SoA §7 | 4h | 18x dimension scan throughput |
-| W5.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow |
-| W5.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim |
+| W6.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy |
+| W6.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | Uses `U64x8::xor_popcount` from W1.5. 8 candidates per cycle. |
+| W6.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops |
+| W6.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns |
+| W6.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor |
+| W6.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition |
+| W6.7 | **QualiaColumns** (16 × Array1<i8>) | SoA §7 | 4h | 18x dimension scan throughput. Uses `I8x16::from_i4_packed_u64` from W1.1. |
+| W6.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow |
+| W6.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim |
 
 **Gate**: SoA cascade produces identical results to AoS cascade on full test suite.
 Benchmark confirms ≥3x throughput improvement on 1M containers.
 
 ---
 
-## Wave 6 — Namespace Restructure (3 days)
+## Wave 7 — Namespace Restructure (3 days)
 
 From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md.
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W6.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ |
-| W6.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code |
-| W6.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat |
+| W7.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ |
+| W7.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code. W1 primitives land in new locations. |
+| W7.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat |
 
 **Gate**: `pub use` deprecation shims for all moved modules.
 All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning).
 
 ---
 
-## Wave 7 — Test & Bench Infrastructure (2 days)
+## Wave 8 — Test & Bench Infrastructure (2 days)
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W7.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count |
-| W7.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims |
-| W7.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 |
+| W8.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count |
+| W8.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims |
+| W8.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 |
+| W8.4 | **W1a primitive parity bench** (AVX-512 vs NEON vs scalar speedup) | lance-graph | 4h | Acceptance criteria §4 |
 
 ---
 
-## Wave 8 — Version Bump & Downstream Pin (1 day)
+## Wave 9 — Version Bump & Downstream Pin (1 day)
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W8.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut |
-| W8.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed |
-| W8.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort) | R4.4 | 2h | tag = "v0.18.0" |
-| W8.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup |
+| W9.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut |
+| W9.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed |
+| W9.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort, lance-graph) | R4.4 | 2h | tag = "v0.18.0" |
+| W9.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup |
 
 ---
 
-## The SoA Integration with Conventions
+## Dependency Graph
+
+```
+Wave 0 (conventions)
+  │
+  ├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph
+  │       │
+  │       ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5)
+  │       ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
+  │       └──→ Wave 8.4 (primitive parity benches)
+  │
+  ├──→ Wave 2 (codegen macros)
+  │       │
+  │       └──→ Wave 5 (backend wiring uses macros)
+  │
+  ├──→ Wave 3 (type bridges)
+  │       │
+  │       ├──→ Wave 4 (extension traits need bridges)
+  │       ├──→ Wave 5 (backend dispatch needs BlasFloat for BF16)
+  │       └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
+  │
+  ├──→ Wave 7 (namespace) — independent of 1-6
+  │
+  └──→ Wave 8 (tests/benches) — after 1+5+6
+          │
+          └──→ Wave 9 (release)
+```
+
+### Critical Path
+
+**For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked.
+
+**For full release**: 0 → 1 → 3 → 5 → 6 → 8 → 9 = **18 days** serial.
+With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**.
 
-The SoA architecture (Wave 5) benefits from every preceding wave:
+---
 
-| Wave 5 needs | Provided by |
-|--------------|-------------|
-| F-order Array2<u64> database | Wave 2 (type bridges — Fingerprint converts to Array) |
-| SIMD column scanning | Wave 4 (unified SIMD detection, core dispatch) |
-| Bitmask as Array1<u64> | Wave 2 (ArrayView factory pattern) |
-| `_into` form for columnar kernels | Wave 0 (signature convention) |
-| Dispatch macro for k0_columnar_simd | Wave 1 (codegen macros) |
-| Extension trait: `database.cascade_soa(&query, &gate)` | Wave 3 (HdcOps trait extended) |
-| BF16FieldDatabase uses Quantize trait | Wave 3 (Quantize extension) |
-| Arrow columns → direct scan | Wave 2 (Arrow view factories) |
-| Benchmark harness to prove 4-8x | Wave 7 (bench infrastructure) |
+## How Wave 1 Feeds Everything Downstream
+
+The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves:
 
-This is why the waves are ordered this way. The SoA kernel is the most impactful
-change (~4-8x cascade throughput), but it needs the convention/macro/bridge
-foundation to land cleanly. Without Wave 0-4, the SoA code would be yet another
-raw-slice module that doesn't integrate with ndarray — repeating the original sin.
+| Primitive | Immediate consumer (lance-graph) | Later consumer (ndarray) |
+|-----------|----------------------------------|--------------------------|
+| `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop |
+| `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics |
+| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** — gather exponent fields from lookup table |
+| `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan |
+| `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation |
+
+The SoA cascade (Wave 6) is fundamentally built on `U64x8::xor_popcount()`. Without Wave 1,
+the SoA implementation would need to drop back to `hpc::bitwise::hamming_distance_raw()` on
+slices, losing the typed-wrapper discipline and duplicating dispatch logic.
+
+---
+
+## The SoA Integration with All Waves
+
+| Wave 6 needs | Provided by |
+|--------------|-------------|
+| `U64x8::xor_popcount()` for columnar scan | **Wave 1.5** (SIMD primitives) |
+| `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) |
+| `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) |
+| F-order Array2<u64> database | **Wave 3** (type bridges — Fingerprint converts to Array) |
+| Dispatch macro for k0_columnar_simd | **Wave 2** (codegen macros) |
+| `_into` form for columnar kernels | **Wave 0** (signature convention) |
+| Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) |
+| BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) |
+| Arrow columns → direct scan | **Wave 3** (Arrow view factories) |
+| Unified SIMD detection for dispatch | **Wave 5** (backend wiring) |
+| Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) |
 
 ---
 
 ## The Meta-Level SoA Observation
 
 The SoA insight goes beyond just "transpose the database." It's a design principle
-that applies recursively across the entire hpc/ surface:
+that applies recursively across the entire surface:
 
 ### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)
 - Fingerprint database: column-per-word → cascade becomes column scan
 - BF16 fields: separated sign/exp/mantissa → awareness is native layout
 - Qualia: column-per-dimension → per-dim queries are sequential scans
 
-### Level 2: API surface (this document)
+### Level 2: API surface
 - The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures:
   - AoS API: `fn foo(input) -> output` (allocates inside, hides layout)
   - SoA API: `fn foo_into(input, output)` (caller controls memory layout)
-  - The `_into` form lets the caller decide whether output is contiguous, chunked, or interleaved
 
-### Level 3: Module structure (W6.1 namespace split)
-- AoS: one module (`hpc/`) contains everything → grep to find anything
-- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → each namespace is a contiguous "column" of related functionality that you scan sequentially
+### Level 3: SIMD primitives (lance-graph W1a)
+- The **typed-wrapper-as-method** convention IS SoA for the SIMD surface:
+  - AoS: raw intrinsics scattered across consumer crates (158 violations)
+  - SoA: struct methods on typed wrappers, consumers import ONE namespace
+- The **closure-batch pattern** (W1.1 `batch_packed_i4_16`) IS SoA for computation:
+  - AoS: each consumer writes its own unpack+process+pack loop
+  - SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch
 
-### Level 4: ndarray's type system
-- `Array2<u64>` in F-order IS the SoA database. No wrapper type needed.
-- `.column(i)` IS the field-access pattern. No custom iterator needed.
-- `.row(i)` IS the record-access pattern. No conversion needed.
-- Strides encode the duality. The same allocation serves both patterns.
-
-### Level 5: The cascade pipeline itself
-- K0: ONE field (word 0) across ALL records → column scan
-- K1: EIGHT fields (words 0-7) across SURVIVORS → multi-column scan
-- K2: ALL fields across FEW records → row reconstruction
+### Level 4: Module structure (W7.1 namespace split)
+- AoS: one module (`hpc/`) contains everything → grep to find anything
+- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → contiguous columns
 
-This is a **widening read pattern**: start narrow (1 column), widen as population shrinks.
-SoA makes the narrow-start cheap (contiguous) and tolerates the wide-end being strided
-(because so few records reach it).
+### Level 5: ndarray's type system
+- `Array2<u64>` in F-order IS the SoA database
+- `.column(i)` IS the field-access pattern
+- `.row(i)` IS the record-access pattern
+- Strides encode the duality without duplication
 
 ### Level 6: Arrow + Lance alignment
 - Arrow IS columnar storage (SoA by definition)
 - Lance stores Arrow columns on disk
 - The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy
-- The database layout on disk IS the query execution layout in memory
-- This is the zero-copy nirvana: mmap → column pointer → SIMD scan → result
+- Storage format = memory layout = compute pattern = cache access order
 
-**The architecture collapses all intermediate transformations.**
-Storage format = memory layout = compute pattern = cache access order.
-One decision (F-order Array2) propagates from disk through cache to SIMD lanes.
+### Level 7: The consumer contract itself
+- lance-graph declares what it needs (W1a primitives), ndarray provides
+- The contract is columnar: one primitive per concern, not a monolithic "SIMD library"
+- Each primitive is independently testable, deployable, and benchmarkable
+
+**The architecture collapses all intermediate transformations at every level.**
 
 ---
 
@@ -225,17 +414,18 @@ One decision (F-order Array2) propagates from disk through cache to SIMD lanes.
 | Wave | Days | Parallel? | Key Outcome |
 |------|------|-----------|-------------|
 | 0 | 3 | Yes | Conventions locked |
-| 1 | 2 | Yes | Codegen macros ready |
-| 2 | 2 | Yes (after 0) | Domain types ↔ Array |
-| 3 | 3 | Yes (after 2) | Extension traits live |
-| 4 | 2 | After 1+2 | Core ops accelerated |
-| 5 | 5 | After 2+4 | SoA cascade (the big win) |
-| 6 | 3 | After 0 | Namespace clean |
-| 7 | 2 | After 5 | Proof via benchmarks |
-| 8 | 1 | After all | Release |
-
-**Critical path**: 0 → 2 → 4 → 5 → 7 → 8 = **15 days** serial.
-With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**.
+| 1 | 3 | Yes (after 0) | **lance-graph unblocked** (P0) |
+| 2 | 2 | ∥ with 1 | Codegen macros ready |
+| 3 | 2 | After 0 | Domain types ↔ Array |
+| 4 | 3 | After 3 | Extension traits live |
+| 5 | 2 | After 2+3 | Core ops accelerated |
+| 6 | 5 | After 1+3+5 | SoA cascade (the big win) |
+| 7 | 3 | ∥ with 4-6 | Namespace clean |
+| 8 | 2 | After 1+6 | Proof via benchmarks |
+| 9 | 1 | After all | Release |
+
+**lance-graph unblocked**: Wave 0+1 = **6 days**.
+**Full release critical path**: 18 days serial, **~12 days** with parallelism.
 
 ---
 
@@ -246,3 +436,6 @@ With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**.
 3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional
 4. **Don't couple SoA with module restructure** — they're independent; merge separately
 5. **Don't break downstream in one shot** — deprecation shims for one release minimum
+6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed
+7. **Don't add `is_x86_feature_detected!` outside simd_caps.rs** — dispatch through the singleton
+8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated

From 4321fcf33c22867c2e20970d0f107e58e2606312 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 20:52:56 +0000
Subject: [PATCH 5/7] Add W1a vertical SIMD consumer contract spec

Canonical spec for 5 P0 SIMD primitives lance-graph requires:
- I8x16::from_i4_packed_u64 (closure-batch)
- I8x16::saturating_abs (VPABSB + VPMINUB clamp)
- U16x8::gather_u16 (with Codex P2 OOB correction for 32-bit gather)
- prefetch_read_t0/t1/t2
- U64x8::popcnt + xor_popcount (NEON: vpaddlq cascade, NOT vaddvq_u8)

Includes Codex P2 corrections for gather OOB and NEON popcount
lane contamination.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 .../vertical-simd-consumer-contract.md        | 242 ++++++++++++++++++
 1 file changed, 242 insertions(+)
 create mode 100644 .claude/knowledge/vertical-simd-consumer-contract.md

diff --git a/.claude/knowledge/vertical-simd-consumer-contract.md b/.claude/knowledge/vertical-simd-consumer-contract.md
new file mode 100644
index 00000000..65c445c7
--- /dev/null
+++ b/.claude/knowledge/vertical-simd-consumer-contract.md
@@ -0,0 +1,242 @@
+# Vertical SIMD Consumer Contract — lance-graph W1a
+
+> Canonical spec for the 5 P0 SIMD primitives that AdaWorldAPI/lance-graph
+> requires from this ndarray fork before consumer-side migrations can land.
+
+## Context
+
+A PRE-MERGE audit of lance-graph `main` (2026-05-16) surfaced **158 raw-intrinsic
+violations** across 5 consumer crates plus 3 missing primitives here that block
+clean remediation. This document is the **binding contract** between ndarray's
+SIMD surface and its primary consumer.
+
+## Design Pattern
+
+ndarray's SIMD surface is designed AS-IF for lance-graph's exact workloads:
+
+- **Struct methods on typed wrappers** — `I8x16`, `U8x32`, `F32x16`, `U64x8`, …
+- **Closure-parameterized batch primitives** — absorb consumer domain semantics
+- **Consumers see zero raw intrinsics, zero `cfg(target_arch)`, zero runtime
+  feature-detect** — they call `I8x16::from_i4_packed_u64(...)`,
+  `I8x16::saturating_abs(...)`, `batch_packed_i4_16(..., |lanes, aux| { ... })`
+- **This repo (the polyfill) owns** dispatch, chunking, tail handling, scalar fallback
+
+---
+
+## W1a Queue — 5 Primitives
+
+### TD-NDARRAY-SIMD-UNPACK-I4-16D
+
+**API:**
+```rust
+impl I8x16 {
+    /// Unpack 8 packed i4 pairs from a u64 into 16 × i8 lanes (sign-extended).
+    fn from_i4_packed_u64(packed: u64) -> Self;
+}
+
+/// Closure-batch: owns chunking, dispatch, tail handling.
+/// `f` receives 16 i8 lanes + aux slice position; returns accumulator update.
+fn batch_packed_i4_16<E, F>(
+    packed: &[u64],
+    aux: &[E],
+    out: &mut [i8],
+    f: F,
+) where F: Fn(I8x16, &[E]) -> I8x16;
+```
+
+**Consumer:** `lance-graph::mul::i4_eval::batch` (5 fns)
+
+**Per-arch implementation:**
+- **AVX-512:** `VPSHUFB` nibble shuffle + sign-extend via arithmetic shift
+- **NEON:** `VTBL` permute + `VSHR`/`VAND` nibble extraction
+- **Scalar:** shift-and-mask loop, 2 nibbles per byte
+
+---
+
+### TD-NDARRAY-SIMD-SATURATING-ABS-I8
+
+**API:**
+```rust
+impl I8x16 {
+    /// Saturating absolute value: abs(i8::MIN) → i8::MAX (not i8::MIN).
+    fn saturating_abs(self) -> Self;
+}
+```
+
+**Consumer:** lance-graph PR #398 Direction-B fix
+
+**CRITICAL — VPABSB correction:**
+
+The original lance-graph PR #400 capture claimed `_mm512_abs_epi8` saturates
+`i8::MIN → 127` by ISA. **This is WRONG.** VPABSB returns `0x80` for input
+`0x80` (i.e., `abs(i8::MIN) = i8::MIN` because +128 doesn't fit in i8).
+
+**Correct AVX-512 implementation:**
+```rust
+// SAFETY: self.0 is a valid __m512i from construction
+unsafe {
+    let raw_abs = _mm512_abs_epi8(self.0);
+    // VPMINUB: unsigned min clamps 0x80 (=128 unsigned > 127) to 0x7f
+    let clamped = _mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f));
+    Self(clamped)
+}
+```
+
+**Per-arch implementation:**
+- **AVX-512:** `VPABSB` + `VPMINUB` clamp (as above)
+- **AVX2:** `_mm256_abs_epi8` + `_mm256_min_epu8(..., splat(0x7f))`
+- **NEON:** `vqabsq_s8` — hardware-saturating (the `q` suffix). No fixup needed.
+- **Scalar:** `i8::saturating_abs()` — correct by definition.
+
+**Mandatory parity test:**
+```rust
+let input = I8x16::splat(i8::MIN);
+assert_eq!(input.saturating_abs().lane_i8::<0>(), i8::MAX);
+```
+
+The widen-then-negate trick used in lance-graph PR #398's `mul.rs` is NOT a
+substitute — the new primitive must produce saturating semantics in the
+byte-wide register without widening.
+
+---
+
+### TD-NDARRAY-SIMD-GATHER
+
+**API:**
+```rust
+impl U16x8 {
+    /// Gather 8 × u16 values from `table` at the given indices.
+    /// Panics (debug) / UB-free (release) if any index ≥ table.len().
+    fn gather_u16(table: &[u16], indices: [u32; 8]) -> Self;
+}
+
+/// Palette lookup: gather 8 u8 values from a 256-entry palette.
+fn palette_lookup_u8x8(palette: &[u8; 256], indices: [u8; 8]) -> [u8; 8];
+```
+
+**Consumer:** `bgz17/src/simd.rs:88`
+
+**Per-arch implementation:**
+- **AVX2/AVX-512:** `_mm256_i32gather_epi32` reads 32 bits per slot. Since the
+  source is `&[u16]`, each gather slot reads 4 bytes starting at
+  `&table[index]`. **Correction (Codex P2):** To avoid OOB reads at
+  `table[len-1]`, the implementation MUST:
+  1. Allocate or use a padded table with 2 extra trailing bytes, OR
+  2. Use scalar fallback for indices where `index == table.len() - 1`, OR
+  3. (Preferred) Gather from a `&[u32]` temporary produced by zero-extending
+     the u16 table at ingest time, OR
+  4. Mask the gathered 32-bit values to 16 bits (`_mm256_and_si256(..., 0xFFFF)`)
+     AND ensure the source allocation has ≥2 bytes of padding past the last
+     element (i.e., the table slice is backed by an allocation at least
+     `table.len() * 2 + 2` bytes).
+
+  The bounds contract is: `max(indices) < table.len()`. The implementation must
+  guarantee no UB even when `max(indices) == table.len() - 1`.
+- **NEON:** No hardware gather for u16. Scalar lane-by-lane load into vector.
+- **Scalar:** Direct indexing loop.
+
+---
+
+### TD-NDARRAY-SIMD-PREFETCH
+
+**API:**
+```rust
+/// Prefetch read hint — temporal locality level 0/1/2.
+/// No-op on architectures without prefetch instructions.
+#[inline(always)]
+fn prefetch_read_t0<T>(ptr: *const T);
+#[inline(always)]
+fn prefetch_read_t1<T>(ptr: *const T);
+#[inline(always)]
+fn prefetch_read_t2<T>(ptr: *const T);
+```
+
+**Consumer:** `bgz17/src/prefetch.rs:96,100`
+
+**Per-arch implementation:**
+- **x86:** `_mm_prefetch::<_MM_HINT_T0/T1/T2>(ptr as *const i8)`
+- **NEON/AArch64:** `__prefetch(ptr)` (ARM has only one hint level in stable intrinsics)
+- **Scalar/fallback:** No-op `#[inline(always)]` empty function
+
+---
+
+### TD-NDARRAY-SIMD-POPCOUNT-U64
+
+**API:**
+```rust
+impl U64x8 {
+    /// Per-lane population count: count set bits in each u64 lane.
+    fn popcnt(self) -> Self;
+
+    /// XOR two vectors then popcount each lane (Hamming distance).
+    fn xor_popcount(self, other: Self) -> Self;
+}
+```
+
+**Consumer:** `holograph/hamming.rs`, `blasgraph/types.rs`
+
+**Per-arch implementation:**
+- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`.
+- **AVX-512 without VPOPCNTDQ:** Mula's algorithm — `_mm512_shuffle_epi8` (VPSHUFB)
+  for per-nibble LUT popcount on bytes, then `_mm512_sad_epu8` against zero to
+  horizontally sum bytes within each 64-bit lane.
+- **NEON:** Per-u64 lane popcount via widening reduction:
+  ```
+  vcntq_u8        → 16 × u8 byte-popcounts
+  vpaddlq_u8      → 8 × u16 (pairwise add bytes)
+  vpaddlq_u16     → 4 × u32 (pairwise add u16s)
+  vpaddlq_u32     → 2 × u64 (pairwise add u32s)
+  ```
+  This produces one count per u64 lane. **Correction (Codex P2):** Do NOT use
+  `vaddvq_u8` — it reduces the entire 128-bit vector to a single scalar,
+  merging counts across u64 lanes. The `vpaddlq_*` widening cascade preserves
+  per-lane boundaries.
+- **Scalar:** `u64::count_ones()` per lane.
+
+---
+
+## W1.5 — Deferred Primitives
+
+Gated on lance-graph `sigker` certification (Pillar 11 activation):
+
+| ID | Primitive | Depends on |
+|----|-----------|-----------|
+| W1.5-#7 | signature-PDE-sweep | W1a-#1 closure-batch shape |
+| W1.5-#8 | randomized-projection | W1a-#1 batch pattern |
+| W1.5-#9 | lyndon-pack | W1a-#5 popcount |
+
+The W1a closure-batch shape (`batch_packed_i4_16`) is the foundation for W1.5-#7
+and #8. Design W1a-#1 with this forward-compatibility in mind.
+
+---
+
+## Acceptance Criteria (per W1a PR)
+
+1. All three backends (AVX-512/AVX2, NEON, scalar) — scalar is the correctness anchor
+2. Doc-comment states saturating/overflow/signedness semantics explicitly
+3. Mandatory parity test on randomized + edge-case corpus (≥ 64 randomized vectors)
+4. No new `is_*_feature_detected!` outside `src/hpc/simd_caps.rs`
+5. `// SAFETY:` comments on all `unsafe` blocks
+6. Consumer call-site cited in PR description
+7. `cargo clippy -- -D warnings` passes
+
+---
+
+## Cross-References
+
+- AdaWorldAPI/lance-graph PR #399 — simd-savant agent + autoattended-multiagent pattern
+- AdaWorldAPI/lance-graph PR #400 — architectural capture (VPABSB correction origin)
+- AdaWorldAPI/lance-graph PR #398 — codex P1 (NEON OOB) + P2 (i8::MIN divergence)
+- Intel Intrinsics Guide: `_mm512_abs_epi8`, `_mm512_min_epu8`, `_mm512_popcnt_epi64`, `_mm256_i32gather_epi32`
+- ARM Architecture Reference: `VQABS` (`vqabsq_s8`), `VCNT` (`vcntq_u8`), `VPADDL`
+
+---
+
+## Gating Relationship
+
+> ndarray ships these 5 primitives → lance-graph remediation PRs land →
+> 158 raw-intrinsic violations drop to 0 → sigker certification unblocks W1.5
+
+Consumer-side migrations **cannot proceed** until these primitives ship.
+The `simd-savant` agent on the lance-graph side runs PRE-MERGE against every
+W1a PR to verify compliance.

From b0fcc07b061057ea3201d7b4e2e53e3dd9550ea7 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 22:53:03 +0000
Subject: [PATCH 6/7] =?UTF-8?q?docs:=20fix=20dispatch=20model=20in=20refac?=
 =?UTF-8?q?tor=20docs=20=E2=80=94=20cfg=20routing,=20not=20runtime=20detec?=
 =?UTF-8?q?tion?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

UNIFIED_REFACTOR_SEQUENCE.md:
- Add "Dispatch Model" section documenting compile-time cfg(target_feature) routing
- Wave 1 contract: replace "three backends" with correct per-file impl rule
- Replace rules 6-7 with: no is_x86_feature_detected, no #[target_feature(enable)]
- Wave 5: reframe as "delete dead detection code" not "unify runtime singleton"
- Add rules 9-10 (don't touch simd_avx2.rs, don't reach for rayon)

REFACTOR_HPC_INTEGRATION.md:
- §3.2: replace LazyLock<CpuCaps> proposal with "delete 877 lines of dead code"
- Architecture diagram: "backend dispatch" → "cfg(target_feature) routing"
- Phase C execution order updated to match

Keeps: all type bridges, extension traits, SoA cascade, Wave sequencing,
W1a primitive specs, VPABSB correction, palette-256 priority, NEON 2×128-bit,
Arrow integration, codegen macros, namespace restructure, effort estimates.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 REFACTOR_HPC_INTEGRATION.md  | 71 +++++++++++++++--------------
 UNIFIED_REFACTOR_SEQUENCE.md | 86 ++++++++++++++++++++++++++++--------
 2 files changed, 103 insertions(+), 54 deletions(-)

diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md
index 1c11c56d..2823219f 100644
--- a/REFACTOR_HPC_INTEGRATION.md
+++ b/REFACTOR_HPC_INTEGRATION.md
@@ -36,14 +36,14 @@ The refactoring creates **bidirectional bridges** without removing the raw layer
 │  Array<f32, Ix2>, ArrayView, Zip, Broadcasting      │
 └───────────────┬──────────────────────▲──────────────┘
                 │                      │
-          .as_slice()          backend dispatch
-                │                      │
+          .as_slice()       cfg(target_feature) routing
+                │              (compile-time, zero-cost)
                 ▼                      │
 ┌─────────────────────────────────────────────────────┐
 │  hpc/ bridge layer (NEW)                            │
 │  Extension traits on ArrayBase<S, D>                │
 │  From/Into impls for domain types                   │
-│  Core reductions route to SIMD backends             │
+│  Core reductions call typed SIMD wrappers           │
 └───────────────┬──────────────────────▲──────────────┘
                 │                      │
           delegates to          implements
@@ -51,11 +51,17 @@ The refactoring creates **bidirectional bridges** without removing the raw layer
                 ▼                      │
 ┌─────────────────────────────────────────────────────┐
 │  hpc/ raw compute (unchanged)                       │
-│  &[u8], &[u64], Fingerprint<N>, SIMD dispatch       │
+│  &[u8], &[u64], Fingerprint<N>, typed SIMD          │
 │  K0/K1/K2, BF16 GEMM, VNNI, VML                    │
+│  Uses crate::simd::* → resolves to simd_avx512.rs  │
 └─────────────────────────────────────────────────────┘
 ```
 
+**Dispatch model**: `crate::simd::U64x8` resolves at compile time via
+`cfg(target_feature = "avx512f")` in `src/simd.rs` to `simd_avx512::U64x8`.
+No runtime detection, no match, no branching. The target-cpu pin in
+`.cargo/config.toml` makes the cfg gate TRUE at compile time.
+
 ---
 
 ## Refactoring Shopping List
@@ -608,44 +614,37 @@ faster. Zero API change for users.
 
 ---
 
-#### 3.2 Unify SIMD Detection (Delete Duplicates)
+#### 3.2 Delete Dead SIMD Detection Code
 
-**Files to merge**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
-**Into**: `src/simd.rs` (the existing core SIMD module)
+**Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
 
-**Current state**: Three overlapping detection systems:
-1. `src/simd.rs` → `LazyLock<Tier>` — detects AVX-512/AVX2/NEON
-2. `src/hpc/simd_caps.rs` → `CpuCaps` struct — detects same capabilities
-3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` — another LazyLock with function pointers
+**Current state**: Three overlapping detection systems exist:
+1. `src/simd.rs` → `LazyLock<Tier>` + `cfg(target_feature)` re-exports
+2. `src/hpc/simd_caps.rs` → `CpuCaps` struct with runtime detection
+3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` with function pointers
 
-**Transform**:
+**Reality with target-cpu pinned**: When `.cargo/config.toml` pins
+`target-cpu=sapphirerapids`, all `cfg(target_feature = "avx512f")` gates resolve
+TRUE at compile time. `simd.rs` re-exports route directly to `simd_avx512.rs`.
+The `LazyLock<Tier>` is dead code (const-folded away). The CpuCaps struct and
+SimdDispatch function pointers are completely unreachable.
 
-```rust
-// 1. In src/simd.rs, export the unified detection:
-pub static CPU_CAPS: LazyLock<CpuCaps> = LazyLock::new(|| detect_cpu_caps());
-
-pub struct CpuCaps {
-    pub tier: Tier,  // existing
-    pub avx512f: bool,
-    pub avx512bw: bool,
-    pub avx512vpopcntdq: bool,
-    pub avx512vnni: bool,
-    pub avx2: bool,
-    pub fma: bool,
-    pub popcnt: bool,
-    // ARM
-    pub neon: bool,
-    pub sve: bool,
-}
+**Transform**:
 
-// 2. In src/hpc/simd_dispatch.rs, replace with:
-pub use crate::simd::CPU_CAPS;
-// Keep function-pointer dispatch but source caps from unified singleton
+```
+1. Delete src/hpc/simd_caps.rs (515 lines — all dead under cfg pin)
+2. Delete src/hpc/simd_dispatch.rs (362 lines — all dead under cfg pin)
+3. Leave src/simd.rs as-is (its cfg gates ARE the dispatch mechanism)
+```
 
-// 3. Delete src/hpc/simd_caps.rs (or make it a thin re-export)
+If CI fallback (no target-cpu pin) is needed, gate these behind:
+```rust
+#[cfg(not(target_feature = "avx512f"))]
+mod simd_caps;  // only compiles when features aren't pinned
 ```
 
-**Result**: One CPUID check, one atomic, one struct — shared by core and all hpc modules.
+**Result**: -877 lines of dead code. The dispatch mechanism is `cfg(target_feature)`
+in `simd.rs` — no runtime anything.
 
 ---
 
@@ -913,8 +912,8 @@ Phase B (2 weeks) — Extension Traits:
   2.5 SimdMath trait (VML)           (standalone)
 
 Phase C (2 weeks) — Backend Wiring:
-  3.1 Core sum/mean → SIMD dispatch  (depends on 3.2)
-  3.2 Unified SIMD detection         (standalone, delete duplicates)
+  3.1 Core sum/mean → typed SIMD     (standalone — calls crate::simd::F32x16 directly)
+  3.2 Delete dead detection code     (standalone, -877 lines unreachable under cfg pin)
   3.3 INT8 Matmul via trait          (depends on 1.3 pattern)
 
 Phase D (1 week) — View Factories:
diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md
index 6ad54321..d7835fd5 100644
--- a/UNIFIED_REFACTOR_SEQUENCE.md
+++ b/UNIFIED_REFACTOR_SEQUENCE.md
@@ -11,6 +11,35 @@
 
 ---
 
+## Dispatch Model (the ground truth)
+
+All SIMD in this repo resolves via **compile-time `cfg(target_feature)` routing**:
+
+```
+.cargo/config.toml pins target-cpu=sapphirerapids
+    → cfg(target_feature = "avx512f") = TRUE at compile time
+    → simd.rs re-exports resolve DIRECTLY to simd_avx512.rs types
+    → zero runtime detection, zero branching, zero LazyLock in hot path
+
+Consumer writes: crate::simd::U64x8
+simd.rs routes:  pub use crate::simd_avx512::U64x8;  // compile-time, no match
+```
+
+The `LazyLock<Tier>` in `simd.rs` exists for the `std` path but is **dead code**
+when target features are pinned — the compiler const-folds `detect_tier()` and
+the `cfg` gates resolve the re-exports statically.
+
+**On aarch64**: NEON is mandatory. `cfg(target_arch = "aarch64")` routes to
+`simd_neon.rs`. 256-bit types are 2×128-bit paired dispatch (e.g. `F32x16` =
+4×`float32x4_t`). No runtime detection needed — NEON is always there.
+
+**simd_avx2.rs**: Dead code on x86-64-v4. Only reached when
+`cfg(not(target_feature = "avx512f"))` — i.e., when someone builds without the
+target-cpu pin (CI fallback, cross-compile). Never add new methods to it for
+AVX-512 targets; it's a waste.
+
+---
+
 ## Wave 0 — Conventions & Foundations (3 days)
 
 Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
@@ -46,13 +75,14 @@ The lance-graph `simd-savant` agent runs PRE-MERGE against each PR.
 ### Per-primitive implementation contract
 
 Every PR MUST:
-1. **Three backends**: AVX-512, NEON, scalar. Missing scalar = P0 reject.
+1. **Implement on the backing type in `simd_avx512.rs`** (the only live file on x86-64-v4). NEON impl goes in `simd_neon.rs`. Scalar fallback in the `scalar` module inside `simd.rs`.
 2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices).
-3. **Parity test**: all backends produce identical output on randomized corpus including edge cases.
+3. **Parity test**: all cfg-routed backends produce identical output on randomized corpus including edge cases.
 4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body.
 5. **`// SAFETY:` on every `unsafe` block**.
-6. **No new `is_x86_feature_detected!`** outside `simd_caps.rs`.
-7. **Consumer site cited** in PR description.
+6. **No `is_x86_feature_detected!` anywhere** — dispatch is at `cfg(target_feature)` in `simd.rs` re-exports, not per-call runtime checks.
+7. **No `#[target_feature(enable = ...)]` on functions** — the cargo target-cpu pin handles this globally.
+8. **Consumer site cited** in PR description.
 
 ### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)
 
@@ -110,7 +140,18 @@ impl U16x8 {
 pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
 ```
 
-**AVX2**: `_mm256_i32gather_epi32` with index widening + downcast.
+**Palette-256 is the dominant use case** — every index fits in u8, table is always
+256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must
+handle arbitrary tables too, but palette-256 should be the fast path (no OOB
+possible when table.len() == 256 and indices are u8-sourced).
+
+**Codex P2 fix (gather_u16 OOB)**: `_mm256_i32gather_epi32` reads 4 bytes per slot
+from a `&[u16]` table — overreads 2 bytes at `table[len-1]`. For palette-256 this
+is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513,
+which is within a 512-byte aligned allocation + padding). For arbitrary tables,
+use scalar fallback or pad the table allocation.
+
+**AVX-512**: `_mm512_i32gather_epi32` with index widening + mask to u16.
 **NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`.
 
 ### W1.4 — Prefetch hints (cross-arch)
@@ -134,12 +175,13 @@ impl U64x8 {
     /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
     pub fn xor_popcount(self, other: Self) -> u64;
 }
-impl U64x4 { pub fn popcnt(self) -> Self; }  // AVX2 parity
+impl U64x4 { pub fn popcnt(self) -> Self; }  // NEON/scalar parity
 ```
 
-**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`).
-**AVX-512 without VPOPCNTDQ**: Mula's algorithm via `VPSHUFB` + `VPSADBW` per-byte LUT.
-**NEON**: `vcntq_u8` → `vpaddlq_u8` cascade to sum within each u64.
+**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq` —
+available on sapphirerapids, enabled by the target-cpu pin).
+**NEON popcount per-u64**: `vcntq_u8` → `vpaddlq_u8` → `vpaddlq_u16` → `vpaddlq_u32`
+(NOT `vaddvq_u8` which merges ALL lanes to a single scalar — Codex P2 fix).
 **Scalar**: `u64::count_ones()` fused loop.
 
 ### W1.5+ — Deferred primitives (gated on sigker certification)
@@ -170,7 +212,7 @@ Wave 1 primitives become the first consumers of these macros (retrofit optional)
 | ID | Item | Source | Effort | What It Produces |
 |----|------|--------|--------|------------------|
 | W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
-| W2.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. |
+| W2.2 | **Per-arch impl macro** (`simd_impl!`) | R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. |
 | W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
 | W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
 
@@ -219,10 +261,17 @@ From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accele
 
 | ID | Item | Source | Effort | Impact |
 |----|------|--------|--------|--------|
-| W5.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication |
+| W5.1 | **Delete duplicate detection code** (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned |
 | W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
 | W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
 
+**Note on W5.1**: With `target-cpu=sapphirerapids` in `.cargo/config.toml`, the
+`cfg(target_feature = "avx512f")` branch in `simd.rs` is the only live path.
+`hpc/simd_caps.rs` (515 lines) and `hpc/simd_dispatch.rs` (362 lines) exist for
+a multi-binary world that doesn't apply when the target is pinned. These can be
+deleted or feature-gated behind `cfg(not(target_feature = "avx512f"))` for CI
+fallback builds.
+
 **Gate**: `cargo bench` shows measurable improvement on contiguous arrays.
 Non-contiguous arrays unchanged (fallback to generic fold).
 
@@ -299,14 +348,14 @@ Wave 0 (conventions)
   │       ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
   │       └──→ Wave 8.4 (primitive parity benches)
   │
-  ├──→ Wave 2 (codegen macros)
+  ├���─→ Wave 2 (codegen macros)
   │       │
   │       └──→ Wave 5 (backend wiring uses macros)
   │
   ├──→ Wave 3 (type bridges)
   │       │
   │       ├──→ Wave 4 (extension traits need bridges)
-  │       ├──→ Wave 5 (backend dispatch needs BlasFloat for BF16)
+  │       ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
   │       └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
   │
   ├──→ Wave 7 (namespace) — independent of 1-6
@@ -320,7 +369,7 @@ Wave 0 (conventions)
 
 **For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked.
 
-**For full release**: 0 → 1 → 3 → 5 → 6 → 8 → 9 = **18 days** serial.
+**For full release**: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = **18 days** serial.
 With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**.
 
 ---
@@ -333,7 +382,7 @@ The 5 SIMD primitives aren't isolated additions — they're load-bearing for lat
 |-----------|----------------------------------|--------------------------|
 | `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop |
 | `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics |
-| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** — gather exponent fields from lookup table |
+| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** �� gather exponent fields from lookup table |
 | `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan |
 | `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation |
 
@@ -351,12 +400,11 @@ slices, losing the typed-wrapper discipline and duplicating dispatch logic.
 | `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) |
 | `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) |
 | F-order Array2<u64> database | **Wave 3** (type bridges — Fingerprint converts to Array) |
-| Dispatch macro for k0_columnar_simd | **Wave 2** (codegen macros) |
+| Impl macro for k0_columnar_simd | **Wave 2** (codegen macros) |
 | `_into` form for columnar kernels | **Wave 0** (signature convention) |
 | Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) |
 | BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) |
 | Arrow columns → direct scan | **Wave 3** (Arrow view factories) |
-| Unified SIMD detection for dispatch | **Wave 5** (backend wiring) |
 | Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) |
 
 ---
@@ -437,5 +485,7 @@ that applies recursively across the entire surface:
 4. **Don't couple SoA with module restructure** — they're independent; merge separately
 5. **Don't break downstream in one shot** — deprecation shims for one release minimum
 6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed
-7. **Don't add `is_x86_feature_detected!` outside simd_caps.rs** — dispatch through the singleton
+7. **Don't use `is_x86_feature_detected!` or `#[target_feature(enable=...)]`** — cfg(target_feature) at the re-export level handles dispatch; per-function annotations and per-call runtime checks are wrong
 8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated
+9. **Don't add methods to simd_avx2.rs for AVX-512 targets** — it's dead code on x86-64-v4, never reached via cfg routing
+10. **Don't use rayon work-stealing** if the type-system integration (Wave 3-4) is the global lever — typed SIMD dispatch across the full surface eliminates the slicing problem that rayon would paper over

From a5dba8d7d74c51961f89b36065a3e4ce8138a0dd Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sat, 16 May 2026 23:00:57 +0000
Subject: [PATCH 7/7] docs: strip tutorial filler, keep only integration map +
 known constraints
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Remove the "Dispatch Model" explainer section, excessive rule expansions,
and paragraph-length justifications for obvious things. What remains:
- Known constraints list (7 bullets from prior sessions)
- Terse per-primitive contract (7 rules, no essays)
- 7 don'ts instead of 10 (dropped the ones restating known constraints)
- §3.2 reduced to 3 lines (dead code, delete it)
- Architecture diagram without dispatch lecture

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
---
 REFACTOR_HPC_INTEGRATION.md  | 39 +++-----------------------
 UNIFIED_REFACTOR_SEQUENCE.md | 53 +++++++++++-------------------------
 2 files changed, 20 insertions(+), 72 deletions(-)

diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md
index 2823219f..f1aa3a3c 100644
--- a/REFACTOR_HPC_INTEGRATION.md
+++ b/REFACTOR_HPC_INTEGRATION.md
@@ -36,14 +36,13 @@ The refactoring creates **bidirectional bridges** without removing the raw layer
 │  Array<f32, Ix2>, ArrayView, Zip, Broadcasting      │
 └───────────────┬──────────────────────▲──────────────┘
                 │                      │
-          .as_slice()       cfg(target_feature) routing
-                │              (compile-time, zero-cost)
+          .as_slice()          crate::simd::*
+                │                      │
                 ▼                      │
 ┌─────────────────────────────────────────────────────┐
 │  hpc/ bridge layer (NEW)                            │
 │  Extension traits on ArrayBase<S, D>                │
 │  From/Into impls for domain types                   │
-│  Core reductions call typed SIMD wrappers           │
 └───────────────┬──────────────────────▲──────────────┘
                 │                      │
           delegates to          implements
@@ -53,15 +52,9 @@ The refactoring creates **bidirectional bridges** without removing the raw layer
 │  hpc/ raw compute (unchanged)                       │
 │  &[u8], &[u64], Fingerprint<N>, typed SIMD          │
 │  K0/K1/K2, BF16 GEMM, VNNI, VML                    │
-│  Uses crate::simd::* → resolves to simd_avx512.rs  │
 └─────────────────────────────────────────────────────┘
 ```
 
-**Dispatch model**: `crate::simd::U64x8` resolves at compile time via
-`cfg(target_feature = "avx512f")` in `src/simd.rs` to `simd_avx512::U64x8`.
-No runtime detection, no match, no branching. The target-cpu pin in
-`.cargo/config.toml` makes the cfg gate TRUE at compile time.
-
 ---
 
 ## Refactoring Shopping List
@@ -618,33 +611,9 @@ faster. Zero API change for users.
 
 **Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
 
-**Current state**: Three overlapping detection systems exist:
-1. `src/simd.rs` → `LazyLock<Tier>` + `cfg(target_feature)` re-exports
-2. `src/hpc/simd_caps.rs` → `CpuCaps` struct with runtime detection
-3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` with function pointers
-
-**Reality with target-cpu pinned**: When `.cargo/config.toml` pins
-`target-cpu=sapphirerapids`, all `cfg(target_feature = "avx512f")` gates resolve
-TRUE at compile time. `simd.rs` re-exports route directly to `simd_avx512.rs`.
-The `LazyLock<Tier>` is dead code (const-folded away). The CpuCaps struct and
-SimdDispatch function pointers are completely unreachable.
-
-**Transform**:
-
-```
-1. Delete src/hpc/simd_caps.rs (515 lines — all dead under cfg pin)
-2. Delete src/hpc/simd_dispatch.rs (362 lines — all dead under cfg pin)
-3. Leave src/simd.rs as-is (its cfg gates ARE the dispatch mechanism)
-```
-
-If CI fallback (no target-cpu pin) is needed, gate these behind:
-```rust
-#[cfg(not(target_feature = "avx512f"))]
-mod simd_caps;  // only compiles when features aren't pinned
-```
+Dead under the target-cpu pin. Delete or gate behind `cfg(not(target_feature = "avx512f"))`.
 
-**Result**: -877 lines of dead code. The dispatch mechanism is `cfg(target_feature)`
-in `simd.rs` — no runtime anything.
+**Result**: -877 lines.
 
 ---
 
diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md
index d7835fd5..f55b0417 100644
--- a/UNIFIED_REFACTOR_SEQUENCE.md
+++ b/UNIFIED_REFACTOR_SEQUENCE.md
@@ -11,32 +11,15 @@
 
 ---
 
-## Dispatch Model (the ground truth)
+## Known constraints (from prior sessions)
 
-All SIMD in this repo resolves via **compile-time `cfg(target_feature)` routing**:
-
-```
-.cargo/config.toml pins target-cpu=sapphirerapids
-    → cfg(target_feature = "avx512f") = TRUE at compile time
-    → simd.rs re-exports resolve DIRECTLY to simd_avx512.rs types
-    → zero runtime detection, zero branching, zero LazyLock in hot path
-
-Consumer writes: crate::simd::U64x8
-simd.rs routes:  pub use crate::simd_avx512::U64x8;  // compile-time, no match
-```
-
-The `LazyLock<Tier>` in `simd.rs` exists for the `std` path but is **dead code**
-when target features are pinned — the compiler const-folds `detect_tier()` and
-the `cfg` gates resolve the re-exports statically.
-
-**On aarch64**: NEON is mandatory. `cfg(target_arch = "aarch64")` routes to
-`simd_neon.rs`. 256-bit types are 2×128-bit paired dispatch (e.g. `F32x16` =
-4×`float32x4_t`). No runtime detection needed — NEON is always there.
-
-**simd_avx2.rs**: Dead code on x86-64-v4. Only reached when
-`cfg(not(target_feature = "avx512f"))` — i.e., when someone builds without the
-target-cpu pin (CI fallback, cross-compile). Never add new methods to it for
-AVX-512 targets; it's a waste.
+- Dispatch resolves at compile time via `cfg(target_feature)` in `simd.rs`. No per-call runtime checks.
+- `simd_avx2.rs` is dead on x86-64-v4. Don't add to it.
+- NEON does 256-bit as 2×128-bit paired.
+- VPABSB does NOT saturate i8::MIN.
+- Palette-256 is the dominant gather use case.
+- Rayon work-stealing is not the lever if typed SIMD integration is the global lever.
+- The `hpc/simd_caps.rs` and `hpc/simd_dispatch.rs` files are dead code under the pin — 877 lines to delete.
 
 ---
 
@@ -75,14 +58,13 @@ The lance-graph `simd-savant` agent runs PRE-MERGE against each PR.
 ### Per-primitive implementation contract
 
 Every PR MUST:
-1. **Implement on the backing type in `simd_avx512.rs`** (the only live file on x86-64-v4). NEON impl goes in `simd_neon.rs`. Scalar fallback in the `scalar` module inside `simd.rs`.
-2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices).
-3. **Parity test**: all cfg-routed backends produce identical output on randomized corpus including edge cases.
-4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body.
+1. **Impl in `simd_avx512.rs`** + **`simd_neon.rs`** + scalar fallback.
+2. **Edge-case semantics documented** (`i8::MIN`, empty slices, OOB indices).
+3. **Parity test**: all backends produce identical output on randomized corpus.
+4. **Bench against scalar**: speedup ratios in PR body.
 5. **`// SAFETY:` on every `unsafe` block**.
-6. **No `is_x86_feature_detected!` anywhere** — dispatch is at `cfg(target_feature)` in `simd.rs` re-exports, not per-call runtime checks.
-7. **No `#[target_feature(enable = ...)]` on functions** — the cargo target-cpu pin handles this globally.
-8. **Consumer site cited** in PR description.
+6. **No `is_x86_feature_detected!`**. No `#[target_feature(enable=...)]`.
+7. **Consumer site cited** in PR description.
 
 ### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)
 
@@ -484,8 +466,5 @@ that applies recursively across the entire surface:
 3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional
 4. **Don't couple SoA with module restructure** — they're independent; merge separately
 5. **Don't break downstream in one shot** — deprecation shims for one release minimum
-6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed
-7. **Don't use `is_x86_feature_detected!` or `#[target_feature(enable=...)]`** — cfg(target_feature) at the re-export level handles dispatch; per-function annotations and per-call runtime checks are wrong
-8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated
-9. **Don't add methods to simd_avx2.rs for AVX-512 targets** — it's dead code on x86-64-v4, never reached via cfg routing
-10. **Don't use rayon work-stealing** if the type-system integration (Wave 3-4) is the global lever — typed SIMD dispatch across the full surface eliminates the slicing problem that rayon would paper over
+6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence happened because no test existed
+7. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated