From ff234171c6f5ed1363ee4e807c2a18dbd50171f7 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 20:12:38 +0000 Subject: [PATCH 1/7] Add detailed hpc/ integration plan: bridge raw-slice layer to ndarray types MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 6-tier refactoring shopping list transforming hpc/ from a bolted-on raw-slice library into a first-class ndarray citizen: - Tier 1: Type bridges (Fingerprint/VsaVector ↔ ArrayView, BF16 → BlasFloat) - Tier 2: Extension traits (HdcOps, Quantize, Prefilter, SimdMath) - Tier 3: Backend wiring (core sum/mean → SIMD, unified detection) - Tier 4: View factories (Arrow → ArrayView) - Tier 5: Zip modernization (VML tails, parallel hamming) - Tier 6: N-D axis reductions with SIMD dispatch No deletions — raw-slice functions stay for FFI/embedded. Extension traits add ndarray-native overloads that delegate to existing SIMD kernels. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- REFACTOR_HPC_INTEGRATION.md | 1006 +++++++++++++++++++++++++++++++++++ 1 file changed, 1006 insertions(+) create mode 100644 REFACTOR_HPC_INTEGRATION.md diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md new file mode 100644 index 00000000..1c11c56d --- /dev/null +++ b/REFACTOR_HPC_INTEGRATION.md @@ -0,0 +1,1006 @@ +# ndarray hpc/ Module Integration & Transformation Plan + +> **Goal**: Transform hpc/ from a bolted-on raw-slice library into a first-class +> citizen of ndarray's type system — without breaking any existing API. +> +> **Principle**: Every refactoring adds an ndarray-native overload or trait impl. +> Raw-slice functions stay (FFI, embedded, zero-overhead). No deletion. + +--- + +## Architectural Diagnosis + +The hpc/ module (80K lines, 90+ files) currently operates in two disconnected worlds: + +``` +┌─────────────────────────────────────────────────────┐ +│ ndarray core │ +│ Array, ArrayView, Zip, Broadcasting, │ +│ .sum(), .dot(), mapv(), LinalgScalar, BlasFloat │ +└───────────────┬─────────────────────────────────────┘ + │ .as_slice() ← the only bridge + ▼ +┌─────────────────────────────────────────────────────┐ +│ hpc/ module │ +│ &[u8], &[u64], &[f32], *const u8, Vec │ +│ Fingerprint, VsaVector, Cascade, EnergyConflict │ +│ Custom SIMD dispatch, manual row/col params │ +└─────────────────────────────────────────────────────┘ +``` + +The refactoring creates **bidirectional bridges** without removing the raw layer: + +``` +┌─────────────────────────────────────────────────────┐ +│ ndarray core (unchanged) │ +│ Array, ArrayView, Zip, Broadcasting │ +└───────────────┬──────────────────────▲──────────────┘ + │ │ + .as_slice() backend dispatch + │ │ + ▼ │ +┌─────────────────────────────────────────────────────┐ +│ hpc/ bridge layer (NEW) │ +│ Extension traits on ArrayBase │ +│ From/Into impls for domain types │ +│ Core reductions route to SIMD backends │ +└───────────────┬──────────────────────▲──────────────┘ + │ │ + delegates to implements + │ │ + ▼ │ +┌─────────────────────────────────────────────────────┐ +│ hpc/ raw compute (unchanged) │ +│ &[u8], &[u64], Fingerprint, SIMD dispatch │ +│ K0/K1/K2, BF16 GEMM, VNNI, VML │ +└─────────────────────────────────────────────────────┘ +``` + +--- + +## Refactoring Shopping List + +### Tier 1: Type Bridge — Domain Types ↔ Array (Low effort, High impact) + +These add conversion traits so domain types can participate in ndarray's ecosystem. + +--- + +#### 1.1 Fingerprint ↔ ArrayView1 + +**File**: `src/hpc/fingerprint.rs` + +**Current state**: `Fingerprint` wraps `[u64; N]`. Provides `.as_bytes()` via +unsafe pointer cast. No connection to ndarray. + +**Transform**: + +```rust +// --- Zero-copy views (no allocation) --- + +impl Fingerprint { + /// View this fingerprint as an ndarray 1-D array of u64 words. + /// Zero-copy: borrows the internal buffer. + pub fn as_array_view(&self) -> ArrayView1<'_, u64> { + ArrayView1::from_shape(N, &self.words).unwrap() + } + + /// Mutable view for in-place operations. + pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> { + ArrayViewMut1::from_shape(N, &mut self.words).unwrap() + } + + /// View as 2-D matrix (e.g., 256 = 32×8 for tiled operations). + pub fn as_matrix_view(&self, rows: usize, cols: usize) -> ArrayView2<'_, u64> { + assert_eq!(rows * cols, N); + ArrayView2::from_shape((rows, cols), &self.words).unwrap() + } +} + +// --- Owned conversions (allocation on From, zero-copy on Into) --- + +impl From> for Fingerprint { + fn from(arr: Array1) -> Self { + assert_eq!(arr.len(), N); + let mut words = [0u64; N]; + words.copy_from_slice(arr.as_slice().unwrap()); + Self { words } + } +} + +impl From> for Array1 { + fn from(fp: Fingerprint) -> Self { + Array1::from_vec(fp.words.to_vec()) + } +} +``` + +**Why**: Callers can now pass fingerprints to any function expecting `ArrayView1`. +Enables `.sum()`, `.mapv()`, Zip, broadcasting on fingerprint data. + +**Example — before vs. after**: + +```rust +// BEFORE: raw slice, manual loop +let dist = k2_exact(&fp_a.words, &fp_b.words, 256); + +// AFTER: can also use ndarray ops directly +let view_a = fp_a.as_array_view(); +let view_b = fp_b.as_array_view(); +let xor_popcount: u64 = (&view_a ^ &view_b).mapv(|w| w.count_ones() as u64).sum(); +``` + +--- + +#### 1.2 VsaVector ↔ ArrayView1 + +**File**: `src/hpc/vsa.rs` + +**Same pattern as 1.1** — VsaVector has `words: [u64; 256]`. Add: + +```rust +impl VsaVector { + pub fn as_array_view(&self) -> ArrayView1<'_, u64> { + ArrayView1::from_shape(VSA_WORDS, &self.words).unwrap() + } + pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> { + ArrayViewMut1::from_shape(VSA_WORDS, &mut self.words).unwrap() + } +} + +impl From> for VsaVector { ... } +impl From for Array1 { ... } +``` + +--- + +#### 1.3 BF16 / F16 as ndarray Element Types + +**Files**: `src/hpc/quantized.rs`, `src/backend/mod.rs` + +**Current state**: BF16 and F16 implement `ScalarOperand` (can appear in arrays) +but NOT `BlasFloat` (cannot use `.dot()`, BLAS traits). They're second-class citizens. + +**Transform** — Register BF16/F16 for BLAS dispatch: + +```rust +// In src/backend/mod.rs or src/hpc/quantized.rs: + +impl BlasFloat for BF16 { + fn backend_dot(xs: &[Self], ys: &[Self]) -> Self { + // Accumulate in f32, return as BF16 + let mut acc = 0.0f32; + for (&x, &y) in xs.iter().zip(ys.iter()) { + acc += x.to_f32() * y.to_f32(); + } + BF16::from_f32(acc) + } + + fn backend_gemm(m: usize, n: usize, k: usize, alpha: Self, + a: &[Self], lda: usize, b: &[Self], ldb: usize, + beta: Self, c: &mut [Self], ldc: usize) { + // Delegate to existing bf16_gemm_f32, convert output back + crate::hpc::quantized::bf16_gemm_f32( + a, b, + /* c_f32 temp */ ..., + m, n, k, alpha.to_f32(), beta.to_f32() + ); + // Convert f32 results back to BF16 in c + } + + fn backend_axpy(alpha: Self, x: &[Self], y: &mut [Self]) { ... } + fn backend_scal(alpha: Self, x: &mut [Self]) { ... } + fn backend_nrm2(x: &[Self]) -> Self { ... } + fn backend_asum(x: &[Self]) -> Self { ... } +} +``` + +**Impact**: Unlocks this: + +```rust +let a: Array2 = ...; +let b: Array2 = ...; +let c = a.blas_gemm(BF16::one(), &b, BF16::zero()); // Just works +``` + +--- + +#### 1.4 CogRecord ↔ Structured ArrayView + +**File**: `src/hpc/cogrecord.rs` + +**Current state**: CogRecord has four `Array` fields (btree, cam, embed, meta) +of 4096 bytes each. No combined view. + +**Transform**: + +```rust +impl CogRecord { + /// View all 4 channels as a single [4, 4096] matrix. + pub fn as_channels_view(&self) -> ArrayView2<'_, u8> { + // Stack the 4 contiguous buffers into a 2-D view + let total = self.btree.len() + self.cam.len() + self.embed.len() + self.meta.len(); + debug_assert_eq!(total, 4 * CONTAINER_BYTES); + // NOTE: only works if fields are contiguous in memory. + // If not, provide a copy-based `to_channels_array()`. + todo!("requires contiguous layout guarantee") + } + + /// Get a single channel as ArrayView1 (for Hamming/XOR ops). + pub fn channel_as_words(&self, channel: usize) -> ArrayView1<'_, u64> { + let bytes = match channel { + 0 => self.btree.as_slice().unwrap(), + 1 => self.cam.as_slice().unwrap(), + 2 => self.embed.as_slice().unwrap(), + 3 => self.meta.as_slice().unwrap(), + _ => panic!("channel must be 0..4"), + }; + // Safe: u8 slice reinterpreted as u64 (alignment checked) + let words = unsafe { + std::slice::from_raw_parts( + bytes.as_ptr() as *const u64, + bytes.len() / 8, + ) + }; + ArrayView1::from_shape(words.len(), words).unwrap() + } +} +``` + +--- + +### Tier 2: Trait Extensions — hpc Operations on Array Types (Medium effort, High impact) + +Add extension traits so hpc operations feel native on ndarray arrays. + +--- + +#### 2.1 HDC Extension Trait (Hamming, Cascade, K0/K1/K2) + +**New file**: `src/hpc/ndarray_ext.rs` (or extend `bitwise.rs`) + +**Reasoning**: The K0/K1/K2 pipeline currently takes `&[u64]`. Users with +`Array1` must manually call `.as_slice().unwrap()`. This trait bridges that. + +```rust +use crate::imp_prelude::*; +use super::kernels::{k0_probe, k1_stats, k2_exact, SliceGate, EnergyConflict}; + +/// HDC (Hyperdimensional Computing) operations on u64 word arrays. +/// +/// Provides the K0/K1/K2 rejection pipeline directly on ndarray types. +pub trait HdcOps { + /// K0 probe: fast 64-bit reject. + /// Returns true if the first word passes the gate threshold. + fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool; + + /// K1 stats: 512-bit medium reject. + /// Returns true if the first 8 words pass the gate threshold. + fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool; + + /// K2 exact: Full-width Hamming + energy decomposition. + fn k2_exact(&self, candidate: &Self) -> EnergyConflict; + + /// Full cascade: K0 → K1 → K2 with early rejection. + fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option; +} + +impl HdcOps for ArrayBase +where + S: Data, +{ + fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool { + k0_probe(self[0], candidate[0], gate) + } + + fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool { + let q = self.as_slice().expect("query must be contiguous"); + let c = candidate.as_slice().expect("candidate must be contiguous"); + k1_stats(q, c, gate) + } + + fn k2_exact(&self, candidate: &Self) -> EnergyConflict { + let q = self.as_slice().expect("query must be contiguous"); + let c = candidate.as_slice().expect("candidate must be contiguous"); + k2_exact(q, c, q.len()) + } + + fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option { + if !self.k0_probe(candidate, gate) { return None; } + if !self.k1_stats(candidate, gate) { return None; } + Some(self.k2_exact(candidate)) + } +} +``` + +**Example usage after**: + +```rust +use ndarray::hpc::ndarray_ext::HdcOps; + +let query: Array1 = fp_query.into(); +let candidate: Array1 = fp_candidate.into(); +let gate = SliceGate::for_sku16k(); + +// Clean ndarray-native API +if let Some(result) = query.cascade_distance(&candidate, &gate) { + println!("Hamming: {}, Agreement: {}", result.conflict, result.agreement); +} +``` + +--- + +#### 2.2 Quantization Extension Trait + +**File**: extend `src/hpc/quantized.rs` + +**Reasoning**: Quantize/dequantize currently returns `Vec`. Should return `Array1`. + +```rust +/// Quantization operations on f32 arrays. +pub trait Quantize { + /// Quantize to BF16 (element-wise truncation). + fn to_bf16(&self) -> Array; + + /// Quantize to symmetric INT8 with scale factor. + fn to_i8_symmetric(&self) -> (Array, QuantParams); + + /// Quantize per-channel (rows) to INT8. + fn to_i8_per_channel(&self) -> (Array2, PerChannelQuantParams) + where + Self: AsArray<'_, f32, Ix2>; // Only for 2-D +} + +/// Dequantization operations. +pub trait Dequantize { + /// Dequantize INT8 codes back to f32. + fn dequantize_f32(&self, params: &QuantParams) -> Array; +} + +impl Quantize for ArrayBase +where + S: Data, +{ + fn to_bf16(&self) -> Array { + let slice = self.as_slice().expect("must be contiguous"); + let mut out = vec![BF16::ZERO; slice.len()]; + f32_to_bf16_slice(slice, &mut out); // existing raw function + Array1::from_vec(out) + } + + fn to_i8_symmetric(&self) -> (Array, QuantParams) { + let slice = self.as_slice().expect("must be contiguous"); + let (vec, params) = quantize_f32_to_i8(slice); // existing raw function + (Array1::from_vec(vec), params) + } +} +``` + +--- + +#### 2.3 Prefilter — Accept ArrayView2 Instead of (slice, rows, cols) + +**File**: `src/hpc/prefilter.rs` + +**Current state** (lines 48-295): +```rust +pub fn approx_mean_std_f32(data: &[f32]) -> (f32, f32) +pub fn approx_row_norms_f32(data: &[f32], rows: usize, cols: usize) -> Vec +pub fn top_k_rows_by_norm(data: &[f32], rows: usize, cols: usize, k: usize) -> Vec +pub fn approx_column_std(data: &[f32], rows: usize, cols: usize) -> Vec +``` + +**Transform** — add ndarray overloads that delegate to existing implementations: + +```rust +/// Array-native overloads for prefilter operations. +pub trait Prefilter { + /// Approximate mean and standard deviation (SIMD). + fn approx_mean_std(&self) -> (f32, f32); + + /// L2 norm per row (SIMD). + fn row_norms(&self) -> Array1; + + /// Indices of top-K rows by L2 norm. + fn top_k_by_norm(&self, k: usize) -> Array1; + + /// Standard deviation per column. + fn column_std(&self) -> Array1; +} + +impl Prefilter for ArrayBase +where + S: Data, +{ + fn approx_mean_std(&self) -> (f32, f32) { + let slice = self.as_slice().expect("must be contiguous"); + approx_mean_std_f32(slice) + } + + fn row_norms(&self) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(approx_row_norms_f32(slice, rows, cols)) + } + + fn top_k_by_norm(&self, k: usize) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(top_k_rows_by_norm(slice, rows, cols, k)) + } + + fn column_std(&self) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(approx_column_std(slice, rows, cols)) + } +} +``` + +--- + +#### 2.4 Activations — Extend to N-D with Axis + +**File**: `src/hpc/activations.rs` + +**Current state**: Trait only works on `ArrayBase`. No per-row softmax. + +**Transform**: + +```rust +/// Multi-dimensional activations (applied along an axis). +pub trait ActivationsNd { + /// Softmax along the given axis. + /// For a [batch, features] matrix with axis=1, computes softmax per row. + fn softmax_axis(&self, axis: Axis) -> Array; + + /// Log-softmax along the given axis. + fn log_softmax_axis(&self, axis: Axis) -> Array; +} + +impl ActivationsNd for ArrayBase +where + A: Float + 'static, + S: Data, +{ + fn softmax_axis(&self, axis: Axis) -> Array { + let mut result = self.to_owned(); + for mut row in result.axis_iter_mut(axis) { + let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b)); + row.mapv_inplace(|v| (v - max_val).exp()); + let sum: A = row.iter().fold(A::zero(), |acc, &v| acc + v); + row.mapv_inplace(|v| v / sum); + } + result + } + + fn log_softmax_axis(&self, axis: Axis) -> Array { + let mut result = self.to_owned(); + for mut row in result.axis_iter_mut(axis) { + let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b)); + row.mapv_inplace(|v| v - max_val); + let log_sum_exp = row.mapv(|v| v.exp()) + .iter().fold(A::zero(), |acc, &v| acc + v).ln(); + row.mapv_inplace(|v| v - log_sum_exp); + } + result + } +} +``` + +**Impact**: Enables `batch_logits.softmax_axis(Axis(1))` — critical for inference. + +--- + +#### 2.5 VML — SIMD Transcendentals as mapv Acceleration + +**File**: `src/hpc/vml.rs` (2543 lines) + +**Current state**: 12+ standalone functions like `vsexp(x: &[f32], out: &mut [f32])`. + +**Transform** — add `ArrayBase` methods that auto-dispatch to SIMD: + +```rust +/// SIMD-accelerated element-wise math on contiguous f32 arrays. +/// +/// These call the VML SIMD kernels when the array is contiguous, +/// falling back to scalar mapv() otherwise. +pub trait SimdMath { + fn simd_exp(&self) -> Array; + fn simd_ln(&self) -> Array; + fn simd_sqrt(&self) -> Array; + fn simd_sin(&self) -> Array; + fn simd_cos(&self) -> Array; + fn simd_tanh(&self) -> Array; + fn simd_abs(&self) -> Array; +} + +impl SimdMath for ArrayBase +where + S: Data, + D: Dimension, +{ + fn simd_exp(&self) -> Array { + if let Some(slice) = self.as_slice() { + let mut out = vec![0.0f32; slice.len()]; + vsexp(slice, &mut out); // existing SIMD kernel + Array::from_shape_vec(self.raw_dim(), out).unwrap() + } else { + self.mapv(|x| x.exp()) + } + } + + fn simd_ln(&self) -> Array { + if let Some(slice) = self.as_slice() { + let mut out = vec![0.0f32; slice.len()]; + vsln(slice, &mut out); + Array::from_shape_vec(self.raw_dim(), out).unwrap() + } else { + self.mapv(|x| x.ln()) + } + } + // ... same pattern for sqrt, sin, cos, tanh, abs +} +``` + +**Benchmark expectation**: 10–50x for transcendentals (polynomial SIMD vs scalar libm). + +--- + +### Tier 3: Backend Integration — Core Dispatch to SIMD (Medium effort, Very High impact) + +Make ndarray's **built-in** operations (`.sum()`, `.mean()`, `.dot()`) automatically +use hpc SIMD kernels for contiguous arrays. + +--- + +#### 3.1 Wire Reductions into Core + +**Files**: `src/numeric/impl_numeric.rs` (core sum/mean) + `src/hpc/reductions.rs` + +**Current state**: `Array.sum()` uses `numeric_util::unrolled_fold()` — a scalar +4x unrolled loop. `hpc::reductions::sum_f32()` uses F32x16 (16-lane AVX-512) +with 4-accumulator ILP. They never talk to each other. + +**Transform** — add backend dispatch in core's sum(): + +```rust +// In src/numeric/impl_numeric.rs or src/backend/native.rs: + +impl ArrayBase +where + A: Clone + num_traits::Zero + std::ops::Add, + S: Data, + D: Dimension, +{ + pub fn sum(&self) -> A { + // Fast path: contiguous f32 → SIMD reduction + if std::any::TypeId::of::() == std::any::TypeId::of::() { + if let Some(slice) = self.as_slice_memory_order() { + // SAFETY: TypeId check guarantees A == f32 + let f32_slice = unsafe { + std::slice::from_raw_parts(slice.as_ptr() as *const f32, slice.len()) + }; + let result = crate::hpc::reductions::sum_f32(f32_slice); + // SAFETY: result is f32, A is f32 + return unsafe { std::mem::transmute_copy(&result) }; + } + } + if std::any::TypeId::of::() == std::any::TypeId::of::() { + if let Some(slice) = self.as_slice_memory_order() { + let f64_slice = unsafe { + std::slice::from_raw_parts(slice.as_ptr() as *const f64, slice.len()) + }; + let result = crate::hpc::reductions::sum_f64(f64_slice); + return unsafe { std::mem::transmute_copy(&result) }; + } + } + // Fallback: generic fold + self.fold(A::zero(), |acc, &x| acc + x) + } +} +``` + +**Impact**: Every `.sum()` on a contiguous f32/f64 array silently becomes 16x +faster. Zero API change for users. + +**Same pattern for**: `.mean()`, internal `max()`/`min()` in softmax paths. + +--- + +#### 3.2 Unify SIMD Detection (Delete Duplicates) + +**Files to merge**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines) +**Into**: `src/simd.rs` (the existing core SIMD module) + +**Current state**: Three overlapping detection systems: +1. `src/simd.rs` → `LazyLock` — detects AVX-512/AVX2/NEON +2. `src/hpc/simd_caps.rs` → `CpuCaps` struct — detects same capabilities +3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` — another LazyLock with function pointers + +**Transform**: + +```rust +// 1. In src/simd.rs, export the unified detection: +pub static CPU_CAPS: LazyLock = LazyLock::new(|| detect_cpu_caps()); + +pub struct CpuCaps { + pub tier: Tier, // existing + pub avx512f: bool, + pub avx512bw: bool, + pub avx512vpopcntdq: bool, + pub avx512vnni: bool, + pub avx2: bool, + pub fma: bool, + pub popcnt: bool, + // ARM + pub neon: bool, + pub sve: bool, +} + +// 2. In src/hpc/simd_dispatch.rs, replace with: +pub use crate::simd::CPU_CAPS; +// Keep function-pointer dispatch but source caps from unified singleton + +// 3. Delete src/hpc/simd_caps.rs (or make it a thin re-export) +``` + +**Result**: One CPUID check, one atomic, one struct — shared by core and all hpc modules. + +--- + +#### 3.3 VNNI GEMM via BlasFloat Dispatch + +**Files**: `src/hpc/vnni_gemm.rs`, `src/hpc/quantized.rs:618`, `src/backend/mod.rs` + +**Current state**: `int8_gemm_vnni()` takes raw `&[u8], &[i8], &mut [i32]` with manual +m/n/k parameters. Not accessible via `Array.dot()` or `blas_gemm()`. + +**Transform** — wire INT8 GEMM into the backend trait system: + +```rust +// New trait in src/backend/mod.rs: +pub trait BlasInt8 { + /// INT8 GEMM: C_i32 = A_u8 × B_i8 + fn backend_gemm_i8( + m: usize, n: usize, k: usize, + a: &[u8], lda: usize, + b: &[i8], ldb: usize, + c: &mut [i32], ldc: usize, + ); +} + +// Extension trait on Array: +pub trait Int8Matmul { + /// Quantized matrix multiply: self (u8) × rhs (i8) → Array2 + fn matmul_i8(&self, rhs: &ArrayView2) -> Array2; +} + +impl Int8Matmul for ArrayBase +where + S: Data, +{ + fn matmul_i8(&self, rhs: &ArrayView2) -> Array2 { + let (m, k) = self.dim(); + let (k2, n) = rhs.dim(); + assert_eq!(k, k2, "inner dimensions must match"); + + let a_slice = self.as_slice().expect("A must be contiguous"); + let b_slice = rhs.as_slice().expect("B must be contiguous"); + let mut c = vec![0i32; m * n]; + + crate::hpc::vnni_gemm::int8_gemm_vnni(a_slice, b_slice, &mut c, m, n, k); + Array2::from_shape_vec((m, n), c).unwrap() + } +} +``` + +**Example**: + +```rust +let weights: Array2 = load_quantized_weights(); +let activations: Array2 = quantize_input(&input); +let output: Array2 = activations.matmul_i8(&weights.view()); +// 4× throughput vs f32 GEMM +``` + +--- + +### Tier 4: View Factories — Arrow & External Data (Low effort, Medium impact) + +Make hpc's data ingestion return ndarray views instead of raw slices. + +--- + +#### 4.1 Arrow Bridge → ArrayView + +**File**: `src/hpc/arrow_bridge.rs` + +**Current state**: `binary_entry()` returns `&[u8]`. `SoakingBuffer` provides +`as_columnar_slice()` as `&[i8]`. + +**Transform**: + +```rust +impl ArrowBridge { + /// Get one binary entry as a 1-D array view. + pub fn binary_entry_view(&self, idx: usize) -> ArrayView1<'_, u8> { + let bytes = self.binary_entry(idx); + ArrayView1::from_shape(bytes.len(), bytes).unwrap() + } + + /// Get the full binary column as a 2-D view [n_rows, entry_bytes]. + pub fn binary_column_view(&self) -> ArrayView2<'_, u8> { + let data = self.as_binary_slice(); + let entry_len = self.entry_len(); + let n_rows = data.len() / entry_len; + ArrayView2::from_shape((n_rows, entry_len), data).unwrap() + } +} + +impl SoakingBuffer { + /// View the columnar buffer as a 2-D matrix [n_entries, n_dims]. + pub fn as_array_view_2d(&self) -> ArrayView2<'_, i8> { + let data = self.as_columnar_slice(); + ArrayView2::from_shape((self.n_entries, self.n_dims), data).unwrap() + } +} +``` + +**Impact**: Downstream code can use ndarray broadcasting, slicing, and Zip +on Arrow data without copying. + +--- + +#### 4.2 Distance Functions → Return Array1 + +**File**: `src/hpc/distance.rs` + +**Current state**: Returns `Vec`. + +**Transform**: + +```rust +// Keep existing: +pub fn squared_distances_f32(query: [f32; 3], points: &[[f32; 3]]) -> Vec { ... } + +// Add ndarray overload: +pub fn squared_distances_array(query: &ArrayView1, points: &ArrayView2) -> Array1 { + assert_eq!(query.len(), points.ncols()); + let n = points.nrows(); + let mut result = Array1::zeros(n); + for (i, row) in points.axis_iter(Axis(0)).enumerate() { + let diff = &row - query; + result[i] = diff.mapv(|x| x * x).sum(); + } + result +} +``` + +--- + +### Tier 5: Zip & Parallel — Modernize Inner Loops (Medium effort, Medium impact) + +Replace manual `iter_mut().zip()` with ndarray's `Zip` for vectorization hints +and future `par_for_each()` support. + +--- + +#### 5.1 VML Tail Loops → Zip + +**File**: `src/hpc/vml.rs` — 12+ sites + +**Current pattern** (repeated throughout): +```rust +for (o, &v) in out[tail_start..].iter_mut().zip(x[tail_start..].iter()) { + *o = v.exp(); +} +``` + +**Transform**: +```rust +use crate::Zip; + +let out_view = ArrayViewMut1::from_shape(n - tail_start, &mut out[tail_start..]).unwrap(); +let x_view = ArrayView1::from_shape(n - tail_start, &x[tail_start..]).unwrap(); +Zip::from(&mut out_view).and(&x_view).for_each(|o, &v| { + *o = v.exp(); +}); +``` + +**Benefit**: Zip provides SIMD-friendlier iteration order and enables +future `par_for_each()` without restructuring. + +--- + +#### 5.2 Hamming Batch → Parallel Zip + +**File**: `src/hpc/bitwise.rs` lines 320-333 + +**Current state**: Sequential loop over candidates. + +**Transform** (feature-gated on `rayon`): + +```rust +#[cfg(feature = "rayon")] +pub fn hamming_batch_parallel( + query: &ArrayView1, + database: &ArrayView2, // [n_candidates, vec_len] +) -> Array1 { + use rayon::prelude::*; + let q = query.as_slice().unwrap(); + let results: Vec = database + .axis_iter(Axis(0)) + .into_par_iter() + .map(|row| { + let r = row.as_slice().unwrap(); + dispatch_hamming(q, r) + }) + .collect(); + Array1::from_vec(results) +} +``` + +--- + +### Tier 6: Multi-Dimensional Reductions (High effort, High impact for ML) + +Lift 1-D slice operations to full N-D axis reductions. + +--- + +#### 6.1 SIMD Axis Reductions + +**File**: `src/hpc/reductions.rs` + +**Current state**: Only 1-D: `sum_f32(&[f32])`, `mean_f32(&[f32])`, etc. + +**Transform**: + +```rust +/// Sum along an axis, using SIMD for contiguous lanes. +pub fn sum_axis_f32(arr: &ArrayBase, axis: Axis) -> Array +where + S: Data, + D: Dimension + RemoveAxis, +{ + let mut result = Array::zeros(arr.raw_dim().remove_axis(axis)); + for (i, lane) in arr.lanes(axis).into_iter().enumerate() { + if let Some(slice) = lane.as_slice() { + result.as_slice_mut().unwrap()[i] = sum_f32(slice); + } else { + result.as_slice_mut().unwrap()[i] = lane.iter().fold(0.0, |a, &b| a + b); + } + } + result +} +``` + +--- + +#### 6.2 Statistics Trait — Generalize to N-D + +**File**: `src/hpc/statistics.rs` + +**Transform**: Make `variance`, `std_dev`, `percentile` work along axes: + +```rust +pub trait StatisticsNd { + type Reduced; + + fn variance_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced; + fn std_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced; + fn percentile_axis(&self, axis: Axis, q: f64) -> Self::Reduced; +} +``` + +--- + +## Execution Order & Dependencies + +``` +Phase A (2 weeks) — Type Bridges: + 1.1 Fingerprint ↔ ArrayView1 (standalone, no deps) + 1.2 VsaVector ↔ ArrayView1 (standalone) + 1.3 BF16/F16 → BlasFloat (requires backend/mod.rs edit) + 1.4 CogRecord channel views (standalone) + +Phase B (2 weeks) — Extension Traits: + 2.1 HdcOps trait (K0/K1/K2) (depends on 1.1) + 2.2 Quantize trait (depends on 1.3) + 2.3 Prefilter trait (standalone) + 2.4 ActivationsNd (axis softmax) (standalone) + 2.5 SimdMath trait (VML) (standalone) + +Phase C (2 weeks) — Backend Wiring: + 3.1 Core sum/mean → SIMD dispatch (depends on 3.2) + 3.2 Unified SIMD detection (standalone, delete duplicates) + 3.3 INT8 Matmul via trait (depends on 1.3 pattern) + +Phase D (1 week) — View Factories: + 4.1 Arrow → ArrayView (standalone) + 4.2 Distance → Array1 returns (standalone) + +Phase E (1 week) — Modernize Loops: + 5.1 VML tails → Zip (standalone) + 5.2 Hamming batch → parallel (standalone) + +Phase F (2 weeks) — N-D Reductions: + 6.1 SIMD axis reductions (depends on 3.1) + 6.2 Statistics N-D (depends on 6.1) +``` + +--- + +## Rules of Engagement + +1. **No deletion** — raw-slice functions stay. They serve FFI, embedded, and hot-path callers + who've already pre-validated contiguity. + +2. **Extension traits, not modifications** — new traits in new files or at the bottom of + existing files. Existing public APIs unchanged. + +3. **Contiguous fast-path, generic fallback** — every new trait method checks + `as_slice()` first. If contiguous → SIMD dispatch. If not → `mapv()`/`iter()`. + +4. **Feature-gate expensive additions** — rayon parallel (`rayon` feature), + BlasFloat for BF16 (`bf16-blas` feature if controversial). + +5. **Tests mirror existing** — each new trait method gets the same test coverage + as the raw-slice version it wraps. + +6. **Naming convention** — ndarray-native methods use `snake_case` matching ndarray + style (`.softmax_axis()`, `.simd_exp()`). Existing raw functions keep their + current names (`softmax_f32()`, `vsexp()`). + +--- + +## Impact Matrix + +| Item | Lines Changed | New Lines | Performance Impact | Ergonomics Impact | +|------|--------------|-----------|-------------------|-------------------| +| 1.1 Fingerprint bridge | 0 | ~40 | Zero (zero-copy) | High — unblocks Zip/broadcast | +| 1.3 BF16 BlasFloat | ~10 | ~80 | Enables BF16 GEMM via .dot() | Very High — removes type barrier | +| 2.1 HdcOps trait | 0 | ~60 | Zero (delegates) | High — clean pipeline API | +| 2.5 SimdMath (VML) | 0 | ~100 | 10-50x for exp/ln/sin | Very High — transparent SIMD | +| 3.1 Core sum → SIMD | ~20 | ~40 | 16x for f32 sum/mean | Zero (invisible) | +| 3.2 Unified detection | -877 | ~50 | Zero (dedup) | N/A (internal) | +| 3.3 INT8 matmul trait | 0 | ~60 | 4x over f32 GEMM | High — quantized inference | +| 5.2 Parallel hamming | 0 | ~30 | N×cores speedup | Medium | +| 6.1 SIMD axis reductions | 0 | ~80 | 16x per lane | High — ML training | + +--- + +## What This Achieves + +After this refactoring, hpc/ is no longer a "second library living inside ndarray." +It becomes the **performance substrate** that ndarray's own operations dispatch to: + +```rust +// User code — completely ndarray-native, no hpc imports needed: +let a: Array2 = load_embeddings(); +let b: Array2 = load_queries(); + +let sims = a.dot(&b.t()); // → routes to SIMD GEMM internally +let topk = sims.sum_axis(Axis(1)); // → routes to SIMD sum internally +let probs = sims.softmax_axis(Axis(1)); // → routes to SIMD exp internally + +// User code — HDC domain, imports hpc trait: +use ndarray::hpc::ndarray_ext::HdcOps; +let fp: Array1 = Fingerprint::<256>::random(42).into(); +let candidates: Array2 = load_database(); + +for row in candidates.axis_iter(Axis(0)) { + if let Some(ec) = fp.cascade_distance(&row, &gate) { + hits.push(ec); + } +} +``` + +The raw-slice layer (`hpc::kernels::k2_exact(&[u64], ...)`) remains for: +- FFI exports (Python, C) +- Embedded/no-std contexts +- Callers who've already validated layout and want zero overhead +- Benchmarks that isolate kernel performance from array metadata + +Both worlds coexist. The bridge is the extension traits. From 7fadc061dec4918686607a080d8694d916f2510f Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 20:17:32 +0000 Subject: [PATCH 2/7] Add SoA kernel architecture: transposed cascade with columnar SIMD filtering The meta-level insight: don't iterate candidates and probe words, iterate words and filter candidates. ndarray's F-order strides make the same Array2 serve both column scanning (K0/K1, contiguous) and per-record access (K2, strided on survivors). Key ideas: - Column-major database: K0 becomes sequential 8-byte scan (8x cache util) - Bitmask survivor propagation (no branching, pure SIMD mask narrowing) - BF16 field-separated SoA (sign/exp/mantissa pre-decomposed at ingest) - QualiaColumns (16 parallel dimension arrays for per-dim scan) - TieredDatabase (K0 in L2, K1 in L3, K2 in DRAM) - Arrow's columnar format aligns natively with this layout Expected: 4-8x cascade throughput, 18x qualia dimension scans. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- SOA_KERNEL_ARCHITECTURE.md | 478 +++++++++++++++++++++++++++++++++++++ 1 file changed, 478 insertions(+) create mode 100644 SOA_KERNEL_ARCHITECTURE.md diff --git a/SOA_KERNEL_ARCHITECTURE.md b/SOA_KERNEL_ARCHITECTURE.md new file mode 100644 index 00000000..384023b3 --- /dev/null +++ b/SOA_KERNEL_ARCHITECTURE.md @@ -0,0 +1,478 @@ +# SoA Kernel Architecture: The Transposed Cascade + +> **The insight**: Don't iterate candidates and probe words. +> Iterate words and filter candidates. +> +> The database layout IS the algorithm. + +--- + +## The Current Architecture (AoS — Row-Major) + +``` +Memory layout: [candidate₀ word₀..word₂₅₅] [candidate₁ word₀..word₂₅₅] ... + +Algorithm (kernel_pipeline, line 725): + for each candidate: ← sequential over N + K0: XOR word[0] ← random access into candidate's word 0 + K1: XOR word[0..8] ← sequential within one candidate + K2: XOR word[0..256] ← sequential within one candidate + +Cache behavior: + K0 touches 8 bytes per candidate but strides by 2048 bytes. + For 1M candidates: 8 bytes useful / 2048 bytes per cache line = 0.4% utilization. +``` + +Each K0 probe fetches one u64 from a different 2KB region. The CPU prefetcher +can't help because the access pattern is "8 bytes at stride 2048" — every fetch +pulls a full cache line but uses 1/256th of it. + +--- + +## The Transposed Architecture (SoA — Column-Major) + +``` +Memory layout: + Column 0: [cand₀_word₀, cand₁_word₀, cand₂_word₀, ... candₙ_word₀] + Column 1: [cand₀_word₁, cand₁_word₁, cand₂_word₁, ... candₙ_word₁] + ... + Column 255: [cand₀_word₂₅₅, cand₁_word₂₅₅, ... candₙ_word₂₅₅] + +Algorithm (transposed cascade): + Phase K0: scan Column 0 — XOR all N words with query[0], threshold filter + → produces survivor_mask (bitmask of which candidates survived) + + Phase K1: scan Columns 0..8 — XOR only survivors' words, accumulate + → narrows survivor_mask further + + Phase K2: scan Columns 0..255 — XOR only final survivors + → exact EnergyConflict for each + +Cache behavior: + K0 is a contiguous sequential scan of N × 8 bytes. + For 1M candidates: 8MB, perfectly prefetchable, 100% cache line utilization. + AVX-512: 8 candidates per instruction (512 bits / 64 bits = 8 u64s). +``` + +--- + +## Why This Is Genius at the Meta Level + +### 1. The Cascade Becomes SIMD-Native Filtering + +Current K0 (per-candidate, scalar): +```rust +for i in 0..n_candidates { + let word0 = database[i * 256]; // stride 2048 bytes — cache-hostile + if (word0 ^ query_word0).count_ones() > threshold { + continue; + } + // ... K1, K2 +} +``` + +Transposed K0 (columnar, SIMD): +```rust +let col0: &[u64] = &columns[0]; // contiguous N×8 bytes +let q0 = U64x8::splat(query_words[0]); // broadcast query word 0 +let thresh = U32x8::splat(gate.k0_reject); + +let mut survivors = vec![0u64; (n_candidates + 63) / 64]; // bitmask + +for chunk in 0..(n_candidates / 8) { + let c = U64x8::from_slice(&col0[chunk * 8..]); + let xor = c ^ q0; + let popcnt = xor.popcnt_per_lane(); // 8 popcounts in parallel + let mask = popcnt.le(thresh); // 8 comparisons → 8-bit mask + survivors[chunk / 8] |= mask.to_bitmask() << (chunk % 8 * 8); +} +``` + +**8 candidates per clock cycle instead of 1.** And the memory access is a +sequential stream — the hardware prefetcher handles it perfectly. + +--- + +### 2. The Survivor Mask IS the State Machine + +The cascade becomes mask narrowing — no branch prediction, no `continue`: + +```rust +// Start: all candidates alive +let mut mask: Vec = vec![u64::MAX; (n + 63) / 64]; + +// K0: narrow mask via column 0 +k0_filter_columnar(&columns[0], query[0], gate.k0_reject, &mut mask); + +// K1: narrow mask via columns 0..8 (only touch survivors) +k1_filter_columnar(&columns[0..8], &query[0..8], gate.k1_reject, &mut mask); + +// K2: exact only on final survivors +let results = k2_exact_masked(&columns, query, &mask); +``` + +Each phase reads ONLY the columns it needs. K0 touches 1 column (N×8 bytes). +K1 touches 8 columns (N×64 bytes). K2 touches 256 columns but only for +survivors (~3-5% of N). Total memory touched: + +``` +AoS: N × 2048 bytes (reads all words for all candidates even if K0 rejects) + Actually reads: N × 8 (K0) + 0.05N × 64 (K1) + 0.003N × 2048 (K2) + But cache lines pulled: N × 64 (one cache line per K0 probe) = N×64 bytes + +SoA: N × 8 (K0, contiguous) + N × 64 (K1, contiguous) + 0.003N × 2048 (K2) + Effective: N × 72 + 6N = ~78N bytes vs ~64N bytes (AoS) + BUT: 100% utilization vs 12.5% utilization per cache line + Net bandwidth saving: ~8× due to no wasted cache line fetch +``` + +--- + +### 3. ndarray Makes This Free (Strides, Not Copies) + +Here's the meta-level genius: **you don't transpose the data. You transpose the strides.** + +```rust +// Build database in Fortran (column-major) order: +let database: Array2 = Array2::from_shape_vec( + (n_candidates, 256).f(), // .f() = Fortran order + flat_word_data, +).unwrap(); + +// Column access is now CONTIGUOUS (no copy, no transpose): +let col0: ArrayView1 = database.column(0); +assert!(col0.is_standard_layout()); // TRUE — physically contiguous + +// Row access still works (strided, for K2 on survivors): +let survivor_fp: ArrayView1 = database.row(42); +// Not contiguous, but only used for the 3-5% of survivors — doesn't matter +``` + +The `(n_candidates, 256).f()` tells ndarray to lay out memory column-by-column. +Accessing `.column(i)` returns a contiguous view. Accessing `.row(i)` returns a +strided view. **Same data, both access patterns, zero transposition cost.** + +This is ndarray's stride system doing something no raw-slice library can: +**the same allocation serves both columnar scanning AND per-candidate access.** + +--- + +### 4. Arrow Is Already Column-Major + +Arrow stores columns separately. The arrow_bridge already provides columnar views. +If each "word position" is stored as its own Arrow column (or a single Arrow +FixedSizeList with the right chunking), the SoA layout is the native format: + +```rust +// Arrow RecordBatch with 256 columns of u64: +// col_0: [cand₀_w0, cand₁_w0, ...] ← this IS the K0 scan buffer +// col_1: [cand₀_w1, cand₁_w1, ...] +// ... + +fn kernel_pipeline_arrow(batch: &RecordBatch, query: &[u64; 256], gate: &SliceGate) { + // K0: read column 0 directly from Arrow buffer (zero-copy) + let col0 = batch.column(0).as_primitive::().values(); + let mask = k0_filter_columnar(col0, query[0], gate.k0_reject); + + // K1: read columns 0..8 + for i in 0..8 { + let col = batch.column(i).as_primitive::().values(); + k1_accumulate_columnar(col, query[i], &mut mask, &mut accum); + } + // ... +} +``` + +**No conversion. No copy. Arrow's native layout IS the SoA kernel's input format.** + +--- + +### 5. Multi-Resolution SoA: The Three Tiers as Physical Arrays + +```rust +/// Tiered columnar database. Each tier is a separate Array2 because they have +/// different access patterns and lifetimes. +pub struct TieredDatabase { + /// Tier K0: shape [N, 1] — just word 0 per candidate. + /// Fits in L2 for 1M candidates (8MB). + pub k0: Array2, // Column-major, 1 column + + /// Tier K1: shape [N, 8] — first 8 words per candidate. + /// Fits in L3 for 1M candidates (64MB). + pub k1: Array2, // Column-major, 8 columns + + /// Tier K2: shape [N, 256] — full containers. + /// Lives in main memory. Only accessed for 3-5% survivors. + pub k2: Array2, // Column-major, 256 columns +} +``` + +**Why separate arrays?** Because K0 is HOT (scanned every query). K1 is WARM +(scanned for ~5% survivors). K2 is COLD (scanned for ~0.3% survivors). Separating +them lets you: +- Pin K0 in L2 cache (8MB for 1M candidates) +- Let K1 live in L3 +- Let K2 stay in DRAM — only a few thousand rows are ever touched + +This is **cache-aware SoA** — not just columnar layout, but columnar layout with +physical separation matching the thermal hierarchy of the CPU cache. + +--- + +### 6. BF16 Field-Separated SoA: Awareness Becomes Memory Layout + +Current BF16 storage: N vectors × 1024 BF16 values = N × 2048 bytes (interleaved). + +Field-separated SoA: +```rust +/// BF16 database stored as separated bitfields. +/// The decomposition that bf16_hamming.rs does AT RUNTIME is instead +/// baked INTO THE STORAGE FORMAT. +pub struct BF16FieldDatabase { + /// Sign bits: 1 bit per dimension × 1024 dims = 128 bytes per vector. + /// Shape [N, 128] as u8 — each byte = 8 sign bits. + pub signs: Array2, + + /// Exponent fields: 8 bits per dimension × 1024 dims = 1024 bytes per vector. + /// Shape [N, 1024] as u8. + pub exponents: Array2, + + /// Mantissa fields: 7 bits per dimension × 1024 dims ≈ 896 bytes per vector. + /// Shape [N, 896] as u8 (packed 7-bit). + pub mantissas: Array2, +} +``` + +**The awareness algebra becomes trivially parallel**: + +```rust +impl BF16FieldDatabase { + /// Sign-only distance (strongest signal): just Hamming on sign bytes. + /// 128 bytes per comparison — fits in one AVX-512 pass for 4 vectors. + fn sign_distance(&self, query_signs: &[u8], candidate_idx: usize) -> u32 { + let row = self.signs.row(candidate_idx); + hamming_dispatch(query_signs, row.as_slice().unwrap()) + } + + /// Exponent distance (medium signal): Hamming on exp bytes. + fn exp_distance(&self, query_exp: &[u8], candidate_idx: usize) -> u32 { + let row = self.exponents.row(candidate_idx); + hamming_dispatch(query_exp, row.as_slice().unwrap()) + } + + /// Awareness classification without per-element decomposition. + /// Signs are already separated — just compare columns directly. + fn batch_awareness(&self, query_signs: &Array1) -> Array1 { + // For ALL candidates simultaneously: + // sign_agreement = popcount(NOT (query_signs XOR candidate_signs)) + // Already column-accessible, already SIMD-friendly + let n = self.signs.nrows(); + let mut states = Array1::zeros(n); + for i in 0..n { + let row = self.signs.row(i); + let agreement = hamming_agreement(query_signs.as_slice().unwrap(), + row.as_slice().unwrap()); + states[i] = classify_awareness(agreement, ...); + } + states + } +} +``` + +**The insight**: bf16_hamming.rs currently decomposes each BF16 value into +sign/exp/mantissa at QUERY TIME (runtime unpacking). With field-separated SoA, +the decomposition happens at INGEST TIME (once). Every subsequent query operates +on pre-decomposed fields. The hot path never sees a BF16 value — only its fields. + +This trades ingest bandwidth for query speed: +- Ingest: O(N) decomposition (amortized over all future queries) +- Query: zero decomposition cost (fields already separated) +- For a database queried 10,000× per second, this is a massive win. + +--- + +### 7. PackedQualia SoA: 16 Parallel Dimension Columns + +Current: `Vec` where each PackedQualia = 16 × i8 + 1 BF16 (18 bytes). + +SoA: +```rust +/// Qualia database stored as separated dimension columns. +/// Each column is ONE phenomenological dimension across ALL agents. +pub struct QualiaColumns { + /// 16 dimension columns, each Array1 with N_agents elements. + /// dims[0] = warmth, dims[1] = texture, dims[4] = social, etc. + pub dims: [Array1; 16], + + /// Magnitude column: Array1 (BF16 scalars). + pub magnitude: Array1, +} +``` + +**What this unlocks**: + +```rust +impl QualiaColumns { + /// "Which agents are tensioned on warmth (dim 4)?" + /// → Single SIMD scan of one column. + fn tensioned_on(&self, dim: usize, threshold: i8) -> Array1 { + self.dims[dim].mapv(|v| v.abs() > threshold && v < 0) + } + + /// "Resonance between agent A and all others on dimension 5" + /// → One column × scalar = Array1 + fn resonance_on_dim(&self, agent_idx: usize, dim: usize) -> Array1 { + let agent_val = self.dims[dim][agent_idx] as i16; + self.dims[dim].mapv(|v| v as i16 * agent_val) + } + + /// "Total resonance between agent A and all others (dot product)" + /// → 16 column scans, accumulate + fn total_resonance(&self, agent_idx: usize) -> Array1 { + let mut acc = Array1::::zeros(self.dims[0].len()); + for d in 0..16 { + let agent_val = self.dims[d][agent_idx] as i32; + acc += &self.dims[d].mapv(|v| v as i32 * agent_val); + } + acc + } +} +``` + +**AoS version of the same query** (current): +```rust +// "Find agents tensioned on warmth" — must scan ALL 18 bytes per agent, +// extract byte [4], compare. Cache line waste: 4/18 ≈ 22%. +for agent in &qualia_db { + if agent.resonance[4].abs() > threshold && agent.resonance[4] < 0 { + results.push(agent); + } +} +``` + +**SoA version**: scan 1 byte per agent, sequential, 100% cache utilization. +For 1M agents: 1MB scan vs ~18MB with random byte extraction. + +--- + +### 8. The Meta-Meta Level: ndarray IS the SoA Abstraction + +The deepest insight is that **ndarray's type system already encodes this**. +`Array2` with Fortran order IS a SoA database. The columns are the "fields." +The rows are the "records." And ndarray's stride machinery gives you BOTH access +patterns from the same allocation: + +```rust +// THE SAME Array2, TWO access patterns: + +// SoA access (column scan for K0 — contiguous): +let k0_column: ArrayView1 = database.column(0); +// → strides = [1], physically contiguous + +// AoS access (full record for K2 — strided): +let full_record: ArrayView1 = database.row(survivor_idx); +// → strides = [n_candidates], non-contiguous but only 3% of data +``` + +You don't need two copies. You don't need explicit transposition. ndarray's +`Order::F` gives you SoA as the NATIVE layout while `Order::C` gives you AoS. +The same API works for both — only the performance characteristics change. + +**This is what makes ndarray + hpc/ potentially genius**: the array type IS +the database schema. Shape IS the data model. Strides ARE the access pattern. +And the cascade kernel can be expressed as: + +```rust +pub fn cascade_soa( + database: &Array2, // F-order: [N, 256], columns contiguous + query: &Array1, // [256] + gate: &SliceGate, +) -> Vec { + let n = database.nrows(); + + // K0: column 0 scan (contiguous — full SIMD bandwidth) + let col0 = database.column(0); + let mut mask = k0_columnar_simd(col0.as_slice().unwrap(), query[0], gate); + + // K1: columns 0..8 scan (contiguous per column) + for w in 0..8 { + let col = database.column(w); + k1_accumulate_simd(col.as_slice().unwrap(), query[w], &mut mask); + } + k1_threshold(&mut mask, gate); + + // K2: only survivors (row access — strided but few) + let mut results = Vec::new(); + for idx in mask.iter_ones() { + let row = database.row(idx); + // row is strided — copy to stack for SIMD (256 × 8 = 2KB, fits in L1) + let mut buf = [0u64; 256]; + row.assign_to(ArrayViewMut1::from(&mut buf)); + let ec = k2_exact(&query.as_slice().unwrap(), &buf, 256); + results.push(/* ... */); + } + results +} +``` + +--- + +## The Shopping List Addition + +Add to `REFACTOR_HPC_INTEGRATION.md` Phase C: + +``` +Phase C-SoA (2 weeks) — Columnar Kernel Architecture: + C.1 TieredDatabase struct (K0/K1/K2 as separate F-order Array2) + C.2 k0_columnar_simd: scan N u64s with U64x8 broadcast+XOR+popcount + C.3 Bitmask survivor propagation (no per-candidate branching) + C.4 k1_accumulate_columnar: 8-column scan with running mask + C.5 k2_exact_masked: row extraction for final survivors + C.6 Arrow native path: RecordBatch columns → direct columnar scan + C.7 BF16FieldDatabase: sign/exp/mantissa pre-separated at ingest + C.8 QualiaColumns: 16 × Array1 for per-dimension scan + C.9 Benchmark: AoS vs SoA cascade on 1M SKU-16K containers +``` + +--- + +## Expected Performance Impact + +| Operation | AoS (current) | SoA (proposed) | Speedup | Why | +|-----------|---------------|----------------|---------|-----| +| K0 scan (1M) | ~2ms (cache-miss bound) | ~0.25ms (streaming) | **8×** | Contiguous vs stride-2048 | +| K1 scan (50K survivors) | ~0.8ms | ~0.1ms | **8×** | Same reason, 8 columns | +| K2 exact (1500 survivors) | ~0.3ms | ~0.35ms | 0.85× | Row copy overhead | +| **Total pipeline** | **~3.1ms** | **~0.7ms** | **4.4×** | K0+K1 dominate | +| BF16 awareness (1M) | ~5ms (decompose each) | ~0.6ms (pre-separated) | **8×** | No runtime decomposition | +| Qualia dim scan (1M agents) | ~4.5ms (18 byte stride) | ~0.25ms (1 byte stride) | **18×** | Perfect cache line use | + +The K2 step is slightly SLOWER in SoA (strided row access + copy to stack). +But K2 only runs on 0.3% of candidates — it's irrelevant to total time. +The cascade is dominated by K0 (touches all candidates) and K1 (touches 5%). +Both are massively faster with columnar layout. + +--- + +## The Philosophical Payoff + +> **AoS says**: "Here is a complete entity. Ask it anything." +> **SoA says**: "Here is one question. Ask it of everyone." + +The cascade pipeline IS the SoA access pattern: +- K0 asks ONE question (word 0) of EVERYONE +- K1 asks EIGHT questions (words 0-7) of SURVIVORS +- K2 asks ALL questions (words 0-255) of VERY FEW + +This is a funnel. Each stage narrows the population and widens the per-entity cost. +SoA is optimal for funnels because the wide-population stages (K0, K1) get +contiguous memory access, and the narrow-population stage (K2) pays the strided +penalty on so few entities that it doesn't matter. + +ndarray encodes this directly in its type system: +- `database.column(0)` = the K0 question to everyone (contiguous) +- `database.row(survivor)` = all answers from one entity (strided, rare) + +**The database layout IS the algorithm. +The strides ARE the access pattern. +ndarray IS the SoA abstraction.** From c003c193f6423f01149defeb205698b161cea3b9 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 20:19:28 +0000 Subject: [PATCH 3/7] Add unified refactor sequence: 8 waves integrating all analysis passes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Merges three inputs into one executable plan: - REFACTOR_HPC_INTEGRATION.md (type bridges, extension traits) - SOA_KERNEL_ARCHITECTURE.md (columnar cascade, field separation) - Transformer session feedback (API conventions, namespace, codegen) 8 waves: conventions → macros → bridges → traits → backend → SoA → namespace → release Critical path: 15 days serial, 10 days with parallelism. The SoA meta-insight applies at 6 levels: 1. Database layout (column-per-word) 2. API surface (_into forms = caller-controlled layout) 3. Module structure (hpc/cog/ext/io = columns of concern) 4. ndarray types (F-order strides encode the duality) 5. Cascade pipeline (widen reads as population shrinks) 6. Arrow/Lance (storage IS the compute layout, zero ETL) https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- UNIFIED_REFACTOR_SEQUENCE.md | 248 +++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 UNIFIED_REFACTOR_SEQUENCE.md diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md new file mode 100644 index 00000000..afa4c9d5 --- /dev/null +++ b/UNIFIED_REFACTOR_SEQUENCE.md @@ -0,0 +1,248 @@ +# Unified Refactor Sequence: ndarray Fork + +> Integrates three analysis passes into one executable sequence: +> 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits +> 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage +> 3. **Transformer session feedback** — API conventions, namespace, codegen macros +> +> Each wave is self-contained. Later waves depend on earlier ones. +> Within a wave, items are independent and can run in parallel. + +--- + +## Wave 0 — Conventions & Foundations (3 days) + +Unlocks everything else. No code changes to hot paths. Pure contract + tooling. + +| ID | Item | Source | Effort | Why First | +|----|------|--------|--------|-----------| +| W0.1 | **Dual-form signature convention** (`_into` + Vec wrapper + `_ptr`) | R1.1 | 1d | Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape. | +| W0.2 | **Unified HpcError enum** (`src/hpc/error.rs`) | R1.2 | 4h | Replaces 4 error conventions with 1. | +| W0.3 | **`#![deny(warnings)]` → targeted denies** | R1.3 | 1h | Unblocks doctest authoring for all subsequent waves. | +| W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. | +| W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. | + +**Gate**: All downstream crates (burn, candle, tract, ort) still compile unchanged. +The dual-form is additive; existing call sites continue working. + +--- + +## Wave 1 — Codegen Macros (2 days) + +Eliminates copy-paste before adding new code. Every subsequent wave benefits. + +| ID | Item | Source | Effort | What It Produces | +|----|------|--------|--------|------------------| +| W1.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. | +| W1.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. | +| W1.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. | +| W1.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. | + +**Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions. +Run benches to verify no regression. + +--- + +## Wave 2 — Type Bridges (2 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W2.1 | **Fingerprint ↔ ArrayView1** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints | +| W2.2 | **VsaVector ↔ ArrayView1** | Tier 1.2 | 2h | Same pattern | +| W2.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array.dot()`, unlocks BLAS dispatch | +| W2.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1 per channel | +| W2.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 | + +**Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge. + +--- + +## Wave 3 — Extension Traits (3 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W3.1 | **HdcOps trait** (K0/K1/K2 on Array1) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` | +| W3.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` | +| W3.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern | +| W3.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference | +| W3.5 | **SimdMath trait** (VML on any Array) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals | +| W3.6 | **Int8Matmul trait** (Array2 × Array2 → Array2) | Tier 3.3 | 4h | Quantized inference via ndarray types | + +**Gate**: Each trait has tests matching the raw-slice function's test suite. +Benchmark: trait overhead vs. direct raw-slice call ≤ 1%. + +--- + +## Wave 4 — Backend Wiring (2 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W4.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication | +| W4.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 | +| W4.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path | + +**Gate**: `cargo bench` shows measurable improvement on contiguous arrays. +Non-contiguous arrays unchanged (fallback to generic fold). + +--- + +## Wave 5 — SoA Kernel Architecture (1 week) + +From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W5.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy | +| W5.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | 8 candidates per cycle, sequential memory | +| W5.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops | +| W5.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns | +| W5.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor | +| W5.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition | +| W5.7 | **QualiaColumns** (16 × Array1) | SoA §7 | 4h | 18x dimension scan throughput | +| W5.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow | +| W5.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim | + +**Gate**: SoA cascade produces identical results to AoS cascade on full test suite. +Benchmark confirms ≥3x throughput improvement on 1M containers. + +--- + +## Wave 6 — Namespace Restructure (3 days) + +From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W6.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ | +| W6.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code | +| W6.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat | + +**Gate**: `pub use` deprecation shims for all moved modules. +All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning). + +--- + +## Wave 7 — Test & Bench Infrastructure (2 days) + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W7.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count | +| W7.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims | +| W7.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 | + +--- + +## Wave 8 — Version Bump & Downstream Pin (1 day) + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W8.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut | +| W8.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed | +| W8.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort) | R4.4 | 2h | tag = "v0.18.0" | +| W8.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup | + +--- + +## The SoA Integration with Conventions + +The SoA architecture (Wave 5) benefits from every preceding wave: + +| Wave 5 needs | Provided by | +|--------------|-------------| +| F-order Array2 database | Wave 2 (type bridges — Fingerprint converts to Array) | +| SIMD column scanning | Wave 4 (unified SIMD detection, core dispatch) | +| Bitmask as Array1 | Wave 2 (ArrayView factory pattern) | +| `_into` form for columnar kernels | Wave 0 (signature convention) | +| Dispatch macro for k0_columnar_simd | Wave 1 (codegen macros) | +| Extension trait: `database.cascade_soa(&query, &gate)` | Wave 3 (HdcOps trait extended) | +| BF16FieldDatabase uses Quantize trait | Wave 3 (Quantize extension) | +| Arrow columns → direct scan | Wave 2 (Arrow view factories) | +| Benchmark harness to prove 4-8x | Wave 7 (bench infrastructure) | + +This is why the waves are ordered this way. The SoA kernel is the most impactful +change (~4-8x cascade throughput), but it needs the convention/macro/bridge +foundation to land cleanly. Without Wave 0-4, the SoA code would be yet another +raw-slice module that doesn't integrate with ndarray — repeating the original sin. + +--- + +## The Meta-Level SoA Observation + +The SoA insight goes beyond just "transpose the database." It's a design principle +that applies recursively across the entire hpc/ surface: + +### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md) +- Fingerprint database: column-per-word → cascade becomes column scan +- BF16 fields: separated sign/exp/mantissa → awareness is native layout +- Qualia: column-per-dimension → per-dim queries are sequential scans + +### Level 2: API surface (this document) +- The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures: + - AoS API: `fn foo(input) -> output` (allocates inside, hides layout) + - SoA API: `fn foo_into(input, output)` (caller controls memory layout) + - The `_into` form lets the caller decide whether output is contiguous, chunked, or interleaved + +### Level 3: Module structure (W6.1 namespace split) +- AoS: one module (`hpc/`) contains everything → grep to find anything +- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → each namespace is a contiguous "column" of related functionality that you scan sequentially + +### Level 4: ndarray's type system +- `Array2` in F-order IS the SoA database. No wrapper type needed. +- `.column(i)` IS the field-access pattern. No custom iterator needed. +- `.row(i)` IS the record-access pattern. No conversion needed. +- Strides encode the duality. The same allocation serves both patterns. + +### Level 5: The cascade pipeline itself +- K0: ONE field (word 0) across ALL records → column scan +- K1: EIGHT fields (words 0-7) across SURVIVORS → multi-column scan +- K2: ALL fields across FEW records → row reconstruction + +This is a **widening read pattern**: start narrow (1 column), widen as population shrinks. +SoA makes the narrow-start cheap (contiguous) and tolerates the wide-end being strided +(because so few records reach it). + +### Level 6: Arrow + Lance alignment +- Arrow IS columnar storage (SoA by definition) +- Lance stores Arrow columns on disk +- The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy +- The database layout on disk IS the query execution layout in memory +- This is the zero-copy nirvana: mmap → column pointer → SIMD scan → result + +**The architecture collapses all intermediate transformations.** +Storage format = memory layout = compute pattern = cache access order. +One decision (F-order Array2) propagates from disk through cache to SIMD lanes. + +--- + +## Effort Summary + +| Wave | Days | Parallel? | Key Outcome | +|------|------|-----------|-------------| +| 0 | 3 | Yes | Conventions locked | +| 1 | 2 | Yes | Codegen macros ready | +| 2 | 2 | Yes (after 0) | Domain types ↔ Array | +| 3 | 3 | Yes (after 2) | Extension traits live | +| 4 | 2 | After 1+2 | Core ops accelerated | +| 5 | 5 | After 2+4 | SoA cascade (the big win) | +| 6 | 3 | After 0 | Namespace clean | +| 7 | 2 | After 5 | Proof via benchmarks | +| 8 | 1 | After all | Release | + +**Critical path**: 0 → 2 → 4 → 5 → 7 → 8 = **15 days** serial. +With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**. + +--- + +## What NOT to Do + +1. **Don't delete hpc/ modules** — raw-slice functions stay (FFI, embedded, zero-overhead) +2. **Don't premature-v1.0** — surface is too young; candle/tract wiring will shake it again +3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional +4. **Don't couple SoA with module restructure** — they're independent; merge separately +5. **Don't break downstream in one shot** — deprecation shims for one release minimum From d1882b6fad58900c7603902f6ef80deb6eba6da8 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 20:32:32 +0000 Subject: [PATCH 4/7] Expand unified refactor sequence: integrate lance-graph W1a consumer contract MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fourth analysis pass adds Wave 1 (SIMD Consumer Primitives) — 5 P0 primitives blocking 158 raw-intrinsic violations in lance-graph: - I8x16::from_i4_packed_u64 (closure-batch pattern) - I8x16::saturating_abs (VPABSB correction for i8::MIN) - U16x8::gather_u16 (quantized distance tables) - prefetch_read_t0/t1/t2 (hint wrappers) - U64x8::popcnt + xor_popcount (HDC hamming) Sequence now 9 waves (was 8), dependency graph updated, lance-graph unblocked at day 6 of 12-day plan. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- UNIFIED_REFACTOR_SEQUENCE.md | 391 ++++++++++++++++++++++++++--------- 1 file changed, 292 insertions(+), 99 deletions(-) diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md index afa4c9d5..6ad54321 100644 --- a/UNIFIED_REFACTOR_SEQUENCE.md +++ b/UNIFIED_REFACTOR_SEQUENCE.md @@ -1,9 +1,10 @@ # Unified Refactor Sequence: ndarray Fork -> Integrates three analysis passes into one executable sequence: +> Integrates four analysis passes into one executable sequence: > 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits > 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage > 3. **Transformer session feedback** — API conventions, namespace, codegen macros +> 4. **lance-graph W1a consumer contract** — 5 SIMD primitives blocking 158-violation remediation > > Each wave is self-contained. Later waves depend on earlier ones. > Within a wave, items are independent and can run in parallel. @@ -22,201 +23,389 @@ Unlocks everything else. No code changes to hot paths. Pure contract + tooling. | W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. | | W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. | -**Gate**: All downstream crates (burn, candle, tract, ort) still compile unchanged. +**Gate**: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged. The dual-form is additive; existing call sites continue working. --- -## Wave 1 — Codegen Macros (2 days) +## Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0 + +From `lance-graph` W1a consumer contract. **Five primitives blocking 158 raw-intrinsic +violations across 5 lance-graph crates.** Each is a tight-scope PR. Parallel review. + +The lance-graph `simd-savant` agent runs PRE-MERGE against each PR. + +| ID | Item | Source | Effort | Consumer Site | +|----|------|--------|--------|---------------| +| W1.1 | **`I8x16::from_i4_packed_u64()`** + `batch_packed_i4_16()` | W1a-#1 | 1d | `lance-graph-contract/src/mul.rs::i4_eval::batch` — 5 batch fns over `QualiaI4_16D(u64)`. Closes PR #398 codex P1 (NEON OOB at `len==2`). | +| W1.2 | **`I8x16::saturating_abs()`** + `I8x32::saturating_abs()` | W1a-#2 | 4h | `lance-graph-contract/src/mul.rs` — Direction-B fix. `abs(i8::MIN)` must return `i8::MAX`, not wrap. See §VPABSB correction. | +| W1.3 | **`U16x8::gather_u16()`** + `palette_lookup_u8x8()` | W1a-#3 | 4h | `bgz17/src/simd.rs:88` — currently inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation). | +| W1.4 | **`prefetch_read_t0/t1/t2()`** | W1a-#4 | 2h | `bgz17/src/prefetch.rs:96,100` — currently inlines `_mm_prefetch` directly. | +| W1.5 | **`U64x8::popcnt()`** + `U64x8::xor_popcount()` | W1a-#5 | 4h | `holograph/hamming.rs` + `lance-graph/graph/blasgraph/types.rs` — inline `_mm512_popcnt_epi64`. | + +### Per-primitive implementation contract + +Every PR MUST: +1. **Three backends**: AVX-512, NEON, scalar. Missing scalar = P0 reject. +2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices). +3. **Parity test**: all backends produce identical output on randomized corpus including edge cases. +4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body. +5. **`// SAFETY:` on every `unsafe` block**. +6. **No new `is_x86_feature_detected!`** outside `simd_caps.rs`. +7. **Consumer site cited** in PR description. + +### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend) + +```rust +impl I8x16 { + /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended). + /// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf). + pub fn from_i4_packed_u64(packed: u64) -> Self; + + /// Const-folded lane extract. + pub fn lane_i8(self) -> i8; +} + +/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair. +/// Bounds-aware tail handling; scalar fallback on unsupported arch. +pub fn batch_packed_i4_16(packed: &[u64], aux: &[i8], out: &mut [E], f: F) +where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy; +``` + +**AVX-512**: `_mm_cvtsi64_si128` + nibble shuffle via VPSHUFB mask LUT, then sign-extend. +Alt: PDEP (`_pdep_u64` × 2) into two u64 halves → load → `vpmovsxbw`. Bench both on Zen4 + SPR. +**NEON**: `vld1_u8` 8 bytes → `vshl_n_s8(v, 4)` + `vshr_n_s8(v, 4)` for nibble split + sign-extend. +**Scalar**: `((packed >> (4*i)) & 0xf) as i8`, with `if x > 7 { x - 16 } else { x }` for sign-extend. + +The `batch_packed_i4_16` closure-batch is the foundation for W1.5-#7 (randomized signatures). + +### W1.2 — I8x16::saturating_abs (VPABSB correction) + +**Critical**: `_mm512_abs_epi8` does NOT saturate `i8::MIN`. Returns `0x80` (= -128). + +```rust +impl I8x16 { pub fn saturating_abs(self) -> Self; } +impl I8x32 { pub fn saturating_abs(self) -> Self; } +``` + +**AVX-512**: `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`. +**NEON**: `vqabsq_s8(x)` — hardware-saturating (the `q` suffix). +**Scalar**: `i8::saturating_abs()` (stdlib, well-defined). + +**Mandatory test**: +```rust +let input = I8x16::splat(i8::MIN); +let result = input.saturating_abs(); +assert_eq!(result.lane_i8::<0>(), i8::MAX); // 127, NOT -128 +``` + +### W1.3 — U16x8::gather_u16 (palette lookup) + +```rust +impl U16x8 { + /// Gather 8 u16 values from `table` at given indices. + /// Debug: panics if max(indices) >= table.len(). Release: scalar fallback. + pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self; +} +pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8; +``` + +**AVX2**: `_mm256_i32gather_epi32` with index widening + downcast. +**NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`. + +### W1.4 — Prefetch hints (cross-arch) + +```rust +pub fn prefetch_read_t0(ptr: *const u8); // L1 +pub fn prefetch_read_t1(ptr: *const u8); // L2 +pub fn prefetch_read_t2(ptr: *const u8); // L3 +``` + +**x86**: `_mm_prefetch(ptr, _MM_HINT_T0/T1/T2)`. +**aarch64**: `prfm pldl1keep/pldl2keep/pldl3keep`. +**Other**: no-op (prefetch is a hint; silent no-op is correct per ISA). + +### W1.5 — U64x8::popcnt (lane-wise popcount) + +```rust +impl U64x8 { + /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64). + pub fn popcnt(self) -> Self; + /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance. + pub fn xor_popcount(self, other: Self) -> u64; +} +impl U64x4 { pub fn popcnt(self) -> Self; } // AVX2 parity +``` + +**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`). +**AVX-512 without VPOPCNTDQ**: Mula's algorithm via `VPSHUFB` + `VPSADBW` per-byte LUT. +**NEON**: `vcntq_u8` → `vpaddlq_u8` cascade to sum within each u64. +**Scalar**: `u64::count_ones()` fused loop. + +### W1.5+ — Deferred primitives (gated on sigker certification) + +Three more primitives queued behind `lance-graph:crates/sigker` certification +(Hambly-Lyons 2010 path-signature uniqueness). **No code needed today.** Listed +so W1a additions are designed broad enough to compose with these later. + +| ID | Primitive | Gate | +|----|-----------|------| +| W1.5-#6 | `signature_pde_sweep()` — Goursat PDE for `〈S(X), S(Y)〉` | `jc Pillar 11` activation | +| W1.5-#7 | Randomized projection (Cuchiero-Schmocker-Teichmann 2021) | sigker bench at production widths | +| W1.5-#8 | Lyndon-pack log-signature (7-13× compression) on `I16x16` | sigker bench at production widths | + +The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7. + +**Gate**: All 5 primitives pass parity tests across AVX-512/NEON/scalar. +`lance-graph` `simd-savant` agent verifies compliance PRE-MERGE. +Consumer-side migration PRs (#398-#402) unblocked. + +--- + +## Wave 2 — Codegen Macros (2 days) Eliminates copy-paste before adding new code. Every subsequent wave benefits. +Wave 1 primitives become the first consumers of these macros (retrofit optional). | ID | Item | Source | Effort | What It Produces | |----|------|--------|--------|------------------| -| W1.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. | -| W1.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. | -| W1.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. | -| W1.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. | +| W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. | +| W2.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. | +| W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. | +| W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. | **Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions. Run benches to verify no regression. --- -## Wave 2 — Type Bridges (2 days) +## Wave 3 — Type Bridges (2 days) From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native. | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W2.1 | **Fingerprint ↔ ArrayView1** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints | -| W2.2 | **VsaVector ↔ ArrayView1** | Tier 1.2 | 2h | Same pattern | -| W2.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array.dot()`, unlocks BLAS dispatch | -| W2.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1 per channel | -| W2.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 | +| W3.1 | **Fingerprint ↔ ArrayView1** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints | +| W3.2 | **VsaVector ↔ ArrayView1** | Tier 1.2 | 2h | Same pattern | +| W3.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array.dot()`, unlocks BLAS dispatch | +| W3.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1 per channel | +| W3.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 | **Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge. --- -## Wave 3 — Extension Traits (3 days) +## Wave 4 — Extension Traits (3 days) From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types. | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W3.1 | **HdcOps trait** (K0/K1/K2 on Array1) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` | -| W3.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` | -| W3.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern | -| W3.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference | -| W3.5 | **SimdMath trait** (VML on any Array) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals | -| W3.6 | **Int8Matmul trait** (Array2 × Array2 → Array2) | Tier 3.3 | 4h | Quantized inference via ndarray types | +| W4.1 | **HdcOps trait** (K0/K1/K2 on Array1) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` | +| W4.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` | +| W4.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern | +| W4.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference | +| W4.5 | **SimdMath trait** (VML on any Array) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals | +| W4.6 | **Int8Matmul trait** (Array2 × Array2 → Array2) | Tier 3.3 | 4h | Quantized inference via ndarray types | **Gate**: Each trait has tests matching the raw-slice function's test suite. Benchmark: trait overhead vs. direct raw-slice call ≤ 1%. --- -## Wave 4 — Backend Wiring (2 days) +## Wave 5 — Backend Wiring (2 days) From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate. | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W4.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication | -| W4.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 | -| W4.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path | +| W5.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication | +| W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 | +| W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path | **Gate**: `cargo bench` shows measurable improvement on contiguous arrays. Non-contiguous arrays unchanged (fallback to generic fold). --- -## Wave 5 — SoA Kernel Architecture (1 week) +## Wave 6 — SoA Kernel Architecture (1 week) From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change. +Wave 1's `U64x8::popcnt()` and `U64x8::xor_popcount()` are direct prerequisites +for the columnar XOR+popcount scan in W6.2. | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W5.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy | -| W5.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | 8 candidates per cycle, sequential memory | -| W5.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops | -| W5.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns | -| W5.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor | -| W5.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition | -| W5.7 | **QualiaColumns** (16 × Array1) | SoA §7 | 4h | 18x dimension scan throughput | -| W5.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow | -| W5.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim | +| W6.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy | +| W6.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | Uses `U64x8::xor_popcount` from W1.5. 8 candidates per cycle. | +| W6.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops | +| W6.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns | +| W6.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor | +| W6.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition | +| W6.7 | **QualiaColumns** (16 × Array1) | SoA §7 | 4h | 18x dimension scan throughput. Uses `I8x16::from_i4_packed_u64` from W1.1. | +| W6.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow | +| W6.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim | **Gate**: SoA cascade produces identical results to AoS cascade on full test suite. Benchmark confirms ≥3x throughput improvement on 1M containers. --- -## Wave 6 — Namespace Restructure (3 days) +## Wave 7 — Namespace Restructure (3 days) From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md. | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W6.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ | -| W6.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code | -| W6.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat | +| W7.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ | +| W7.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code. W1 primitives land in new locations. | +| W7.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat | **Gate**: `pub use` deprecation shims for all moved modules. All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning). --- -## Wave 7 — Test & Bench Infrastructure (2 days) +## Wave 8 — Test & Bench Infrastructure (2 days) | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W7.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count | -| W7.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims | -| W7.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 | +| W8.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count | +| W8.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims | +| W8.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 | +| W8.4 | **W1a primitive parity bench** (AVX-512 vs NEON vs scalar speedup) | lance-graph | 4h | Acceptance criteria §4 | --- -## Wave 8 — Version Bump & Downstream Pin (1 day) +## Wave 9 — Version Bump & Downstream Pin (1 day) | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W8.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut | -| W8.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed | -| W8.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort) | R4.4 | 2h | tag = "v0.18.0" | -| W8.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup | +| W9.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut | +| W9.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed | +| W9.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort, lance-graph) | R4.4 | 2h | tag = "v0.18.0" | +| W9.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup | --- -## The SoA Integration with Conventions +## Dependency Graph + +``` +Wave 0 (conventions) + │ + ├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph + │ │ + │ ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5) + │ ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1) + │ └──→ Wave 8.4 (primitive parity benches) + │ + ├──→ Wave 2 (codegen macros) + │ │ + │ └──→ Wave 5 (backend wiring uses macros) + │ + ├──→ Wave 3 (type bridges) + │ │ + │ ├──→ Wave 4 (extension traits need bridges) + │ ├──→ Wave 5 (backend dispatch needs BlasFloat for BF16) + │ └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views) + │ + ├──→ Wave 7 (namespace) — independent of 1-6 + │ + └──→ Wave 8 (tests/benches) — after 1+5+6 + │ + └──→ Wave 9 (release) +``` + +### Critical Path + +**For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked. + +**For full release**: 0 → 1 → 3 → 5 → 6 → 8 → 9 = **18 days** serial. +With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**. -The SoA architecture (Wave 5) benefits from every preceding wave: +--- -| Wave 5 needs | Provided by | -|--------------|-------------| -| F-order Array2 database | Wave 2 (type bridges — Fingerprint converts to Array) | -| SIMD column scanning | Wave 4 (unified SIMD detection, core dispatch) | -| Bitmask as Array1 | Wave 2 (ArrayView factory pattern) | -| `_into` form for columnar kernels | Wave 0 (signature convention) | -| Dispatch macro for k0_columnar_simd | Wave 1 (codegen macros) | -| Extension trait: `database.cascade_soa(&query, &gate)` | Wave 3 (HdcOps trait extended) | -| BF16FieldDatabase uses Quantize trait | Wave 3 (Quantize extension) | -| Arrow columns → direct scan | Wave 2 (Arrow view factories) | -| Benchmark harness to prove 4-8x | Wave 7 (bench infrastructure) | +## How Wave 1 Feeds Everything Downstream + +The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves: -This is why the waves are ordered this way. The SoA kernel is the most impactful -change (~4-8x cascade throughput), but it needs the convention/macro/bridge -foundation to land cleanly. Without Wave 0-4, the SoA code would be yet another -raw-slice module that doesn't integrate with ndarray — repeating the original sin. +| Primitive | Immediate consumer (lance-graph) | Later consumer (ndarray) | +|-----------|----------------------------------|--------------------------| +| `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop | +| `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics | +| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** — gather exponent fields from lookup table | +| `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan | +| `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation | + +The SoA cascade (Wave 6) is fundamentally built on `U64x8::xor_popcount()`. Without Wave 1, +the SoA implementation would need to drop back to `hpc::bitwise::hamming_distance_raw()` on +slices, losing the typed-wrapper discipline and duplicating dispatch logic. + +--- + +## The SoA Integration with All Waves + +| Wave 6 needs | Provided by | +|--------------|-------------| +| `U64x8::xor_popcount()` for columnar scan | **Wave 1.5** (SIMD primitives) | +| `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) | +| `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) | +| F-order Array2 database | **Wave 3** (type bridges — Fingerprint converts to Array) | +| Dispatch macro for k0_columnar_simd | **Wave 2** (codegen macros) | +| `_into` form for columnar kernels | **Wave 0** (signature convention) | +| Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) | +| BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) | +| Arrow columns → direct scan | **Wave 3** (Arrow view factories) | +| Unified SIMD detection for dispatch | **Wave 5** (backend wiring) | +| Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) | --- ## The Meta-Level SoA Observation The SoA insight goes beyond just "transpose the database." It's a design principle -that applies recursively across the entire hpc/ surface: +that applies recursively across the entire surface: ### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md) - Fingerprint database: column-per-word → cascade becomes column scan - BF16 fields: separated sign/exp/mantissa → awareness is native layout - Qualia: column-per-dimension → per-dim queries are sequential scans -### Level 2: API surface (this document) +### Level 2: API surface - The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures: - AoS API: `fn foo(input) -> output` (allocates inside, hides layout) - SoA API: `fn foo_into(input, output)` (caller controls memory layout) - - The `_into` form lets the caller decide whether output is contiguous, chunked, or interleaved -### Level 3: Module structure (W6.1 namespace split) -- AoS: one module (`hpc/`) contains everything → grep to find anything -- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → each namespace is a contiguous "column" of related functionality that you scan sequentially +### Level 3: SIMD primitives (lance-graph W1a) +- The **typed-wrapper-as-method** convention IS SoA for the SIMD surface: + - AoS: raw intrinsics scattered across consumer crates (158 violations) + - SoA: struct methods on typed wrappers, consumers import ONE namespace +- The **closure-batch pattern** (W1.1 `batch_packed_i4_16`) IS SoA for computation: + - AoS: each consumer writes its own unpack+process+pack loop + - SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch -### Level 4: ndarray's type system -- `Array2` in F-order IS the SoA database. No wrapper type needed. -- `.column(i)` IS the field-access pattern. No custom iterator needed. -- `.row(i)` IS the record-access pattern. No conversion needed. -- Strides encode the duality. The same allocation serves both patterns. - -### Level 5: The cascade pipeline itself -- K0: ONE field (word 0) across ALL records → column scan -- K1: EIGHT fields (words 0-7) across SURVIVORS → multi-column scan -- K2: ALL fields across FEW records → row reconstruction +### Level 4: Module structure (W7.1 namespace split) +- AoS: one module (`hpc/`) contains everything → grep to find anything +- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → contiguous columns -This is a **widening read pattern**: start narrow (1 column), widen as population shrinks. -SoA makes the narrow-start cheap (contiguous) and tolerates the wide-end being strided -(because so few records reach it). +### Level 5: ndarray's type system +- `Array2` in F-order IS the SoA database +- `.column(i)` IS the field-access pattern +- `.row(i)` IS the record-access pattern +- Strides encode the duality without duplication ### Level 6: Arrow + Lance alignment - Arrow IS columnar storage (SoA by definition) - Lance stores Arrow columns on disk - The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy -- The database layout on disk IS the query execution layout in memory -- This is the zero-copy nirvana: mmap → column pointer → SIMD scan → result +- Storage format = memory layout = compute pattern = cache access order -**The architecture collapses all intermediate transformations.** -Storage format = memory layout = compute pattern = cache access order. -One decision (F-order Array2) propagates from disk through cache to SIMD lanes. +### Level 7: The consumer contract itself +- lance-graph declares what it needs (W1a primitives), ndarray provides +- The contract is columnar: one primitive per concern, not a monolithic "SIMD library" +- Each primitive is independently testable, deployable, and benchmarkable + +**The architecture collapses all intermediate transformations at every level.** --- @@ -225,17 +414,18 @@ One decision (F-order Array2) propagates from disk through cache to SIMD lanes. | Wave | Days | Parallel? | Key Outcome | |------|------|-----------|-------------| | 0 | 3 | Yes | Conventions locked | -| 1 | 2 | Yes | Codegen macros ready | -| 2 | 2 | Yes (after 0) | Domain types ↔ Array | -| 3 | 3 | Yes (after 2) | Extension traits live | -| 4 | 2 | After 1+2 | Core ops accelerated | -| 5 | 5 | After 2+4 | SoA cascade (the big win) | -| 6 | 3 | After 0 | Namespace clean | -| 7 | 2 | After 5 | Proof via benchmarks | -| 8 | 1 | After all | Release | - -**Critical path**: 0 → 2 → 4 → 5 → 7 → 8 = **15 days** serial. -With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**. +| 1 | 3 | Yes (after 0) | **lance-graph unblocked** (P0) | +| 2 | 2 | ∥ with 1 | Codegen macros ready | +| 3 | 2 | After 0 | Domain types ↔ Array | +| 4 | 3 | After 3 | Extension traits live | +| 5 | 2 | After 2+3 | Core ops accelerated | +| 6 | 5 | After 1+3+5 | SoA cascade (the big win) | +| 7 | 3 | ∥ with 4-6 | Namespace clean | +| 8 | 2 | After 1+6 | Proof via benchmarks | +| 9 | 1 | After all | Release | + +**lance-graph unblocked**: Wave 0+1 = **6 days**. +**Full release critical path**: 18 days serial, **~12 days** with parallelism. --- @@ -246,3 +436,6 @@ With parallel waves (1∥2, 3∥4, 6∥5, 7∥8): **~10 working days**. 3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional 4. **Don't couple SoA with module restructure** — they're independent; merge separately 5. **Don't break downstream in one shot** — deprecation shims for one release minimum +6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed +7. **Don't add `is_x86_feature_detected!` outside simd_caps.rs** — dispatch through the singleton +8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated From 4321fcf33c22867c2e20970d0f107e58e2606312 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 20:52:56 +0000 Subject: [PATCH 5/7] Add W1a vertical SIMD consumer contract spec Canonical spec for 5 P0 SIMD primitives lance-graph requires: - I8x16::from_i4_packed_u64 (closure-batch) - I8x16::saturating_abs (VPABSB + VPMINUB clamp) - U16x8::gather_u16 (with Codex P2 OOB correction for 32-bit gather) - prefetch_read_t0/t1/t2 - U64x8::popcnt + xor_popcount (NEON: vpaddlq cascade, NOT vaddvq_u8) Includes Codex P2 corrections for gather OOB and NEON popcount lane contamination. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- .../vertical-simd-consumer-contract.md | 242 ++++++++++++++++++ 1 file changed, 242 insertions(+) create mode 100644 .claude/knowledge/vertical-simd-consumer-contract.md diff --git a/.claude/knowledge/vertical-simd-consumer-contract.md b/.claude/knowledge/vertical-simd-consumer-contract.md new file mode 100644 index 00000000..65c445c7 --- /dev/null +++ b/.claude/knowledge/vertical-simd-consumer-contract.md @@ -0,0 +1,242 @@ +# Vertical SIMD Consumer Contract — lance-graph W1a + +> Canonical spec for the 5 P0 SIMD primitives that AdaWorldAPI/lance-graph +> requires from this ndarray fork before consumer-side migrations can land. + +## Context + +A PRE-MERGE audit of lance-graph `main` (2026-05-16) surfaced **158 raw-intrinsic +violations** across 5 consumer crates plus 3 missing primitives here that block +clean remediation. This document is the **binding contract** between ndarray's +SIMD surface and its primary consumer. + +## Design Pattern + +ndarray's SIMD surface is designed AS-IF for lance-graph's exact workloads: + +- **Struct methods on typed wrappers** — `I8x16`, `U8x32`, `F32x16`, `U64x8`, … +- **Closure-parameterized batch primitives** — absorb consumer domain semantics +- **Consumers see zero raw intrinsics, zero `cfg(target_arch)`, zero runtime + feature-detect** — they call `I8x16::from_i4_packed_u64(...)`, + `I8x16::saturating_abs(...)`, `batch_packed_i4_16(..., |lanes, aux| { ... })` +- **This repo (the polyfill) owns** dispatch, chunking, tail handling, scalar fallback + +--- + +## W1a Queue — 5 Primitives + +### TD-NDARRAY-SIMD-UNPACK-I4-16D + +**API:** +```rust +impl I8x16 { + /// Unpack 8 packed i4 pairs from a u64 into 16 × i8 lanes (sign-extended). + fn from_i4_packed_u64(packed: u64) -> Self; +} + +/// Closure-batch: owns chunking, dispatch, tail handling. +/// `f` receives 16 i8 lanes + aux slice position; returns accumulator update. +fn batch_packed_i4_16( + packed: &[u64], + aux: &[E], + out: &mut [i8], + f: F, +) where F: Fn(I8x16, &[E]) -> I8x16; +``` + +**Consumer:** `lance-graph::mul::i4_eval::batch` (5 fns) + +**Per-arch implementation:** +- **AVX-512:** `VPSHUFB` nibble shuffle + sign-extend via arithmetic shift +- **NEON:** `VTBL` permute + `VSHR`/`VAND` nibble extraction +- **Scalar:** shift-and-mask loop, 2 nibbles per byte + +--- + +### TD-NDARRAY-SIMD-SATURATING-ABS-I8 + +**API:** +```rust +impl I8x16 { + /// Saturating absolute value: abs(i8::MIN) → i8::MAX (not i8::MIN). + fn saturating_abs(self) -> Self; +} +``` + +**Consumer:** lance-graph PR #398 Direction-B fix + +**CRITICAL — VPABSB correction:** + +The original lance-graph PR #400 capture claimed `_mm512_abs_epi8` saturates +`i8::MIN → 127` by ISA. **This is WRONG.** VPABSB returns `0x80` for input +`0x80` (i.e., `abs(i8::MIN) = i8::MIN` because +128 doesn't fit in i8). + +**Correct AVX-512 implementation:** +```rust +// SAFETY: self.0 is a valid __m512i from construction +unsafe { + let raw_abs = _mm512_abs_epi8(self.0); + // VPMINUB: unsigned min clamps 0x80 (=128 unsigned > 127) to 0x7f + let clamped = _mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f)); + Self(clamped) +} +``` + +**Per-arch implementation:** +- **AVX-512:** `VPABSB` + `VPMINUB` clamp (as above) +- **AVX2:** `_mm256_abs_epi8` + `_mm256_min_epu8(..., splat(0x7f))` +- **NEON:** `vqabsq_s8` — hardware-saturating (the `q` suffix). No fixup needed. +- **Scalar:** `i8::saturating_abs()` — correct by definition. + +**Mandatory parity test:** +```rust +let input = I8x16::splat(i8::MIN); +assert_eq!(input.saturating_abs().lane_i8::<0>(), i8::MAX); +``` + +The widen-then-negate trick used in lance-graph PR #398's `mul.rs` is NOT a +substitute — the new primitive must produce saturating semantics in the +byte-wide register without widening. + +--- + +### TD-NDARRAY-SIMD-GATHER + +**API:** +```rust +impl U16x8 { + /// Gather 8 × u16 values from `table` at the given indices. + /// Panics (debug) / UB-free (release) if any index ≥ table.len(). + fn gather_u16(table: &[u16], indices: [u32; 8]) -> Self; +} + +/// Palette lookup: gather 8 u8 values from a 256-entry palette. +fn palette_lookup_u8x8(palette: &[u8; 256], indices: [u8; 8]) -> [u8; 8]; +``` + +**Consumer:** `bgz17/src/simd.rs:88` + +**Per-arch implementation:** +- **AVX2/AVX-512:** `_mm256_i32gather_epi32` reads 32 bits per slot. Since the + source is `&[u16]`, each gather slot reads 4 bytes starting at + `&table[index]`. **Correction (Codex P2):** To avoid OOB reads at + `table[len-1]`, the implementation MUST: + 1. Allocate or use a padded table with 2 extra trailing bytes, OR + 2. Use scalar fallback for indices where `index == table.len() - 1`, OR + 3. (Preferred) Gather from a `&[u32]` temporary produced by zero-extending + the u16 table at ingest time, OR + 4. Mask the gathered 32-bit values to 16 bits (`_mm256_and_si256(..., 0xFFFF)`) + AND ensure the source allocation has ≥2 bytes of padding past the last + element (i.e., the table slice is backed by an allocation at least + `table.len() * 2 + 2` bytes). + + The bounds contract is: `max(indices) < table.len()`. The implementation must + guarantee no UB even when `max(indices) == table.len() - 1`. +- **NEON:** No hardware gather for u16. Scalar lane-by-lane load into vector. +- **Scalar:** Direct indexing loop. + +--- + +### TD-NDARRAY-SIMD-PREFETCH + +**API:** +```rust +/// Prefetch read hint — temporal locality level 0/1/2. +/// No-op on architectures without prefetch instructions. +#[inline(always)] +fn prefetch_read_t0(ptr: *const T); +#[inline(always)] +fn prefetch_read_t1(ptr: *const T); +#[inline(always)] +fn prefetch_read_t2(ptr: *const T); +``` + +**Consumer:** `bgz17/src/prefetch.rs:96,100` + +**Per-arch implementation:** +- **x86:** `_mm_prefetch::<_MM_HINT_T0/T1/T2>(ptr as *const i8)` +- **NEON/AArch64:** `__prefetch(ptr)` (ARM has only one hint level in stable intrinsics) +- **Scalar/fallback:** No-op `#[inline(always)]` empty function + +--- + +### TD-NDARRAY-SIMD-POPCOUNT-U64 + +**API:** +```rust +impl U64x8 { + /// Per-lane population count: count set bits in each u64 lane. + fn popcnt(self) -> Self; + + /// XOR two vectors then popcount each lane (Hamming distance). + fn xor_popcount(self, other: Self) -> Self; +} +``` + +**Consumer:** `holograph/hamming.rs`, `blasgraph/types.rs` + +**Per-arch implementation:** +- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`. +- **AVX-512 without VPOPCNTDQ:** Mula's algorithm — `_mm512_shuffle_epi8` (VPSHUFB) + for per-nibble LUT popcount on bytes, then `_mm512_sad_epu8` against zero to + horizontally sum bytes within each 64-bit lane. +- **NEON:** Per-u64 lane popcount via widening reduction: + ``` + vcntq_u8 → 16 × u8 byte-popcounts + vpaddlq_u8 → 8 × u16 (pairwise add bytes) + vpaddlq_u16 → 4 × u32 (pairwise add u16s) + vpaddlq_u32 → 2 × u64 (pairwise add u32s) + ``` + This produces one count per u64 lane. **Correction (Codex P2):** Do NOT use + `vaddvq_u8` — it reduces the entire 128-bit vector to a single scalar, + merging counts across u64 lanes. The `vpaddlq_*` widening cascade preserves + per-lane boundaries. +- **Scalar:** `u64::count_ones()` per lane. + +--- + +## W1.5 — Deferred Primitives + +Gated on lance-graph `sigker` certification (Pillar 11 activation): + +| ID | Primitive | Depends on | +|----|-----------|-----------| +| W1.5-#7 | signature-PDE-sweep | W1a-#1 closure-batch shape | +| W1.5-#8 | randomized-projection | W1a-#1 batch pattern | +| W1.5-#9 | lyndon-pack | W1a-#5 popcount | + +The W1a closure-batch shape (`batch_packed_i4_16`) is the foundation for W1.5-#7 +and #8. Design W1a-#1 with this forward-compatibility in mind. + +--- + +## Acceptance Criteria (per W1a PR) + +1. All three backends (AVX-512/AVX2, NEON, scalar) — scalar is the correctness anchor +2. Doc-comment states saturating/overflow/signedness semantics explicitly +3. Mandatory parity test on randomized + edge-case corpus (≥ 64 randomized vectors) +4. No new `is_*_feature_detected!` outside `src/hpc/simd_caps.rs` +5. `// SAFETY:` comments on all `unsafe` blocks +6. Consumer call-site cited in PR description +7. `cargo clippy -- -D warnings` passes + +--- + +## Cross-References + +- AdaWorldAPI/lance-graph PR #399 — simd-savant agent + autoattended-multiagent pattern +- AdaWorldAPI/lance-graph PR #400 — architectural capture (VPABSB correction origin) +- AdaWorldAPI/lance-graph PR #398 — codex P1 (NEON OOB) + P2 (i8::MIN divergence) +- Intel Intrinsics Guide: `_mm512_abs_epi8`, `_mm512_min_epu8`, `_mm512_popcnt_epi64`, `_mm256_i32gather_epi32` +- ARM Architecture Reference: `VQABS` (`vqabsq_s8`), `VCNT` (`vcntq_u8`), `VPADDL` + +--- + +## Gating Relationship + +> ndarray ships these 5 primitives → lance-graph remediation PRs land → +> 158 raw-intrinsic violations drop to 0 → sigker certification unblocks W1.5 + +Consumer-side migrations **cannot proceed** until these primitives ship. +The `simd-savant` agent on the lance-graph side runs PRE-MERGE against every +W1a PR to verify compliance. From b0fcc07b061057ea3201d7b4e2e53e3dd9550ea7 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 22:53:03 +0000 Subject: [PATCH 6/7] =?UTF-8?q?docs:=20fix=20dispatch=20model=20in=20refac?= =?UTF-8?q?tor=20docs=20=E2=80=94=20cfg=20routing,=20not=20runtime=20detec?= =?UTF-8?q?tion?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit UNIFIED_REFACTOR_SEQUENCE.md: - Add "Dispatch Model" section documenting compile-time cfg(target_feature) routing - Wave 1 contract: replace "three backends" with correct per-file impl rule - Replace rules 6-7 with: no is_x86_feature_detected, no #[target_feature(enable)] - Wave 5: reframe as "delete dead detection code" not "unify runtime singleton" - Add rules 9-10 (don't touch simd_avx2.rs, don't reach for rayon) REFACTOR_HPC_INTEGRATION.md: - §3.2: replace LazyLock proposal with "delete 877 lines of dead code" - Architecture diagram: "backend dispatch" → "cfg(target_feature) routing" - Phase C execution order updated to match Keeps: all type bridges, extension traits, SoA cascade, Wave sequencing, W1a primitive specs, VPABSB correction, palette-256 priority, NEON 2×128-bit, Arrow integration, codegen macros, namespace restructure, effort estimates. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- REFACTOR_HPC_INTEGRATION.md | 71 +++++++++++++++-------------- UNIFIED_REFACTOR_SEQUENCE.md | 86 ++++++++++++++++++++++++++++-------- 2 files changed, 103 insertions(+), 54 deletions(-) diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md index 1c11c56d..2823219f 100644 --- a/REFACTOR_HPC_INTEGRATION.md +++ b/REFACTOR_HPC_INTEGRATION.md @@ -36,14 +36,14 @@ The refactoring creates **bidirectional bridges** without removing the raw layer │ Array, ArrayView, Zip, Broadcasting │ └───────────────┬──────────────────────▲──────────────┘ │ │ - .as_slice() backend dispatch - │ │ + .as_slice() cfg(target_feature) routing + │ (compile-time, zero-cost) ▼ │ ┌─────────────────────────────────────────────────────┐ │ hpc/ bridge layer (NEW) │ │ Extension traits on ArrayBase │ │ From/Into impls for domain types │ -│ Core reductions route to SIMD backends │ +│ Core reductions call typed SIMD wrappers │ └───────────────┬──────────────────────▲──────────────┘ │ │ delegates to implements @@ -51,11 +51,17 @@ The refactoring creates **bidirectional bridges** without removing the raw layer ▼ │ ┌─────────────────────────────────────────────────────┐ │ hpc/ raw compute (unchanged) │ -│ &[u8], &[u64], Fingerprint, SIMD dispatch │ +│ &[u8], &[u64], Fingerprint, typed SIMD │ │ K0/K1/K2, BF16 GEMM, VNNI, VML │ +│ Uses crate::simd::* → resolves to simd_avx512.rs │ └─────────────────────────────────────────────────────┘ ``` +**Dispatch model**: `crate::simd::U64x8` resolves at compile time via +`cfg(target_feature = "avx512f")` in `src/simd.rs` to `simd_avx512::U64x8`. +No runtime detection, no match, no branching. The target-cpu pin in +`.cargo/config.toml` makes the cfg gate TRUE at compile time. + --- ## Refactoring Shopping List @@ -608,44 +614,37 @@ faster. Zero API change for users. --- -#### 3.2 Unify SIMD Detection (Delete Duplicates) +#### 3.2 Delete Dead SIMD Detection Code -**Files to merge**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines) -**Into**: `src/simd.rs` (the existing core SIMD module) +**Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines) -**Current state**: Three overlapping detection systems: -1. `src/simd.rs` → `LazyLock` — detects AVX-512/AVX2/NEON -2. `src/hpc/simd_caps.rs` → `CpuCaps` struct — detects same capabilities -3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` — another LazyLock with function pointers +**Current state**: Three overlapping detection systems exist: +1. `src/simd.rs` → `LazyLock` + `cfg(target_feature)` re-exports +2. `src/hpc/simd_caps.rs` → `CpuCaps` struct with runtime detection +3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` with function pointers -**Transform**: +**Reality with target-cpu pinned**: When `.cargo/config.toml` pins +`target-cpu=sapphirerapids`, all `cfg(target_feature = "avx512f")` gates resolve +TRUE at compile time. `simd.rs` re-exports route directly to `simd_avx512.rs`. +The `LazyLock` is dead code (const-folded away). The CpuCaps struct and +SimdDispatch function pointers are completely unreachable. -```rust -// 1. In src/simd.rs, export the unified detection: -pub static CPU_CAPS: LazyLock = LazyLock::new(|| detect_cpu_caps()); - -pub struct CpuCaps { - pub tier: Tier, // existing - pub avx512f: bool, - pub avx512bw: bool, - pub avx512vpopcntdq: bool, - pub avx512vnni: bool, - pub avx2: bool, - pub fma: bool, - pub popcnt: bool, - // ARM - pub neon: bool, - pub sve: bool, -} +**Transform**: -// 2. In src/hpc/simd_dispatch.rs, replace with: -pub use crate::simd::CPU_CAPS; -// Keep function-pointer dispatch but source caps from unified singleton +``` +1. Delete src/hpc/simd_caps.rs (515 lines — all dead under cfg pin) +2. Delete src/hpc/simd_dispatch.rs (362 lines — all dead under cfg pin) +3. Leave src/simd.rs as-is (its cfg gates ARE the dispatch mechanism) +``` -// 3. Delete src/hpc/simd_caps.rs (or make it a thin re-export) +If CI fallback (no target-cpu pin) is needed, gate these behind: +```rust +#[cfg(not(target_feature = "avx512f"))] +mod simd_caps; // only compiles when features aren't pinned ``` -**Result**: One CPUID check, one atomic, one struct — shared by core and all hpc modules. +**Result**: -877 lines of dead code. The dispatch mechanism is `cfg(target_feature)` +in `simd.rs` — no runtime anything. --- @@ -913,8 +912,8 @@ Phase B (2 weeks) — Extension Traits: 2.5 SimdMath trait (VML) (standalone) Phase C (2 weeks) — Backend Wiring: - 3.1 Core sum/mean → SIMD dispatch (depends on 3.2) - 3.2 Unified SIMD detection (standalone, delete duplicates) + 3.1 Core sum/mean → typed SIMD (standalone — calls crate::simd::F32x16 directly) + 3.2 Delete dead detection code (standalone, -877 lines unreachable under cfg pin) 3.3 INT8 Matmul via trait (depends on 1.3 pattern) Phase D (1 week) — View Factories: diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md index 6ad54321..d7835fd5 100644 --- a/UNIFIED_REFACTOR_SEQUENCE.md +++ b/UNIFIED_REFACTOR_SEQUENCE.md @@ -11,6 +11,35 @@ --- +## Dispatch Model (the ground truth) + +All SIMD in this repo resolves via **compile-time `cfg(target_feature)` routing**: + +``` +.cargo/config.toml pins target-cpu=sapphirerapids + → cfg(target_feature = "avx512f") = TRUE at compile time + → simd.rs re-exports resolve DIRECTLY to simd_avx512.rs types + → zero runtime detection, zero branching, zero LazyLock in hot path + +Consumer writes: crate::simd::U64x8 +simd.rs routes: pub use crate::simd_avx512::U64x8; // compile-time, no match +``` + +The `LazyLock` in `simd.rs` exists for the `std` path but is **dead code** +when target features are pinned — the compiler const-folds `detect_tier()` and +the `cfg` gates resolve the re-exports statically. + +**On aarch64**: NEON is mandatory. `cfg(target_arch = "aarch64")` routes to +`simd_neon.rs`. 256-bit types are 2×128-bit paired dispatch (e.g. `F32x16` = +4×`float32x4_t`). No runtime detection needed — NEON is always there. + +**simd_avx2.rs**: Dead code on x86-64-v4. Only reached when +`cfg(not(target_feature = "avx512f"))` — i.e., when someone builds without the +target-cpu pin (CI fallback, cross-compile). Never add new methods to it for +AVX-512 targets; it's a waste. + +--- + ## Wave 0 — Conventions & Foundations (3 days) Unlocks everything else. No code changes to hot paths. Pure contract + tooling. @@ -46,13 +75,14 @@ The lance-graph `simd-savant` agent runs PRE-MERGE against each PR. ### Per-primitive implementation contract Every PR MUST: -1. **Three backends**: AVX-512, NEON, scalar. Missing scalar = P0 reject. +1. **Implement on the backing type in `simd_avx512.rs`** (the only live file on x86-64-v4). NEON impl goes in `simd_neon.rs`. Scalar fallback in the `scalar` module inside `simd.rs`. 2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices). -3. **Parity test**: all backends produce identical output on randomized corpus including edge cases. +3. **Parity test**: all cfg-routed backends produce identical output on randomized corpus including edge cases. 4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body. 5. **`// SAFETY:` on every `unsafe` block**. -6. **No new `is_x86_feature_detected!`** outside `simd_caps.rs`. -7. **Consumer site cited** in PR description. +6. **No `is_x86_feature_detected!` anywhere** — dispatch is at `cfg(target_feature)` in `simd.rs` re-exports, not per-call runtime checks. +7. **No `#[target_feature(enable = ...)]` on functions** — the cargo target-cpu pin handles this globally. +8. **Consumer site cited** in PR description. ### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend) @@ -110,7 +140,18 @@ impl U16x8 { pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8; ``` -**AVX2**: `_mm256_i32gather_epi32` with index widening + downcast. +**Palette-256 is the dominant use case** — every index fits in u8, table is always +256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must +handle arbitrary tables too, but palette-256 should be the fast path (no OOB +possible when table.len() == 256 and indices are u8-sourced). + +**Codex P2 fix (gather_u16 OOB)**: `_mm256_i32gather_epi32` reads 4 bytes per slot +from a `&[u16]` table — overreads 2 bytes at `table[len-1]`. For palette-256 this +is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513, +which is within a 512-byte aligned allocation + padding). For arbitrary tables, +use scalar fallback or pad the table allocation. + +**AVX-512**: `_mm512_i32gather_epi32` with index widening + mask to u16. **NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`. ### W1.4 — Prefetch hints (cross-arch) @@ -134,12 +175,13 @@ impl U64x8 { /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance. pub fn xor_popcount(self, other: Self) -> u64; } -impl U64x4 { pub fn popcnt(self) -> Self; } // AVX2 parity +impl U64x4 { pub fn popcnt(self) -> Self; } // NEON/scalar parity ``` -**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`). -**AVX-512 without VPOPCNTDQ**: Mula's algorithm via `VPSHUFB` + `VPSADBW` per-byte LUT. -**NEON**: `vcntq_u8` → `vpaddlq_u8` cascade to sum within each u64. +**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq` — +available on sapphirerapids, enabled by the target-cpu pin). +**NEON popcount per-u64**: `vcntq_u8` → `vpaddlq_u8` → `vpaddlq_u16` → `vpaddlq_u32` +(NOT `vaddvq_u8` which merges ALL lanes to a single scalar — Codex P2 fix). **Scalar**: `u64::count_ones()` fused loop. ### W1.5+ — Deferred primitives (gated on sigker certification) @@ -170,7 +212,7 @@ Wave 1 primitives become the first consumers of these macros (retrofit optional) | ID | Item | Source | Effort | What It Produces | |----|------|--------|--------|------------------| | W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. | -| W2.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. | +| W2.2 | **Per-arch impl macro** (`simd_impl!`) | R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. | | W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. | | W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. | @@ -219,10 +261,17 @@ From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accele | ID | Item | Source | Effort | Impact | |----|------|--------|--------|--------| -| W5.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication | +| W5.1 | **Delete duplicate detection code** (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned | | W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 | | W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path | +**Note on W5.1**: With `target-cpu=sapphirerapids` in `.cargo/config.toml`, the +`cfg(target_feature = "avx512f")` branch in `simd.rs` is the only live path. +`hpc/simd_caps.rs` (515 lines) and `hpc/simd_dispatch.rs` (362 lines) exist for +a multi-binary world that doesn't apply when the target is pinned. These can be +deleted or feature-gated behind `cfg(not(target_feature = "avx512f"))` for CI +fallback builds. + **Gate**: `cargo bench` shows measurable improvement on contiguous arrays. Non-contiguous arrays unchanged (fallback to generic fold). @@ -299,14 +348,14 @@ Wave 0 (conventions) │ ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1) │ └──→ Wave 8.4 (primitive parity benches) │ - ├──→ Wave 2 (codegen macros) + ├���─→ Wave 2 (codegen macros) │ │ │ └──→ Wave 5 (backend wiring uses macros) │ ├──→ Wave 3 (type bridges) │ │ │ ├──→ Wave 4 (extension traits need bridges) - │ ├──→ Wave 5 (backend dispatch needs BlasFloat for BF16) + │ ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16) │ └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views) │ ├──→ Wave 7 (namespace) — independent of 1-6 @@ -320,7 +369,7 @@ Wave 0 (conventions) **For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked. -**For full release**: 0 → 1 → 3 → 5 → 6 → 8 → 9 = **18 days** serial. +**For full release**: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = **18 days** serial. With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**. --- @@ -333,7 +382,7 @@ The 5 SIMD primitives aren't isolated additions — they're load-bearing for lat |-----------|----------------------------------|--------------------------| | `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop | | `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics | -| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** — gather exponent fields from lookup table | +| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** �� gather exponent fields from lookup table | | `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan | | `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation | @@ -351,12 +400,11 @@ slices, losing the typed-wrapper discipline and duplicating dispatch logic. | `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) | | `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) | | F-order Array2 database | **Wave 3** (type bridges — Fingerprint converts to Array) | -| Dispatch macro for k0_columnar_simd | **Wave 2** (codegen macros) | +| Impl macro for k0_columnar_simd | **Wave 2** (codegen macros) | | `_into` form for columnar kernels | **Wave 0** (signature convention) | | Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) | | BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) | | Arrow columns → direct scan | **Wave 3** (Arrow view factories) | -| Unified SIMD detection for dispatch | **Wave 5** (backend wiring) | | Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) | --- @@ -437,5 +485,7 @@ that applies recursively across the entire surface: 4. **Don't couple SoA with module restructure** — they're independent; merge separately 5. **Don't break downstream in one shot** — deprecation shims for one release minimum 6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed -7. **Don't add `is_x86_feature_detected!` outside simd_caps.rs** — dispatch through the singleton +7. **Don't use `is_x86_feature_detected!` or `#[target_feature(enable=...)]`** — cfg(target_feature) at the re-export level handles dispatch; per-function annotations and per-call runtime checks are wrong 8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated +9. **Don't add methods to simd_avx2.rs for AVX-512 targets** — it's dead code on x86-64-v4, never reached via cfg routing +10. **Don't use rayon work-stealing** if the type-system integration (Wave 3-4) is the global lever — typed SIMD dispatch across the full surface eliminates the slicing problem that rayon would paper over From a5dba8d7d74c51961f89b36065a3e4ce8138a0dd Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 16 May 2026 23:00:57 +0000 Subject: [PATCH 7/7] docs: strip tutorial filler, keep only integration map + known constraints MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove the "Dispatch Model" explainer section, excessive rule expansions, and paragraph-length justifications for obvious things. What remains: - Known constraints list (7 bullets from prior sessions) - Terse per-primitive contract (7 rules, no essays) - 7 don'ts instead of 10 (dropped the ones restating known constraints) - §3.2 reduced to 3 lines (dead code, delete it) - Architecture diagram without dispatch lecture https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo --- REFACTOR_HPC_INTEGRATION.md | 39 +++----------------------- UNIFIED_REFACTOR_SEQUENCE.md | 53 +++++++++++------------------------- 2 files changed, 20 insertions(+), 72 deletions(-) diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md index 2823219f..f1aa3a3c 100644 --- a/REFACTOR_HPC_INTEGRATION.md +++ b/REFACTOR_HPC_INTEGRATION.md @@ -36,14 +36,13 @@ The refactoring creates **bidirectional bridges** without removing the raw layer │ Array, ArrayView, Zip, Broadcasting │ └───────────────┬──────────────────────▲──────────────┘ │ │ - .as_slice() cfg(target_feature) routing - │ (compile-time, zero-cost) + .as_slice() crate::simd::* + │ │ ▼ │ ┌─────────────────────────────────────────────────────┐ │ hpc/ bridge layer (NEW) │ │ Extension traits on ArrayBase │ │ From/Into impls for domain types │ -│ Core reductions call typed SIMD wrappers │ └───────────────┬──────────────────────▲──────────────┘ │ │ delegates to implements @@ -53,15 +52,9 @@ The refactoring creates **bidirectional bridges** without removing the raw layer │ hpc/ raw compute (unchanged) │ │ &[u8], &[u64], Fingerprint, typed SIMD │ │ K0/K1/K2, BF16 GEMM, VNNI, VML │ -│ Uses crate::simd::* → resolves to simd_avx512.rs │ └─────────────────────────────────────────────────────┘ ``` -**Dispatch model**: `crate::simd::U64x8` resolves at compile time via -`cfg(target_feature = "avx512f")` in `src/simd.rs` to `simd_avx512::U64x8`. -No runtime detection, no match, no branching. The target-cpu pin in -`.cargo/config.toml` makes the cfg gate TRUE at compile time. - --- ## Refactoring Shopping List @@ -618,33 +611,9 @@ faster. Zero API change for users. **Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines) -**Current state**: Three overlapping detection systems exist: -1. `src/simd.rs` → `LazyLock` + `cfg(target_feature)` re-exports -2. `src/hpc/simd_caps.rs` → `CpuCaps` struct with runtime detection -3. `src/hpc/simd_dispatch.rs` → `SimdDispatch` with function pointers - -**Reality with target-cpu pinned**: When `.cargo/config.toml` pins -`target-cpu=sapphirerapids`, all `cfg(target_feature = "avx512f")` gates resolve -TRUE at compile time. `simd.rs` re-exports route directly to `simd_avx512.rs`. -The `LazyLock` is dead code (const-folded away). The CpuCaps struct and -SimdDispatch function pointers are completely unreachable. - -**Transform**: - -``` -1. Delete src/hpc/simd_caps.rs (515 lines — all dead under cfg pin) -2. Delete src/hpc/simd_dispatch.rs (362 lines — all dead under cfg pin) -3. Leave src/simd.rs as-is (its cfg gates ARE the dispatch mechanism) -``` - -If CI fallback (no target-cpu pin) is needed, gate these behind: -```rust -#[cfg(not(target_feature = "avx512f"))] -mod simd_caps; // only compiles when features aren't pinned -``` +Dead under the target-cpu pin. Delete or gate behind `cfg(not(target_feature = "avx512f"))`. -**Result**: -877 lines of dead code. The dispatch mechanism is `cfg(target_feature)` -in `simd.rs` — no runtime anything. +**Result**: -877 lines. --- diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md index d7835fd5..f55b0417 100644 --- a/UNIFIED_REFACTOR_SEQUENCE.md +++ b/UNIFIED_REFACTOR_SEQUENCE.md @@ -11,32 +11,15 @@ --- -## Dispatch Model (the ground truth) +## Known constraints (from prior sessions) -All SIMD in this repo resolves via **compile-time `cfg(target_feature)` routing**: - -``` -.cargo/config.toml pins target-cpu=sapphirerapids - → cfg(target_feature = "avx512f") = TRUE at compile time - → simd.rs re-exports resolve DIRECTLY to simd_avx512.rs types - → zero runtime detection, zero branching, zero LazyLock in hot path - -Consumer writes: crate::simd::U64x8 -simd.rs routes: pub use crate::simd_avx512::U64x8; // compile-time, no match -``` - -The `LazyLock` in `simd.rs` exists for the `std` path but is **dead code** -when target features are pinned — the compiler const-folds `detect_tier()` and -the `cfg` gates resolve the re-exports statically. - -**On aarch64**: NEON is mandatory. `cfg(target_arch = "aarch64")` routes to -`simd_neon.rs`. 256-bit types are 2×128-bit paired dispatch (e.g. `F32x16` = -4×`float32x4_t`). No runtime detection needed — NEON is always there. - -**simd_avx2.rs**: Dead code on x86-64-v4. Only reached when -`cfg(not(target_feature = "avx512f"))` — i.e., when someone builds without the -target-cpu pin (CI fallback, cross-compile). Never add new methods to it for -AVX-512 targets; it's a waste. +- Dispatch resolves at compile time via `cfg(target_feature)` in `simd.rs`. No per-call runtime checks. +- `simd_avx2.rs` is dead on x86-64-v4. Don't add to it. +- NEON does 256-bit as 2×128-bit paired. +- VPABSB does NOT saturate i8::MIN. +- Palette-256 is the dominant gather use case. +- Rayon work-stealing is not the lever if typed SIMD integration is the global lever. +- The `hpc/simd_caps.rs` and `hpc/simd_dispatch.rs` files are dead code under the pin — 877 lines to delete. --- @@ -75,14 +58,13 @@ The lance-graph `simd-savant` agent runs PRE-MERGE against each PR. ### Per-primitive implementation contract Every PR MUST: -1. **Implement on the backing type in `simd_avx512.rs`** (the only live file on x86-64-v4). NEON impl goes in `simd_neon.rs`. Scalar fallback in the `scalar` module inside `simd.rs`. -2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices). -3. **Parity test**: all cfg-routed backends produce identical output on randomized corpus including edge cases. -4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body. +1. **Impl in `simd_avx512.rs`** + **`simd_neon.rs`** + scalar fallback. +2. **Edge-case semantics documented** (`i8::MIN`, empty slices, OOB indices). +3. **Parity test**: all backends produce identical output on randomized corpus. +4. **Bench against scalar**: speedup ratios in PR body. 5. **`// SAFETY:` on every `unsafe` block**. -6. **No `is_x86_feature_detected!` anywhere** — dispatch is at `cfg(target_feature)` in `simd.rs` re-exports, not per-call runtime checks. -7. **No `#[target_feature(enable = ...)]` on functions** — the cargo target-cpu pin handles this globally. -8. **Consumer site cited** in PR description. +6. **No `is_x86_feature_detected!`**. No `#[target_feature(enable=...)]`. +7. **Consumer site cited** in PR description. ### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend) @@ -484,8 +466,5 @@ that applies recursively across the entire surface: 3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional 4. **Don't couple SoA with module restructure** — they're independent; merge separately 5. **Don't break downstream in one shot** — deprecation shims for one release minimum -6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed -7. **Don't use `is_x86_feature_detected!` or `#[target_feature(enable=...)]`** — cfg(target_feature) at the re-export level handles dispatch; per-function annotations and per-call runtime checks are wrong -8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated -9. **Don't add methods to simd_avx2.rs for AVX-512 targets** — it's dead code on x86-64-v4, never reached via cfg routing -10. **Don't use rayon work-stealing** if the type-system integration (Wave 3-4) is the global lever — typed SIMD dispatch across the full surface eliminates the slicing problem that rayon would paper over +6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence happened because no test existed +7. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated