diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md new file mode 100644 index 00000000..f1aa3a3c --- /dev/null +++ b/REFACTOR_HPC_INTEGRATION.md @@ -0,0 +1,974 @@ +# ndarray hpc/ Module Integration & Transformation Plan + +> **Goal**: Transform hpc/ from a bolted-on raw-slice library into a first-class +> citizen of ndarray's type system — without breaking any existing API. +> +> **Principle**: Every refactoring adds an ndarray-native overload or trait impl. +> Raw-slice functions stay (FFI, embedded, zero-overhead). No deletion. + +--- + +## Architectural Diagnosis + +The hpc/ module (80K lines, 90+ files) currently operates in two disconnected worlds: + +``` +┌─────────────────────────────────────────────────────┐ +│ ndarray core │ +│ Array, ArrayView, Zip, Broadcasting, │ +│ .sum(), .dot(), mapv(), LinalgScalar, BlasFloat │ +└───────────────┬─────────────────────────────────────┘ + │ .as_slice() ← the only bridge + ▼ +┌─────────────────────────────────────────────────────┐ +│ hpc/ module │ +│ &[u8], &[u64], &[f32], *const u8, Vec │ +│ Fingerprint, VsaVector, Cascade, EnergyConflict │ +│ Custom SIMD dispatch, manual row/col params │ +└─────────────────────────────────────────────────────┘ +``` + +The refactoring creates **bidirectional bridges** without removing the raw layer: + +``` +┌─────────────────────────────────────────────────────┐ +│ ndarray core (unchanged) │ +│ Array, ArrayView, Zip, Broadcasting │ +└───────────────┬──────────────────────▲──────────────┘ + │ │ + .as_slice() crate::simd::* + │ │ + ▼ │ +┌─────────────────────────────────────────────────────┐ +│ hpc/ bridge layer (NEW) │ +│ Extension traits on ArrayBase │ +│ From/Into impls for domain types │ +└───────────────┬──────────────────────▲──────────────┘ + │ │ + delegates to implements + │ │ + ▼ │ +┌─────────────────────────────────────────────────────┐ +│ hpc/ raw compute (unchanged) │ +│ &[u8], &[u64], Fingerprint, typed SIMD │ +│ K0/K1/K2, BF16 GEMM, VNNI, VML │ +└─────────────────────────────────────────────────────┘ +``` + +--- + +## Refactoring Shopping List + +### Tier 1: Type Bridge — Domain Types ↔ Array (Low effort, High impact) + +These add conversion traits so domain types can participate in ndarray's ecosystem. + +--- + +#### 1.1 Fingerprint ↔ ArrayView1 + +**File**: `src/hpc/fingerprint.rs` + +**Current state**: `Fingerprint` wraps `[u64; N]`. Provides `.as_bytes()` via +unsafe pointer cast. No connection to ndarray. + +**Transform**: + +```rust +// --- Zero-copy views (no allocation) --- + +impl Fingerprint { + /// View this fingerprint as an ndarray 1-D array of u64 words. + /// Zero-copy: borrows the internal buffer. + pub fn as_array_view(&self) -> ArrayView1<'_, u64> { + ArrayView1::from_shape(N, &self.words).unwrap() + } + + /// Mutable view for in-place operations. + pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> { + ArrayViewMut1::from_shape(N, &mut self.words).unwrap() + } + + /// View as 2-D matrix (e.g., 256 = 32×8 for tiled operations). + pub fn as_matrix_view(&self, rows: usize, cols: usize) -> ArrayView2<'_, u64> { + assert_eq!(rows * cols, N); + ArrayView2::from_shape((rows, cols), &self.words).unwrap() + } +} + +// --- Owned conversions (allocation on From, zero-copy on Into) --- + +impl From> for Fingerprint { + fn from(arr: Array1) -> Self { + assert_eq!(arr.len(), N); + let mut words = [0u64; N]; + words.copy_from_slice(arr.as_slice().unwrap()); + Self { words } + } +} + +impl From> for Array1 { + fn from(fp: Fingerprint) -> Self { + Array1::from_vec(fp.words.to_vec()) + } +} +``` + +**Why**: Callers can now pass fingerprints to any function expecting `ArrayView1`. +Enables `.sum()`, `.mapv()`, Zip, broadcasting on fingerprint data. + +**Example — before vs. after**: + +```rust +// BEFORE: raw slice, manual loop +let dist = k2_exact(&fp_a.words, &fp_b.words, 256); + +// AFTER: can also use ndarray ops directly +let view_a = fp_a.as_array_view(); +let view_b = fp_b.as_array_view(); +let xor_popcount: u64 = (&view_a ^ &view_b).mapv(|w| w.count_ones() as u64).sum(); +``` + +--- + +#### 1.2 VsaVector ↔ ArrayView1 + +**File**: `src/hpc/vsa.rs` + +**Same pattern as 1.1** — VsaVector has `words: [u64; 256]`. Add: + +```rust +impl VsaVector { + pub fn as_array_view(&self) -> ArrayView1<'_, u64> { + ArrayView1::from_shape(VSA_WORDS, &self.words).unwrap() + } + pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> { + ArrayViewMut1::from_shape(VSA_WORDS, &mut self.words).unwrap() + } +} + +impl From> for VsaVector { ... } +impl From for Array1 { ... } +``` + +--- + +#### 1.3 BF16 / F16 as ndarray Element Types + +**Files**: `src/hpc/quantized.rs`, `src/backend/mod.rs` + +**Current state**: BF16 and F16 implement `ScalarOperand` (can appear in arrays) +but NOT `BlasFloat` (cannot use `.dot()`, BLAS traits). They're second-class citizens. + +**Transform** — Register BF16/F16 for BLAS dispatch: + +```rust +// In src/backend/mod.rs or src/hpc/quantized.rs: + +impl BlasFloat for BF16 { + fn backend_dot(xs: &[Self], ys: &[Self]) -> Self { + // Accumulate in f32, return as BF16 + let mut acc = 0.0f32; + for (&x, &y) in xs.iter().zip(ys.iter()) { + acc += x.to_f32() * y.to_f32(); + } + BF16::from_f32(acc) + } + + fn backend_gemm(m: usize, n: usize, k: usize, alpha: Self, + a: &[Self], lda: usize, b: &[Self], ldb: usize, + beta: Self, c: &mut [Self], ldc: usize) { + // Delegate to existing bf16_gemm_f32, convert output back + crate::hpc::quantized::bf16_gemm_f32( + a, b, + /* c_f32 temp */ ..., + m, n, k, alpha.to_f32(), beta.to_f32() + ); + // Convert f32 results back to BF16 in c + } + + fn backend_axpy(alpha: Self, x: &[Self], y: &mut [Self]) { ... } + fn backend_scal(alpha: Self, x: &mut [Self]) { ... } + fn backend_nrm2(x: &[Self]) -> Self { ... } + fn backend_asum(x: &[Self]) -> Self { ... } +} +``` + +**Impact**: Unlocks this: + +```rust +let a: Array2 = ...; +let b: Array2 = ...; +let c = a.blas_gemm(BF16::one(), &b, BF16::zero()); // Just works +``` + +--- + +#### 1.4 CogRecord ↔ Structured ArrayView + +**File**: `src/hpc/cogrecord.rs` + +**Current state**: CogRecord has four `Array` fields (btree, cam, embed, meta) +of 4096 bytes each. No combined view. + +**Transform**: + +```rust +impl CogRecord { + /// View all 4 channels as a single [4, 4096] matrix. + pub fn as_channels_view(&self) -> ArrayView2<'_, u8> { + // Stack the 4 contiguous buffers into a 2-D view + let total = self.btree.len() + self.cam.len() + self.embed.len() + self.meta.len(); + debug_assert_eq!(total, 4 * CONTAINER_BYTES); + // NOTE: only works if fields are contiguous in memory. + // If not, provide a copy-based `to_channels_array()`. + todo!("requires contiguous layout guarantee") + } + + /// Get a single channel as ArrayView1 (for Hamming/XOR ops). + pub fn channel_as_words(&self, channel: usize) -> ArrayView1<'_, u64> { + let bytes = match channel { + 0 => self.btree.as_slice().unwrap(), + 1 => self.cam.as_slice().unwrap(), + 2 => self.embed.as_slice().unwrap(), + 3 => self.meta.as_slice().unwrap(), + _ => panic!("channel must be 0..4"), + }; + // Safe: u8 slice reinterpreted as u64 (alignment checked) + let words = unsafe { + std::slice::from_raw_parts( + bytes.as_ptr() as *const u64, + bytes.len() / 8, + ) + }; + ArrayView1::from_shape(words.len(), words).unwrap() + } +} +``` + +--- + +### Tier 2: Trait Extensions — hpc Operations on Array Types (Medium effort, High impact) + +Add extension traits so hpc operations feel native on ndarray arrays. + +--- + +#### 2.1 HDC Extension Trait (Hamming, Cascade, K0/K1/K2) + +**New file**: `src/hpc/ndarray_ext.rs` (or extend `bitwise.rs`) + +**Reasoning**: The K0/K1/K2 pipeline currently takes `&[u64]`. Users with +`Array1` must manually call `.as_slice().unwrap()`. This trait bridges that. + +```rust +use crate::imp_prelude::*; +use super::kernels::{k0_probe, k1_stats, k2_exact, SliceGate, EnergyConflict}; + +/// HDC (Hyperdimensional Computing) operations on u64 word arrays. +/// +/// Provides the K0/K1/K2 rejection pipeline directly on ndarray types. +pub trait HdcOps { + /// K0 probe: fast 64-bit reject. + /// Returns true if the first word passes the gate threshold. + fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool; + + /// K1 stats: 512-bit medium reject. + /// Returns true if the first 8 words pass the gate threshold. + fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool; + + /// K2 exact: Full-width Hamming + energy decomposition. + fn k2_exact(&self, candidate: &Self) -> EnergyConflict; + + /// Full cascade: K0 → K1 → K2 with early rejection. + fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option; +} + +impl HdcOps for ArrayBase +where + S: Data, +{ + fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool { + k0_probe(self[0], candidate[0], gate) + } + + fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool { + let q = self.as_slice().expect("query must be contiguous"); + let c = candidate.as_slice().expect("candidate must be contiguous"); + k1_stats(q, c, gate) + } + + fn k2_exact(&self, candidate: &Self) -> EnergyConflict { + let q = self.as_slice().expect("query must be contiguous"); + let c = candidate.as_slice().expect("candidate must be contiguous"); + k2_exact(q, c, q.len()) + } + + fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option { + if !self.k0_probe(candidate, gate) { return None; } + if !self.k1_stats(candidate, gate) { return None; } + Some(self.k2_exact(candidate)) + } +} +``` + +**Example usage after**: + +```rust +use ndarray::hpc::ndarray_ext::HdcOps; + +let query: Array1 = fp_query.into(); +let candidate: Array1 = fp_candidate.into(); +let gate = SliceGate::for_sku16k(); + +// Clean ndarray-native API +if let Some(result) = query.cascade_distance(&candidate, &gate) { + println!("Hamming: {}, Agreement: {}", result.conflict, result.agreement); +} +``` + +--- + +#### 2.2 Quantization Extension Trait + +**File**: extend `src/hpc/quantized.rs` + +**Reasoning**: Quantize/dequantize currently returns `Vec`. Should return `Array1`. + +```rust +/// Quantization operations on f32 arrays. +pub trait Quantize { + /// Quantize to BF16 (element-wise truncation). + fn to_bf16(&self) -> Array; + + /// Quantize to symmetric INT8 with scale factor. + fn to_i8_symmetric(&self) -> (Array, QuantParams); + + /// Quantize per-channel (rows) to INT8. + fn to_i8_per_channel(&self) -> (Array2, PerChannelQuantParams) + where + Self: AsArray<'_, f32, Ix2>; // Only for 2-D +} + +/// Dequantization operations. +pub trait Dequantize { + /// Dequantize INT8 codes back to f32. + fn dequantize_f32(&self, params: &QuantParams) -> Array; +} + +impl Quantize for ArrayBase +where + S: Data, +{ + fn to_bf16(&self) -> Array { + let slice = self.as_slice().expect("must be contiguous"); + let mut out = vec![BF16::ZERO; slice.len()]; + f32_to_bf16_slice(slice, &mut out); // existing raw function + Array1::from_vec(out) + } + + fn to_i8_symmetric(&self) -> (Array, QuantParams) { + let slice = self.as_slice().expect("must be contiguous"); + let (vec, params) = quantize_f32_to_i8(slice); // existing raw function + (Array1::from_vec(vec), params) + } +} +``` + +--- + +#### 2.3 Prefilter — Accept ArrayView2 Instead of (slice, rows, cols) + +**File**: `src/hpc/prefilter.rs` + +**Current state** (lines 48-295): +```rust +pub fn approx_mean_std_f32(data: &[f32]) -> (f32, f32) +pub fn approx_row_norms_f32(data: &[f32], rows: usize, cols: usize) -> Vec +pub fn top_k_rows_by_norm(data: &[f32], rows: usize, cols: usize, k: usize) -> Vec +pub fn approx_column_std(data: &[f32], rows: usize, cols: usize) -> Vec +``` + +**Transform** — add ndarray overloads that delegate to existing implementations: + +```rust +/// Array-native overloads for prefilter operations. +pub trait Prefilter { + /// Approximate mean and standard deviation (SIMD). + fn approx_mean_std(&self) -> (f32, f32); + + /// L2 norm per row (SIMD). + fn row_norms(&self) -> Array1; + + /// Indices of top-K rows by L2 norm. + fn top_k_by_norm(&self, k: usize) -> Array1; + + /// Standard deviation per column. + fn column_std(&self) -> Array1; +} + +impl Prefilter for ArrayBase +where + S: Data, +{ + fn approx_mean_std(&self) -> (f32, f32) { + let slice = self.as_slice().expect("must be contiguous"); + approx_mean_std_f32(slice) + } + + fn row_norms(&self) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(approx_row_norms_f32(slice, rows, cols)) + } + + fn top_k_by_norm(&self, k: usize) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(top_k_rows_by_norm(slice, rows, cols, k)) + } + + fn column_std(&self) -> Array1 { + let slice = self.as_slice().expect("must be contiguous"); + let (rows, cols) = self.dim(); + Array1::from_vec(approx_column_std(slice, rows, cols)) + } +} +``` + +--- + +#### 2.4 Activations — Extend to N-D with Axis + +**File**: `src/hpc/activations.rs` + +**Current state**: Trait only works on `ArrayBase`. No per-row softmax. + +**Transform**: + +```rust +/// Multi-dimensional activations (applied along an axis). +pub trait ActivationsNd { + /// Softmax along the given axis. + /// For a [batch, features] matrix with axis=1, computes softmax per row. + fn softmax_axis(&self, axis: Axis) -> Array; + + /// Log-softmax along the given axis. + fn log_softmax_axis(&self, axis: Axis) -> Array; +} + +impl ActivationsNd for ArrayBase +where + A: Float + 'static, + S: Data, +{ + fn softmax_axis(&self, axis: Axis) -> Array { + let mut result = self.to_owned(); + for mut row in result.axis_iter_mut(axis) { + let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b)); + row.mapv_inplace(|v| (v - max_val).exp()); + let sum: A = row.iter().fold(A::zero(), |acc, &v| acc + v); + row.mapv_inplace(|v| v / sum); + } + result + } + + fn log_softmax_axis(&self, axis: Axis) -> Array { + let mut result = self.to_owned(); + for mut row in result.axis_iter_mut(axis) { + let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b)); + row.mapv_inplace(|v| v - max_val); + let log_sum_exp = row.mapv(|v| v.exp()) + .iter().fold(A::zero(), |acc, &v| acc + v).ln(); + row.mapv_inplace(|v| v - log_sum_exp); + } + result + } +} +``` + +**Impact**: Enables `batch_logits.softmax_axis(Axis(1))` — critical for inference. + +--- + +#### 2.5 VML — SIMD Transcendentals as mapv Acceleration + +**File**: `src/hpc/vml.rs` (2543 lines) + +**Current state**: 12+ standalone functions like `vsexp(x: &[f32], out: &mut [f32])`. + +**Transform** — add `ArrayBase` methods that auto-dispatch to SIMD: + +```rust +/// SIMD-accelerated element-wise math on contiguous f32 arrays. +/// +/// These call the VML SIMD kernels when the array is contiguous, +/// falling back to scalar mapv() otherwise. +pub trait SimdMath { + fn simd_exp(&self) -> Array; + fn simd_ln(&self) -> Array; + fn simd_sqrt(&self) -> Array; + fn simd_sin(&self) -> Array; + fn simd_cos(&self) -> Array; + fn simd_tanh(&self) -> Array; + fn simd_abs(&self) -> Array; +} + +impl SimdMath for ArrayBase +where + S: Data, + D: Dimension, +{ + fn simd_exp(&self) -> Array { + if let Some(slice) = self.as_slice() { + let mut out = vec![0.0f32; slice.len()]; + vsexp(slice, &mut out); // existing SIMD kernel + Array::from_shape_vec(self.raw_dim(), out).unwrap() + } else { + self.mapv(|x| x.exp()) + } + } + + fn simd_ln(&self) -> Array { + if let Some(slice) = self.as_slice() { + let mut out = vec![0.0f32; slice.len()]; + vsln(slice, &mut out); + Array::from_shape_vec(self.raw_dim(), out).unwrap() + } else { + self.mapv(|x| x.ln()) + } + } + // ... same pattern for sqrt, sin, cos, tanh, abs +} +``` + +**Benchmark expectation**: 10–50x for transcendentals (polynomial SIMD vs scalar libm). + +--- + +### Tier 3: Backend Integration — Core Dispatch to SIMD (Medium effort, Very High impact) + +Make ndarray's **built-in** operations (`.sum()`, `.mean()`, `.dot()`) automatically +use hpc SIMD kernels for contiguous arrays. + +--- + +#### 3.1 Wire Reductions into Core + +**Files**: `src/numeric/impl_numeric.rs` (core sum/mean) + `src/hpc/reductions.rs` + +**Current state**: `Array.sum()` uses `numeric_util::unrolled_fold()` — a scalar +4x unrolled loop. `hpc::reductions::sum_f32()` uses F32x16 (16-lane AVX-512) +with 4-accumulator ILP. They never talk to each other. + +**Transform** — add backend dispatch in core's sum(): + +```rust +// In src/numeric/impl_numeric.rs or src/backend/native.rs: + +impl ArrayBase +where + A: Clone + num_traits::Zero + std::ops::Add, + S: Data, + D: Dimension, +{ + pub fn sum(&self) -> A { + // Fast path: contiguous f32 → SIMD reduction + if std::any::TypeId::of::() == std::any::TypeId::of::() { + if let Some(slice) = self.as_slice_memory_order() { + // SAFETY: TypeId check guarantees A == f32 + let f32_slice = unsafe { + std::slice::from_raw_parts(slice.as_ptr() as *const f32, slice.len()) + }; + let result = crate::hpc::reductions::sum_f32(f32_slice); + // SAFETY: result is f32, A is f32 + return unsafe { std::mem::transmute_copy(&result) }; + } + } + if std::any::TypeId::of::() == std::any::TypeId::of::() { + if let Some(slice) = self.as_slice_memory_order() { + let f64_slice = unsafe { + std::slice::from_raw_parts(slice.as_ptr() as *const f64, slice.len()) + }; + let result = crate::hpc::reductions::sum_f64(f64_slice); + return unsafe { std::mem::transmute_copy(&result) }; + } + } + // Fallback: generic fold + self.fold(A::zero(), |acc, &x| acc + x) + } +} +``` + +**Impact**: Every `.sum()` on a contiguous f32/f64 array silently becomes 16x +faster. Zero API change for users. + +**Same pattern for**: `.mean()`, internal `max()`/`min()` in softmax paths. + +--- + +#### 3.2 Delete Dead SIMD Detection Code + +**Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines) + +Dead under the target-cpu pin. Delete or gate behind `cfg(not(target_feature = "avx512f"))`. + +**Result**: -877 lines. + +--- + +#### 3.3 VNNI GEMM via BlasFloat Dispatch + +**Files**: `src/hpc/vnni_gemm.rs`, `src/hpc/quantized.rs:618`, `src/backend/mod.rs` + +**Current state**: `int8_gemm_vnni()` takes raw `&[u8], &[i8], &mut [i32]` with manual +m/n/k parameters. Not accessible via `Array.dot()` or `blas_gemm()`. + +**Transform** — wire INT8 GEMM into the backend trait system: + +```rust +// New trait in src/backend/mod.rs: +pub trait BlasInt8 { + /// INT8 GEMM: C_i32 = A_u8 × B_i8 + fn backend_gemm_i8( + m: usize, n: usize, k: usize, + a: &[u8], lda: usize, + b: &[i8], ldb: usize, + c: &mut [i32], ldc: usize, + ); +} + +// Extension trait on Array: +pub trait Int8Matmul { + /// Quantized matrix multiply: self (u8) × rhs (i8) → Array2 + fn matmul_i8(&self, rhs: &ArrayView2) -> Array2; +} + +impl Int8Matmul for ArrayBase +where + S: Data, +{ + fn matmul_i8(&self, rhs: &ArrayView2) -> Array2 { + let (m, k) = self.dim(); + let (k2, n) = rhs.dim(); + assert_eq!(k, k2, "inner dimensions must match"); + + let a_slice = self.as_slice().expect("A must be contiguous"); + let b_slice = rhs.as_slice().expect("B must be contiguous"); + let mut c = vec![0i32; m * n]; + + crate::hpc::vnni_gemm::int8_gemm_vnni(a_slice, b_slice, &mut c, m, n, k); + Array2::from_shape_vec((m, n), c).unwrap() + } +} +``` + +**Example**: + +```rust +let weights: Array2 = load_quantized_weights(); +let activations: Array2 = quantize_input(&input); +let output: Array2 = activations.matmul_i8(&weights.view()); +// 4× throughput vs f32 GEMM +``` + +--- + +### Tier 4: View Factories — Arrow & External Data (Low effort, Medium impact) + +Make hpc's data ingestion return ndarray views instead of raw slices. + +--- + +#### 4.1 Arrow Bridge → ArrayView + +**File**: `src/hpc/arrow_bridge.rs` + +**Current state**: `binary_entry()` returns `&[u8]`. `SoakingBuffer` provides +`as_columnar_slice()` as `&[i8]`. + +**Transform**: + +```rust +impl ArrowBridge { + /// Get one binary entry as a 1-D array view. + pub fn binary_entry_view(&self, idx: usize) -> ArrayView1<'_, u8> { + let bytes = self.binary_entry(idx); + ArrayView1::from_shape(bytes.len(), bytes).unwrap() + } + + /// Get the full binary column as a 2-D view [n_rows, entry_bytes]. + pub fn binary_column_view(&self) -> ArrayView2<'_, u8> { + let data = self.as_binary_slice(); + let entry_len = self.entry_len(); + let n_rows = data.len() / entry_len; + ArrayView2::from_shape((n_rows, entry_len), data).unwrap() + } +} + +impl SoakingBuffer { + /// View the columnar buffer as a 2-D matrix [n_entries, n_dims]. + pub fn as_array_view_2d(&self) -> ArrayView2<'_, i8> { + let data = self.as_columnar_slice(); + ArrayView2::from_shape((self.n_entries, self.n_dims), data).unwrap() + } +} +``` + +**Impact**: Downstream code can use ndarray broadcasting, slicing, and Zip +on Arrow data without copying. + +--- + +#### 4.2 Distance Functions → Return Array1 + +**File**: `src/hpc/distance.rs` + +**Current state**: Returns `Vec`. + +**Transform**: + +```rust +// Keep existing: +pub fn squared_distances_f32(query: [f32; 3], points: &[[f32; 3]]) -> Vec { ... } + +// Add ndarray overload: +pub fn squared_distances_array(query: &ArrayView1, points: &ArrayView2) -> Array1 { + assert_eq!(query.len(), points.ncols()); + let n = points.nrows(); + let mut result = Array1::zeros(n); + for (i, row) in points.axis_iter(Axis(0)).enumerate() { + let diff = &row - query; + result[i] = diff.mapv(|x| x * x).sum(); + } + result +} +``` + +--- + +### Tier 5: Zip & Parallel — Modernize Inner Loops (Medium effort, Medium impact) + +Replace manual `iter_mut().zip()` with ndarray's `Zip` for vectorization hints +and future `par_for_each()` support. + +--- + +#### 5.1 VML Tail Loops → Zip + +**File**: `src/hpc/vml.rs` — 12+ sites + +**Current pattern** (repeated throughout): +```rust +for (o, &v) in out[tail_start..].iter_mut().zip(x[tail_start..].iter()) { + *o = v.exp(); +} +``` + +**Transform**: +```rust +use crate::Zip; + +let out_view = ArrayViewMut1::from_shape(n - tail_start, &mut out[tail_start..]).unwrap(); +let x_view = ArrayView1::from_shape(n - tail_start, &x[tail_start..]).unwrap(); +Zip::from(&mut out_view).and(&x_view).for_each(|o, &v| { + *o = v.exp(); +}); +``` + +**Benefit**: Zip provides SIMD-friendlier iteration order and enables +future `par_for_each()` without restructuring. + +--- + +#### 5.2 Hamming Batch → Parallel Zip + +**File**: `src/hpc/bitwise.rs` lines 320-333 + +**Current state**: Sequential loop over candidates. + +**Transform** (feature-gated on `rayon`): + +```rust +#[cfg(feature = "rayon")] +pub fn hamming_batch_parallel( + query: &ArrayView1, + database: &ArrayView2, // [n_candidates, vec_len] +) -> Array1 { + use rayon::prelude::*; + let q = query.as_slice().unwrap(); + let results: Vec = database + .axis_iter(Axis(0)) + .into_par_iter() + .map(|row| { + let r = row.as_slice().unwrap(); + dispatch_hamming(q, r) + }) + .collect(); + Array1::from_vec(results) +} +``` + +--- + +### Tier 6: Multi-Dimensional Reductions (High effort, High impact for ML) + +Lift 1-D slice operations to full N-D axis reductions. + +--- + +#### 6.1 SIMD Axis Reductions + +**File**: `src/hpc/reductions.rs` + +**Current state**: Only 1-D: `sum_f32(&[f32])`, `mean_f32(&[f32])`, etc. + +**Transform**: + +```rust +/// Sum along an axis, using SIMD for contiguous lanes. +pub fn sum_axis_f32(arr: &ArrayBase, axis: Axis) -> Array +where + S: Data, + D: Dimension + RemoveAxis, +{ + let mut result = Array::zeros(arr.raw_dim().remove_axis(axis)); + for (i, lane) in arr.lanes(axis).into_iter().enumerate() { + if let Some(slice) = lane.as_slice() { + result.as_slice_mut().unwrap()[i] = sum_f32(slice); + } else { + result.as_slice_mut().unwrap()[i] = lane.iter().fold(0.0, |a, &b| a + b); + } + } + result +} +``` + +--- + +#### 6.2 Statistics Trait — Generalize to N-D + +**File**: `src/hpc/statistics.rs` + +**Transform**: Make `variance`, `std_dev`, `percentile` work along axes: + +```rust +pub trait StatisticsNd { + type Reduced; + + fn variance_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced; + fn std_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced; + fn percentile_axis(&self, axis: Axis, q: f64) -> Self::Reduced; +} +``` + +--- + +## Execution Order & Dependencies + +``` +Phase A (2 weeks) — Type Bridges: + 1.1 Fingerprint ↔ ArrayView1 (standalone, no deps) + 1.2 VsaVector ↔ ArrayView1 (standalone) + 1.3 BF16/F16 → BlasFloat (requires backend/mod.rs edit) + 1.4 CogRecord channel views (standalone) + +Phase B (2 weeks) — Extension Traits: + 2.1 HdcOps trait (K0/K1/K2) (depends on 1.1) + 2.2 Quantize trait (depends on 1.3) + 2.3 Prefilter trait (standalone) + 2.4 ActivationsNd (axis softmax) (standalone) + 2.5 SimdMath trait (VML) (standalone) + +Phase C (2 weeks) — Backend Wiring: + 3.1 Core sum/mean → typed SIMD (standalone — calls crate::simd::F32x16 directly) + 3.2 Delete dead detection code (standalone, -877 lines unreachable under cfg pin) + 3.3 INT8 Matmul via trait (depends on 1.3 pattern) + +Phase D (1 week) — View Factories: + 4.1 Arrow → ArrayView (standalone) + 4.2 Distance → Array1 returns (standalone) + +Phase E (1 week) — Modernize Loops: + 5.1 VML tails → Zip (standalone) + 5.2 Hamming batch → parallel (standalone) + +Phase F (2 weeks) — N-D Reductions: + 6.1 SIMD axis reductions (depends on 3.1) + 6.2 Statistics N-D (depends on 6.1) +``` + +--- + +## Rules of Engagement + +1. **No deletion** — raw-slice functions stay. They serve FFI, embedded, and hot-path callers + who've already pre-validated contiguity. + +2. **Extension traits, not modifications** — new traits in new files or at the bottom of + existing files. Existing public APIs unchanged. + +3. **Contiguous fast-path, generic fallback** — every new trait method checks + `as_slice()` first. If contiguous → SIMD dispatch. If not → `mapv()`/`iter()`. + +4. **Feature-gate expensive additions** — rayon parallel (`rayon` feature), + BlasFloat for BF16 (`bf16-blas` feature if controversial). + +5. **Tests mirror existing** — each new trait method gets the same test coverage + as the raw-slice version it wraps. + +6. **Naming convention** — ndarray-native methods use `snake_case` matching ndarray + style (`.softmax_axis()`, `.simd_exp()`). Existing raw functions keep their + current names (`softmax_f32()`, `vsexp()`). + +--- + +## Impact Matrix + +| Item | Lines Changed | New Lines | Performance Impact | Ergonomics Impact | +|------|--------------|-----------|-------------------|-------------------| +| 1.1 Fingerprint bridge | 0 | ~40 | Zero (zero-copy) | High — unblocks Zip/broadcast | +| 1.3 BF16 BlasFloat | ~10 | ~80 | Enables BF16 GEMM via .dot() | Very High — removes type barrier | +| 2.1 HdcOps trait | 0 | ~60 | Zero (delegates) | High — clean pipeline API | +| 2.5 SimdMath (VML) | 0 | ~100 | 10-50x for exp/ln/sin | Very High — transparent SIMD | +| 3.1 Core sum → SIMD | ~20 | ~40 | 16x for f32 sum/mean | Zero (invisible) | +| 3.2 Unified detection | -877 | ~50 | Zero (dedup) | N/A (internal) | +| 3.3 INT8 matmul trait | 0 | ~60 | 4x over f32 GEMM | High — quantized inference | +| 5.2 Parallel hamming | 0 | ~30 | N×cores speedup | Medium | +| 6.1 SIMD axis reductions | 0 | ~80 | 16x per lane | High — ML training | + +--- + +## What This Achieves + +After this refactoring, hpc/ is no longer a "second library living inside ndarray." +It becomes the **performance substrate** that ndarray's own operations dispatch to: + +```rust +// User code — completely ndarray-native, no hpc imports needed: +let a: Array2 = load_embeddings(); +let b: Array2 = load_queries(); + +let sims = a.dot(&b.t()); // → routes to SIMD GEMM internally +let topk = sims.sum_axis(Axis(1)); // → routes to SIMD sum internally +let probs = sims.softmax_axis(Axis(1)); // → routes to SIMD exp internally + +// User code — HDC domain, imports hpc trait: +use ndarray::hpc::ndarray_ext::HdcOps; +let fp: Array1 = Fingerprint::<256>::random(42).into(); +let candidates: Array2 = load_database(); + +for row in candidates.axis_iter(Axis(0)) { + if let Some(ec) = fp.cascade_distance(&row, &gate) { + hits.push(ec); + } +} +``` + +The raw-slice layer (`hpc::kernels::k2_exact(&[u64], ...)`) remains for: +- FFI exports (Python, C) +- Embedded/no-std contexts +- Callers who've already validated layout and want zero overhead +- Benchmarks that isolate kernel performance from array metadata + +Both worlds coexist. The bridge is the extension traits. diff --git a/SOA_KERNEL_ARCHITECTURE.md b/SOA_KERNEL_ARCHITECTURE.md new file mode 100644 index 00000000..384023b3 --- /dev/null +++ b/SOA_KERNEL_ARCHITECTURE.md @@ -0,0 +1,478 @@ +# SoA Kernel Architecture: The Transposed Cascade + +> **The insight**: Don't iterate candidates and probe words. +> Iterate words and filter candidates. +> +> The database layout IS the algorithm. + +--- + +## The Current Architecture (AoS — Row-Major) + +``` +Memory layout: [candidate₀ word₀..word₂₅₅] [candidate₁ word₀..word₂₅₅] ... + +Algorithm (kernel_pipeline, line 725): + for each candidate: ← sequential over N + K0: XOR word[0] ← random access into candidate's word 0 + K1: XOR word[0..8] ← sequential within one candidate + K2: XOR word[0..256] ← sequential within one candidate + +Cache behavior: + K0 touches 8 bytes per candidate but strides by 2048 bytes. + For 1M candidates: 8 bytes useful / 2048 bytes per cache line = 0.4% utilization. +``` + +Each K0 probe fetches one u64 from a different 2KB region. The CPU prefetcher +can't help because the access pattern is "8 bytes at stride 2048" — every fetch +pulls a full cache line but uses 1/256th of it. + +--- + +## The Transposed Architecture (SoA — Column-Major) + +``` +Memory layout: + Column 0: [cand₀_word₀, cand₁_word₀, cand₂_word₀, ... candₙ_word₀] + Column 1: [cand₀_word₁, cand₁_word₁, cand₂_word₁, ... candₙ_word₁] + ... + Column 255: [cand₀_word₂₅₅, cand₁_word₂₅₅, ... candₙ_word₂₅₅] + +Algorithm (transposed cascade): + Phase K0: scan Column 0 — XOR all N words with query[0], threshold filter + → produces survivor_mask (bitmask of which candidates survived) + + Phase K1: scan Columns 0..8 — XOR only survivors' words, accumulate + → narrows survivor_mask further + + Phase K2: scan Columns 0..255 — XOR only final survivors + → exact EnergyConflict for each + +Cache behavior: + K0 is a contiguous sequential scan of N × 8 bytes. + For 1M candidates: 8MB, perfectly prefetchable, 100% cache line utilization. + AVX-512: 8 candidates per instruction (512 bits / 64 bits = 8 u64s). +``` + +--- + +## Why This Is Genius at the Meta Level + +### 1. The Cascade Becomes SIMD-Native Filtering + +Current K0 (per-candidate, scalar): +```rust +for i in 0..n_candidates { + let word0 = database[i * 256]; // stride 2048 bytes — cache-hostile + if (word0 ^ query_word0).count_ones() > threshold { + continue; + } + // ... K1, K2 +} +``` + +Transposed K0 (columnar, SIMD): +```rust +let col0: &[u64] = &columns[0]; // contiguous N×8 bytes +let q0 = U64x8::splat(query_words[0]); // broadcast query word 0 +let thresh = U32x8::splat(gate.k0_reject); + +let mut survivors = vec![0u64; (n_candidates + 63) / 64]; // bitmask + +for chunk in 0..(n_candidates / 8) { + let c = U64x8::from_slice(&col0[chunk * 8..]); + let xor = c ^ q0; + let popcnt = xor.popcnt_per_lane(); // 8 popcounts in parallel + let mask = popcnt.le(thresh); // 8 comparisons → 8-bit mask + survivors[chunk / 8] |= mask.to_bitmask() << (chunk % 8 * 8); +} +``` + +**8 candidates per clock cycle instead of 1.** And the memory access is a +sequential stream — the hardware prefetcher handles it perfectly. + +--- + +### 2. The Survivor Mask IS the State Machine + +The cascade becomes mask narrowing — no branch prediction, no `continue`: + +```rust +// Start: all candidates alive +let mut mask: Vec = vec![u64::MAX; (n + 63) / 64]; + +// K0: narrow mask via column 0 +k0_filter_columnar(&columns[0], query[0], gate.k0_reject, &mut mask); + +// K1: narrow mask via columns 0..8 (only touch survivors) +k1_filter_columnar(&columns[0..8], &query[0..8], gate.k1_reject, &mut mask); + +// K2: exact only on final survivors +let results = k2_exact_masked(&columns, query, &mask); +``` + +Each phase reads ONLY the columns it needs. K0 touches 1 column (N×8 bytes). +K1 touches 8 columns (N×64 bytes). K2 touches 256 columns but only for +survivors (~3-5% of N). Total memory touched: + +``` +AoS: N × 2048 bytes (reads all words for all candidates even if K0 rejects) + Actually reads: N × 8 (K0) + 0.05N × 64 (K1) + 0.003N × 2048 (K2) + But cache lines pulled: N × 64 (one cache line per K0 probe) = N×64 bytes + +SoA: N × 8 (K0, contiguous) + N × 64 (K1, contiguous) + 0.003N × 2048 (K2) + Effective: N × 72 + 6N = ~78N bytes vs ~64N bytes (AoS) + BUT: 100% utilization vs 12.5% utilization per cache line + Net bandwidth saving: ~8× due to no wasted cache line fetch +``` + +--- + +### 3. ndarray Makes This Free (Strides, Not Copies) + +Here's the meta-level genius: **you don't transpose the data. You transpose the strides.** + +```rust +// Build database in Fortran (column-major) order: +let database: Array2 = Array2::from_shape_vec( + (n_candidates, 256).f(), // .f() = Fortran order + flat_word_data, +).unwrap(); + +// Column access is now CONTIGUOUS (no copy, no transpose): +let col0: ArrayView1 = database.column(0); +assert!(col0.is_standard_layout()); // TRUE — physically contiguous + +// Row access still works (strided, for K2 on survivors): +let survivor_fp: ArrayView1 = database.row(42); +// Not contiguous, but only used for the 3-5% of survivors — doesn't matter +``` + +The `(n_candidates, 256).f()` tells ndarray to lay out memory column-by-column. +Accessing `.column(i)` returns a contiguous view. Accessing `.row(i)` returns a +strided view. **Same data, both access patterns, zero transposition cost.** + +This is ndarray's stride system doing something no raw-slice library can: +**the same allocation serves both columnar scanning AND per-candidate access.** + +--- + +### 4. Arrow Is Already Column-Major + +Arrow stores columns separately. The arrow_bridge already provides columnar views. +If each "word position" is stored as its own Arrow column (or a single Arrow +FixedSizeList with the right chunking), the SoA layout is the native format: + +```rust +// Arrow RecordBatch with 256 columns of u64: +// col_0: [cand₀_w0, cand₁_w0, ...] ← this IS the K0 scan buffer +// col_1: [cand₀_w1, cand₁_w1, ...] +// ... + +fn kernel_pipeline_arrow(batch: &RecordBatch, query: &[u64; 256], gate: &SliceGate) { + // K0: read column 0 directly from Arrow buffer (zero-copy) + let col0 = batch.column(0).as_primitive::().values(); + let mask = k0_filter_columnar(col0, query[0], gate.k0_reject); + + // K1: read columns 0..8 + for i in 0..8 { + let col = batch.column(i).as_primitive::().values(); + k1_accumulate_columnar(col, query[i], &mut mask, &mut accum); + } + // ... +} +``` + +**No conversion. No copy. Arrow's native layout IS the SoA kernel's input format.** + +--- + +### 5. Multi-Resolution SoA: The Three Tiers as Physical Arrays + +```rust +/// Tiered columnar database. Each tier is a separate Array2 because they have +/// different access patterns and lifetimes. +pub struct TieredDatabase { + /// Tier K0: shape [N, 1] — just word 0 per candidate. + /// Fits in L2 for 1M candidates (8MB). + pub k0: Array2, // Column-major, 1 column + + /// Tier K1: shape [N, 8] — first 8 words per candidate. + /// Fits in L3 for 1M candidates (64MB). + pub k1: Array2, // Column-major, 8 columns + + /// Tier K2: shape [N, 256] — full containers. + /// Lives in main memory. Only accessed for 3-5% survivors. + pub k2: Array2, // Column-major, 256 columns +} +``` + +**Why separate arrays?** Because K0 is HOT (scanned every query). K1 is WARM +(scanned for ~5% survivors). K2 is COLD (scanned for ~0.3% survivors). Separating +them lets you: +- Pin K0 in L2 cache (8MB for 1M candidates) +- Let K1 live in L3 +- Let K2 stay in DRAM — only a few thousand rows are ever touched + +This is **cache-aware SoA** — not just columnar layout, but columnar layout with +physical separation matching the thermal hierarchy of the CPU cache. + +--- + +### 6. BF16 Field-Separated SoA: Awareness Becomes Memory Layout + +Current BF16 storage: N vectors × 1024 BF16 values = N × 2048 bytes (interleaved). + +Field-separated SoA: +```rust +/// BF16 database stored as separated bitfields. +/// The decomposition that bf16_hamming.rs does AT RUNTIME is instead +/// baked INTO THE STORAGE FORMAT. +pub struct BF16FieldDatabase { + /// Sign bits: 1 bit per dimension × 1024 dims = 128 bytes per vector. + /// Shape [N, 128] as u8 — each byte = 8 sign bits. + pub signs: Array2, + + /// Exponent fields: 8 bits per dimension × 1024 dims = 1024 bytes per vector. + /// Shape [N, 1024] as u8. + pub exponents: Array2, + + /// Mantissa fields: 7 bits per dimension × 1024 dims ≈ 896 bytes per vector. + /// Shape [N, 896] as u8 (packed 7-bit). + pub mantissas: Array2, +} +``` + +**The awareness algebra becomes trivially parallel**: + +```rust +impl BF16FieldDatabase { + /// Sign-only distance (strongest signal): just Hamming on sign bytes. + /// 128 bytes per comparison — fits in one AVX-512 pass for 4 vectors. + fn sign_distance(&self, query_signs: &[u8], candidate_idx: usize) -> u32 { + let row = self.signs.row(candidate_idx); + hamming_dispatch(query_signs, row.as_slice().unwrap()) + } + + /// Exponent distance (medium signal): Hamming on exp bytes. + fn exp_distance(&self, query_exp: &[u8], candidate_idx: usize) -> u32 { + let row = self.exponents.row(candidate_idx); + hamming_dispatch(query_exp, row.as_slice().unwrap()) + } + + /// Awareness classification without per-element decomposition. + /// Signs are already separated — just compare columns directly. + fn batch_awareness(&self, query_signs: &Array1) -> Array1 { + // For ALL candidates simultaneously: + // sign_agreement = popcount(NOT (query_signs XOR candidate_signs)) + // Already column-accessible, already SIMD-friendly + let n = self.signs.nrows(); + let mut states = Array1::zeros(n); + for i in 0..n { + let row = self.signs.row(i); + let agreement = hamming_agreement(query_signs.as_slice().unwrap(), + row.as_slice().unwrap()); + states[i] = classify_awareness(agreement, ...); + } + states + } +} +``` + +**The insight**: bf16_hamming.rs currently decomposes each BF16 value into +sign/exp/mantissa at QUERY TIME (runtime unpacking). With field-separated SoA, +the decomposition happens at INGEST TIME (once). Every subsequent query operates +on pre-decomposed fields. The hot path never sees a BF16 value — only its fields. + +This trades ingest bandwidth for query speed: +- Ingest: O(N) decomposition (amortized over all future queries) +- Query: zero decomposition cost (fields already separated) +- For a database queried 10,000× per second, this is a massive win. + +--- + +### 7. PackedQualia SoA: 16 Parallel Dimension Columns + +Current: `Vec` where each PackedQualia = 16 × i8 + 1 BF16 (18 bytes). + +SoA: +```rust +/// Qualia database stored as separated dimension columns. +/// Each column is ONE phenomenological dimension across ALL agents. +pub struct QualiaColumns { + /// 16 dimension columns, each Array1 with N_agents elements. + /// dims[0] = warmth, dims[1] = texture, dims[4] = social, etc. + pub dims: [Array1; 16], + + /// Magnitude column: Array1 (BF16 scalars). + pub magnitude: Array1, +} +``` + +**What this unlocks**: + +```rust +impl QualiaColumns { + /// "Which agents are tensioned on warmth (dim 4)?" + /// → Single SIMD scan of one column. + fn tensioned_on(&self, dim: usize, threshold: i8) -> Array1 { + self.dims[dim].mapv(|v| v.abs() > threshold && v < 0) + } + + /// "Resonance between agent A and all others on dimension 5" + /// → One column × scalar = Array1 + fn resonance_on_dim(&self, agent_idx: usize, dim: usize) -> Array1 { + let agent_val = self.dims[dim][agent_idx] as i16; + self.dims[dim].mapv(|v| v as i16 * agent_val) + } + + /// "Total resonance between agent A and all others (dot product)" + /// → 16 column scans, accumulate + fn total_resonance(&self, agent_idx: usize) -> Array1 { + let mut acc = Array1::::zeros(self.dims[0].len()); + for d in 0..16 { + let agent_val = self.dims[d][agent_idx] as i32; + acc += &self.dims[d].mapv(|v| v as i32 * agent_val); + } + acc + } +} +``` + +**AoS version of the same query** (current): +```rust +// "Find agents tensioned on warmth" — must scan ALL 18 bytes per agent, +// extract byte [4], compare. Cache line waste: 4/18 ≈ 22%. +for agent in &qualia_db { + if agent.resonance[4].abs() > threshold && agent.resonance[4] < 0 { + results.push(agent); + } +} +``` + +**SoA version**: scan 1 byte per agent, sequential, 100% cache utilization. +For 1M agents: 1MB scan vs ~18MB with random byte extraction. + +--- + +### 8. The Meta-Meta Level: ndarray IS the SoA Abstraction + +The deepest insight is that **ndarray's type system already encodes this**. +`Array2` with Fortran order IS a SoA database. The columns are the "fields." +The rows are the "records." And ndarray's stride machinery gives you BOTH access +patterns from the same allocation: + +```rust +// THE SAME Array2, TWO access patterns: + +// SoA access (column scan for K0 — contiguous): +let k0_column: ArrayView1 = database.column(0); +// → strides = [1], physically contiguous + +// AoS access (full record for K2 — strided): +let full_record: ArrayView1 = database.row(survivor_idx); +// → strides = [n_candidates], non-contiguous but only 3% of data +``` + +You don't need two copies. You don't need explicit transposition. ndarray's +`Order::F` gives you SoA as the NATIVE layout while `Order::C` gives you AoS. +The same API works for both — only the performance characteristics change. + +**This is what makes ndarray + hpc/ potentially genius**: the array type IS +the database schema. Shape IS the data model. Strides ARE the access pattern. +And the cascade kernel can be expressed as: + +```rust +pub fn cascade_soa( + database: &Array2, // F-order: [N, 256], columns contiguous + query: &Array1, // [256] + gate: &SliceGate, +) -> Vec { + let n = database.nrows(); + + // K0: column 0 scan (contiguous — full SIMD bandwidth) + let col0 = database.column(0); + let mut mask = k0_columnar_simd(col0.as_slice().unwrap(), query[0], gate); + + // K1: columns 0..8 scan (contiguous per column) + for w in 0..8 { + let col = database.column(w); + k1_accumulate_simd(col.as_slice().unwrap(), query[w], &mut mask); + } + k1_threshold(&mut mask, gate); + + // K2: only survivors (row access — strided but few) + let mut results = Vec::new(); + for idx in mask.iter_ones() { + let row = database.row(idx); + // row is strided — copy to stack for SIMD (256 × 8 = 2KB, fits in L1) + let mut buf = [0u64; 256]; + row.assign_to(ArrayViewMut1::from(&mut buf)); + let ec = k2_exact(&query.as_slice().unwrap(), &buf, 256); + results.push(/* ... */); + } + results +} +``` + +--- + +## The Shopping List Addition + +Add to `REFACTOR_HPC_INTEGRATION.md` Phase C: + +``` +Phase C-SoA (2 weeks) — Columnar Kernel Architecture: + C.1 TieredDatabase struct (K0/K1/K2 as separate F-order Array2) + C.2 k0_columnar_simd: scan N u64s with U64x8 broadcast+XOR+popcount + C.3 Bitmask survivor propagation (no per-candidate branching) + C.4 k1_accumulate_columnar: 8-column scan with running mask + C.5 k2_exact_masked: row extraction for final survivors + C.6 Arrow native path: RecordBatch columns → direct columnar scan + C.7 BF16FieldDatabase: sign/exp/mantissa pre-separated at ingest + C.8 QualiaColumns: 16 × Array1 for per-dimension scan + C.9 Benchmark: AoS vs SoA cascade on 1M SKU-16K containers +``` + +--- + +## Expected Performance Impact + +| Operation | AoS (current) | SoA (proposed) | Speedup | Why | +|-----------|---------------|----------------|---------|-----| +| K0 scan (1M) | ~2ms (cache-miss bound) | ~0.25ms (streaming) | **8×** | Contiguous vs stride-2048 | +| K1 scan (50K survivors) | ~0.8ms | ~0.1ms | **8×** | Same reason, 8 columns | +| K2 exact (1500 survivors) | ~0.3ms | ~0.35ms | 0.85× | Row copy overhead | +| **Total pipeline** | **~3.1ms** | **~0.7ms** | **4.4×** | K0+K1 dominate | +| BF16 awareness (1M) | ~5ms (decompose each) | ~0.6ms (pre-separated) | **8×** | No runtime decomposition | +| Qualia dim scan (1M agents) | ~4.5ms (18 byte stride) | ~0.25ms (1 byte stride) | **18×** | Perfect cache line use | + +The K2 step is slightly SLOWER in SoA (strided row access + copy to stack). +But K2 only runs on 0.3% of candidates — it's irrelevant to total time. +The cascade is dominated by K0 (touches all candidates) and K1 (touches 5%). +Both are massively faster with columnar layout. + +--- + +## The Philosophical Payoff + +> **AoS says**: "Here is a complete entity. Ask it anything." +> **SoA says**: "Here is one question. Ask it of everyone." + +The cascade pipeline IS the SoA access pattern: +- K0 asks ONE question (word 0) of EVERYONE +- K1 asks EIGHT questions (words 0-7) of SURVIVORS +- K2 asks ALL questions (words 0-255) of VERY FEW + +This is a funnel. Each stage narrows the population and widens the per-entity cost. +SoA is optimal for funnels because the wide-population stages (K0, K1) get +contiguous memory access, and the narrow-population stage (K2) pays the strided +penalty on so few entities that it doesn't matter. + +ndarray encodes this directly in its type system: +- `database.column(0)` = the K0 question to everyone (contiguous) +- `database.row(survivor)` = all answers from one entity (strided, rare) + +**The database layout IS the algorithm. +The strides ARE the access pattern. +ndarray IS the SoA abstraction.** diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md new file mode 100644 index 00000000..f55b0417 --- /dev/null +++ b/UNIFIED_REFACTOR_SEQUENCE.md @@ -0,0 +1,470 @@ +# Unified Refactor Sequence: ndarray Fork + +> Integrates four analysis passes into one executable sequence: +> 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits +> 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage +> 3. **Transformer session feedback** — API conventions, namespace, codegen macros +> 4. **lance-graph W1a consumer contract** — 5 SIMD primitives blocking 158-violation remediation +> +> Each wave is self-contained. Later waves depend on earlier ones. +> Within a wave, items are independent and can run in parallel. + +--- + +## Known constraints (from prior sessions) + +- Dispatch resolves at compile time via `cfg(target_feature)` in `simd.rs`. No per-call runtime checks. +- `simd_avx2.rs` is dead on x86-64-v4. Don't add to it. +- NEON does 256-bit as 2×128-bit paired. +- VPABSB does NOT saturate i8::MIN. +- Palette-256 is the dominant gather use case. +- Rayon work-stealing is not the lever if typed SIMD integration is the global lever. +- The `hpc/simd_caps.rs` and `hpc/simd_dispatch.rs` files are dead code under the pin — 877 lines to delete. + +--- + +## Wave 0 — Conventions & Foundations (3 days) + +Unlocks everything else. No code changes to hot paths. Pure contract + tooling. + +| ID | Item | Source | Effort | Why First | +|----|------|--------|--------|-----------| +| W0.1 | **Dual-form signature convention** (`_into` + Vec wrapper + `_ptr`) | R1.1 | 1d | Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape. | +| W0.2 | **Unified HpcError enum** (`src/hpc/error.rs`) | R1.2 | 4h | Replaces 4 error conventions with 1. | +| W0.3 | **`#![deny(warnings)]` → targeted denies** | R1.3 | 1h | Unblocks doctest authoring for all subsequent waves. | +| W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. | +| W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. | + +**Gate**: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged. +The dual-form is additive; existing call sites continue working. + +--- + +## Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0 + +From `lance-graph` W1a consumer contract. **Five primitives blocking 158 raw-intrinsic +violations across 5 lance-graph crates.** Each is a tight-scope PR. Parallel review. + +The lance-graph `simd-savant` agent runs PRE-MERGE against each PR. + +| ID | Item | Source | Effort | Consumer Site | +|----|------|--------|--------|---------------| +| W1.1 | **`I8x16::from_i4_packed_u64()`** + `batch_packed_i4_16()` | W1a-#1 | 1d | `lance-graph-contract/src/mul.rs::i4_eval::batch` — 5 batch fns over `QualiaI4_16D(u64)`. Closes PR #398 codex P1 (NEON OOB at `len==2`). | +| W1.2 | **`I8x16::saturating_abs()`** + `I8x32::saturating_abs()` | W1a-#2 | 4h | `lance-graph-contract/src/mul.rs` — Direction-B fix. `abs(i8::MIN)` must return `i8::MAX`, not wrap. See §VPABSB correction. | +| W1.3 | **`U16x8::gather_u16()`** + `palette_lookup_u8x8()` | W1a-#3 | 4h | `bgz17/src/simd.rs:88` — currently inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation). | +| W1.4 | **`prefetch_read_t0/t1/t2()`** | W1a-#4 | 2h | `bgz17/src/prefetch.rs:96,100` — currently inlines `_mm_prefetch` directly. | +| W1.5 | **`U64x8::popcnt()`** + `U64x8::xor_popcount()` | W1a-#5 | 4h | `holograph/hamming.rs` + `lance-graph/graph/blasgraph/types.rs` — inline `_mm512_popcnt_epi64`. | + +### Per-primitive implementation contract + +Every PR MUST: +1. **Impl in `simd_avx512.rs`** + **`simd_neon.rs`** + scalar fallback. +2. **Edge-case semantics documented** (`i8::MIN`, empty slices, OOB indices). +3. **Parity test**: all backends produce identical output on randomized corpus. +4. **Bench against scalar**: speedup ratios in PR body. +5. **`// SAFETY:` on every `unsafe` block**. +6. **No `is_x86_feature_detected!`**. No `#[target_feature(enable=...)]`. +7. **Consumer site cited** in PR description. + +### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend) + +```rust +impl I8x16 { + /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended). + /// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf). + pub fn from_i4_packed_u64(packed: u64) -> Self; + + /// Const-folded lane extract. + pub fn lane_i8(self) -> i8; +} + +/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair. +/// Bounds-aware tail handling; scalar fallback on unsupported arch. +pub fn batch_packed_i4_16(packed: &[u64], aux: &[i8], out: &mut [E], f: F) +where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy; +``` + +**AVX-512**: `_mm_cvtsi64_si128` + nibble shuffle via VPSHUFB mask LUT, then sign-extend. +Alt: PDEP (`_pdep_u64` × 2) into two u64 halves → load → `vpmovsxbw`. Bench both on Zen4 + SPR. +**NEON**: `vld1_u8` 8 bytes → `vshl_n_s8(v, 4)` + `vshr_n_s8(v, 4)` for nibble split + sign-extend. +**Scalar**: `((packed >> (4*i)) & 0xf) as i8`, with `if x > 7 { x - 16 } else { x }` for sign-extend. + +The `batch_packed_i4_16` closure-batch is the foundation for W1.5-#7 (randomized signatures). + +### W1.2 — I8x16::saturating_abs (VPABSB correction) + +**Critical**: `_mm512_abs_epi8` does NOT saturate `i8::MIN`. Returns `0x80` (= -128). + +```rust +impl I8x16 { pub fn saturating_abs(self) -> Self; } +impl I8x32 { pub fn saturating_abs(self) -> Self; } +``` + +**AVX-512**: `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`. +**NEON**: `vqabsq_s8(x)` — hardware-saturating (the `q` suffix). +**Scalar**: `i8::saturating_abs()` (stdlib, well-defined). + +**Mandatory test**: +```rust +let input = I8x16::splat(i8::MIN); +let result = input.saturating_abs(); +assert_eq!(result.lane_i8::<0>(), i8::MAX); // 127, NOT -128 +``` + +### W1.3 — U16x8::gather_u16 (palette lookup) + +```rust +impl U16x8 { + /// Gather 8 u16 values from `table` at given indices. + /// Debug: panics if max(indices) >= table.len(). Release: scalar fallback. + pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self; +} +pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8; +``` + +**Palette-256 is the dominant use case** — every index fits in u8, table is always +256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must +handle arbitrary tables too, but palette-256 should be the fast path (no OOB +possible when table.len() == 256 and indices are u8-sourced). + +**Codex P2 fix (gather_u16 OOB)**: `_mm256_i32gather_epi32` reads 4 bytes per slot +from a `&[u16]` table — overreads 2 bytes at `table[len-1]`. For palette-256 this +is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513, +which is within a 512-byte aligned allocation + padding). For arbitrary tables, +use scalar fallback or pad the table allocation. + +**AVX-512**: `_mm512_i32gather_epi32` with index widening + mask to u16. +**NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`. + +### W1.4 — Prefetch hints (cross-arch) + +```rust +pub fn prefetch_read_t0(ptr: *const u8); // L1 +pub fn prefetch_read_t1(ptr: *const u8); // L2 +pub fn prefetch_read_t2(ptr: *const u8); // L3 +``` + +**x86**: `_mm_prefetch(ptr, _MM_HINT_T0/T1/T2)`. +**aarch64**: `prfm pldl1keep/pldl2keep/pldl3keep`. +**Other**: no-op (prefetch is a hint; silent no-op is correct per ISA). + +### W1.5 — U64x8::popcnt (lane-wise popcount) + +```rust +impl U64x8 { + /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64). + pub fn popcnt(self) -> Self; + /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance. + pub fn xor_popcount(self, other: Self) -> u64; +} +impl U64x4 { pub fn popcnt(self) -> Self; } // NEON/scalar parity +``` + +**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq` — +available on sapphirerapids, enabled by the target-cpu pin). +**NEON popcount per-u64**: `vcntq_u8` → `vpaddlq_u8` → `vpaddlq_u16` → `vpaddlq_u32` +(NOT `vaddvq_u8` which merges ALL lanes to a single scalar — Codex P2 fix). +**Scalar**: `u64::count_ones()` fused loop. + +### W1.5+ — Deferred primitives (gated on sigker certification) + +Three more primitives queued behind `lance-graph:crates/sigker` certification +(Hambly-Lyons 2010 path-signature uniqueness). **No code needed today.** Listed +so W1a additions are designed broad enough to compose with these later. + +| ID | Primitive | Gate | +|----|-----------|------| +| W1.5-#6 | `signature_pde_sweep()` — Goursat PDE for `〈S(X), S(Y)〉` | `jc Pillar 11` activation | +| W1.5-#7 | Randomized projection (Cuchiero-Schmocker-Teichmann 2021) | sigker bench at production widths | +| W1.5-#8 | Lyndon-pack log-signature (7-13× compression) on `I16x16` | sigker bench at production widths | + +The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7. + +**Gate**: All 5 primitives pass parity tests across AVX-512/NEON/scalar. +`lance-graph` `simd-savant` agent verifies compliance PRE-MERGE. +Consumer-side migration PRs (#398-#402) unblocked. + +--- + +## Wave 2 — Codegen Macros (2 days) + +Eliminates copy-paste before adding new code. Every subsequent wave benefits. +Wave 1 primitives become the first consumers of these macros (retrofit optional). + +| ID | Item | Source | Effort | What It Produces | +|----|------|--------|--------|------------------| +| W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. | +| W2.2 | **Per-arch impl macro** (`simd_impl!`) | R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. | +| W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. | +| W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. | + +**Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions. +Run benches to verify no regression. + +--- + +## Wave 3 — Type Bridges (2 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W3.1 | **Fingerprint ↔ ArrayView1** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints | +| W3.2 | **VsaVector ↔ ArrayView1** | Tier 1.2 | 2h | Same pattern | +| W3.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array.dot()`, unlocks BLAS dispatch | +| W3.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1 per channel | +| W3.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 | + +**Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge. + +--- + +## Wave 4 — Extension Traits (3 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W4.1 | **HdcOps trait** (K0/K1/K2 on Array1) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` | +| W4.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` | +| W4.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern | +| W4.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference | +| W4.5 | **SimdMath trait** (VML on any Array) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals | +| W4.6 | **Int8Matmul trait** (Array2 × Array2 → Array2) | Tier 3.3 | 4h | Quantized inference via ndarray types | + +**Gate**: Each trait has tests matching the raw-slice function's test suite. +Benchmark: trait overhead vs. direct raw-slice call ≤ 1%. + +--- + +## Wave 5 — Backend Wiring (2 days) + +From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W5.1 | **Delete duplicate detection code** (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned | +| W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 | +| W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path | + +**Note on W5.1**: With `target-cpu=sapphirerapids` in `.cargo/config.toml`, the +`cfg(target_feature = "avx512f")` branch in `simd.rs` is the only live path. +`hpc/simd_caps.rs` (515 lines) and `hpc/simd_dispatch.rs` (362 lines) exist for +a multi-binary world that doesn't apply when the target is pinned. These can be +deleted or feature-gated behind `cfg(not(target_feature = "avx512f"))` for CI +fallback builds. + +**Gate**: `cargo bench` shows measurable improvement on contiguous arrays. +Non-contiguous arrays unchanged (fallback to generic fold). + +--- + +## Wave 6 — SoA Kernel Architecture (1 week) + +From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change. +Wave 1's `U64x8::popcnt()` and `U64x8::xor_popcount()` are direct prerequisites +for the columnar XOR+popcount scan in W6.2. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W6.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy | +| W6.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | Uses `U64x8::xor_popcount` from W1.5. 8 candidates per cycle. | +| W6.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops | +| W6.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns | +| W6.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor | +| W6.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition | +| W6.7 | **QualiaColumns** (16 × Array1) | SoA §7 | 4h | 18x dimension scan throughput. Uses `I8x16::from_i4_packed_u64` from W1.1. | +| W6.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow | +| W6.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim | + +**Gate**: SoA cascade produces identical results to AoS cascade on full test suite. +Benchmark confirms ≥3x throughput improvement on 1M containers. + +--- + +## Wave 7 — Namespace Restructure (3 days) + +From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md. + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W7.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ | +| W7.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code. W1 primitives land in new locations. | +| W7.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat | + +**Gate**: `pub use` deprecation shims for all moved modules. +All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning). + +--- + +## Wave 8 — Test & Bench Infrastructure (2 days) + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W8.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count | +| W8.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims | +| W8.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 | +| W8.4 | **W1a primitive parity bench** (AVX-512 vs NEON vs scalar speedup) | lance-graph | 4h | Acceptance criteria §4 | + +--- + +## Wave 9 — Version Bump & Downstream Pin (1 day) + +| ID | Item | Source | Effort | Impact | +|----|------|--------|--------|--------| +| W9.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut | +| W9.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed | +| W9.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort, lance-graph) | R4.4 | 2h | tag = "v0.18.0" | +| W9.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup | + +--- + +## Dependency Graph + +``` +Wave 0 (conventions) + │ + ├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph + │ │ + │ ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5) + │ ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1) + │ └──→ Wave 8.4 (primitive parity benches) + │ + ├���─→ Wave 2 (codegen macros) + │ │ + │ └──→ Wave 5 (backend wiring uses macros) + │ + ├──→ Wave 3 (type bridges) + │ │ + │ ├──→ Wave 4 (extension traits need bridges) + │ ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16) + │ └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views) + │ + ├──→ Wave 7 (namespace) — independent of 1-6 + │ + └──→ Wave 8 (tests/benches) — after 1+5+6 + │ + └──→ Wave 9 (release) +``` + +### Critical Path + +**For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked. + +**For full release**: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = **18 days** serial. +With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**. + +--- + +## How Wave 1 Feeds Everything Downstream + +The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves: + +| Primitive | Immediate consumer (lance-graph) | Later consumer (ndarray) | +|-----------|----------------------------------|--------------------------| +| `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop | +| `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics | +| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** �� gather exponent fields from lookup table | +| `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan | +| `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation | + +The SoA cascade (Wave 6) is fundamentally built on `U64x8::xor_popcount()`. Without Wave 1, +the SoA implementation would need to drop back to `hpc::bitwise::hamming_distance_raw()` on +slices, losing the typed-wrapper discipline and duplicating dispatch logic. + +--- + +## The SoA Integration with All Waves + +| Wave 6 needs | Provided by | +|--------------|-------------| +| `U64x8::xor_popcount()` for columnar scan | **Wave 1.5** (SIMD primitives) | +| `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) | +| `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) | +| F-order Array2 database | **Wave 3** (type bridges — Fingerprint converts to Array) | +| Impl macro for k0_columnar_simd | **Wave 2** (codegen macros) | +| `_into` form for columnar kernels | **Wave 0** (signature convention) | +| Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) | +| BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) | +| Arrow columns → direct scan | **Wave 3** (Arrow view factories) | +| Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) | + +--- + +## The Meta-Level SoA Observation + +The SoA insight goes beyond just "transpose the database." It's a design principle +that applies recursively across the entire surface: + +### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md) +- Fingerprint database: column-per-word → cascade becomes column scan +- BF16 fields: separated sign/exp/mantissa → awareness is native layout +- Qualia: column-per-dimension → per-dim queries are sequential scans + +### Level 2: API surface +- The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures: + - AoS API: `fn foo(input) -> output` (allocates inside, hides layout) + - SoA API: `fn foo_into(input, output)` (caller controls memory layout) + +### Level 3: SIMD primitives (lance-graph W1a) +- The **typed-wrapper-as-method** convention IS SoA for the SIMD surface: + - AoS: raw intrinsics scattered across consumer crates (158 violations) + - SoA: struct methods on typed wrappers, consumers import ONE namespace +- The **closure-batch pattern** (W1.1 `batch_packed_i4_16`) IS SoA for computation: + - AoS: each consumer writes its own unpack+process+pack loop + - SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch + +### Level 4: Module structure (W7.1 namespace split) +- AoS: one module (`hpc/`) contains everything → grep to find anything +- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → contiguous columns + +### Level 5: ndarray's type system +- `Array2` in F-order IS the SoA database +- `.column(i)` IS the field-access pattern +- `.row(i)` IS the record-access pattern +- Strides encode the duality without duplication + +### Level 6: Arrow + Lance alignment +- Arrow IS columnar storage (SoA by definition) +- Lance stores Arrow columns on disk +- The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy +- Storage format = memory layout = compute pattern = cache access order + +### Level 7: The consumer contract itself +- lance-graph declares what it needs (W1a primitives), ndarray provides +- The contract is columnar: one primitive per concern, not a monolithic "SIMD library" +- Each primitive is independently testable, deployable, and benchmarkable + +**The architecture collapses all intermediate transformations at every level.** + +--- + +## Effort Summary + +| Wave | Days | Parallel? | Key Outcome | +|------|------|-----------|-------------| +| 0 | 3 | Yes | Conventions locked | +| 1 | 3 | Yes (after 0) | **lance-graph unblocked** (P0) | +| 2 | 2 | ∥ with 1 | Codegen macros ready | +| 3 | 2 | After 0 | Domain types ↔ Array | +| 4 | 3 | After 3 | Extension traits live | +| 5 | 2 | After 2+3 | Core ops accelerated | +| 6 | 5 | After 1+3+5 | SoA cascade (the big win) | +| 7 | 3 | ∥ with 4-6 | Namespace clean | +| 8 | 2 | After 1+6 | Proof via benchmarks | +| 9 | 1 | After all | Release | + +**lance-graph unblocked**: Wave 0+1 = **6 days**. +**Full release critical path**: 18 days serial, **~12 days** with parallelism. + +--- + +## What NOT to Do + +1. **Don't delete hpc/ modules** — raw-slice functions stay (FFI, embedded, zero-overhead) +2. **Don't premature-v1.0** — surface is too young; candle/tract wiring will shake it again +3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional +4. **Don't couple SoA with module restructure** — they're independent; merge separately +5. **Don't break downstream in one shot** — deprecation shims for one release minimum +6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence happened because no test existed +7. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated