diff --git a/REFACTOR_HPC_INTEGRATION.md b/REFACTOR_HPC_INTEGRATION.md
new file mode 100644
index 00000000..f1aa3a3c
--- /dev/null
+++ b/REFACTOR_HPC_INTEGRATION.md
@@ -0,0 +1,974 @@
+# ndarray hpc/ Module Integration & Transformation Plan
+
+> **Goal**: Transform hpc/ from a bolted-on raw-slice library into a first-class
+> citizen of ndarray's type system — without breaking any existing API.
+>
+> **Principle**: Every refactoring adds an ndarray-native overload or trait impl.
+> Raw-slice functions stay (FFI, embedded, zero-overhead). No deletion.
+
+---
+
+## Architectural Diagnosis
+
+The hpc/ module (80K lines, 90+ files) currently operates in two disconnected worlds:
+
+```
+┌─────────────────────────────────────────────────────┐
+│  ndarray core                                       │
+│  Array<f32, Ix2>, ArrayView, Zip, Broadcasting,     │
+│  .sum(), .dot(), mapv(), LinalgScalar, BlasFloat    │
+└───────────────┬─────────────────────────────────────┘
+                │ .as_slice() ← the only bridge
+                ▼
+┌─────────────────────────────────────────────────────┐
+│  hpc/ module                                        │
+│  &[u8], &[u64], &[f32], *const u8, Vec<f32>        │
+│  Fingerprint<N>, VsaVector, Cascade, EnergyConflict │
+│  Custom SIMD dispatch, manual row/col params        │
+└─────────────────────────────────────────────────────┘
+```
+
+The refactoring creates **bidirectional bridges** without removing the raw layer:
+
+```
+┌─────────────────────────────────────────────────────┐
+│  ndarray core (unchanged)                           │
+│  Array<f32, Ix2>, ArrayView, Zip, Broadcasting      │
+└───────────────┬──────────────────────▲──────────────┘
+                │                      │
+          .as_slice()          crate::simd::*
+                │                      │
+                ▼                      │
+┌─────────────────────────────────────────────────────┐
+│  hpc/ bridge layer (NEW)                            │
+│  Extension traits on ArrayBase<S, D>                │
+│  From/Into impls for domain types                   │
+└───────────────┬──────────────────────▲──────────────┘
+                │                      │
+          delegates to          implements
+                │                      │
+                ▼                      │
+┌─────────────────────────────────────────────────────┐
+│  hpc/ raw compute (unchanged)                       │
+│  &[u8], &[u64], Fingerprint<N>, typed SIMD          │
+│  K0/K1/K2, BF16 GEMM, VNNI, VML                    │
+└─────────────────────────────────────────────────────┘
+```
+
+---
+
+## Refactoring Shopping List
+
+### Tier 1: Type Bridge — Domain Types ↔ Array (Low effort, High impact)
+
+These add conversion traits so domain types can participate in ndarray's ecosystem.
+
+---
+
+#### 1.1 Fingerprint<N> ↔ ArrayView1<u64>
+
+**File**: `src/hpc/fingerprint.rs`
+
+**Current state**: `Fingerprint<N>` wraps `[u64; N]`. Provides `.as_bytes()` via
+unsafe pointer cast. No connection to ndarray.
+
+**Transform**:
+
+```rust
+// --- Zero-copy views (no allocation) ---
+
+impl<const N: usize> Fingerprint<N> {
+    /// View this fingerprint as an ndarray 1-D array of u64 words.
+    /// Zero-copy: borrows the internal buffer.
+    pub fn as_array_view(&self) -> ArrayView1<'_, u64> {
+        ArrayView1::from_shape(N, &self.words).unwrap()
+    }
+
+    /// Mutable view for in-place operations.
+    pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> {
+        ArrayViewMut1::from_shape(N, &mut self.words).unwrap()
+    }
+
+    /// View as 2-D matrix (e.g., 256 = 32×8 for tiled operations).
+    pub fn as_matrix_view(&self, rows: usize, cols: usize) -> ArrayView2<'_, u64> {
+        assert_eq!(rows * cols, N);
+        ArrayView2::from_shape((rows, cols), &self.words).unwrap()
+    }
+}
+
+// --- Owned conversions (allocation on From, zero-copy on Into) ---
+
+impl<const N: usize> From<Array1<u64>> for Fingerprint<N> {
+    fn from(arr: Array1<u64>) -> Self {
+        assert_eq!(arr.len(), N);
+        let mut words = [0u64; N];
+        words.copy_from_slice(arr.as_slice().unwrap());
+        Self { words }
+    }
+}
+
+impl<const N: usize> From<Fingerprint<N>> for Array1<u64> {
+    fn from(fp: Fingerprint<N>) -> Self {
+        Array1::from_vec(fp.words.to_vec())
+    }
+}
+```
+
+**Why**: Callers can now pass fingerprints to any function expecting `ArrayView1<u64>`.
+Enables `.sum()`, `.mapv()`, Zip, broadcasting on fingerprint data.
+
+**Example — before vs. after**:
+
+```rust
+// BEFORE: raw slice, manual loop
+let dist = k2_exact(&fp_a.words, &fp_b.words, 256);
+
+// AFTER: can also use ndarray ops directly
+let view_a = fp_a.as_array_view();
+let view_b = fp_b.as_array_view();
+let xor_popcount: u64 = (&view_a ^ &view_b).mapv(|w| w.count_ones() as u64).sum();
+```
+
+---
+
+#### 1.2 VsaVector ↔ ArrayView1<u64>
+
+**File**: `src/hpc/vsa.rs`
+
+**Same pattern as 1.1** — VsaVector has `words: [u64; 256]`. Add:
+
+```rust
+impl VsaVector {
+    pub fn as_array_view(&self) -> ArrayView1<'_, u64> {
+        ArrayView1::from_shape(VSA_WORDS, &self.words).unwrap()
+    }
+    pub fn as_array_view_mut(&mut self) -> ArrayViewMut1<'_, u64> {
+        ArrayViewMut1::from_shape(VSA_WORDS, &mut self.words).unwrap()
+    }
+}
+
+impl From<Array1<u64>> for VsaVector { ... }
+impl From<VsaVector> for Array1<u64> { ... }
+```
+
+---
+
+#### 1.3 BF16 / F16 as ndarray Element Types
+
+**Files**: `src/hpc/quantized.rs`, `src/backend/mod.rs`
+
+**Current state**: BF16 and F16 implement `ScalarOperand` (can appear in arrays)
+but NOT `BlasFloat` (cannot use `.dot()`, BLAS traits). They're second-class citizens.
+
+**Transform** — Register BF16/F16 for BLAS dispatch:
+
+```rust
+// In src/backend/mod.rs or src/hpc/quantized.rs:
+
+impl BlasFloat for BF16 {
+    fn backend_dot(xs: &[Self], ys: &[Self]) -> Self {
+        // Accumulate in f32, return as BF16
+        let mut acc = 0.0f32;
+        for (&x, &y) in xs.iter().zip(ys.iter()) {
+            acc += x.to_f32() * y.to_f32();
+        }
+        BF16::from_f32(acc)
+    }
+
+    fn backend_gemm(m: usize, n: usize, k: usize, alpha: Self,
+                    a: &[Self], lda: usize, b: &[Self], ldb: usize,
+                    beta: Self, c: &mut [Self], ldc: usize) {
+        // Delegate to existing bf16_gemm_f32, convert output back
+        crate::hpc::quantized::bf16_gemm_f32(
+            a, b,
+            /* c_f32 temp */ ...,
+            m, n, k, alpha.to_f32(), beta.to_f32()
+        );
+        // Convert f32 results back to BF16 in c
+    }
+
+    fn backend_axpy(alpha: Self, x: &[Self], y: &mut [Self]) { ... }
+    fn backend_scal(alpha: Self, x: &mut [Self]) { ... }
+    fn backend_nrm2(x: &[Self]) -> Self { ... }
+    fn backend_asum(x: &[Self]) -> Self { ... }
+}
+```
+
+**Impact**: Unlocks this:
+
+```rust
+let a: Array2<BF16> = ...;
+let b: Array2<BF16> = ...;
+let c = a.blas_gemm(BF16::one(), &b, BF16::zero());  // Just works
+```
+
+---
+
+#### 1.4 CogRecord ↔ Structured ArrayView
+
+**File**: `src/hpc/cogrecord.rs`
+
+**Current state**: CogRecord has four `Array<u8, Ix1>` fields (btree, cam, embed, meta)
+of 4096 bytes each. No combined view.
+
+**Transform**:
+
+```rust
+impl CogRecord {
+    /// View all 4 channels as a single [4, 4096] matrix.
+    pub fn as_channels_view(&self) -> ArrayView2<'_, u8> {
+        // Stack the 4 contiguous buffers into a 2-D view
+        let total = self.btree.len() + self.cam.len() + self.embed.len() + self.meta.len();
+        debug_assert_eq!(total, 4 * CONTAINER_BYTES);
+        // NOTE: only works if fields are contiguous in memory.
+        // If not, provide a copy-based `to_channels_array()`.
+        todo!("requires contiguous layout guarantee")
+    }
+
+    /// Get a single channel as ArrayView1<u64> (for Hamming/XOR ops).
+    pub fn channel_as_words(&self, channel: usize) -> ArrayView1<'_, u64> {
+        let bytes = match channel {
+            0 => self.btree.as_slice().unwrap(),
+            1 => self.cam.as_slice().unwrap(),
+            2 => self.embed.as_slice().unwrap(),
+            3 => self.meta.as_slice().unwrap(),
+            _ => panic!("channel must be 0..4"),
+        };
+        // Safe: u8 slice reinterpreted as u64 (alignment checked)
+        let words = unsafe {
+            std::slice::from_raw_parts(
+                bytes.as_ptr() as *const u64,
+                bytes.len() / 8,
+            )
+        };
+        ArrayView1::from_shape(words.len(), words).unwrap()
+    }
+}
+```
+
+---
+
+### Tier 2: Trait Extensions — hpc Operations on Array Types (Medium effort, High impact)
+
+Add extension traits so hpc operations feel native on ndarray arrays.
+
+---
+
+#### 2.1 HDC Extension Trait (Hamming, Cascade, K0/K1/K2)
+
+**New file**: `src/hpc/ndarray_ext.rs` (or extend `bitwise.rs`)
+
+**Reasoning**: The K0/K1/K2 pipeline currently takes `&[u64]`. Users with
+`Array1<u64>` must manually call `.as_slice().unwrap()`. This trait bridges that.
+
+```rust
+use crate::imp_prelude::*;
+use super::kernels::{k0_probe, k1_stats, k2_exact, SliceGate, EnergyConflict};
+
+/// HDC (Hyperdimensional Computing) operations on u64 word arrays.
+///
+/// Provides the K0/K1/K2 rejection pipeline directly on ndarray types.
+pub trait HdcOps {
+    /// K0 probe: fast 64-bit reject.
+    /// Returns true if the first word passes the gate threshold.
+    fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool;
+
+    /// K1 stats: 512-bit medium reject.
+    /// Returns true if the first 8 words pass the gate threshold.
+    fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool;
+
+    /// K2 exact: Full-width Hamming + energy decomposition.
+    fn k2_exact(&self, candidate: &Self) -> EnergyConflict;
+
+    /// Full cascade: K0 → K1 → K2 with early rejection.
+    fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option<EnergyConflict>;
+}
+
+impl<S> HdcOps for ArrayBase<S, Ix1>
+where
+    S: Data<Elem = u64>,
+{
+    fn k0_probe(&self, candidate: &Self, gate: &SliceGate) -> bool {
+        k0_probe(self[0], candidate[0], gate)
+    }
+
+    fn k1_stats(&self, candidate: &Self, gate: &SliceGate) -> bool {
+        let q = self.as_slice().expect("query must be contiguous");
+        let c = candidate.as_slice().expect("candidate must be contiguous");
+        k1_stats(q, c, gate)
+    }
+
+    fn k2_exact(&self, candidate: &Self) -> EnergyConflict {
+        let q = self.as_slice().expect("query must be contiguous");
+        let c = candidate.as_slice().expect("candidate must be contiguous");
+        k2_exact(q, c, q.len())
+    }
+
+    fn cascade_distance(&self, candidate: &Self, gate: &SliceGate) -> Option<EnergyConflict> {
+        if !self.k0_probe(candidate, gate) { return None; }
+        if !self.k1_stats(candidate, gate) { return None; }
+        Some(self.k2_exact(candidate))
+    }
+}
+```
+
+**Example usage after**:
+
+```rust
+use ndarray::hpc::ndarray_ext::HdcOps;
+
+let query: Array1<u64> = fp_query.into();
+let candidate: Array1<u64> = fp_candidate.into();
+let gate = SliceGate::for_sku16k();
+
+// Clean ndarray-native API
+if let Some(result) = query.cascade_distance(&candidate, &gate) {
+    println!("Hamming: {}, Agreement: {}", result.conflict, result.agreement);
+}
+```
+
+---
+
+#### 2.2 Quantization Extension Trait
+
+**File**: extend `src/hpc/quantized.rs`
+
+**Reasoning**: Quantize/dequantize currently returns `Vec<T>`. Should return `Array1<T>`.
+
+```rust
+/// Quantization operations on f32 arrays.
+pub trait Quantize {
+    /// Quantize to BF16 (element-wise truncation).
+    fn to_bf16(&self) -> Array<BF16, Ix1>;
+
+    /// Quantize to symmetric INT8 with scale factor.
+    fn to_i8_symmetric(&self) -> (Array<i8, Ix1>, QuantParams);
+
+    /// Quantize per-channel (rows) to INT8.
+    fn to_i8_per_channel(&self) -> (Array2<i8>, PerChannelQuantParams)
+    where
+        Self: AsArray<'_, f32, Ix2>;  // Only for 2-D
+}
+
+/// Dequantization operations.
+pub trait Dequantize<A> {
+    /// Dequantize INT8 codes back to f32.
+    fn dequantize_f32(&self, params: &QuantParams) -> Array<f32, Ix1>;
+}
+
+impl<S> Quantize for ArrayBase<S, Ix1>
+where
+    S: Data<Elem = f32>,
+{
+    fn to_bf16(&self) -> Array<BF16, Ix1> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let mut out = vec![BF16::ZERO; slice.len()];
+        f32_to_bf16_slice(slice, &mut out);  // existing raw function
+        Array1::from_vec(out)
+    }
+
+    fn to_i8_symmetric(&self) -> (Array<i8, Ix1>, QuantParams) {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (vec, params) = quantize_f32_to_i8(slice);  // existing raw function
+        (Array1::from_vec(vec), params)
+    }
+}
+```
+
+---
+
+#### 2.3 Prefilter — Accept ArrayView2 Instead of (slice, rows, cols)
+
+**File**: `src/hpc/prefilter.rs`
+
+**Current state** (lines 48-295):
+```rust
+pub fn approx_mean_std_f32(data: &[f32]) -> (f32, f32)
+pub fn approx_row_norms_f32(data: &[f32], rows: usize, cols: usize) -> Vec<f32>
+pub fn top_k_rows_by_norm(data: &[f32], rows: usize, cols: usize, k: usize) -> Vec<usize>
+pub fn approx_column_std(data: &[f32], rows: usize, cols: usize) -> Vec<f32>
+```
+
+**Transform** — add ndarray overloads that delegate to existing implementations:
+
+```rust
+/// Array-native overloads for prefilter operations.
+pub trait Prefilter {
+    /// Approximate mean and standard deviation (SIMD).
+    fn approx_mean_std(&self) -> (f32, f32);
+
+    /// L2 norm per row (SIMD).
+    fn row_norms(&self) -> Array1<f32>;
+
+    /// Indices of top-K rows by L2 norm.
+    fn top_k_by_norm(&self, k: usize) -> Array1<usize>;
+
+    /// Standard deviation per column.
+    fn column_std(&self) -> Array1<f32>;
+}
+
+impl<S> Prefilter for ArrayBase<S, Ix2>
+where
+    S: Data<Elem = f32>,
+{
+    fn approx_mean_std(&self) -> (f32, f32) {
+        let slice = self.as_slice().expect("must be contiguous");
+        approx_mean_std_f32(slice)
+    }
+
+    fn row_norms(&self) -> Array1<f32> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(approx_row_norms_f32(slice, rows, cols))
+    }
+
+    fn top_k_by_norm(&self, k: usize) -> Array1<usize> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(top_k_rows_by_norm(slice, rows, cols, k))
+    }
+
+    fn column_std(&self) -> Array1<f32> {
+        let slice = self.as_slice().expect("must be contiguous");
+        let (rows, cols) = self.dim();
+        Array1::from_vec(approx_column_std(slice, rows, cols))
+    }
+}
+```
+
+---
+
+#### 2.4 Activations — Extend to N-D with Axis
+
+**File**: `src/hpc/activations.rs`
+
+**Current state**: Trait only works on `ArrayBase<S, Ix1>`. No per-row softmax.
+
+**Transform**:
+
+```rust
+/// Multi-dimensional activations (applied along an axis).
+pub trait ActivationsNd<A> {
+    /// Softmax along the given axis.
+    /// For a [batch, features] matrix with axis=1, computes softmax per row.
+    fn softmax_axis(&self, axis: Axis) -> Array<A, Self::Dim>;
+
+    /// Log-softmax along the given axis.
+    fn log_softmax_axis(&self, axis: Axis) -> Array<A, Self::Dim>;
+}
+
+impl<A, S> ActivationsNd<A> for ArrayBase<S, Ix2>
+where
+    A: Float + 'static,
+    S: Data<Elem = A>,
+{
+    fn softmax_axis(&self, axis: Axis) -> Array<A, Ix2> {
+        let mut result = self.to_owned();
+        for mut row in result.axis_iter_mut(axis) {
+            let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b));
+            row.mapv_inplace(|v| (v - max_val).exp());
+            let sum: A = row.iter().fold(A::zero(), |acc, &v| acc + v);
+            row.mapv_inplace(|v| v / sum);
+        }
+        result
+    }
+
+    fn log_softmax_axis(&self, axis: Axis) -> Array<A, Ix2> {
+        let mut result = self.to_owned();
+        for mut row in result.axis_iter_mut(axis) {
+            let max_val = row.iter().fold(A::neg_infinity(), |a, &b| a.max(b));
+            row.mapv_inplace(|v| v - max_val);
+            let log_sum_exp = row.mapv(|v| v.exp())
+                .iter().fold(A::zero(), |acc, &v| acc + v).ln();
+            row.mapv_inplace(|v| v - log_sum_exp);
+        }
+        result
+    }
+}
+```
+
+**Impact**: Enables `batch_logits.softmax_axis(Axis(1))` — critical for inference.
+
+---
+
+#### 2.5 VML — SIMD Transcendentals as mapv Acceleration
+
+**File**: `src/hpc/vml.rs` (2543 lines)
+
+**Current state**: 12+ standalone functions like `vsexp(x: &[f32], out: &mut [f32])`.
+
+**Transform** — add `ArrayBase` methods that auto-dispatch to SIMD:
+
+```rust
+/// SIMD-accelerated element-wise math on contiguous f32 arrays.
+///
+/// These call the VML SIMD kernels when the array is contiguous,
+/// falling back to scalar mapv() otherwise.
+pub trait SimdMath {
+    fn simd_exp(&self) -> Array<f32, Self::Dim>;
+    fn simd_ln(&self) -> Array<f32, Self::Dim>;
+    fn simd_sqrt(&self) -> Array<f32, Self::Dim>;
+    fn simd_sin(&self) -> Array<f32, Self::Dim>;
+    fn simd_cos(&self) -> Array<f32, Self::Dim>;
+    fn simd_tanh(&self) -> Array<f32, Self::Dim>;
+    fn simd_abs(&self) -> Array<f32, Self::Dim>;
+}
+
+impl<S, D> SimdMath for ArrayBase<S, D>
+where
+    S: Data<Elem = f32>,
+    D: Dimension,
+{
+    fn simd_exp(&self) -> Array<f32, D> {
+        if let Some(slice) = self.as_slice() {
+            let mut out = vec![0.0f32; slice.len()];
+            vsexp(slice, &mut out);  // existing SIMD kernel
+            Array::from_shape_vec(self.raw_dim(), out).unwrap()
+        } else {
+            self.mapv(|x| x.exp())
+        }
+    }
+
+    fn simd_ln(&self) -> Array<f32, D> {
+        if let Some(slice) = self.as_slice() {
+            let mut out = vec![0.0f32; slice.len()];
+            vsln(slice, &mut out);
+            Array::from_shape_vec(self.raw_dim(), out).unwrap()
+        } else {
+            self.mapv(|x| x.ln())
+        }
+    }
+    // ... same pattern for sqrt, sin, cos, tanh, abs
+}
+```
+
+**Benchmark expectation**: 10–50x for transcendentals (polynomial SIMD vs scalar libm).
+
+---
+
+### Tier 3: Backend Integration — Core Dispatch to SIMD (Medium effort, Very High impact)
+
+Make ndarray's **built-in** operations (`.sum()`, `.mean()`, `.dot()`) automatically
+use hpc SIMD kernels for contiguous arrays.
+
+---
+
+#### 3.1 Wire Reductions into Core
+
+**Files**: `src/numeric/impl_numeric.rs` (core sum/mean) + `src/hpc/reductions.rs`
+
+**Current state**: `Array.sum()` uses `numeric_util::unrolled_fold()` — a scalar
+4x unrolled loop. `hpc::reductions::sum_f32()` uses F32x16 (16-lane AVX-512)
+with 4-accumulator ILP. They never talk to each other.
+
+**Transform** — add backend dispatch in core's sum():
+
+```rust
+// In src/numeric/impl_numeric.rs or src/backend/native.rs:
+
+impl<A, S, D> ArrayBase<S, D>
+where
+    A: Clone + num_traits::Zero + std::ops::Add<Output = A>,
+    S: Data<Elem = A>,
+    D: Dimension,
+{
+    pub fn sum(&self) -> A {
+        // Fast path: contiguous f32 → SIMD reduction
+        if std::any::TypeId::of::<A>() == std::any::TypeId::of::<f32>() {
+            if let Some(slice) = self.as_slice_memory_order() {
+                // SAFETY: TypeId check guarantees A == f32
+                let f32_slice = unsafe {
+                    std::slice::from_raw_parts(slice.as_ptr() as *const f32, slice.len())
+                };
+                let result = crate::hpc::reductions::sum_f32(f32_slice);
+                // SAFETY: result is f32, A is f32
+                return unsafe { std::mem::transmute_copy(&result) };
+            }
+        }
+        if std::any::TypeId::of::<A>() == std::any::TypeId::of::<f64>() {
+            if let Some(slice) = self.as_slice_memory_order() {
+                let f64_slice = unsafe {
+                    std::slice::from_raw_parts(slice.as_ptr() as *const f64, slice.len())
+                };
+                let result = crate::hpc::reductions::sum_f64(f64_slice);
+                return unsafe { std::mem::transmute_copy(&result) };
+            }
+        }
+        // Fallback: generic fold
+        self.fold(A::zero(), |acc, &x| acc + x)
+    }
+}
+```
+
+**Impact**: Every `.sum()` on a contiguous f32/f64 array silently becomes 16x
+faster. Zero API change for users.
+
+**Same pattern for**: `.mean()`, internal `max()`/`min()` in softmax paths.
+
+---
+
+#### 3.2 Delete Dead SIMD Detection Code
+
+**Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
+
+Dead under the target-cpu pin. Delete or gate behind `cfg(not(target_feature = "avx512f"))`.
+
+**Result**: -877 lines.
+
+---
+
+#### 3.3 VNNI GEMM via BlasFloat Dispatch
+
+**Files**: `src/hpc/vnni_gemm.rs`, `src/hpc/quantized.rs:618`, `src/backend/mod.rs`
+
+**Current state**: `int8_gemm_vnni()` takes raw `&[u8], &[i8], &mut [i32]` with manual
+m/n/k parameters. Not accessible via `Array.dot()` or `blas_gemm()`.
+
+**Transform** — wire INT8 GEMM into the backend trait system:
+
+```rust
+// New trait in src/backend/mod.rs:
+pub trait BlasInt8 {
+    /// INT8 GEMM: C_i32 = A_u8 × B_i8
+    fn backend_gemm_i8(
+        m: usize, n: usize, k: usize,
+        a: &[u8], lda: usize,
+        b: &[i8], ldb: usize,
+        c: &mut [i32], ldc: usize,
+    );
+}
+
+// Extension trait on Array:
+pub trait Int8Matmul {
+    /// Quantized matrix multiply: self (u8) × rhs (i8) → Array2<i32>
+    fn matmul_i8(&self, rhs: &ArrayView2<i8>) -> Array2<i32>;
+}
+
+impl<S> Int8Matmul for ArrayBase<S, Ix2>
+where
+    S: Data<Elem = u8>,
+{
+    fn matmul_i8(&self, rhs: &ArrayView2<i8>) -> Array2<i32> {
+        let (m, k) = self.dim();
+        let (k2, n) = rhs.dim();
+        assert_eq!(k, k2, "inner dimensions must match");
+
+        let a_slice = self.as_slice().expect("A must be contiguous");
+        let b_slice = rhs.as_slice().expect("B must be contiguous");
+        let mut c = vec![0i32; m * n];
+
+        crate::hpc::vnni_gemm::int8_gemm_vnni(a_slice, b_slice, &mut c, m, n, k);
+        Array2::from_shape_vec((m, n), c).unwrap()
+    }
+}
+```
+
+**Example**:
+
+```rust
+let weights: Array2<i8> = load_quantized_weights();
+let activations: Array2<u8> = quantize_input(&input);
+let output: Array2<i32> = activations.matmul_i8(&weights.view());
+// 4× throughput vs f32 GEMM
+```
+
+---
+
+### Tier 4: View Factories — Arrow & External Data (Low effort, Medium impact)
+
+Make hpc's data ingestion return ndarray views instead of raw slices.
+
+---
+
+#### 4.1 Arrow Bridge → ArrayView
+
+**File**: `src/hpc/arrow_bridge.rs`
+
+**Current state**: `binary_entry()` returns `&[u8]`. `SoakingBuffer` provides
+`as_columnar_slice()` as `&[i8]`.
+
+**Transform**:
+
+```rust
+impl ArrowBridge {
+    /// Get one binary entry as a 1-D array view.
+    pub fn binary_entry_view(&self, idx: usize) -> ArrayView1<'_, u8> {
+        let bytes = self.binary_entry(idx);
+        ArrayView1::from_shape(bytes.len(), bytes).unwrap()
+    }
+
+    /// Get the full binary column as a 2-D view [n_rows, entry_bytes].
+    pub fn binary_column_view(&self) -> ArrayView2<'_, u8> {
+        let data = self.as_binary_slice();
+        let entry_len = self.entry_len();
+        let n_rows = data.len() / entry_len;
+        ArrayView2::from_shape((n_rows, entry_len), data).unwrap()
+    }
+}
+
+impl SoakingBuffer {
+    /// View the columnar buffer as a 2-D matrix [n_entries, n_dims].
+    pub fn as_array_view_2d(&self) -> ArrayView2<'_, i8> {
+        let data = self.as_columnar_slice();
+        ArrayView2::from_shape((self.n_entries, self.n_dims), data).unwrap()
+    }
+}
+```
+
+**Impact**: Downstream code can use ndarray broadcasting, slicing, and Zip
+on Arrow data without copying.
+
+---
+
+#### 4.2 Distance Functions → Return Array1
+
+**File**: `src/hpc/distance.rs`
+
+**Current state**: Returns `Vec<f32>`.
+
+**Transform**:
+
+```rust
+// Keep existing:
+pub fn squared_distances_f32(query: [f32; 3], points: &[[f32; 3]]) -> Vec<f32> { ... }
+
+// Add ndarray overload:
+pub fn squared_distances_array(query: &ArrayView1<f32>, points: &ArrayView2<f32>) -> Array1<f32> {
+    assert_eq!(query.len(), points.ncols());
+    let n = points.nrows();
+    let mut result = Array1::zeros(n);
+    for (i, row) in points.axis_iter(Axis(0)).enumerate() {
+        let diff = &row - query;
+        result[i] = diff.mapv(|x| x * x).sum();
+    }
+    result
+}
+```
+
+---
+
+### Tier 5: Zip & Parallel — Modernize Inner Loops (Medium effort, Medium impact)
+
+Replace manual `iter_mut().zip()` with ndarray's `Zip` for vectorization hints
+and future `par_for_each()` support.
+
+---
+
+#### 5.1 VML Tail Loops → Zip
+
+**File**: `src/hpc/vml.rs` — 12+ sites
+
+**Current pattern** (repeated throughout):
+```rust
+for (o, &v) in out[tail_start..].iter_mut().zip(x[tail_start..].iter()) {
+    *o = v.exp();
+}
+```
+
+**Transform**:
+```rust
+use crate::Zip;
+
+let out_view = ArrayViewMut1::from_shape(n - tail_start, &mut out[tail_start..]).unwrap();
+let x_view = ArrayView1::from_shape(n - tail_start, &x[tail_start..]).unwrap();
+Zip::from(&mut out_view).and(&x_view).for_each(|o, &v| {
+    *o = v.exp();
+});
+```
+
+**Benefit**: Zip provides SIMD-friendlier iteration order and enables
+future `par_for_each()` without restructuring.
+
+---
+
+#### 5.2 Hamming Batch → Parallel Zip
+
+**File**: `src/hpc/bitwise.rs` lines 320-333
+
+**Current state**: Sequential loop over candidates.
+
+**Transform** (feature-gated on `rayon`):
+
+```rust
+#[cfg(feature = "rayon")]
+pub fn hamming_batch_parallel(
+    query: &ArrayView1<u8>,
+    database: &ArrayView2<u8>,  // [n_candidates, vec_len]
+) -> Array1<u64> {
+    use rayon::prelude::*;
+    let q = query.as_slice().unwrap();
+    let results: Vec<u64> = database
+        .axis_iter(Axis(0))
+        .into_par_iter()
+        .map(|row| {
+            let r = row.as_slice().unwrap();
+            dispatch_hamming(q, r)
+        })
+        .collect();
+    Array1::from_vec(results)
+}
+```
+
+---
+
+### Tier 6: Multi-Dimensional Reductions (High effort, High impact for ML)
+
+Lift 1-D slice operations to full N-D axis reductions.
+
+---
+
+#### 6.1 SIMD Axis Reductions
+
+**File**: `src/hpc/reductions.rs`
+
+**Current state**: Only 1-D: `sum_f32(&[f32])`, `mean_f32(&[f32])`, etc.
+
+**Transform**:
+
+```rust
+/// Sum along an axis, using SIMD for contiguous lanes.
+pub fn sum_axis_f32<S, D>(arr: &ArrayBase<S, D>, axis: Axis) -> Array<f32, D::Smaller>
+where
+    S: Data<Elem = f32>,
+    D: Dimension + RemoveAxis,
+{
+    let mut result = Array::zeros(arr.raw_dim().remove_axis(axis));
+    for (i, lane) in arr.lanes(axis).into_iter().enumerate() {
+        if let Some(slice) = lane.as_slice() {
+            result.as_slice_mut().unwrap()[i] = sum_f32(slice);
+        } else {
+            result.as_slice_mut().unwrap()[i] = lane.iter().fold(0.0, |a, &b| a + b);
+        }
+    }
+    result
+}
+```
+
+---
+
+#### 6.2 Statistics Trait — Generalize to N-D
+
+**File**: `src/hpc/statistics.rs`
+
+**Transform**: Make `variance`, `std_dev`, `percentile` work along axes:
+
+```rust
+pub trait StatisticsNd<A> {
+    type Reduced;
+
+    fn variance_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced;
+    fn std_axis(&self, axis: Axis, ddof: usize) -> Self::Reduced;
+    fn percentile_axis(&self, axis: Axis, q: f64) -> Self::Reduced;
+}
+```
+
+---
+
+## Execution Order & Dependencies
+
+```
+Phase A (2 weeks) — Type Bridges:
+  1.1 Fingerprint ↔ ArrayView1      (standalone, no deps)
+  1.2 VsaVector ↔ ArrayView1        (standalone)
+  1.3 BF16/F16 → BlasFloat          (requires backend/mod.rs edit)
+  1.4 CogRecord channel views       (standalone)
+
+Phase B (2 weeks) — Extension Traits:
+  2.1 HdcOps trait (K0/K1/K2)       (depends on 1.1)
+  2.2 Quantize trait                 (depends on 1.3)
+  2.3 Prefilter trait                (standalone)
+  2.4 ActivationsNd (axis softmax)   (standalone)
+  2.5 SimdMath trait (VML)           (standalone)
+
+Phase C (2 weeks) — Backend Wiring:
+  3.1 Core sum/mean → typed SIMD     (standalone — calls crate::simd::F32x16 directly)
+  3.2 Delete dead detection code     (standalone, -877 lines unreachable under cfg pin)
+  3.3 INT8 Matmul via trait          (depends on 1.3 pattern)
+
+Phase D (1 week) — View Factories:
+  4.1 Arrow → ArrayView              (standalone)
+  4.2 Distance → Array1 returns      (standalone)
+
+Phase E (1 week) — Modernize Loops:
+  5.1 VML tails → Zip                (standalone)
+  5.2 Hamming batch → parallel       (standalone)
+
+Phase F (2 weeks) — N-D Reductions:
+  6.1 SIMD axis reductions           (depends on 3.1)
+  6.2 Statistics N-D                  (depends on 6.1)
+```
+
+---
+
+## Rules of Engagement
+
+1. **No deletion** — raw-slice functions stay. They serve FFI, embedded, and hot-path callers
+   who've already pre-validated contiguity.
+
+2. **Extension traits, not modifications** — new traits in new files or at the bottom of
+   existing files. Existing public APIs unchanged.
+
+3. **Contiguous fast-path, generic fallback** — every new trait method checks
+   `as_slice()` first. If contiguous → SIMD dispatch. If not → `mapv()`/`iter()`.
+
+4. **Feature-gate expensive additions** — rayon parallel (`rayon` feature),
+   BlasFloat for BF16 (`bf16-blas` feature if controversial).
+
+5. **Tests mirror existing** — each new trait method gets the same test coverage
+   as the raw-slice version it wraps.
+
+6. **Naming convention** — ndarray-native methods use `snake_case` matching ndarray
+   style (`.softmax_axis()`, `.simd_exp()`). Existing raw functions keep their
+   current names (`softmax_f32()`, `vsexp()`).
+
+---
+
+## Impact Matrix
+
+| Item | Lines Changed | New Lines | Performance Impact | Ergonomics Impact |
+|------|--------------|-----------|-------------------|-------------------|
+| 1.1 Fingerprint bridge | 0 | ~40 | Zero (zero-copy) | High — unblocks Zip/broadcast |
+| 1.3 BF16 BlasFloat | ~10 | ~80 | Enables BF16 GEMM via .dot() | Very High — removes type barrier |
+| 2.1 HdcOps trait | 0 | ~60 | Zero (delegates) | High — clean pipeline API |
+| 2.5 SimdMath (VML) | 0 | ~100 | 10-50x for exp/ln/sin | Very High — transparent SIMD |
+| 3.1 Core sum → SIMD | ~20 | ~40 | 16x for f32 sum/mean | Zero (invisible) |
+| 3.2 Unified detection | -877 | ~50 | Zero (dedup) | N/A (internal) |
+| 3.3 INT8 matmul trait | 0 | ~60 | 4x over f32 GEMM | High — quantized inference |
+| 5.2 Parallel hamming | 0 | ~30 | N×cores speedup | Medium |
+| 6.1 SIMD axis reductions | 0 | ~80 | 16x per lane | High — ML training |
+
+---
+
+## What This Achieves
+
+After this refactoring, hpc/ is no longer a "second library living inside ndarray."
+It becomes the **performance substrate** that ndarray's own operations dispatch to:
+
+```rust
+// User code — completely ndarray-native, no hpc imports needed:
+let a: Array2<f32> = load_embeddings();
+let b: Array2<f32> = load_queries();
+
+let sims = a.dot(&b.t());           // → routes to SIMD GEMM internally
+let topk = sims.sum_axis(Axis(1));   // → routes to SIMD sum internally
+let probs = sims.softmax_axis(Axis(1));  // → routes to SIMD exp internally
+
+// User code — HDC domain, imports hpc trait:
+use ndarray::hpc::ndarray_ext::HdcOps;
+let fp: Array1<u64> = Fingerprint::<256>::random(42).into();
+let candidates: Array2<u64> = load_database();
+
+for row in candidates.axis_iter(Axis(0)) {
+    if let Some(ec) = fp.cascade_distance(&row, &gate) {
+        hits.push(ec);
+    }
+}
+```
+
+The raw-slice layer (`hpc::kernels::k2_exact(&[u64], ...)`) remains for:
+- FFI exports (Python, C)
+- Embedded/no-std contexts
+- Callers who've already validated layout and want zero overhead
+- Benchmarks that isolate kernel performance from array metadata
+
+Both worlds coexist. The bridge is the extension traits.
diff --git a/SOA_KERNEL_ARCHITECTURE.md b/SOA_KERNEL_ARCHITECTURE.md
new file mode 100644
index 00000000..384023b3
--- /dev/null
+++ b/SOA_KERNEL_ARCHITECTURE.md
@@ -0,0 +1,478 @@
+# SoA Kernel Architecture: The Transposed Cascade
+
+> **The insight**: Don't iterate candidates and probe words.
+> Iterate words and filter candidates.
+>
+> The database layout IS the algorithm.
+
+---
+
+## The Current Architecture (AoS — Row-Major)
+
+```
+Memory layout: [candidate₀ word₀..word₂₅₅] [candidate₁ word₀..word₂₅₅] ...
+
+Algorithm (kernel_pipeline, line 725):
+    for each candidate:          ← sequential over N
+        K0: XOR word[0]          ← random access into candidate's word 0
+        K1: XOR word[0..8]       ← sequential within one candidate
+        K2: XOR word[0..256]     ← sequential within one candidate
+
+Cache behavior:
+    K0 touches 8 bytes per candidate but strides by 2048 bytes.
+    For 1M candidates: 8 bytes useful / 2048 bytes per cache line = 0.4% utilization.
+```
+
+Each K0 probe fetches one u64 from a different 2KB region. The CPU prefetcher
+can't help because the access pattern is "8 bytes at stride 2048" — every fetch
+pulls a full cache line but uses 1/256th of it.
+
+---
+
+## The Transposed Architecture (SoA — Column-Major)
+
+```
+Memory layout:
+    Column 0:   [cand₀_word₀, cand₁_word₀, cand₂_word₀, ... candₙ_word₀]
+    Column 1:   [cand₀_word₁, cand₁_word₁, cand₂_word₁, ... candₙ_word₁]
+    ...
+    Column 255: [cand₀_word₂₅₅, cand₁_word₂₅₅, ... candₙ_word₂₅₅]
+
+Algorithm (transposed cascade):
+    Phase K0: scan Column 0 — XOR all N words with query[0], threshold filter
+        → produces survivor_mask (bitmask of which candidates survived)
+
+    Phase K1: scan Columns 0..8 — XOR only survivors' words, accumulate
+        → narrows survivor_mask further
+
+    Phase K2: scan Columns 0..255 — XOR only final survivors
+        → exact EnergyConflict for each
+
+Cache behavior:
+    K0 is a contiguous sequential scan of N × 8 bytes.
+    For 1M candidates: 8MB, perfectly prefetchable, 100% cache line utilization.
+    AVX-512: 8 candidates per instruction (512 bits / 64 bits = 8 u64s).
+```
+
+---
+
+## Why This Is Genius at the Meta Level
+
+### 1. The Cascade Becomes SIMD-Native Filtering
+
+Current K0 (per-candidate, scalar):
+```rust
+for i in 0..n_candidates {
+    let word0 = database[i * 256];           // stride 2048 bytes — cache-hostile
+    if (word0 ^ query_word0).count_ones() > threshold {
+        continue;
+    }
+    // ... K1, K2
+}
+```
+
+Transposed K0 (columnar, SIMD):
+```rust
+let col0: &[u64] = &columns[0];              // contiguous N×8 bytes
+let q0 = U64x8::splat(query_words[0]);       // broadcast query word 0
+let thresh = U32x8::splat(gate.k0_reject);
+
+let mut survivors = vec![0u64; (n_candidates + 63) / 64];  // bitmask
+
+for chunk in 0..(n_candidates / 8) {
+    let c = U64x8::from_slice(&col0[chunk * 8..]);
+    let xor = c ^ q0;
+    let popcnt = xor.popcnt_per_lane();      // 8 popcounts in parallel
+    let mask = popcnt.le(thresh);             // 8 comparisons → 8-bit mask
+    survivors[chunk / 8] |= mask.to_bitmask() << (chunk % 8 * 8);
+}
+```
+
+**8 candidates per clock cycle instead of 1.** And the memory access is a
+sequential stream — the hardware prefetcher handles it perfectly.
+
+---
+
+### 2. The Survivor Mask IS the State Machine
+
+The cascade becomes mask narrowing — no branch prediction, no `continue`:
+
+```rust
+// Start: all candidates alive
+let mut mask: Vec<u64> = vec![u64::MAX; (n + 63) / 64];
+
+// K0: narrow mask via column 0
+k0_filter_columnar(&columns[0], query[0], gate.k0_reject, &mut mask);
+
+// K1: narrow mask via columns 0..8 (only touch survivors)
+k1_filter_columnar(&columns[0..8], &query[0..8], gate.k1_reject, &mut mask);
+
+// K2: exact only on final survivors
+let results = k2_exact_masked(&columns, query, &mask);
+```
+
+Each phase reads ONLY the columns it needs. K0 touches 1 column (N×8 bytes).
+K1 touches 8 columns (N×64 bytes). K2 touches 256 columns but only for
+survivors (~3-5% of N). Total memory touched:
+
+```
+AoS:  N × 2048 bytes (reads all words for all candidates even if K0 rejects)
+      Actually reads: N × 8 (K0) + 0.05N × 64 (K1) + 0.003N × 2048 (K2)
+      But cache lines pulled: N × 64 (one cache line per K0 probe) = N×64 bytes
+
+SoA:  N × 8 (K0, contiguous) + N × 64 (K1, contiguous) + 0.003N × 2048 (K2)
+      Effective: N × 72 + 6N = ~78N bytes vs ~64N bytes (AoS)
+      BUT: 100% utilization vs 12.5% utilization per cache line
+      Net bandwidth saving: ~8× due to no wasted cache line fetch
+```
+
+---
+
+### 3. ndarray Makes This Free (Strides, Not Copies)
+
+Here's the meta-level genius: **you don't transpose the data. You transpose the strides.**
+
+```rust
+// Build database in Fortran (column-major) order:
+let database: Array2<u64> = Array2::from_shape_vec(
+    (n_candidates, 256).f(),  // .f() = Fortran order
+    flat_word_data,
+).unwrap();
+
+// Column access is now CONTIGUOUS (no copy, no transpose):
+let col0: ArrayView1<u64> = database.column(0);
+assert!(col0.is_standard_layout());  // TRUE — physically contiguous
+
+// Row access still works (strided, for K2 on survivors):
+let survivor_fp: ArrayView1<u64> = database.row(42);
+// Not contiguous, but only used for the 3-5% of survivors — doesn't matter
+```
+
+The `(n_candidates, 256).f()` tells ndarray to lay out memory column-by-column.
+Accessing `.column(i)` returns a contiguous view. Accessing `.row(i)` returns a
+strided view. **Same data, both access patterns, zero transposition cost.**
+
+This is ndarray's stride system doing something no raw-slice library can:
+**the same allocation serves both columnar scanning AND per-candidate access.**
+
+---
+
+### 4. Arrow Is Already Column-Major
+
+Arrow stores columns separately. The arrow_bridge already provides columnar views.
+If each "word position" is stored as its own Arrow column (or a single Arrow
+FixedSizeList with the right chunking), the SoA layout is the native format:
+
+```rust
+// Arrow RecordBatch with 256 columns of u64:
+// col_0: [cand₀_w0, cand₁_w0, ...]  ← this IS the K0 scan buffer
+// col_1: [cand₀_w1, cand₁_w1, ...]
+// ...
+
+fn kernel_pipeline_arrow(batch: &RecordBatch, query: &[u64; 256], gate: &SliceGate) {
+    // K0: read column 0 directly from Arrow buffer (zero-copy)
+    let col0 = batch.column(0).as_primitive::<UInt64Type>().values();
+    let mask = k0_filter_columnar(col0, query[0], gate.k0_reject);
+
+    // K1: read columns 0..8
+    for i in 0..8 {
+        let col = batch.column(i).as_primitive::<UInt64Type>().values();
+        k1_accumulate_columnar(col, query[i], &mut mask, &mut accum);
+    }
+    // ...
+}
+```
+
+**No conversion. No copy. Arrow's native layout IS the SoA kernel's input format.**
+
+---
+
+### 5. Multi-Resolution SoA: The Three Tiers as Physical Arrays
+
+```rust
+/// Tiered columnar database. Each tier is a separate Array2 because they have
+/// different access patterns and lifetimes.
+pub struct TieredDatabase {
+    /// Tier K0: shape [N, 1] — just word 0 per candidate.
+    /// Fits in L2 for 1M candidates (8MB).
+    pub k0: Array2<u64>,  // Column-major, 1 column
+
+    /// Tier K1: shape [N, 8] — first 8 words per candidate.
+    /// Fits in L3 for 1M candidates (64MB).
+    pub k1: Array2<u64>,  // Column-major, 8 columns
+
+    /// Tier K2: shape [N, 256] — full containers.
+    /// Lives in main memory. Only accessed for 3-5% survivors.
+    pub k2: Array2<u64>,  // Column-major, 256 columns
+}
+```
+
+**Why separate arrays?** Because K0 is HOT (scanned every query). K1 is WARM
+(scanned for ~5% survivors). K2 is COLD (scanned for ~0.3% survivors). Separating
+them lets you:
+- Pin K0 in L2 cache (8MB for 1M candidates)
+- Let K1 live in L3
+- Let K2 stay in DRAM — only a few thousand rows are ever touched
+
+This is **cache-aware SoA** — not just columnar layout, but columnar layout with
+physical separation matching the thermal hierarchy of the CPU cache.
+
+---
+
+### 6. BF16 Field-Separated SoA: Awareness Becomes Memory Layout
+
+Current BF16 storage: N vectors × 1024 BF16 values = N × 2048 bytes (interleaved).
+
+Field-separated SoA:
+```rust
+/// BF16 database stored as separated bitfields.
+/// The decomposition that bf16_hamming.rs does AT RUNTIME is instead
+/// baked INTO THE STORAGE FORMAT.
+pub struct BF16FieldDatabase {
+    /// Sign bits: 1 bit per dimension × 1024 dims = 128 bytes per vector.
+    /// Shape [N, 128] as u8 — each byte = 8 sign bits.
+    pub signs: Array2<u8>,
+
+    /// Exponent fields: 8 bits per dimension × 1024 dims = 1024 bytes per vector.
+    /// Shape [N, 1024] as u8.
+    pub exponents: Array2<u8>,
+
+    /// Mantissa fields: 7 bits per dimension × 1024 dims ≈ 896 bytes per vector.
+    /// Shape [N, 896] as u8 (packed 7-bit).
+    pub mantissas: Array2<u8>,
+}
+```
+
+**The awareness algebra becomes trivially parallel**:
+
+```rust
+impl BF16FieldDatabase {
+    /// Sign-only distance (strongest signal): just Hamming on sign bytes.
+    /// 128 bytes per comparison — fits in one AVX-512 pass for 4 vectors.
+    fn sign_distance(&self, query_signs: &[u8], candidate_idx: usize) -> u32 {
+        let row = self.signs.row(candidate_idx);
+        hamming_dispatch(query_signs, row.as_slice().unwrap())
+    }
+
+    /// Exponent distance (medium signal): Hamming on exp bytes.
+    fn exp_distance(&self, query_exp: &[u8], candidate_idx: usize) -> u32 {
+        let row = self.exponents.row(candidate_idx);
+        hamming_dispatch(query_exp, row.as_slice().unwrap())
+    }
+
+    /// Awareness classification without per-element decomposition.
+    /// Signs are already separated — just compare columns directly.
+    fn batch_awareness(&self, query_signs: &Array1<u8>) -> Array1<AwarenessState> {
+        // For ALL candidates simultaneously:
+        // sign_agreement = popcount(NOT (query_signs XOR candidate_signs))
+        // Already column-accessible, already SIMD-friendly
+        let n = self.signs.nrows();
+        let mut states = Array1::zeros(n);
+        for i in 0..n {
+            let row = self.signs.row(i);
+            let agreement = hamming_agreement(query_signs.as_slice().unwrap(),
+                                             row.as_slice().unwrap());
+            states[i] = classify_awareness(agreement, ...);
+        }
+        states
+    }
+}
+```
+
+**The insight**: bf16_hamming.rs currently decomposes each BF16 value into
+sign/exp/mantissa at QUERY TIME (runtime unpacking). With field-separated SoA,
+the decomposition happens at INGEST TIME (once). Every subsequent query operates
+on pre-decomposed fields. The hot path never sees a BF16 value — only its fields.
+
+This trades ingest bandwidth for query speed:
+- Ingest: O(N) decomposition (amortized over all future queries)
+- Query: zero decomposition cost (fields already separated)
+- For a database queried 10,000× per second, this is a massive win.
+
+---
+
+### 7. PackedQualia SoA: 16 Parallel Dimension Columns
+
+Current: `Vec<PackedQualia>` where each PackedQualia = 16 × i8 + 1 BF16 (18 bytes).
+
+SoA:
+```rust
+/// Qualia database stored as separated dimension columns.
+/// Each column is ONE phenomenological dimension across ALL agents.
+pub struct QualiaColumns {
+    /// 16 dimension columns, each Array1<i8> with N_agents elements.
+    /// dims[0] = warmth, dims[1] = texture, dims[4] = social, etc.
+    pub dims: [Array1<i8>; 16],
+
+    /// Magnitude column: Array1<u16> (BF16 scalars).
+    pub magnitude: Array1<u16>,
+}
+```
+
+**What this unlocks**:
+
+```rust
+impl QualiaColumns {
+    /// "Which agents are tensioned on warmth (dim 4)?"
+    /// → Single SIMD scan of one column.
+    fn tensioned_on(&self, dim: usize, threshold: i8) -> Array1<bool> {
+        self.dims[dim].mapv(|v| v.abs() > threshold && v < 0)
+    }
+
+    /// "Resonance between agent A and all others on dimension 5"
+    /// → One column × scalar = Array1<i8>
+    fn resonance_on_dim(&self, agent_idx: usize, dim: usize) -> Array1<i16> {
+        let agent_val = self.dims[dim][agent_idx] as i16;
+        self.dims[dim].mapv(|v| v as i16 * agent_val)
+    }
+
+    /// "Total resonance between agent A and all others (dot product)"
+    /// → 16 column scans, accumulate
+    fn total_resonance(&self, agent_idx: usize) -> Array1<i32> {
+        let mut acc = Array1::<i32>::zeros(self.dims[0].len());
+        for d in 0..16 {
+            let agent_val = self.dims[d][agent_idx] as i32;
+            acc += &self.dims[d].mapv(|v| v as i32 * agent_val);
+        }
+        acc
+    }
+}
+```
+
+**AoS version of the same query** (current):
+```rust
+// "Find agents tensioned on warmth" — must scan ALL 18 bytes per agent,
+// extract byte [4], compare. Cache line waste: 4/18 ≈ 22%.
+for agent in &qualia_db {
+    if agent.resonance[4].abs() > threshold && agent.resonance[4] < 0 {
+        results.push(agent);
+    }
+}
+```
+
+**SoA version**: scan 1 byte per agent, sequential, 100% cache utilization.
+For 1M agents: 1MB scan vs ~18MB with random byte extraction.
+
+---
+
+### 8. The Meta-Meta Level: ndarray IS the SoA Abstraction
+
+The deepest insight is that **ndarray's type system already encodes this**.
+`Array2<u64>` with Fortran order IS a SoA database. The columns are the "fields."
+The rows are the "records." And ndarray's stride machinery gives you BOTH access
+patterns from the same allocation:
+
+```rust
+// THE SAME Array2, TWO access patterns:
+
+// SoA access (column scan for K0 — contiguous):
+let k0_column: ArrayView1<u64> = database.column(0);
+// → strides = [1], physically contiguous
+
+// AoS access (full record for K2 — strided):  
+let full_record: ArrayView1<u64> = database.row(survivor_idx);
+// → strides = [n_candidates], non-contiguous but only 3% of data
+```
+
+You don't need two copies. You don't need explicit transposition. ndarray's
+`Order::F` gives you SoA as the NATIVE layout while `Order::C` gives you AoS.
+The same API works for both — only the performance characteristics change.
+
+**This is what makes ndarray + hpc/ potentially genius**: the array type IS
+the database schema. Shape IS the data model. Strides ARE the access pattern.
+And the cascade kernel can be expressed as:
+
+```rust
+pub fn cascade_soa(
+    database: &Array2<u64>,      // F-order: [N, 256], columns contiguous
+    query: &Array1<u64>,         // [256]
+    gate: &SliceGate,
+) -> Vec<KernelResult> {
+    let n = database.nrows();
+
+    // K0: column 0 scan (contiguous — full SIMD bandwidth)
+    let col0 = database.column(0);
+    let mut mask = k0_columnar_simd(col0.as_slice().unwrap(), query[0], gate);
+
+    // K1: columns 0..8 scan (contiguous per column)
+    for w in 0..8 {
+        let col = database.column(w);
+        k1_accumulate_simd(col.as_slice().unwrap(), query[w], &mut mask);
+    }
+    k1_threshold(&mut mask, gate);
+
+    // K2: only survivors (row access — strided but few)
+    let mut results = Vec::new();
+    for idx in mask.iter_ones() {
+        let row = database.row(idx);
+        // row is strided — copy to stack for SIMD (256 × 8 = 2KB, fits in L1)
+        let mut buf = [0u64; 256];
+        row.assign_to(ArrayViewMut1::from(&mut buf));
+        let ec = k2_exact(&query.as_slice().unwrap(), &buf, 256);
+        results.push(/* ... */);
+    }
+    results
+}
+```
+
+---
+
+## The Shopping List Addition
+
+Add to `REFACTOR_HPC_INTEGRATION.md` Phase C:
+
+```
+Phase C-SoA (2 weeks) — Columnar Kernel Architecture:
+    C.1  TieredDatabase struct (K0/K1/K2 as separate F-order Array2)
+    C.2  k0_columnar_simd: scan N u64s with U64x8 broadcast+XOR+popcount
+    C.3  Bitmask survivor propagation (no per-candidate branching)
+    C.4  k1_accumulate_columnar: 8-column scan with running mask
+    C.5  k2_exact_masked: row extraction for final survivors
+    C.6  Arrow native path: RecordBatch columns → direct columnar scan
+    C.7  BF16FieldDatabase: sign/exp/mantissa pre-separated at ingest
+    C.8  QualiaColumns: 16 × Array1<i8> for per-dimension scan
+    C.9  Benchmark: AoS vs SoA cascade on 1M SKU-16K containers
+```
+
+---
+
+## Expected Performance Impact
+
+| Operation | AoS (current) | SoA (proposed) | Speedup | Why |
+|-----------|---------------|----------------|---------|-----|
+| K0 scan (1M) | ~2ms (cache-miss bound) | ~0.25ms (streaming) | **8×** | Contiguous vs stride-2048 |
+| K1 scan (50K survivors) | ~0.8ms | ~0.1ms | **8×** | Same reason, 8 columns |
+| K2 exact (1500 survivors) | ~0.3ms | ~0.35ms | 0.85× | Row copy overhead |
+| **Total pipeline** | **~3.1ms** | **~0.7ms** | **4.4×** | K0+K1 dominate |
+| BF16 awareness (1M) | ~5ms (decompose each) | ~0.6ms (pre-separated) | **8×** | No runtime decomposition |
+| Qualia dim scan (1M agents) | ~4.5ms (18 byte stride) | ~0.25ms (1 byte stride) | **18×** | Perfect cache line use |
+
+The K2 step is slightly SLOWER in SoA (strided row access + copy to stack).
+But K2 only runs on 0.3% of candidates — it's irrelevant to total time.
+The cascade is dominated by K0 (touches all candidates) and K1 (touches 5%).
+Both are massively faster with columnar layout.
+
+---
+
+## The Philosophical Payoff
+
+> **AoS says**: "Here is a complete entity. Ask it anything."
+> **SoA says**: "Here is one question. Ask it of everyone."
+
+The cascade pipeline IS the SoA access pattern:
+- K0 asks ONE question (word 0) of EVERYONE
+- K1 asks EIGHT questions (words 0-7) of SURVIVORS
+- K2 asks ALL questions (words 0-255) of VERY FEW
+
+This is a funnel. Each stage narrows the population and widens the per-entity cost.
+SoA is optimal for funnels because the wide-population stages (K0, K1) get
+contiguous memory access, and the narrow-population stage (K2) pays the strided
+penalty on so few entities that it doesn't matter.
+
+ndarray encodes this directly in its type system:
+- `database.column(0)` = the K0 question to everyone (contiguous)
+- `database.row(survivor)` = all answers from one entity (strided, rare)
+
+**The database layout IS the algorithm.
+The strides ARE the access pattern.
+ndarray IS the SoA abstraction.**
diff --git a/UNIFIED_REFACTOR_SEQUENCE.md b/UNIFIED_REFACTOR_SEQUENCE.md
new file mode 100644
index 00000000..f55b0417
--- /dev/null
+++ b/UNIFIED_REFACTOR_SEQUENCE.md
@@ -0,0 +1,470 @@
+# Unified Refactor Sequence: ndarray Fork
+
+> Integrates four analysis passes into one executable sequence:
+> 1. **REFACTOR_HPC_INTEGRATION.md** — Type bridges & extension traits
+> 2. **SOA_KERNEL_ARCHITECTURE.md** — Columnar cascade & field-separated storage
+> 3. **Transformer session feedback** — API conventions, namespace, codegen macros
+> 4. **lance-graph W1a consumer contract** — 5 SIMD primitives blocking 158-violation remediation
+>
+> Each wave is self-contained. Later waves depend on earlier ones.
+> Within a wave, items are independent and can run in parallel.
+
+---
+
+## Known constraints (from prior sessions)
+
+- Dispatch resolves at compile time via `cfg(target_feature)` in `simd.rs`. No per-call runtime checks.
+- `simd_avx2.rs` is dead on x86-64-v4. Don't add to it.
+- NEON does 256-bit as 2×128-bit paired.
+- VPABSB does NOT saturate i8::MIN.
+- Palette-256 is the dominant gather use case.
+- Rayon work-stealing is not the lever if typed SIMD integration is the global lever.
+- The `hpc/simd_caps.rs` and `hpc/simd_dispatch.rs` files are dead code under the pin — 877 lines to delete.
+
+---
+
+## Wave 0 — Conventions & Foundations (3 days)
+
+Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
+
+| ID | Item | Source | Effort | Why First |
+|----|------|--------|--------|-----------|
+| W0.1 | **Dual-form signature convention** (`_into` + Vec wrapper + `_ptr`) | R1.1 | 1d | Fixes root cause of PP-15 WS-1/2/4 ❌. All future kernels follow this shape. |
+| W0.2 | **Unified HpcError enum** (`src/hpc/error.rs`) | R1.2 | 4h | Replaces 4 error conventions with 1. |
+| W0.3 | **`#![deny(warnings)]` → targeted denies** | R1.3 | 1h | Unblocks doctest authoring for all subsequent waves. |
+| W0.4 | **Feature flag rename** (`hpc-extras` → `research`, `backend-*` prefix) | R4.1 | 4h | Clean slate for feature-gated additions. |
+| W0.5 | **Prelude module** (`hpc::prelude::*`) | R1.4 | 1h | Single discovery surface for downstream consumers. |
+
+**Gate**: All downstream crates (burn, candle, tract, ort, lance-graph) still compile unchanged.
+The dual-form is additive; existing call sites continue working.
+
+---
+
+## Wave 1 — SIMD Consumer Primitives (3 days) ← lance-graph P0
+
+From `lance-graph` W1a consumer contract. **Five primitives blocking 158 raw-intrinsic
+violations across 5 lance-graph crates.** Each is a tight-scope PR. Parallel review.
+
+The lance-graph `simd-savant` agent runs PRE-MERGE against each PR.
+
+| ID | Item | Source | Effort | Consumer Site |
+|----|------|--------|--------|---------------|
+| W1.1 | **`I8x16::from_i4_packed_u64()`** + `batch_packed_i4_16()` | W1a-#1 | 1d | `lance-graph-contract/src/mul.rs::i4_eval::batch` — 5 batch fns over `QualiaI4_16D(u64)`. Closes PR #398 codex P1 (NEON OOB at `len==2`). |
+| W1.2 | **`I8x16::saturating_abs()`** + `I8x32::saturating_abs()` | W1a-#2 | 4h | `lance-graph-contract/src/mul.rs` — Direction-B fix. `abs(i8::MIN)` must return `i8::MAX`, not wrap. See §VPABSB correction. |
+| W1.3 | **`U16x8::gather_u16()`** + `palette_lookup_u8x8()` | W1a-#3 | 4h | `bgz17/src/simd.rs:88` — currently inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation). |
+| W1.4 | **`prefetch_read_t0/t1/t2()`** | W1a-#4 | 2h | `bgz17/src/prefetch.rs:96,100` — currently inlines `_mm_prefetch` directly. |
+| W1.5 | **`U64x8::popcnt()`** + `U64x8::xor_popcount()` | W1a-#5 | 4h | `holograph/hamming.rs` + `lance-graph/graph/blasgraph/types.rs` — inline `_mm512_popcnt_epi64`. |
+
+### Per-primitive implementation contract
+
+Every PR MUST:
+1. **Impl in `simd_avx512.rs`** + **`simd_neon.rs`** + scalar fallback.
+2. **Edge-case semantics documented** (`i8::MIN`, empty slices, OOB indices).
+3. **Parity test**: all backends produce identical output on randomized corpus.
+4. **Bench against scalar**: speedup ratios in PR body.
+5. **`// SAFETY:` on every `unsafe` block**.
+6. **No `is_x86_feature_detected!`**. No `#[target_feature(enable=...)]`.
+7. **Consumer site cited** in PR description.
+
+### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)
+
+```rust
+impl I8x16 {
+    /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes (sign-extended).
+    /// Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf).
+    pub fn from_i4_packed_u64(packed: u64) -> Self;
+
+    /// Const-folded lane extract.
+    pub fn lane_i8<const N: usize>(self) -> i8;
+}
+
+/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
+/// Bounds-aware tail handling; scalar fallback on unsupported arch.
+pub fn batch_packed_i4_16<E, F>(packed: &[u64], aux: &[i8], out: &mut [E], f: F)
+where F: Fn(I8x16, i8) -> E + Sync + Send, E: Copy;
+```
+
+**AVX-512**: `_mm_cvtsi64_si128` + nibble shuffle via VPSHUFB mask LUT, then sign-extend.
+Alt: PDEP (`_pdep_u64` × 2) into two u64 halves → load → `vpmovsxbw`. Bench both on Zen4 + SPR.
+**NEON**: `vld1_u8` 8 bytes → `vshl_n_s8(v, 4)` + `vshr_n_s8(v, 4)` for nibble split + sign-extend.
+**Scalar**: `((packed >> (4*i)) & 0xf) as i8`, with `if x > 7 { x - 16 } else { x }` for sign-extend.
+
+The `batch_packed_i4_16` closure-batch is the foundation for W1.5-#7 (randomized signatures).
+
+### W1.2 — I8x16::saturating_abs (VPABSB correction)
+
+**Critical**: `_mm512_abs_epi8` does NOT saturate `i8::MIN`. Returns `0x80` (= -128).
+
+```rust
+impl I8x16 { pub fn saturating_abs(self) -> Self; }
+impl I8x32 { pub fn saturating_abs(self) -> Self; }
+```
+
+**AVX-512**: `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`.
+**NEON**: `vqabsq_s8(x)` — hardware-saturating (the `q` suffix).
+**Scalar**: `i8::saturating_abs()` (stdlib, well-defined).
+
+**Mandatory test**:
+```rust
+let input = I8x16::splat(i8::MIN);
+let result = input.saturating_abs();
+assert_eq!(result.lane_i8::<0>(), i8::MAX);  // 127, NOT -128
+```
+
+### W1.3 — U16x8::gather_u16 (palette lookup)
+
+```rust
+impl U16x8 {
+    /// Gather 8 u16 values from `table` at given indices.
+    /// Debug: panics if max(indices) >= table.len(). Release: scalar fallback.
+    pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
+}
+pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
+```
+
+**Palette-256 is the dominant use case** — every index fits in u8, table is always
+256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must
+handle arbitrary tables too, but palette-256 should be the fast path (no OOB
+possible when table.len() == 256 and indices are u8-sourced).
+
+**Codex P2 fix (gather_u16 OOB)**: `_mm256_i32gather_epi32` reads 4 bytes per slot
+from a `&[u16]` table — overreads 2 bytes at `table[len-1]`. For palette-256 this
+is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513,
+which is within a 512-byte aligned allocation + padding). For arbitrary tables,
+use scalar fallback or pad the table allocation.
+
+**AVX-512**: `_mm512_i32gather_epi32` with index widening + mask to u16.
+**NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`.
+
+### W1.4 — Prefetch hints (cross-arch)
+
+```rust
+pub fn prefetch_read_t0(ptr: *const u8);  // L1
+pub fn prefetch_read_t1(ptr: *const u8);  // L2
+pub fn prefetch_read_t2(ptr: *const u8);  // L3
+```
+
+**x86**: `_mm_prefetch(ptr, _MM_HINT_T0/T1/T2)`.
+**aarch64**: `prfm pldl1keep/pldl2keep/pldl3keep`.
+**Other**: no-op (prefetch is a hint; silent no-op is correct per ISA).
+
+### W1.5 — U64x8::popcnt (lane-wise popcount)
+
+```rust
+impl U64x8 {
+    /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
+    pub fn popcnt(self) -> Self;
+    /// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
+    pub fn xor_popcount(self, other: Self) -> u64;
+}
+impl U64x4 { pub fn popcnt(self) -> Self; }  // NEON/scalar parity
+```
+
+**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq` —
+available on sapphirerapids, enabled by the target-cpu pin).
+**NEON popcount per-u64**: `vcntq_u8` → `vpaddlq_u8` → `vpaddlq_u16` → `vpaddlq_u32`
+(NOT `vaddvq_u8` which merges ALL lanes to a single scalar — Codex P2 fix).
+**Scalar**: `u64::count_ones()` fused loop.
+
+### W1.5+ — Deferred primitives (gated on sigker certification)
+
+Three more primitives queued behind `lance-graph:crates/sigker` certification
+(Hambly-Lyons 2010 path-signature uniqueness). **No code needed today.** Listed
+so W1a additions are designed broad enough to compose with these later.
+
+| ID | Primitive | Gate |
+|----|-----------|------|
+| W1.5-#6 | `signature_pde_sweep()` — Goursat PDE for `〈S(X), S(Y)〉` | `jc Pillar 11` activation |
+| W1.5-#7 | Randomized projection (Cuchiero-Schmocker-Teichmann 2021) | sigker bench at production widths |
+| W1.5-#8 | Lyndon-pack log-signature (7-13× compression) on `I16x16` | sigker bench at production widths |
+
+The closure-batch shape introduced in W1.1 is the foundation for W1.5-#7.
+
+**Gate**: All 5 primitives pass parity tests across AVX-512/NEON/scalar.
+`lance-graph` `simd-savant` agent verifies compliance PRE-MERGE.
+Consumer-side migration PRs (#398-#402) unblocked.
+
+---
+
+## Wave 2 — Codegen Macros (2 days)
+
+Eliminates copy-paste before adding new code. Every subsequent wave benefits.
+Wave 1 primitives become the first consumers of these macros (retrofit optional).
+
+| ID | Item | Source | Effort | What It Produces |
+|----|------|--------|--------|------------------|
+| W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
+| W2.2 | **Per-arch impl macro** (`simd_impl!`) | R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. |
+| W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
+| W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
+
+**Gate**: `cargo test -p ndarray` passes. Macro output matches existing hand-rolled functions.
+Run benches to verify no regression.
+
+---
+
+## Wave 3 — Type Bridges (2 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 1. Makes domain types ndarray-native.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W3.1 | **Fingerprint<N> ↔ ArrayView1<u64>** | Tier 1.1 | 4h | Zero-copy bridge, unblocks Zip/broadcast on fingerprints |
+| W3.2 | **VsaVector ↔ ArrayView1<u64>** | Tier 1.2 | 2h | Same pattern |
+| W3.3 | **BF16/F16 → BlasFloat** | Tier 1.3 | 4h | Enables `Array<BF16>.dot()`, unlocks BLAS dispatch |
+| W3.4 | **CogRecord channel_as_words()** | Tier 1.4 | 2h | ArrayView1<u64> per channel |
+| W3.5 | **Arrow bridge → ArrayView factories** | Tier 4.1 | 2h | `binary_column_view()` returns ArrayView2 |
+
+**Gate**: New `From`/`Into` impls compile. Existing code unchanged. Add tests for each bridge.
+
+---
+
+## Wave 4 — Extension Traits (3 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 2. hpc operations feel native on ndarray types.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W4.1 | **HdcOps trait** (K0/K1/K2 on Array1<u64>) | Tier 2.1 | 4h | `query.cascade_distance(&candidate, &gate)` |
+| W4.2 | **Quantize trait** (Array → BF16/I8 with Array output) | Tier 2.2 | 4h | `array.to_bf16()`, `array.to_i8_symmetric()` |
+| W4.3 | **Prefilter on ArrayView2** | Tier 2.3 | 4h | Eliminates `(slice, rows, cols)` pattern |
+| W4.4 | **ActivationsNd** (softmax_axis, log_softmax_axis) | Tier 2.4 | 4h | `batch.softmax_axis(Axis(1))` for inference |
+| W4.5 | **SimdMath trait** (VML on any Array<f32>) | Tier 2.5 | 4h | `array.simd_exp()` — 10-50x for transcendentals |
+| W4.6 | **Int8Matmul trait** (Array2<u8> × Array2<i8> → Array2<i32>) | Tier 3.3 | 4h | Quantized inference via ndarray types |
+
+**Gate**: Each trait has tests matching the raw-slice function's test suite.
+Benchmark: trait overhead vs. direct raw-slice call ≤ 1%.
+
+---
+
+## Wave 5 — Backend Wiring (2 days)
+
+From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accelerate.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W5.1 | **Delete duplicate detection code** (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned |
+| W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
+| W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
+
+**Note on W5.1**: With `target-cpu=sapphirerapids` in `.cargo/config.toml`, the
+`cfg(target_feature = "avx512f")` branch in `simd.rs` is the only live path.
+`hpc/simd_caps.rs` (515 lines) and `hpc/simd_dispatch.rs` (362 lines) exist for
+a multi-binary world that doesn't apply when the target is pinned. These can be
+deleted or feature-gated behind `cfg(not(target_feature = "avx512f"))` for CI
+fallback builds.
+
+**Gate**: `cargo bench` shows measurable improvement on contiguous arrays.
+Non-contiguous arrays unchanged (fallback to generic fold).
+
+---
+
+## Wave 6 — SoA Kernel Architecture (1 week)
+
+From SOA_KERNEL_ARCHITECTURE.md. The genius-level structural change.
+Wave 1's `U64x8::popcnt()` and `U64x8::xor_popcount()` are direct prerequisites
+for the columnar XOR+popcount scan in W6.2.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W6.1 | **TieredDatabase struct** (K0/K1/K2 as F-order Array2) | SoA §5 | 1d | Physical separation matching cache hierarchy |
+| W6.2 | **k0_columnar_simd** (scan N u64s with broadcast+XOR+popcount) | SoA §1 | 1d | Uses `U64x8::xor_popcount` from W1.5. 8 candidates per cycle. |
+| W6.3 | **Bitmask survivor propagation** (mask narrowing, no branching) | SoA §2 | 4h | Replace per-candidate `continue` with mask ops |
+| W6.4 | **k1_accumulate_columnar** (8-column scan with mask) | SoA §1 | 4h | K1 on contiguous columns |
+| W6.5 | **k2_exact_masked** (row extraction for survivors only) | SoA §1 | 4h | Stack-copy 2KB per survivor |
+| W6.6 | **BF16FieldDatabase** (sign/exp/mantissa separated at ingest) | SoA §6 | 1d | Awareness without runtime decomposition |
+| W6.7 | **QualiaColumns** (16 × Array1<i8>) | SoA §7 | 4h | 18x dimension scan throughput. Uses `I8x16::from_i4_packed_u64` from W1.1. |
+| W6.8 | **Arrow native path** (RecordBatch columns → columnar scan) | SoA §4 | 4h | Zero-copy from Lance/Arrow |
+| W6.9 | **Benchmark: AoS vs SoA** on 1M SKU-16K containers | SoA §perf | 4h | Proof of 4-8x claim |
+
+**Gate**: SoA cascade produces identical results to AoS cascade on full test suite.
+Benchmark confirms ≥3x throughput improvement on 1M containers.
+
+---
+
+## Wave 7 — Namespace Restructure (3 days)
+
+From transformer feedback R2.1. Enforces the architecture rule from CLAUDE.md.
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W7.1 | **Split hpc/ → hpc/ + cog/ + ext/ + io/** | R2.1 | 2-3d | 30 numeric modules in hpc/, 35 cognitive in cog/, 20 experimental in ext/, 6 I/O in io/ |
+| W7.2 | **SIMD directory consolidation** (simd_*.rs → src/simd/) | R2.2 | 1d | One directory for all SIMD code. W1 primitives land in new locations. |
+| W7.3 | **Quantized module split** (quantized.rs → hpc/quant/) | R2.3 | 4h | BlockQ4_0 packed struct for candle compat |
+
+**Gate**: `pub use` deprecation shims for all moved modules.
+All existing `use ndarray::hpc::*` paths still resolve (with deprecation warning).
+
+---
+
+## Wave 8 — Test & Bench Infrastructure (2 days)
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W8.1 | **Extract integration tests** to `tests/hpc/`, `tests/cog/` | R4.2 | 2d | Module files lose 30-50% of line count |
+| W8.2 | **HPC bench harness** (dot, reductions, softmax, quantized) | R4.5 | 1d | Reproducible perf claims |
+| W8.3 | **Cross-comparison bench** (ndarray vs candle vs tract kernel) | R4.5 ext | 4h | The whole reason for B1 |
+| W8.4 | **W1a primitive parity bench** (AVX-512 vs NEON vs scalar speedup) | lance-graph | 4h | Acceptance criteria §4 |
+
+---
+
+## Wave 9 — Version Bump & Downstream Pin (1 day)
+
+| ID | Item | Source | Effort | Impact |
+|----|------|--------|--------|--------|
+| W9.1 | **Tag v0.18.0** | R4.4 | 2h | Clean cut |
+| W9.2 | **CHANGELOG with migration guide** | R4.4 | 2h | Downstream knows what changed |
+| W9.3 | **Downstream Cargo.toml pins** (burn, candle, tract, ort, lance-graph) | R4.4 | 2h | tag = "v0.18.0" |
+| W9.4 | **Remove deprecation shims** (schedule for v0.19) | R2.1 | — | Future cleanup |
+
+---
+
+## Dependency Graph
+
+```
+Wave 0 (conventions)
+  │
+  ├──→ Wave 1 (SIMD primitives) ←── P0 for lance-graph
+  │       │
+  │       ├──→ Wave 6.2 (k0_columnar uses U64x8::xor_popcount from W1.5)
+  │       ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
+  │       └──→ Wave 8.4 (primitive parity benches)
+  │
+  ├���─→ Wave 2 (codegen macros)
+  │       │
+  │       └──→ Wave 5 (backend wiring uses macros)
+  │
+  ├──→ Wave 3 (type bridges)
+  │       │
+  │       ├──→ Wave 4 (extension traits need bridges)
+  │       ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
+  │       └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
+  │
+  ├──→ Wave 7 (namespace) — independent of 1-6
+  │
+  └──→ Wave 8 (tests/benches) — after 1+5+6
+          │
+          └──→ Wave 9 (release)
+```
+
+### Critical Path
+
+**For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked.
+
+**For full release**: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = **18 days** serial.
+With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**.
+
+---
+
+## How Wave 1 Feeds Everything Downstream
+
+The 5 SIMD primitives aren't isolated additions — they're load-bearing for later waves:
+
+| Primitive | Immediate consumer (lance-graph) | Later consumer (ndarray) |
+|-----------|----------------------------------|--------------------------|
+| `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop |
+| `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics |
+| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** �� gather exponent fields from lookup table |
+| `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan |
+| `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation |
+
+The SoA cascade (Wave 6) is fundamentally built on `U64x8::xor_popcount()`. Without Wave 1,
+the SoA implementation would need to drop back to `hpc::bitwise::hamming_distance_raw()` on
+slices, losing the typed-wrapper discipline and duplicating dispatch logic.
+
+---
+
+## The SoA Integration with All Waves
+
+| Wave 6 needs | Provided by |
+|--------------|-------------|
+| `U64x8::xor_popcount()` for columnar scan | **Wave 1.5** (SIMD primitives) |
+| `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) |
+| `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) |
+| F-order Array2<u64> database | **Wave 3** (type bridges — Fingerprint converts to Array) |
+| Impl macro for k0_columnar_simd | **Wave 2** (codegen macros) |
+| `_into` form for columnar kernels | **Wave 0** (signature convention) |
+| Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) |
+| BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) |
+| Arrow columns → direct scan | **Wave 3** (Arrow view factories) |
+| Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) |
+
+---
+
+## The Meta-Level SoA Observation
+
+The SoA insight goes beyond just "transpose the database." It's a design principle
+that applies recursively across the entire surface:
+
+### Level 1: Database layout (SOA_KERNEL_ARCHITECTURE.md)
+- Fingerprint database: column-per-word → cascade becomes column scan
+- BF16 fields: separated sign/exp/mantissa → awareness is native layout
+- Qualia: column-per-dimension → per-dim queries are sequential scans
+
+### Level 2: API surface
+- The **dual-form convention** (W0.1) IS SoA thinking applied to function signatures:
+  - AoS API: `fn foo(input) -> output` (allocates inside, hides layout)
+  - SoA API: `fn foo_into(input, output)` (caller controls memory layout)
+
+### Level 3: SIMD primitives (lance-graph W1a)
+- The **typed-wrapper-as-method** convention IS SoA for the SIMD surface:
+  - AoS: raw intrinsics scattered across consumer crates (158 violations)
+  - SoA: struct methods on typed wrappers, consumers import ONE namespace
+- The **closure-batch pattern** (W1.1 `batch_packed_i4_16`) IS SoA for computation:
+  - AoS: each consumer writes its own unpack+process+pack loop
+  - SoA: one batch primitive accepts a closure, owns chunking/tailing/dispatch
+
+### Level 4: Module structure (W7.1 namespace split)
+- AoS: one module (`hpc/`) contains everything → grep to find anything
+- SoA: separated by concern (`hpc/`, `cog/`, `ext/`, `io/`) → contiguous columns
+
+### Level 5: ndarray's type system
+- `Array2<u64>` in F-order IS the SoA database
+- `.column(i)` IS the field-access pattern
+- `.row(i)` IS the record-access pattern
+- Strides encode the duality without duplication
+
+### Level 6: Arrow + Lance alignment
+- Arrow IS columnar storage (SoA by definition)
+- Lance stores Arrow columns on disk
+- The SoA kernel reads Arrow columns directly — no ETL, no transpose, no copy
+- Storage format = memory layout = compute pattern = cache access order
+
+### Level 7: The consumer contract itself
+- lance-graph declares what it needs (W1a primitives), ndarray provides
+- The contract is columnar: one primitive per concern, not a monolithic "SIMD library"
+- Each primitive is independently testable, deployable, and benchmarkable
+
+**The architecture collapses all intermediate transformations at every level.**
+
+---
+
+## Effort Summary
+
+| Wave | Days | Parallel? | Key Outcome |
+|------|------|-----------|-------------|
+| 0 | 3 | Yes | Conventions locked |
+| 1 | 3 | Yes (after 0) | **lance-graph unblocked** (P0) |
+| 2 | 2 | ∥ with 1 | Codegen macros ready |
+| 3 | 2 | After 0 | Domain types ↔ Array |
+| 4 | 3 | After 3 | Extension traits live |
+| 5 | 2 | After 2+3 | Core ops accelerated |
+| 6 | 5 | After 1+3+5 | SoA cascade (the big win) |
+| 7 | 3 | ∥ with 4-6 | Namespace clean |
+| 8 | 2 | After 1+6 | Proof via benchmarks |
+| 9 | 1 | After all | Release |
+
+**lance-graph unblocked**: Wave 0+1 = **6 days**.
+**Full release critical path**: 18 days serial, **~12 days** with parallelism.
+
+---
+
+## What NOT to Do
+
+1. **Don't delete hpc/ modules** — raw-slice functions stay (FFI, embedded, zero-overhead)
+2. **Don't premature-v1.0** — surface is too young; candle/tract wiring will shake it again
+3. **Don't gate SoA behind a feature flag** — it's the default hot path, not optional
+4. **Don't couple SoA with module restructure** — they're independent; merge separately
+5. **Don't break downstream in one shot** — deprecation shims for one release minimum
+6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence happened because no test existed
+7. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated