Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions SIMD_COMPARISON.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# SIMD Wishlist Audit: AdaWorldAPI/ndarray (March 2026)

## Codebase Snapshot

- **57 HPC modules** in `src/hpc/` (52K+ lines)
- **2,846 lines** of portable SIMD polyfill: `src/simd.rs` → `src/simd_avx512.rs` → scalar fallback
- SIMD types: `F32x16`, `F64x8`, `U8x64`, `I32x16`, `U32x16`, `U64x8` (AVX-512) + `f32x8`, `f64x4` (AVX2)
- Full operator overloading, `mul_add` (FMA), `sqrt`, `reduce_sum/min/max`, `simd_clamp`
- AVX2 dot product with 4× unrolled accumulators (`src/simd_avx2.rs:52`)
- Runtime dispatch via `is_x86_feature_detected!` (65+ sites in `bitwise.rs`)

---

## Wishlist Scorecard

| # | Item | Status | Key Evidence |
|---|---|---|---|
| 1 | `simd_map()` | **PARTIAL** | SIMD types exist (`F32x16` etc.), VML scalar. Missing: generic lane iteration |
| 2 | `SpatialArray3<T>` | **PARTIAL** | `cam_index.rs` CAM + `dn_tree.rs` spatial tree. Missing: f32 3D coordinates |
| 3 | `xor_diff()` | **DONE** | `bitwise.rs` AVX-512BW/AVX2/scalar XOR + popcount |
| 4 | `gather_scatter()` | **MISSING** | Only vpshufb nibble gathers in bitwise.rs |
| 5 | `columnar_view()` | **PARTIAL** | `arrow_bridge.rs` schema + `SoakingBuffer`. Missing: `ArrayView` bridge |
| 6 | `Zip::simd_apply()` | **PARTIAL** | `kernels.rs` K0→K1→K2 fusion. Missing: generic over closures |
| 7 | `runtime_dispatch()` | **DONE** | 65+ `is_x86_feature_detected!` sites + scalar fallbacks |
| 8 | `stencil()` | **MISSING** | BNN has neighbor patterns but no 3D stencil API |
| 9 | `compact_palette()` | **PARTIAL** | `palette_distance.rs` 256-entry codebook + `quantized.rs` + vpshufb |
| 10 | `prefetch/stream` | **PARTIAL** | `packed.rs` layout-for-prefetch. No explicit `_mm_prefetch` |

---

## Per-Item Detail

### 1. `simd_map()` — Lane-Native SIMD Iteration

**Exists:** `F32x16::from_slice()`, `copy_to_slice()`, `mul_add()`, `sqrt()`, all operators. AVX2 `dot_f32()` with 4-accumulator unrolling in `simd_avx2.rs:52-90`.

**Exists:** `src/hpc/vml.rs` has `vsexp`, `vssqrt`, `vsln`, `vsabs`, `vsadd`, `vsmul`, `vsdiv` — but ALL are scalar loops.

**Gap:** Need `vml.rs` to use `F32x16`/`f32x8` types. The types exist, the functions exist, they just aren't connected. Example:
```rust
// Current vml.rs:
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
for (o, &v) in out.iter_mut().zip(x.iter()) { *o = v.sqrt(); }
}
// Should be:
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
let chunks = x.len() / 16;
for i in 0..chunks {
let v = F32x16::from_slice(&x[i*16..]);
v.sqrt().copy_to_slice(&mut out[i*16..]);
}
// scalar remainder
}
```

### 2. `SpatialArray3<T>` — Content-Addressable Memory

**Exists:** `cam_index.rs` — multi-probe LSH CAM for 49,152-bit `GraphHV` binary vectors. `dn_tree.rs` — hierarchical spatial partitioning (739 lines). `parallel_search.rs` — dual-path HHTL + CLAM tree search.

**Gap:** All CAM infrastructure operates on binary hypervectors, not f32 spatial coordinates. Need a `SpatialCam3D` adapter that uses spatial hashing (floor(x/cell_size)) for the Pumpkin entity bind/unbind pattern.

### 3. `xor_diff()` — SIMD XOR Change Detection

**DONE.** `bitwise.rs`:
- `hamming_avx2()` (line 62): 32 bytes/iter via vpshufb
- `hamming_avx512bw()` (line 117): 64 bytes/iter via vpshufb-512
- `hamming_avx512_vpopcnt()`: native VPOPCNTDQ when available
- Runtime dispatch (line 234): `avx512vpopcntdq` → `avx512bw` → `avx2` → scalar
- `hamming_query_batch()`: batch mode for tick N vs N+1 comparison

**Only gap:** No sparse `nonzero_iter()` returning positions of changed elements.

### 4. `gather_scatter()` — Vectorized Gather

**MISSING.** `tekamolo.rs` and `cam_index.rs` use hash-based lookups (conceptually gather) but no `VPGATHERDD`/`VGATHERDPS` intrinsics anywhere.

### 5. `columnar_view()` — Zero-Copy Arrow Interop

**Exists:** `arrow_bridge.rs` has schema constants (`s_binary`, `p_binary`, `o_binary`, `node_id`), `GateState` lifecycle (Form→Flow→Freeze), `SoakingBuffer { data: Vec<i8>, n_entries, n_dims }`.

**Gap:** Missing the one-liner: `unsafe { ArrayView1::from_shape_ptr(len, arrow_buf.as_ptr()) }`.

### 6. `Zip::simd_apply()` — Multi-Array Fused SIMD Kernel

**Exists:** `kernels.rs` K0→K1→K2 fused cascade (1589 lines). `packed.rs` stroke-aligned cascade query. Both fuse multiple passes into one traversal.

**Gap:** Fusion is hardcoded for binary Hamming. Need generic version accepting `Fn(F32x16, F32x16) -> F32x16`.

### 7. `runtime_dispatch()` — CPU Feature Detection

**DONE.** Two complementary systems:
1. `bitwise.rs`: 65+ `is_x86_feature_detected!` with 4-tier fallback
2. `simd.rs` polyfill: compile-time dispatch via `#[cfg(target_arch)]` with scalar fallback types

### 8. `stencil()` — 3D Neighbor-Aware SIMD

**MISSING.** `bnn_causal_trajectory.rs`, `deepnsm.rs`, `clam_search.rs` have neighbor traversal patterns but nothing 3D-stencil-specific.

### 9. `compact_palette()` — Bit-Packed SIMD

**PARTIAL.** Three relevant modules:
- `palette_distance.rs`: 256-entry `Palette` codebook with precomputed pairwise L1 distance matrix
- `quantized.rs`: f32→u8 quantization with scale/zero-point
- `bitwise.rs`: vpshufb nibble lookup (4-bit table, proven in SIMD)

**Gap:** No variable-width (4-15 bit) pack/unpack for Minecraft block state encoding.

### 10. `prefetch_region()` + `stream_store()`

**PARTIAL.** `packed.rs` uses stroke-aligned layout for hardware prefetcher ("the prefetcher handles sequential access"). No explicit `_mm_prefetch` or `_mm_stream_ps`.

---

## What Changed Since Last Audit

| New Module | Lines | Wishlist Impact |
|---|---|---|
| `src/simd.rs` | 829 | #1 #6 #7 — portable SIMD types with scalar fallback |
| `src/simd_avx512.rs` | 1399 | #1 — F32x16/F64x8/U8x64 with FMA, sqrt, reduce |
| `src/simd_avx2.rs` | 618 | #1 — f32x8/f64x4 dot product, GEMM tile sizes |
| `hpc/holo.rs` | new | Phase + focus + carrier (94 tests) |
| `hpc/zeck.rs` | new | Zeckendorf encoding + batch/top_k |
| `hpc/palette_distance.rs` | new | #9 — 256-entry palette with O(1) distance |
| `hpc/parallel_search.rs` | new | #2 — dual-path HHTL + CLAM search |
| `hpc/layered_distance.rs` | new | O(1) distance via palette index + precomputed matrix |
| `hpc/bgz17_bridge.rs` | new | Base17 bridge for palette interop |

---

*Audit generated 2026-03-23. AdaWorldAPI/ndarray master @ 11633d06.*
Loading