diff --git a/.claude/knowledge/cognitive-distance-typing.md b/.claude/knowledge/cognitive-distance-typing.md
new file mode 100644
index 00000000..654fe69e
--- /dev/null
+++ b/.claude/knowledge/cognitive-distance-typing.md
@@ -0,0 +1,166 @@
+# KNOWLEDGE: Cognitive Distance Metrics Are Typed — No Generic Umbrella, No Roundtrips
+
+## READ BY
+- Any agent designing or modifying distance APIs in `src/hpc/{plane,vsa,distance,cascade,causal_diff,seal,merkle_tree,bnn,clam,fingerprint}.rs`
+- Any agent proposing a generic `fn distance<T>(a: &T, b: &T) -> f32` umbrella
+- W7 implementers (when W7 — cognitive bulk ops — moves from deferred to active)
+- The W3-W6 plan-review savant currently auditing the SoA/AoS design (this doc constrains W7's scope and bounds what the SoA helpers should NOT grow into)
+
+## P0 TRIGGERS
+- About to design a generic `distance(a, b) -> f32` that picks the metric internally → STOP, distance is typed; build one fn per metric and name it
+- About to chain `palette 256 → fisher z → cosine → hamming → popcount → palette 256` in one call path → STOP, that's the canonical worst-case roundtrip and erases the typing
+- About to drop the `buckets` / Euler-gamma-offset arguments from a palette-256 distance call → STOP, those are PART of the typed distance, not optional context
+- About to silently convert between metric types in an intermediate step → STOP, conversions must be EXPLICIT (named fns) and documented at the call site as escape hatches
+
+---
+
+## The taxonomy
+
+Each row is a distinct typed primitive. Mixing them at runtime is the bug this doc exists to prevent.
+
+| Metric | Computes | Input | Output | Cascade role | Notes |
+|---|---|---|---|---|---|
+| **Palette 256 distance** | precomputed 256×256 table lookup with bucket + Euler-gamma-offset context | two `PaletteIdx` (u8) + `Buckets` + `EulerGammaOffset` | `PaletteDistance` (typed f32 newtype) | **Level 3** (finalist scoring, ~200 candidates) | The buckets and Euler offset are INTEGRAL to the metric. Dropping them changes the answer; passing zero offset is NOT the same as omitting. |
+| **HDR popcount early-exit** | Hamming on 256-bit bitpacked fingerprints with under-threshold short-circuit | two `&Fingerprint256` + `u16` threshold | `Option<HammingDistance>` (None = exceeded threshold, early exit) | **Level 1** (cosine REPLACEMENT, ~1M → ~20K) | This IS the cosine replacement on the cascade — NOT a derivative or approximation of cosine. The popcount metric directly substitutes for FP cosine in the search topology. |
+| **Base17 L1** | L1 (Manhattan) distance on 17-dim i16 vectors | two `&[i16; 17]` | i32 | **Level 2** (~20K → ~200) | Fits in one AVX-512 load or two NEON loads. The 17-dimension shape is specific; don't pad to 16 or 18. |
+| **Fisher-z transform** | variance-stabilizing transform of correlation → z-score | f32 correlation | f32 z-score | **NOT a distance** — a normalization applied to palette 256 OUTPUT when distance distributions need comparison across heterogeneous buckets | Calling Fisher-z on a non-correlation value is a category error. |
+| **BF16 mantissa exact transformation** | direct palette-256 → palette-256 mapping using BF16 mantissa context | `PaletteIdx` + `EulerGammaOffset` (+ mantissa context) | new `PaletteIdx` | **bypasses the cascade entirely** when the direct mapping is known | The fast path: when you already have a palette index and need the transformed palette index under a known offset, this is one typed hop in palette space. No metric translation, no cascade levels. |
+
+---
+
+## The roundtrip anti-pattern (worst case — do not write this code path)
+
+```
+  palette 256 distance        [Level 3 typed]
+        │
+        ▼
+   fisher z normalize         [valid: variance-stabilize a Level-3 result]
+        │
+        ▼
+   "treated as cosine"        ←── BUG: popcount IS cosine replacement,
+        │                          fisher-z of palette ≠ cosine input
+        ▼
+   hamming distance           [Level 1 typed — wrong scale, wrong topology]
+        │
+        ▼
+   HDR popcount preheat       [Level 1 detail — re-derived from wrong source]
+        │
+        ▼
+   early exit                 [decisions made on un-rounded round-trip]
+        │
+        ▼
+   palette 256 distance       [back to start, BUT: buckets + Euler offset lost
+                               somewhere in the chain — answer differs from
+                               the original Level-3 result by quantization noise
+                               + loss of bucket assignment]
+```
+
+Each arrow:
+- pays an arithmetic / cache cost
+- loses the typed-distance identity (the type system stops protecting the call site)
+- introduces conversion error that compounds along the chain
+- can converge back to "approximately the same number" — which makes the bug invisible in unit tests but wrong in deployment
+
+## The direct path (preferred)
+
+```
+  palette 256 distance ──[Euler gamma offset known]──▶ palette 256 BF16 mantissa exact transformation
+                                                                    │
+                                                                    ▼
+                                                              new PaletteIdx
+                                                              (stays in palette space)
+```
+
+ONE typed step. Stays in palette-256 type space. Preserves bucket + offset throughout. No cascade traversal, no metric translation, no conversion noise.
+
+When the BF16-mantissa direct path is applicable (caller has a `PaletteIdx` and an `EulerGammaOffset`), use it. The cascade exists for the case where you don't yet know which palette band the target lives in.
+
+---
+
+## API design rule (binding)
+
+1. **One fn per metric. Named.**
+   ```rust
+   pub fn palette256_distance(
+       a: PaletteIdx, b: PaletteIdx,
+       buckets: &Buckets, offset: EulerGammaOffset,
+   ) -> PaletteDistance;
+
+   pub fn hdr_popcount_early_exit(
+       a: &Fingerprint256, b: &Fingerprint256, threshold: u16,
+   ) -> Option<HammingDistance>;
+
+   pub fn base17_l1(a: &[i16; 17], b: &[i16; 17]) -> i32;
+
+   pub fn palette256_bf16_mantissa_transform(
+       p: PaletteIdx, offset: EulerGammaOffset, mantissa: BF16MantissaCtx,
+   ) -> PaletteIdx;
+   ```
+
+2. **Conversions are explicit and named.** When a caller must cross metric boundaries (escape-hatch case):
+   ```rust
+   pub fn hamming_distance_to_palette_index_estimate(d: HammingDistance) -> PaletteIdx;
+   //                                  ^-- name says "estimate" so callers can't mistake it for exact
+   ```
+   The call site MUST carry a comment naming WHY the conversion is happening (not the default path).
+
+3. **No `Box<dyn Distance>` / no `enum DistanceMetric { Palette, Hamming, Base17, … }` / no `fn distance<T: HasMetric>(a, b) -> f32` umbrella.**
+   The type system distinguishes the metrics for a reason.
+
+4. **Newtype the output.** `PaletteDistance(f32)`, `HammingDistance(u16)`, `Base17L1(i32)` — different output types prevent accidental cross-metric arithmetic. `let d = palette256_distance(...) + hamming_dist;` should not compile.
+
+5. **The buckets and Euler-gamma-offset arguments to palette-256 fns are REQUIRED, not optional.** Default values for those parameters are domain-meaningful (changing them changes the metric); a `Default::default()` impl on `EulerGammaOffset` is acceptable only with strong documentation that the default is a deliberate calibration constant, not a sentinel.
+
+---
+
+## What this means for the SoA/AoS sprint (W3-W6) and beyond
+
+### W3-W6 (currently in-flight on `claude/w3-w6-soa-aos-helpers`)
+- `SoaVec<T, N>`, `soa_struct!`, `aos_to_soa`, `soa_to_aos`, `bulk_apply`, `bulk_scan` are **generic over T**. They do NOT bake in any distance metric.
+- **DO NOT** during the sprint or in post-review extend any of these primitives toward distance computation. Distance stays out of the helper layer.
+- If a worker is tempted to add `fn bulk_distance<T>(...)` to `bulk.rs` → STOP, that's the umbrella anti-pattern.
+
+### W7 (deferred — cognitive bulk ops)
+- When W7 lands, EACH metric gets its own bulk primitive named for the metric:
+  ```rust
+  pub fn bulk_hdr_popcount_early_exit(
+      query: &Fingerprint256, db: &[Fingerprint256], threshold: u16,
+  ) -> Vec<Option<HammingDistance>>;
+
+  pub fn bulk_palette256_distance(
+      query: PaletteIdx, db: &[PaletteIdx],
+      buckets: &Buckets, offset: EulerGammaOffset,
+  ) -> Vec<PaletteDistance>;
+
+  pub fn bulk_palette256_mantissa_transform(
+      palettes: &[PaletteIdx], offset: EulerGammaOffset, mantissa: BF16MantissaCtx,
+  ) -> Vec<PaletteIdx>;
+  ```
+- Underneath, the bulk fns MAY use `SoaVec` / `bulk_apply` from W3/W4 for layout staging. That's fine — those are layout helpers, not distance helpers.
+- The cascade orchestrator in `hpc/cascade.rs` calls each Level's typed bulk primitive directly. It does NOT internally translate Level-1 outputs to Level-3 inputs by passing through Fisher-z.
+
+### Bench harness (prereq for W7's SIMD acceleration)
+- Per-metric benches: one per typed primitive, no umbrella `bench_distance` macro.
+- The bench output should report the typed primitive's name in the column header so regressions are attributable to the specific metric.
+
+---
+
+## Cross-references
+
+- `CLAUDE.md` § "Three-Level Cascade: How the Search Actually Works" — describes L1 Hamming sweep, L2 Base17 L1, L3 Palette lookup. The Levels are typed distance bands, NOT tiers of a generic distance abstraction.
+- `src/hpc/cascade.rs` — the orchestrator. Inspect before adding any new distance code; the call chain there is the canonical correct example.
+- `src/hpc/distance.rs` — distance utilities. Audit for any `fn distance<T>` umbrella that may have crept in; refactor to typed fns if found.
+- `src/hpc/plane.rs`, `src/hpc/vsa.rs` — produce values that feed the cascade. Their output types must be the typed distance newtypes, not raw `f32`.
+- `.claude/knowledge/w3-w6-soa-aos-design.md` — the W3-W6 helper design (this doc constrains what those helpers must NOT grow into).
+- `.claude/knowledge/vertical-simd-consumer-contract.md` — the layering rule (user → crate::simd → simd_{type}). Same family: layering and typing are both about preserving identity across abstraction levels.
+
+---
+
+## TL;DR for an agent reading this in 30 seconds
+
+1. Palette-256 distance ≠ Hamming popcount ≠ Base17 L1. Don't put them under one API.
+2. Palette-256 distance carries `buckets + EulerGammaOffset` always. Don't drop them.
+3. The fast path inside palette space is the BF16-mantissa direct transform — one hop, no cascade.
+4. The cascade is THREE typed levels in sequence, NOT a generic distance pipeline with intermediate conversions.
+5. Conversions between metric types must be explicit, named, documented per call site.
+6. No `Box<dyn Distance>`, no umbrella `fn distance<T>(...)`, no `enum DistanceMetric`.
diff --git a/.claude/knowledge/w3-w6-codex-audit.md b/.claude/knowledge/w3-w6-codex-audit.md
new file mode 100644
index 00000000..eb4ae7d7
--- /dev/null
+++ b/.claude/knowledge/w3-w6-codex-audit.md
@@ -0,0 +1,55 @@
+# W3-W6 Codex Audit
+
+Auditor: codex P0 review agent
+Branch: `claude/w3-w6-soa-aos-helpers` (rebased onto master at `3f35170d`)
+Commits under review:
+- `5095853c` feat(hpc/soa): `SoaVec` + `soa_struct!` + `aos_to_soa` + `soa_to_aos` (W3+W5+W6)
+- `5845ab2d` feat(hpc/bulk): `bulk_apply` + `bulk_scan` (W4)
+
+## Verdict
+
+**READY-FOR-PR.** Zero P0 findings. One P1 cosmetic docstring gap (patched on this branch). Three P2 items intentionally deferred per the design contract.
+
+## Verification exit codes (all 0)
+
+| Command | Exit | Notes |
+|---|---|---|
+| `cargo check -p ndarray --no-default-features --features std` | 0 | |
+| `cargo test -p ndarray --lib --no-default-features --features std hpc::soa` | 0 | 29 passed, 0 failed |
+| `cargo test -p ndarray --lib --no-default-features --features std hpc::bulk` | 0 | 16 passed, 0 failed |
+| `cargo test --doc -p ndarray --no-default-features --features std hpc::soa` | 0 | 10 passed, 1 intentionally ignored |
+| `cargo test --doc -p ndarray --no-default-features --features std hpc::bulk` | 0 | 2 passed, 1 intentionally ignored |
+| `cargo fmt --all -- --check` | 0 | |
+| `cargo clippy -p ndarray --no-default-features --features std -- -D warnings` | 0 | |
+
+## P0 findings
+
+None.
+
+## P1 findings
+
+- **F4** — `usize::MAX` chunk-size behavior is tested at `src/hpc/bulk.rs:182-194` but NOT documented in the public docstring of `bulk_apply` (`src/hpc/bulk.rs:46-66`) or `bulk_scan` (`src/hpc/bulk.rs:80-97`). One-line addition: "`chunk_size == usize::MAX` yields the entire slice as a single chunk." **Patched on this branch** before PR opens.
+
+## P2 findings (deferred per design contract)
+
+- **G1** (`bulk_scan` naming): the savant flagged that "scan" conventionally means fold-with-state. Kept as `bulk_scan` for symmetry with `bulk_apply`. Rename to `bulk_for_each` / `bulk_inspect` is a follow-up if downstream consumers find the name misleading.
+- **G2** (`SoaVec::iter_rows`): row iterator yielding `[&T; N]` per row is absent. Use `soa.chunks(1)` for the same effect. Deferred to a follow-up once a real use case exists.
+- **G3** (`SoaVec` lacks `#[derive(Clone, Debug)]`): macro-generated structs DO support derive passthrough (verified by test at `src/hpc/soa.rs:733-742`), but the generic container does not. Deliberately deferred — adding derives would require `where T: Clone + Debug` bounds that callers don't always want.
+
+## D4 — integration test gate
+
+The `bulk_apply` × `aos_to_soa` integration test at `src/hpc/bulk.rs:295` correctly uses `#[cfg(any())]` (canonical never-compile sentinel). The test body at `src/hpc/bulk.rs:297-324` is sound. Now that `hpc/soa.rs` and `hpc/bulk.rs` are landing in the same PR, the gate could be removed as a follow-up — but worker B's `cfg(any())` gate preserves the safe deferral if the PR review wants to keep them independently mergeable.
+
+## Compliance summary
+
+| Concern | Status | Notes |
+|---|---|---|
+| Layering rule (no `#[target_feature]`, no per-arch imports, no raw intrinsics) | ✅ clean | Only doc-prose mentions in module headers; zero actual attributes |
+| Distance typing (no umbrella `fn distance<T>`, no `enum DistanceMetric`, no `Box<dyn Distance>`) | ✅ clean | Both module headers cite `cognitive-distance-typing.md` and warn against extension toward distance |
+| Spec API match (per design doc v2 §C1-C7, §D1-D4) | ✅ exact | `field_n::<const I>()` uses `const { assert!(I < N) }`; all method signatures verbatim |
+| Doc coverage (every `pub fn` has `///` doc with working `# Example`) | ✅ complete | After F4 patch |
+| Test coverage (per design doc §"Tests" per fn) | ✅ complete | 29 + 16 = 45 unit tests + 12 doctests |
+
+## Recommended next step
+
+Open the W3-W6 PR. F4 fix landed on the same branch as a follow-up commit; integration test stays `cfg(any())`-gated for the PR; P2 items deferred.
diff --git a/.claude/knowledge/w3-w6-plan-review.md b/.claude/knowledge/w3-w6-plan-review.md
new file mode 100644
index 00000000..9634e728
--- /dev/null
+++ b/.claude/knowledge/w3-w6-plan-review.md
@@ -0,0 +1,100 @@
+# W3-W6 Plan Review
+
+Reviewer: plan-review savant
+Branch: `claude/w3-w6-soa-aos-helpers`
+Design under review: `/home/user/ndarray/.claude/knowledge/w3-w6-soa-aos-design.md`
+Date: 2026-05-18
+
+## Verdict
+
+**READY-WITH-DOC-FIXES** — design is sound, layering rule respected, no irreducible ambiguity. One P0 doc fix (module placement), six P1 clarifications, and several P2 polish notes should land before workers spawn. Implementation can proceed in parallel once the doc patches commit.
+
+## Findings
+
+### Blockers (P0 — must fix before workers spawn)
+
+- **P0-1 — `aos_to_soa` / `soa_to_aos` are wired to the wrong module.** Design doc §"W5 + W6" (lines 264–330) places them in `src/simd_ops.rs`. That module is *exclusively* SIMD-dispatch glue (every existing fn uses `F32x16` / `F64x8`; see `simd_ops.rs:1–10` doc header: "Slice-level elementwise ops built on the polyfill SIMD types"). Putting *pure-scalar, no-SIMD* helpers there:
+  1. Contradicts the module's documented charter (`simd_ops.rs:1`).
+  2. Violates the W1a consumer contract (`vertical-simd-consumer-contract.md:31`): "ndarray's SIMD surface is shaped to fit exactly what the Ada stack vertically needs — not as a generic library". Free-function `aos_to_soa` in `simd_ops.rs` is exactly the free-function shape the litmus tests at lines 315–317 say to reject.
+  3. Forces the future SIMD swap to either (a) leave the scalar fn at the same module path while adding `_avx512` / `_neon` cousins in different files — semantically incoherent — or (b) rename, breaking callers.
+
+  **Recommended fix in the design doc:** move both fns to `src/hpc/soa.rs` (same module as `SoaVec`). Doc example becomes:
+  ```rust
+  use ndarray::hpc::soa::{SoaVec, aos_to_soa, soa_to_aos};
+  ```
+  This co-locates types and their conversions and keeps `simd_ops.rs` pure-SIMD. When the future SIMD swap happens, the dispatcher at `hpc::soa::aos_to_soa` can grow per-arch arms internally (delegating to per-arch impls under `simd_*.rs` only for the SIMD bits, not for the user-visible API).
+
+  Cross-ref: E2 below (answer: `hpc::soa`).
+
+### Important (P1 — fix in doc or first commit)
+
+- **P1-1 — `SoaVec::field(i)` runtime panic is avoidable for known indices.** Design doc lines 132–136. `N` is const-generic and `i: usize` is runtime — most call sites will use literal indices. Recommend adding a const-generic accessor alongside the runtime one:
+  ```rust
+  pub fn field_n<const I: usize>(&self) -> &[T] {
+      const { assert!(I < N) };  // compile-time bounds check (Rust 1.79+)
+      &self.fields[I]
+  }
+  ```
+  Keep the runtime `field(i)` for dynamic-index callers. Doc-comment must state which to use when. Without this, every hot-path access pays an unnecessary bounds check that the optimizer often can't elide.
+
+- **P1-2 — `T: Copy` bound is silent in `aos_to_soa` / `soa_to_aos`.** Design doc lines 290–298: `extract: F where F: Fn(&T) -> [f32; N]` — the closure returns `[f32; N]` by value, so the *return type* is `Copy`. `T` itself does NOT need `Copy` because `extract(item)` borrows via `&T`. But the doc-comment must say so explicitly. The current example uses `struct Item { a: f32, b: f32, c: f32 }` which is implicitly `Copy`; a consumer with `String` fields would be uncertain. Add a doc note: "`T` need not be `Copy`; only the extracted `[f32; N]` row is materialized."
+
+- **P1-3 — `soa_struct!` macro has zero reserved-word / field-name-collision handling.** Design doc lines 202–253. The macro generates a `push(&mut self, $($field: $ty),*)` method — if a struct has a field named `len`, `clear`, `new`, `with_capacity`, `is_empty`, or `default`, the generated impl will fail to compile due to method-name collision (the macro generates `pub fn len(&self)` AND the field `pub len: Vec<...>`; user accessing `self.len` is ambiguous between the field and the method). The compile error will be cryptic. Doc must add a §"Reserved field names" section listing the six names that conflict and stating "the macro deliberately does not work around this; choose different field names."
+
+- **P1-4 — D3 (pub-field invariant) is left undecided.** Design doc lines 211–212: `$vis struct $name { $($field_vis $field: ::std::vec::Vec<$ty>),* }`. The macro respects user-provided `pub` on fields, which means callers CAN do `batch.means_x.truncate(5)` and break the field-length invariant silently (release-mode debug_assert disabled). The doc never specifies whether the invariant is owned by the type or the caller. Worker will have to guess. See D3 recommendation below.
+
+- **P1-5 — `aos_to_soa::<_, 3, _>` turbofish is awkward; inference is brittle.** Design doc line 285 example: `aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c])`. Two `_` placeholders means callers MUST remember the positional order (T, N, F). The const-generic `N` can usually be inferred from the closure return type `[f32; 3]`, but Rust's inference for const generics from array literals is brittle. Doc must include an "if inference fails" note showing the alternative: `aos_to_soa(&aos, |it| -> [f32; 3] { [it.a, it.b, it.c] })`. Tag as "verified on Rust 1.94".
+
+- **P1-6 — `bulk_apply` panic on `chunk_size == 0` is documented; `usize::MAX` is not.** Design doc lines 374–384. `slice::chunks_mut(usize::MAX)` on a non-empty slice returns a single-chunk iterator (stdlib behavior). Doc should add one line: "`chunk_size == usize::MAX` yields the entire slice as a single chunk." Trivial fix, prevents future confusion.
+
+- **P1-7 — Module registration spec ambiguity in `src/lib.rs`.** Design doc lines 419–429: "the macro is already `#[macro_export]` which puts it at the crate root, so `ndarray::soa_struct!` works. No manual re-export needed." Technically correct, but workers must NOT also write `pub use crate::hpc::soa::soa_struct;` (would conflict). Add an explicit "do not re-export the macro; `#[macro_export]` handles it" line in the spec.
+
+### Nice to have (P2 — workers may apply, codex can flag later)
+
+- **P2-1 — `bulk_scan` is a misnomer.** Design doc lines 386–409. In Rust idiom (`Iterator::scan`), FP, APL/GPU, "scan" = prefix-sum / fold-with-state. The fn is a read-only `chunks` walker. Better names: `bulk_each`, `bulk_for_each`, `bulk_inspect`, or `bulk_apply_ref` to pair with the mut version. Doc should at least call out the divergence from convention if `bulk_scan` is kept.
+
+- **P2-2 — `SoaVec` is missing `iter_rows()` and `iter_rows_mut()`.** Design doc lines 98–148. Chunked iterator is there, row iterator (yielding `[&T; N]` per row) is absent. Consumers will want it for `for row in soa.iter_rows() { ... }` patterns. Either add it to v0 or document explicitly: "row iteration is not provided; use `soa.chunks(1)` if needed".
+
+- **P2-3 — `core::array::from_fn(|k| fields[k][i])` in `soa_to_aos` (line 325) is O(N) per row.** For N=2/3/4 it's fine. Add one sentence: "Complexity: O(N·len) where N is field count and len is row count." Settles the "should this be SIMD'd later?" question by making the cost visible.
+
+- **P2-4 — `SoaVec` doesn't have a `Clone` derive (or any derives).** Design doc line 94. SoA containers commonly want `Clone` and sometimes `Debug`. The macro at line 211 captures `$(#[$meta:meta])*` for the struct attrs — verify that user `#[derive(Clone)]` on the macro invocation passes through correctly; if not, add a derive-passthrough fragment.
+
+- **P2-5 — Test §"hpc::soa::push debug-assert fires" is hard to write for `SoaVec`.** Design doc line 262. For the *macro*-generated struct, fields are `pub`, so the test is straightforward. For `SoaVec`, fields are private, so the corrupt-state test is impossible without `unsafe`. Either drop the `debug_assert!` in `SoaVec::len()` (it can never fire from safe code), or keep it as defense-in-depth and skip the SoaVec test (only test the macro-generated case). Worker spec should pick one.
+
+- **P2-6 — `SoaVec` lacks `pop`, `truncate`, `swap_remove`.** Symmetry with `Vec`. Not required for v0; add a future-work note in the module-level doc.
+
+## Specific recommendation for each open question
+
+- **D3 (macro: pub fields stay pub OR go private?):** **Keep fields `pub` (or whatever vis the user specifies), document the invariant as caller-owned for direct mutation.** Rationale: (1) the SoA layout's entire ergonomic win is direct `&[T]` access for SIMD-style loops — hiding fields behind getters forces `batch.means_x()` everywhere, ugly; (2) the existing `GaussianBatch` in `splat3d/gaussian.rs:93–110` already uses `pub` fields, so changing the pattern would create inconsistency; (3) the `debug_assert!` in `len()` catches mistakes during dev. **Add doc line:** "Fields are `pub` (or use the visibility you specify). Mutation through `push` / `clear` preserves the field-length invariant. If you mutate fields directly (e.g., `batch.means_x.truncate(5)`), you OWN the invariant — `len()` will debug-assert but release builds will return arbitrary results until you restore equality."
+
+- **E2 (`simd_ops` vs `hpc::soa` for the helpers?):** **`hpc::soa`.** See P0-1. The helpers have no SIMD content; they live with `SoaVec`; `simd_ops.rs` module charter (line 1) is SIMD-only. This is the load-bearing change to the design doc.
+
+- **F1 (future SIMD wave: rename `_scalar` or grow internal arms?):** **Grow internal arms.** Keep `pub fn aos_to_soa(...)` at `hpc::soa::aos_to_soa` as the stable user-facing entry. When SIMD lands, the dispatcher inside the fn calls `simd_dispatch::table().aos_to_soa_f32` (or similar function-pointer) which selects between `aos_to_soa_scalar`, `aos_to_soa_avx512`, `aos_to_soa_neon` at LazyLock startup. Public API is forever stable. Reason: per `vertical-simd-consumer-contract.md:31`, the consumer pattern is "closure-parameterized batch primitives that absorb the consumer's domain semantics" — the closure `Fn(&T) -> [f32; N]` IS the consumer's domain semantics, and the dispatch happens *inside* the entry.
+
+  **F2 caveat acknowledged:** a future SIMD impl will want `(*const T, stride)` (or `field_offsets: [usize; N]`) not `Fn(&T) -> [f32; N]`, because the closure is per-row scalar and prevents true vectorization. The migration plan: when SIMD lands, the closure-API stays as fallback for "extract logic too complex to vectorize", and a NEW entry `aos_to_soa_strided(packed: &[T], field_offsets: [usize; N]) -> SoaVec<f32, N>` ships alongside for the "dense POD struct" case where stride/offset can drive a real gather. Two entries, both stable. No rename.
+
+## Layering rule compliance summary
+
+- **A1 (per-arch leakage):** Clean. Zero `#[target_feature]`, zero `is_x86_feature_detected!`, zero `cfg(target_arch)` in any design surface. Doc explicitly forbids them at lines 42–48.
+- **A2 (const-generic forcing specialization):** No. `const N: usize` is a *value* generic, not a SIMD-feature generic. Future SIMD impls can dispatch on `N` (e.g., specialize N=3 for LD3 NEON, N=4 for AVX-512 gather-stride-4) inside the dispatcher without changing the public signature.
+- **A3 (future SIMD swap):** Holds, conditional on E2 (`hpc::soa` placement) and F2 (strided alt-entry as the migration path).
+
+## Confidence to proceed
+
+**HIGH.** Design is implementable with the one P0 module-relocation patch. No irreducible ambiguity. Workers can be spawned after the doc absorbs P0-1, P1-3, P1-4 (D3), P1-7, and the explicit D3 / E2 / F1 picks above. P2 items are codex-audit territory.
+
+## Action items before spawning workers
+
+1. **Doc patch:** apply P0-1 — change all "src/simd_ops.rs" references in §"W5 + W6" to "src/hpc/soa.rs", update import paths in examples.
+2. **Doc patch:** apply D3 and E2 decisions explicitly in the spec text.
+3. **Doc patch:** add P1-1 (`field_n<const I>`), P1-3 (reserved-name section), P1-7 (no manual re-export).
+4. **Optional doc patch:** add P1-2, P1-5, P1-6 clarifying notes.
+5. Spawn W3+W5+W6 as **one** worker (now all in `hpc::soa.rs`), W4 worker separately.
+
+## Sequencing concern flagged by E2
+
+If the E2 recommendation is taken (helpers live in `hpc::soa`), W3 and W5+W6 both touch `src/hpc/soa.rs`. Options:
+- **(a) Sequential:** W3 commits first (just `SoaVec` + macro), then W5+W6 commits the helpers on top.
+- **(b) Single combined worker:** W3+W5+W6 spawn as one worker producing one file in one commit. W4 (`hpc/bulk.rs`) stays parallel.
+
+Option (b) is cleaner — one commit per logical unit (SoA infrastructure), one PR. Recommend (b).
diff --git a/.claude/knowledge/w3-w6-soa-aos-design.md b/.claude/knowledge/w3-w6-soa-aos-design.md
new file mode 100644
index 00000000..4524452e
--- /dev/null
+++ b/.claude/knowledge/w3-w6-soa-aos-design.md
@@ -0,0 +1,550 @@
+# KNOWLEDGE: W3–W6 — SoA/AoS Handoff Helpers (Shape Settlement, No SIMD Yet)
+
+## READ BY
+- Worker agents executing W3 (`src/hpc/soa.rs`), W5+W6 (additions to `src/simd_ops.rs`), W4 (`src/hpc/bulk.rs`) — each agent reads this BEFORE writing code
+- The plan-review savant agent (before workers spawn)
+- The code-review codex agent (after workers commit)
+- Downstream consumer sessions that need SoA containers or AoS↔SoA conversion
+
+## P0 TRIGGERS
+- About to add a per-arch SIMD intrinsic (`#[target_feature]`, raw `_mm*_*` call, `vld3q_*` / `vst3q_*`, etc.) to any W3–W6 file → STOP, that's not in scope, see §"Why no SIMD yet"
+- About to import directly from `simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs` from user code → STOP, violates the layering rule, see §"Layering rule"
+- About to use `&[T]` / `&mut [T]` as the public type for a kernel-layer fn → STOP, kernels use `ArrayView` per W2 (PR #154); only the SIMD primitive layer takes slices
+
+## What this wave is and is not
+
+**Is:** Establish the SoA container shape, AoS↔SoA conversion ergonomics, and a chunked `bulk_apply` wrapper across the codebase. Pure-scalar implementations. Pure user-facing API. No SIMD.
+
+**Is not:** SIMD-accelerated deinterleave (VPGATHERDD on AVX-512, LD3/LD4 on NEON), gather/scatter primitives, perf-claim work. Those wait for bench data on actual hot paths.
+
+**Why no SIMD yet:** Per the §"Recommendation" of the W3-W7 outlook plan, designing SIMD primitives without measured hot-path data produces speculative APIs. The scalar helpers in this wave are useful in their own right (settles SoA shape so cognitive bulk ops, splat3d, and quantized batches share one ergonomic). A future wave (post-bench-harness) can swap the scalar bodies for per-arch SIMD without changing the public API.
+
+## Layering rule (re-stated — violation is P0)
+
+```
+                ┌─────────────────────────────────────┐
+                │ user code (hpc/*, splat3d/, downstream crates)
+                └──────────────┬──────────────────────┘
+                               │ only these imports allowed
+                               ▼
+                ┌─────────────────────────────────────┐
+                │ crate::simd, crate::simd_ops        │  ← dispatch layer
+                │ (src/simd.rs, src/simd_ops.rs)      │     LazyLock-frozen function-pointer tables
+                └──────────────┬──────────────────────┘     in simd_dispatch.rs
+                               │ internal
+                               ▼
+                ┌─────────────────────────────────────┐
+                │ simd_avx512.rs, simd_avx2.rs,        │  ← per-tier impls
+                │ simd_neon.rs, simd_wasm.rs           │     these CARRY #[target_feature]
+                └─────────────────────────────────────┘
+```
+
+User code (this wave's W3, W4, W7 files) MUST NOT:
+1. Add `#[target_feature(enable = "…")]` attributes
+2. Add `#[cfg(target_feature = "…")]` gates
+3. `use crate::simd_avx512::*` / `use crate::simd_avx2::*` / etc.
+4. Call `is_x86_feature_detected!()` on hot paths
+5. Use raw `_mm*_*` / `vld*_*` / `_pdep_u64` intrinsics
+
+W5/W6 additions to `src/simd_ops.rs` are at the **dispatch layer**. In this wave they are scalar-only; future SIMD impls would live in `simd_avx512.rs` / `simd_avx2.rs` and be dispatched via the LazyLock table. Even now, do NOT add `#[target_feature]` to the simd_ops.rs entries — they stay scalar.
+
+## Exact API contracts (workers do not deviate)
+
+### W3 — `src/hpc/soa.rs` (NEW FILE)
+
+Two complementary primitives: a **macro** for named-field SoA structs, and a **generic struct** for ad-hoc cases.
+
+```rust
+//! SoA (Struct of Arrays) containers.
+//!
+//! Two complementary primitives:
+//! - [`soa_struct!`] macro — generates a named-field SoA struct from a struct-like
+//!   declaration. Use when the field names matter to callers (e.g. `means_x`,
+//!   `means_y`, `means_z` for a Gaussian batch).
+//! - [`SoaVec`] generic — `[Vec<T>; N]` wrapper. Use when fields are positional /
+//!   anonymous and you want a single type to talk about an N-field SoA batch.
+//!
+//! Both are SIMD-friendly storage shapes: each field is a contiguous `Vec<T>`,
+//! so per-field SIMD loops just iterate one `Vec`. AoS↔SoA conversion helpers
+//! live in [`crate::simd_ops`].
+//!
+//! This module is intentionally scalar — no `#[target_feature]`, no per-arch
+//! dispatch. A future wave will add SIMD-accelerated chunk iteration once
+//! bench data justifies the per-arch implementations. The public API on this
+//! module is forward-compatible with that future swap.
+
+use core::array;
+
+/// SoA container generic over field type `T` and field count `N`.
+///
+/// Internally: `[Vec<T>; N]`. All N fields are guaranteed to have the same
+/// length (enforced by `push` and asserted in `len()`).
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::soa::SoaVec;
+/// let mut soa: SoaVec<f32, 3> = SoaVec::new();
+/// soa.push([1.0, 2.0, 3.0]);
+/// soa.push([4.0, 5.0, 6.0]);
+/// assert_eq!(soa.len(), 2);
+/// assert_eq!(soa.field(0), &[1.0, 4.0]);
+/// assert_eq!(soa.field(1), &[2.0, 5.0]);
+/// assert_eq!(soa.field(2), &[3.0, 6.0]);
+/// ```
+pub struct SoaVec<T, const N: usize> {
+    fields: [Vec<T>; N],
+}
+
+impl<T, const N: usize> SoaVec<T, N> {
+    /// Construct an empty SoaVec.
+    pub fn new() -> Self {
+        Self { fields: array::from_fn(|_| Vec::new()) }
+    }
+
+    /// Construct an empty SoaVec with each field pre-allocated to `cap`.
+    pub fn with_capacity(cap: usize) -> Self {
+        Self { fields: array::from_fn(|_| Vec::with_capacity(cap)) }
+    }
+
+    /// Append a row (one value per field) to the SoaVec.
+    pub fn push(&mut self, row: [T; N]) {
+        for (slot, val) in self.fields.iter_mut().zip(row) {
+            slot.push(val);
+        }
+    }
+
+    /// Number of rows. All fields are guaranteed to have this length.
+    ///
+    /// # Panics
+    /// In debug builds, panics if fields disagree on length (a bug in custom
+    /// `unsafe` mutation paths). In release, returns the length of field 0.
+    pub fn len(&self) -> usize {
+        let n = self.fields[0].len();
+        debug_assert!(
+            self.fields.iter().all(|f| f.len() == n),
+            "SoaVec field-length invariant violated"
+        );
+        n
+    }
+
+    pub fn is_empty(&self) -> bool { self.len() == 0 }
+
+    /// Borrow field `i` as a slice. Panics if `i >= N`.
+    ///
+    /// For known-at-compile-time indices, prefer [`field_n`](Self::field_n)
+    /// to elide the bounds check.
+    pub fn field(&self, i: usize) -> &[T] { &self.fields[i] }
+
+    /// Compile-time-checked field accessor. Use when the index is a literal.
+    /// `field_n::<2>()` is free of runtime bounds checking.
+    pub fn field_n<const I: usize>(&self) -> &[T] {
+        const { assert!(I < N, "SoaVec::field_n: I out of bounds for N"); }
+        &self.fields[I]
+    }
+
+    /// Mutably borrow field `i` as a slice. Panics if `i >= N`.
+    pub fn field_mut(&mut self, i: usize) -> &mut [T] { &mut self.fields[i] }
+
+    /// Compile-time-checked mutable field accessor.
+    pub fn field_n_mut<const I: usize>(&mut self) -> &mut [T] {
+        const { assert!(I < N, "SoaVec::field_n_mut: I out of bounds for N"); }
+        &mut self.fields[I]
+    }
+
+    /// Borrow all fields at once as an array of slices.
+    pub fn all_fields(&self) -> [&[T]; N] {
+        array::from_fn(|i| self.fields[i].as_slice())
+    }
+
+    /// Iterate over chunks of `chunk_len` rows, yielding `[&[T]; N]` per chunk.
+    /// The last chunk may be shorter than `chunk_len`. Fast: zero-copy slice borrow.
+    pub fn chunks(&self, chunk_len: usize) -> SoaChunks<'_, T, N> {
+        SoaChunks { soa: self, chunk_len, cursor: 0 }
+    }
+}
+
+impl<T, const N: usize> Default for SoaVec<T, N> {
+    fn default() -> Self { Self::new() }
+}
+
+/// Iterator yielded by [`SoaVec::chunks`].
+pub struct SoaChunks<'a, T, const N: usize> {
+    soa: &'a SoaVec<T, N>,
+    chunk_len: usize,
+    cursor: usize,
+}
+
+impl<'a, T, const N: usize> Iterator for SoaChunks<'a, T, N> {
+    type Item = [&'a [T]; N];
+
+    fn next(&mut self) -> Option<Self::Item> {
+        let len = self.soa.len();
+        if self.cursor >= len { return None; }
+        let end = (self.cursor + self.chunk_len).min(len);
+        let chunk: [&'a [T]; N] = array::from_fn(|i| &self.soa.fields[i][self.cursor..end]);
+        self.cursor = end;
+        Some(chunk)
+    }
+}
+
+/// Generate a named-field SoA struct from a struct-like declaration.
+///
+/// # Example
+/// ```
+/// use ndarray::soa_struct;
+///
+/// soa_struct! {
+///     pub struct GaussianBatch {
+///         pub means_x: f32,
+///         pub means_y: f32,
+///         pub means_z: f32,
+///     }
+/// }
+///
+/// // Generates:
+/// // pub struct GaussianBatch {
+/// //     pub means_x: Vec<f32>,
+/// //     pub means_y: Vec<f32>,
+/// //     pub means_z: Vec<f32>,
+/// // }
+/// // impl GaussianBatch { pub fn new() -> Self {...}, ... }
+///
+/// let mut b = GaussianBatch::new();
+/// b.push(1.0, 2.0, 3.0);
+/// b.push(4.0, 5.0, 6.0);
+/// assert_eq!(b.len(), 2);
+/// assert_eq!(b.means_x.as_slice(), &[1.0, 4.0]);
+/// ```
+#[macro_export]
+macro_rules! soa_struct {
+    (
+        $(#[$meta:meta])*
+        $vis:vis struct $name:ident {
+            $($field_vis:vis $field:ident : $ty:ty),* $(,)?
+        }
+    ) => {
+        $(#[$meta])*
+        $vis struct $name {
+            $($field_vis $field: ::std::vec::Vec<$ty>),*
+        }
+
+        impl $name {
+            /// Construct an empty instance.
+            pub fn new() -> Self {
+                Self { $($field: ::std::vec::Vec::new()),* }
+            }
+
+            /// Construct with each field pre-allocated to `cap`.
+            pub fn with_capacity(cap: usize) -> Self {
+                Self { $($field: ::std::vec::Vec::with_capacity(cap)),* }
+            }
+
+            /// Append one row across all fields.
+            pub fn push(&mut self, $($field: $ty),*) {
+                $(self.$field.push($field);)*
+            }
+
+            /// Length (all fields share this length; debug-asserted).
+            pub fn len(&self) -> usize {
+                let lens = [$(self.$field.len()),*];
+                debug_assert!(
+                    lens.iter().all(|&l| l == lens[0]),
+                    concat!(stringify!($name), ": field-length invariant violated")
+                );
+                lens[0]
+            }
+
+            pub fn is_empty(&self) -> bool { self.len() == 0 }
+
+            /// Clear all fields. Capacity is retained.
+            pub fn clear(&mut self) {
+                $(self.$field.clear();)*
+            }
+        }
+
+        impl ::std::default::Default for $name {
+            fn default() -> Self { Self::new() }
+        }
+    };
+}
+```
+
+**Tests** (inline `#[cfg(test)] mod tests`):
+- `SoaVec::new` / `with_capacity` / `push` / `len` / `is_empty`
+- `SoaVec::field` / `field_mut` / `all_fields`
+- `SoaVec::chunks` for chunk_len that divides len, chunk_len that doesn't, chunk_len > len
+- `soa_struct!` macro: generate a 3-field struct, push, len, clear, default
+- `soa_struct!` macro with `pub` / private fields
+- `soa_struct!` debug-assert fires on field-length-mismatch (use a custom block to corrupt, then call len, assert panic)
+
+### W5 + W6 — additions to `src/hpc/soa.rs`
+
+**Plan-review correction (P0-1):** these helpers were originally specced into `src/simd_ops.rs`, but that module's charter (`simd_ops.rs:1`) is "Slice-level elementwise ops built on the polyfill SIMD types" — every existing fn dispatches through `F32x16`/`F64x8`. Pure-scalar helpers in that module would violate the W1a consumer contract ("ndarray's SIMD surface is shaped to fit exactly what the Ada stack vertically needs — not a generic library", `vertical-simd-consumer-contract.md:31`). The free-function `aos_to_soa(&[T], extract)` shape is exactly the kind the W1a litmus rejects.
+
+**Decision: helpers live in `src/hpc/soa.rs`, co-located with `SoaVec`.** When a future bench-justified SIMD wave lands, the dispatcher inside the entry grows internal per-tier arms (calling per-arch impls under `simd_*.rs` for the gather/scatter intrinsics). The public API at `ndarray::hpc::soa::aos_to_soa` stays stable forever.
+
+```rust
+//! Add at the bottom of src/hpc/soa.rs (same file as SoaVec):
+
+/// Deinterleave an AoS slice into a `SoaVec` by extracting `N` field values
+/// per item via the user-supplied `extract` closure.
+///
+/// Scalar implementation. A future wave may add per-arch SIMD gather
+/// (VPGATHERDD on AVX-512, LD3/LD4 on NEON) for stride-known dense layouts;
+/// the public API is forward-compatible.
+///
+/// `T` need not be `Copy`; only the extracted `[f32; N]` row is materialized.
+///
+/// # Inference
+/// If the const-generic `N` fails to infer from the closure return type, annotate:
+/// `aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c])`
+/// or  `aos_to_soa(&aos, |it| -> [f32; 3] { [it.a, it.b, it.c] })`.
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::soa::aos_to_soa;
+/// struct Item { a: f32, b: f32, c: f32 }
+/// let aos = vec![Item { a: 1.0, b: 2.0, c: 3.0 }, Item { a: 4.0, b: 5.0, c: 6.0 }];
+/// let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+/// assert_eq!(soa.field(0), &[1.0, 4.0]);
+/// assert_eq!(soa.field(1), &[2.0, 5.0]);
+/// assert_eq!(soa.field(2), &[3.0, 6.0]);
+/// ```
+pub fn aos_to_soa<T, const N: usize, F>(aos: &[T], extract: F) -> SoaVec<f32, N>
+where
+    F: Fn(&T) -> [f32; N],
+{
+    let mut soa = SoaVec::<f32, N>::with_capacity(aos.len());
+    for item in aos {
+        soa.push(extract(item));
+    }
+    soa
+}
+
+/// Interleave a `SoaVec` into an AoS `Vec<T>` by building each item from
+/// the per-field values via the user-supplied `build` closure.
+///
+/// Scalar implementation. See [`aos_to_soa`] for the forward-compatibility
+/// note.
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::soa::{aos_to_soa, soa_to_aos};
+/// struct Item { a: f32, b: f32, c: f32 }
+/// let aos = vec![Item { a: 1.0, b: 2.0, c: 3.0 }, Item { a: 4.0, b: 5.0, c: 6.0 }];
+/// let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+/// let back: Vec<Item> = soa_to_aos(&soa, |[a, b, c]| Item { a, b, c });
+/// assert_eq!(back[0].a, 1.0);
+/// assert_eq!(back[1].c, 6.0);
+/// ```
+pub fn soa_to_aos<T, const N: usize, F>(soa: &SoaVec<f32, N>, build: F) -> Vec<T>
+where
+    F: Fn([f32; N]) -> T,
+{
+    let n = soa.len();
+    let fields = soa.all_fields();
+    let mut out = Vec::with_capacity(n);
+    for i in 0..n {
+        let row: [f32; N] = core::array::from_fn(|k| fields[k][i]);
+        out.push(build(row));
+    }
+    out
+}
+```
+
+**Tests** (inline in `simd_ops.rs` `#[cfg(test)] mod tests`):
+- `aos_to_soa` round-trip: build AoS, deinterleave, check per-field values
+- `aos_to_soa` with N=2, N=3, N=4 (most common cases)
+- `aos_to_soa` empty input → empty SoaVec
+- `soa_to_aos` round-trip: aos_to_soa then soa_to_aos, deep-equal original
+- `soa_to_aos` with N=2, N=3, N=4
+
+### W4 — `src/hpc/bulk.rs` (NEW FILE)
+
+Thin chunked wrapper. Caller-managed scalar parallelism.
+
+```rust
+//! Bulk traversal helpers for AoS slices.
+//!
+//! [`bulk_apply`] chunks a `&mut [T]` and invokes a closure with each chunk
+//! plus its starting index. Useful when you want predictable cache behavior
+//! (chunk_size matched to L1 working-set) or when staging chunks to SoA for
+//! SIMD processing inside the closure.
+//!
+//! Scalar — no `#[target_feature]`, no per-arch dispatch. Composes with
+//! [`crate::simd_ops::aos_to_soa`] / [`crate::simd_ops::soa_to_aos`] inside
+//! the closure body if the caller wants SIMD per chunk.
+
+/// Apply `f` to consecutive chunks of `items`. Each invocation receives the
+/// chunk slice and the absolute index of the chunk's first element.
+///
+/// The last chunk may be shorter than `chunk_size`.
+///
+/// # Panics
+/// Panics if `chunk_size == 0` (would loop forever).
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::bulk::bulk_apply;
+/// let mut v: Vec<i32> = (0..10).collect();
+/// bulk_apply(&mut v, 3, |chunk, start| {
+///     for (i, x) in chunk.iter_mut().enumerate() {
+///         *x = (start + i) as i32 * 10;
+///     }
+/// });
+/// assert_eq!(v, vec![0, 10, 20, 30, 40, 50, 60, 70, 80, 90]);
+/// ```
+pub fn bulk_apply<T, F>(items: &mut [T], chunk_size: usize, mut f: F)
+where
+    F: FnMut(&mut [T], usize),
+{
+    assert!(chunk_size > 0, "bulk_apply: chunk_size must be > 0");
+    let mut start = 0;
+    for chunk in items.chunks_mut(chunk_size) {
+        f(chunk, start);
+        start += chunk.len();
+    }
+}
+
+/// Read-only sibling of [`bulk_apply`]. Same chunking semantics, immutable
+/// chunks.
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::bulk::bulk_scan;
+/// let v: Vec<i32> = (0..10).collect();
+/// let mut sum = 0i32;
+/// bulk_scan(&v, 4, |chunk, _start| {
+///     sum += chunk.iter().sum::<i32>();
+/// });
+/// assert_eq!(sum, 45);
+/// ```
+pub fn bulk_scan<T, F>(items: &[T], chunk_size: usize, mut f: F)
+where
+    F: FnMut(&[T], usize),
+{
+    assert!(chunk_size > 0, "bulk_scan: chunk_size must be > 0");
+    let mut start = 0;
+    for chunk in items.chunks(chunk_size) {
+        f(chunk, start);
+        start += chunk.len();
+    }
+}
+```
+
+**Tests** (inline):
+- `bulk_apply` with chunk_size that divides len, doesn't divide, > len
+- `bulk_apply` start index is correct across multiple chunks
+- `bulk_apply` panics on chunk_size == 0
+- `bulk_scan` same coverage
+- `bulk_apply` composed with `aos_to_soa` inside the closure (integration smoke test)
+
+## Reserved field names (do not collide)
+
+The `soa_struct!` macro generates inherent methods on the struct. Fields named identically to these methods will cause cryptic compile errors (the macro deliberately does NOT alias around them — choose different field names):
+
+| Reserved name | Reason |
+|---|---|
+| `new` | macro generates `pub fn new()` |
+| `with_capacity` | macro generates `pub fn with_capacity(cap)` |
+| `len` | macro generates `pub fn len(&self)` |
+| `is_empty` | macro generates `pub fn is_empty(&self)` |
+| `clear` | macro generates `pub fn clear(&mut self)` |
+| `push` | macro generates `pub fn push(...)` |
+| `default` | macro implements `Default` trait |
+
+If you need a `len` field semantically (e.g., a per-row count), name it `count`, `n`, or `row_len`.
+
+## Invariant ownership (macro fields are `pub` by design)
+
+The macro respects user-specified visibility per field (`pub means_x: f32` stays `pub`; private stays private). Fields stay `pub` (or whatever the user wrote) intentionally:
+
+1. The SoA layout's entire ergonomic win is direct `&[T]` access for SIMD-style loops; hiding fields behind getters defeats the purpose.
+2. The existing hand-rolled SoA pattern in `splat3d/gaussian.rs` uses `pub` fields — staying consistent.
+3. The macro's generated `len()` carries a `debug_assert!` that catches field-length-mismatch during development.
+
+**Caller-owned invariant rule:** if you mutate fields directly (e.g., `batch.means_x.truncate(5)` to shrink one field), you OWN the field-length invariant until you restore it. `len()` will `debug_assert!` in dev builds; release builds will return the length of field 0 and downstream code may misbehave. Push and clear are the safe mutation paths.
+
+## Module registration
+
+Both new files need to be registered in `src/hpc/mod.rs`. Worker writing each new file is responsible for the corresponding `pub mod` line:
+
+```rust
+// In src/hpc/mod.rs (alphabetical position):
+pub mod bulk;
+pub mod soa;
+```
+
+The macro is already `#[macro_export]` which puts it at the crate root, so `ndarray::soa_struct!` works. **Do NOT** add a manual `pub use crate::hpc::soa::soa_struct;` re-export in `lib.rs` — `#[macro_export]` already does this, and a manual re-export will fail to compile (macro and item namespaces collide).
+
+## Worker sprint plan (post plan-review)
+
+Two workers in parallel, each isolated worktree:
+
+| Worker | Files | Scope |
+|---|---|---|
+| **Worker A — SoA combined** | `src/hpc/soa.rs` (new), `src/hpc/mod.rs` (add `pub mod soa;`) | W3 + W5 + W6: `SoaVec<T,N>` + `soa_struct!` macro + `aos_to_soa` + `soa_to_aos`. Single file, single commit. |
+| **Worker B — bulk** | `src/hpc/bulk.rs` (new), `src/hpc/mod.rs` (add `pub mod bulk;`) | W4: `bulk_apply` + `bulk_scan`. Single file, single commit. |
+
+Worker A and Worker B can both edit `src/hpc/mod.rs` (one line each, different lines — merge clean). Otherwise non-overlapping.
+
+After both commit, codex audit reviews the combined diff for:
+- Zero `#[target_feature]` attributes added
+- Zero `use crate::simd_avx512::*` / `simd_avx2::*` / `simd_neon::*` imports
+- Zero `cfg(target_feature = …)` gates
+- Zero raw `_mm*_*` / `vld*_*` / `_pdep_*` intrinsics
+- All public fns have working `///` doc-examples
+- Tests cover all spec'd cases
+
+## What workers commit per file
+
+1. Implement the spec above exactly. No deviation in API.
+2. Add inline tests covering the cases listed under §"Tests" for the file.
+3. Add the `pub mod` registration in `src/hpc/mod.rs`.
+4. Run from worktree root:
+   - `cargo check -p ndarray --no-default-features --features std`
+   - `cargo test -p ndarray --lib --no-default-features --features std hpc::soa hpc::bulk` (worker scopes its `--test` filter)
+   - `cargo test --doc -p ndarray --no-default-features --features std hpc::soa hpc::bulk`
+   - `cargo fmt --all --check`
+   - `cargo clippy --no-default-features --features std -- -D warnings`
+   - All green before commit.
+5. Commit message format:
+   - Worker A: `feat(hpc/soa): add SoaVec + soa_struct! macro + aos_to_soa + soa_to_aos (W3+W5+W6)`
+   - Worker B: `feat(hpc/bulk): add bulk_apply + bulk_scan (W4)`
+
+## Out of scope — distance metrics
+
+This sprint is layout helpers only. **Workers MUST NOT** extend `SoaVec`, the macro, `aos_to_soa`/`soa_to_aos`, or `bulk_apply` toward distance computation. Specifically:
+
+- No `fn bulk_distance<T>(...)` umbrella API
+- No `enum DistanceMetric { Palette256, Hamming, Base17, … }`
+- No `Box<dyn Distance>` trait object
+- No generic `fn distance<T>(a: &T, b: &T) -> f32`
+- No collapsing palette-256 distance and HDR popcount early-exit into one helper
+
+Distance metrics in this codebase are **typed** — each metric is its own named fn with its own output type, and conversions between metrics are explicit. See `/home/user/ndarray/.claude/knowledge/cognitive-distance-typing.md` for the binding rule and the canonical worst-case roundtrip anti-pattern (palette-256 → Fisher-z → "cosine" → hamming → popcount → palette-256). The W3-W6 helpers stay generic over `T` and never bake in any metric.
+
+## W7 explicit deferral
+
+W7 (cognitive bulk ops on `&[Plane]` / `&[VSA]` / `&[Fingerprint256]` / `&[PaletteIdx]`) is **NOT** in this wave. Two reasons:
+
+1. **No bench data.** Actual hot paths in the cognitive layer aren't measured. The SIMD wins require per-arch gather/scatter primitives whose design should follow bench evidence, not imagination.
+2. **Typed distance scope.** W7's bulk primitives must respect the per-metric typing rule (see `cognitive-distance-typing.md`). Each metric — palette-256 distance, HDR popcount early-exit, Base17 L1, BF16 mantissa direct transform — gets its own named bulk fn with its own output type. Designing those entries without a measured hot-path-vs-target-metric pairing risks shipping a generic API that erases the typing.
+
+When W7 revisits, the typed-bulk primitives will be (representative shape, not exhaustive):
+```rust
+pub fn bulk_hdr_popcount_early_exit(
+    query: &Fingerprint256, db: &[Fingerprint256], threshold: u16,
+) -> Vec<Option<HammingDistance>>;
+
+pub fn bulk_palette256_distance(
+    query: PaletteIdx, db: &[PaletteIdx],
+    buckets: &Buckets, offset: EulerGammaOffset,
+) -> Vec<PaletteDistance>;
+
+pub fn bulk_palette256_bf16_mantissa_transform(
+    palettes: &[PaletteIdx], offset: EulerGammaOffset, mantissa: BF16MantissaCtx,
+) -> Vec<PaletteIdx>;
+```
+
+These MAY internally use `SoaVec` + `bulk_apply` from W3/W4 for layout staging. They MUST NOT collapse into a `bulk_distance<T>` umbrella.
diff --git a/src/hpc/bulk.rs b/src/hpc/bulk.rs
new file mode 100644
index 00000000..3a2a9e9d
--- /dev/null
+++ b/src/hpc/bulk.rs
@@ -0,0 +1,327 @@
+//! Bulk traversal helpers for AoS slices.
+//!
+//! [`bulk_apply`] chunks a `&mut [T]` and invokes a closure with each chunk
+//! plus its starting index. Useful when you want predictable cache behavior
+//! (chunk_size matched to L1 working-set) or when staging chunks to SoA for
+//! SIMD processing inside the closure.
+//!
+//! [`bulk_scan`] is the read-only sibling for non-mutating traversal.
+//!
+//! Both helpers are scalar wrappers — no `#[target_feature]`, no per-arch
+//! dispatch. They are user-level code per the layering rule in
+//! `.claude/knowledge/vertical-simd-consumer-contract.md`: only the dispatch
+//! layer (`crate::simd`, `crate::simd_ops`) and per-tier impls
+//! (`simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`) may carry SIMD
+//! attributes. The public API here is forward-compatible: a future
+//! bench-justified wave can swap in SIMD-accelerated chunk iteration via
+//! the dispatch layer without breaking callers.
+//!
+//! # Composition with SoA staging
+//!
+//! `bulk_apply` composes naturally with `crate::hpc::soa::aos_to_soa` inside
+//! the closure body when the caller wants per-chunk SoA staging (e.g. for
+//! cache-blocked SIMD-style loops on each field):
+//!
+//! ```ignore
+//! use ndarray::hpc::bulk::bulk_apply;
+//! use ndarray::hpc::soa::aos_to_soa;
+//! struct Item { a: f32, b: f32, c: f32 }
+//! let mut items: Vec<Item> = (0..100)
+//!     .map(|i| Item { a: i as f32, b: (i * 2) as f32, c: (i * 3) as f32 })
+//!     .collect();
+//! bulk_apply(&mut items, 16, |chunk, _start| {
+//!     let soa = aos_to_soa::<_, 3, _>(chunk, |it| [it.a, it.b, it.c]);
+//!     // ... per-field SIMD-style loops over soa.field(0), soa.field(1), ...
+//!     let _ = soa;
+//! });
+//! ```
+//!
+//! # Out of scope — distance metrics
+//!
+//! These helpers stay generic over `T`. They MUST NOT grow toward distance
+//! computation (no `bulk_distance<T>` umbrella, no `enum DistanceMetric`).
+//! Distance metrics in this codebase are typed — one named fn per metric.
+//! See `.claude/knowledge/cognitive-distance-typing.md` for the binding rule.
+
+/// Apply `f` to consecutive chunks of `items`. Each invocation receives the
+/// chunk slice and the absolute index of the chunk's first element.
+///
+/// The last chunk may be shorter than `chunk_size` when `chunk_size` does
+/// not divide `items.len()`. A `chunk_size` of `usize::MAX` yields the
+/// entire slice as a single chunk.
+///
+/// # Panics
+/// Panics if `chunk_size == 0` (`chunks_mut(0)` would otherwise return an
+/// iterator that does not make progress).
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::bulk::bulk_apply;
+/// let mut v: Vec<i32> = (0..10).collect();
+/// bulk_apply(&mut v, 3, |chunk, start| {
+///     for (i, x) in chunk.iter_mut().enumerate() {
+///         *x = (start + i) as i32 * 10;
+///     }
+/// });
+/// assert_eq!(v, vec![0, 10, 20, 30, 40, 50, 60, 70, 80, 90]);
+/// ```
+pub fn bulk_apply<T, F>(items: &mut [T], chunk_size: usize, mut f: F)
+where
+    F: FnMut(&mut [T], usize),
+{
+    assert!(chunk_size > 0, "bulk_apply: chunk_size must be > 0");
+    let mut start = 0;
+    for chunk in items.chunks_mut(chunk_size) {
+        let n = chunk.len();
+        f(chunk, start);
+        start += n;
+    }
+}
+
+/// Read-only sibling of [`bulk_apply`]. Applies `f` to consecutive immutable
+/// chunks of `items`, passing the absolute starting index of each chunk.
+///
+/// The last chunk may be shorter than `chunk_size`. A `chunk_size` of
+/// `usize::MAX` yields the entire slice as a single chunk.
+///
+/// # Panics
+/// Panics if `chunk_size == 0`.
+///
+/// # Example
+/// ```
+/// use ndarray::hpc::bulk::bulk_scan;
+/// let v: Vec<i32> = (0..10).collect();
+/// let mut sum = 0i32;
+/// bulk_scan(&v, 4, |chunk, _start| {
+///     sum += chunk.iter().sum::<i32>();
+/// });
+/// assert_eq!(sum, 45);
+/// ```
+pub fn bulk_scan<T, F>(items: &[T], chunk_size: usize, mut f: F)
+where
+    F: FnMut(&[T], usize),
+{
+    assert!(chunk_size > 0, "bulk_scan: chunk_size must be > 0");
+    let mut start = 0;
+    for chunk in items.chunks(chunk_size) {
+        let n = chunk.len();
+        f(chunk, start);
+        start += n;
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    // ----- bulk_apply -----
+
+    #[test]
+    fn bulk_apply_chunk_size_divides_len() {
+        // len == 10, chunk_size == 5 → exactly 2 chunks of 5.
+        let mut v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_apply(&mut v, 5, |chunk, _start| {
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![5, 5]);
+    }
+
+    #[test]
+    fn bulk_apply_chunk_size_does_not_divide_len() {
+        // len == 10, chunk_size == 3 → 3 + 3 + 3 + 1.
+        let mut v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_apply(&mut v, 3, |chunk, _start| {
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![3, 3, 3, 1]);
+    }
+
+    #[test]
+    fn bulk_apply_chunk_size_greater_than_len() {
+        // chunk_size > len → a single chunk of len rows.
+        let mut v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_apply(&mut v, 100, |chunk, start| {
+            assert_eq!(start, 0);
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![10]);
+    }
+
+    #[test]
+    fn bulk_apply_start_indices_3_3_3_1() {
+        // The 3-3-3-1 chunking should produce starts [0, 3, 6, 9].
+        let mut v: Vec<i32> = (0..10).collect();
+        let mut start_indices: Vec<usize> = Vec::new();
+        bulk_apply(&mut v, 3, |_chunk, start| {
+            start_indices.push(start);
+        });
+        assert_eq!(start_indices, vec![0, 3, 6, 9]);
+    }
+
+    #[test]
+    fn bulk_apply_mutates_via_start_plus_offset() {
+        // The closure can compute each element's absolute index from
+        // `start + i` and overwrite. Verifies the start index is correct.
+        let mut v: Vec<i32> = vec![0; 10];
+        bulk_apply(&mut v, 3, |chunk, start| {
+            for (i, x) in chunk.iter_mut().enumerate() {
+                *x = (start + i) as i32 * 10;
+            }
+        });
+        assert_eq!(v, vec![0, 10, 20, 30, 40, 50, 60, 70, 80, 90]);
+    }
+
+    #[test]
+    #[should_panic(expected = "chunk_size must be > 0")]
+    fn bulk_apply_panics_on_zero_chunk_size() {
+        let mut v: Vec<i32> = (0..4).collect();
+        bulk_apply(&mut v, 0, |_, _| {});
+    }
+
+    #[test]
+    fn bulk_apply_chunk_size_usize_max_single_chunk() {
+        // stdlib `chunks_mut(usize::MAX)` yields a single chunk equal to the
+        // whole slice. Smoke-test: doesn't loop, doesn't panic, one chunk.
+        let mut v: Vec<i32> = (0..4).collect();
+        let mut count = 0;
+        bulk_apply(&mut v, usize::MAX, |chunk, start| {
+            count += 1;
+            assert_eq!(start, 0);
+            assert_eq!(chunk.len(), 4);
+        });
+        assert_eq!(count, 1);
+    }
+
+    #[test]
+    fn bulk_apply_empty_slice() {
+        // Empty input: closure never invoked.
+        let mut v: Vec<i32> = Vec::new();
+        let mut count = 0;
+        bulk_apply(&mut v, 4, |_, _| {
+            count += 1;
+        });
+        assert_eq!(count, 0);
+    }
+
+    // ----- bulk_scan -----
+
+    #[test]
+    fn bulk_scan_chunk_size_divides_len() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_scan(&v, 5, |chunk, _start| {
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![5, 5]);
+    }
+
+    #[test]
+    fn bulk_scan_chunk_size_does_not_divide_len() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_scan(&v, 3, |chunk, _start| {
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![3, 3, 3, 1]);
+    }
+
+    #[test]
+    fn bulk_scan_chunk_size_greater_than_len() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut sizes = Vec::new();
+        bulk_scan(&v, 100, |chunk, start| {
+            assert_eq!(start, 0);
+            sizes.push(chunk.len());
+        });
+        assert_eq!(sizes, vec![10]);
+    }
+
+    #[test]
+    fn bulk_scan_start_indices_3_3_3_1() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut start_indices: Vec<usize> = Vec::new();
+        bulk_scan(&v, 3, |_chunk, start| {
+            start_indices.push(start);
+        });
+        assert_eq!(start_indices, vec![0, 3, 6, 9]);
+    }
+
+    #[test]
+    fn bulk_scan_sums_chunks() {
+        let v: Vec<i32> = (0..10).collect();
+        let mut sum = 0i32;
+        bulk_scan(&v, 4, |chunk, _start| {
+            sum += chunk.iter().sum::<i32>();
+        });
+        assert_eq!(sum, 45);
+    }
+
+    #[test]
+    #[should_panic(expected = "chunk_size must be > 0")]
+    fn bulk_scan_panics_on_zero_chunk_size() {
+        let v: Vec<i32> = (0..4).collect();
+        bulk_scan(&v, 0, |_, _| {});
+    }
+
+    #[test]
+    fn bulk_scan_chunk_size_usize_max_single_chunk() {
+        let v: Vec<i32> = (0..4).collect();
+        let mut count = 0;
+        bulk_scan(&v, usize::MAX, |chunk, start| {
+            count += 1;
+            assert_eq!(start, 0);
+            assert_eq!(chunk.len(), 4);
+        });
+        assert_eq!(count, 1);
+    }
+
+    #[test]
+    fn bulk_scan_empty_slice() {
+        let v: Vec<i32> = Vec::new();
+        let mut count = 0;
+        bulk_scan(&v, 4, |_, _| {
+            count += 1;
+        });
+        assert_eq!(count, 0);
+    }
+
+    // ----- integration with aos_to_soa -----
+    //
+    // TODO: enable once W3+W5+W6 land — depends on
+    // `crate::hpc::soa::{aos_to_soa, SoaVec::field}` from Worker A. The
+    // soa module is not yet registered on this branch's base. When
+    // Worker A's commit lands, drop the `#[cfg(any())]` gate.
+    #[cfg(any())]
+    #[test]
+    fn bulk_apply_composes_with_aos_to_soa() {
+        use crate::hpc::soa::aos_to_soa;
+
+        struct Item {
+            a: f32,
+            b: f32,
+            c: f32,
+        }
+
+        let mut items: Vec<Item> = (0..100)
+            .map(|i| Item {
+                a: i as f32,
+                b: (i * 2) as f32,
+                c: (i * 3) as f32,
+            })
+            .collect();
+
+        let mut chunk_count = 0;
+        bulk_apply(&mut items, 16, |chunk, start_idx| {
+            let soa = aos_to_soa::<_, 3, _>(chunk, |it| [it.a, it.b, it.c]);
+            assert_eq!(soa.len(), chunk.len());
+            // First row of the chunk corresponds to absolute index start_idx.
+            assert_eq!(soa.field(0)[0], start_idx as f32);
+            chunk_count += 1;
+        });
+        // 100 / 16 = 6 full chunks of 16 + 1 tail of 4 = 7 chunks total.
+        assert_eq!(chunk_count, 7);
+    }
+}
diff --git a/src/hpc/mod.rs b/src/hpc/mod.rs
index 93eaaa88..09b3b333 100644
--- a/src/hpc/mod.rs
+++ b/src/hpc/mod.rs
@@ -43,6 +43,8 @@ pub mod plane;
 #[allow(missing_docs)]
 pub mod seal;
 #[allow(missing_docs)]
+pub mod soa;
+#[allow(missing_docs)]
 pub mod node;
 #[allow(missing_docs)]
 pub mod cascade;
@@ -70,6 +72,7 @@ pub mod styles;
 pub mod nars;
 #[allow(missing_docs)]
 pub mod blackboard;
+pub mod bulk;
 #[allow(missing_docs)]
 pub mod bnn;
 #[allow(missing_docs)]
diff --git a/src/hpc/soa.rs b/src/hpc/soa.rs
new file mode 100644
index 00000000..d7a1ff22
--- /dev/null
+++ b/src/hpc/soa.rs
@@ -0,0 +1,852 @@
+//! SoA (Struct of Arrays) containers and AoS↔SoA conversion helpers.
+//!
+//! This module provides two complementary primitives for the
+//! "struct-of-arrays" storage shape, plus scalar deinterleave / interleave
+//! free functions:
+//!
+//! - [`soa_struct!`] macro — generates a named-field SoA struct from a
+//!   struct-like declaration. Use when field names matter to callers
+//!   (e.g. `means_x`, `means_y`, `means_z` for a Gaussian batch).
+//! - [`SoaVec`] generic — `[Vec<T>; N]` wrapper. Use when fields are
+//!   positional / anonymous and you want a single type to talk about an
+//!   N-field SoA batch.
+//! - [`aos_to_soa`] / [`soa_to_aos`] — scalar deinterleave / interleave
+//!   between an AoS slice and a `SoaVec<f32, N>`, parameterized on a
+//!   user-supplied extract / build closure.
+//!
+//! Both shapes are SIMD-friendly storage layouts: each field is a
+//! contiguous `Vec<T>`, so per-field SIMD loops iterate one `Vec`.
+//!
+//! # Layering
+//!
+//! This module is **scalar only**. It contains no `#[target_feature]`
+//! attributes, no `cfg(target_feature = ...)` gates, no per-arch imports.
+//! The public API is forward-compatible with a future bench-justified
+//! SIMD swap: the dispatcher inside any conversion entry can grow per-arch
+//! arms internally (delegating to `simd_avx512.rs` / `simd_neon.rs` for
+//! gather / scatter intrinsics) without changing the user-visible
+//! signature. See `.claude/knowledge/vertical-simd-consumer-contract.md`
+//! for the binding layering rule.
+//!
+//! # Out of scope — distance metrics
+//!
+//! These helpers are layout-only and generic over `T`. They never bake in
+//! a distance metric. See `.claude/knowledge/cognitive-distance-typing.md`
+//! for the rule: each cognitive distance metric (palette 256 distance,
+//! HDR popcount early-exit, Base17 L1, BF16 mantissa transform) gets its
+//! own named function with its own typed output. Do NOT extend `SoaVec`,
+//! the macro, `aos_to_soa`, or `soa_to_aos` toward a generic
+//! `bulk_distance<T>` umbrella.
+
+use core::array;
+
+/// SoA container generic over field type `T` and field count `N`.
+///
+/// Internally: `[Vec<T>; N]`. All `N` fields are guaranteed to have the
+/// same length (enforced by [`push`](Self::push) and asserted in
+/// [`len`](Self::len) under `debug_assertions`).
+///
+/// # Example
+///
+/// ```
+/// use ndarray::hpc::soa::SoaVec;
+/// let mut soa: SoaVec<f32, 3> = SoaVec::new();
+/// soa.push([1.0, 2.0, 3.0]);
+/// soa.push([4.0, 5.0, 6.0]);
+/// assert_eq!(soa.len(), 2);
+/// assert_eq!(soa.field(0), &[1.0, 4.0]);
+/// assert_eq!(soa.field(1), &[2.0, 5.0]);
+/// assert_eq!(soa.field(2), &[3.0, 6.0]);
+/// ```
+pub struct SoaVec<T, const N: usize> {
+    fields: [Vec<T>; N],
+}
+
+impl<T, const N: usize> SoaVec<T, N> {
+    /// Construct an empty `SoaVec`.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let soa: SoaVec<u32, 2> = SoaVec::new();
+    /// assert!(soa.is_empty());
+    /// ```
+    pub fn new() -> Self {
+        Self {
+            fields: array::from_fn(|_| Vec::new()),
+        }
+    }
+
+    /// Construct an empty `SoaVec` with each field pre-allocated to `cap`.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let soa: SoaVec<u32, 2> = SoaVec::with_capacity(128);
+    /// assert!(soa.is_empty());
+    /// ```
+    pub fn with_capacity(cap: usize) -> Self {
+        Self {
+            fields: array::from_fn(|_| Vec::with_capacity(cap)),
+        }
+    }
+
+    /// Append a row (one value per field) to the `SoaVec`.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let mut soa: SoaVec<i32, 2> = SoaVec::new();
+    /// soa.push([10, 20]);
+    /// assert_eq!(soa.len(), 1);
+    /// ```
+    pub fn push(&mut self, row: [T; N]) {
+        for (slot, val) in self.fields.iter_mut().zip(row) {
+            slot.push(val);
+        }
+    }
+
+    /// Number of rows. All fields are guaranteed to have this length.
+    ///
+    /// # Panics
+    ///
+    /// In debug builds, panics if fields disagree on length (a bug in
+    /// custom `unsafe` mutation paths). In release, returns the length
+    /// of field 0.
+    pub fn len(&self) -> usize {
+        let n = self.fields[0].len();
+        debug_assert!(self.fields.iter().all(|f| f.len() == n), "SoaVec field-length invariant violated");
+        n
+    }
+
+    /// Returns `true` if the `SoaVec` has zero rows.
+    pub fn is_empty(&self) -> bool {
+        self.len() == 0
+    }
+
+    /// Borrow field `i` as a slice.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `i >= N`.
+    ///
+    /// For known-at-compile-time indices, prefer
+    /// [`field_n`](Self::field_n) to elide the bounds check.
+    pub fn field(&self, i: usize) -> &[T] {
+        &self.fields[i]
+    }
+
+    /// Compile-time-checked field accessor. Use when the index is a
+    /// literal; `field_n::<2>()` is free of runtime bounds checking.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let mut soa: SoaVec<f32, 3> = SoaVec::new();
+    /// soa.push([1.0, 2.0, 3.0]);
+    /// assert_eq!(soa.field_n::<1>(), &[2.0]);
+    /// ```
+    pub fn field_n<const I: usize>(&self) -> &[T] {
+        const { assert!(I < N, "SoaVec::field_n: I out of bounds for N") };
+        &self.fields[I]
+    }
+
+    /// Mutably borrow field `i` as a slice.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `i >= N`.
+    pub fn field_mut(&mut self, i: usize) -> &mut [T] {
+        &mut self.fields[i]
+    }
+
+    /// Compile-time-checked mutable field accessor.
+    pub fn field_n_mut<const I: usize>(&mut self) -> &mut [T] {
+        const { assert!(I < N, "SoaVec::field_n_mut: I out of bounds for N") };
+        &mut self.fields[I]
+    }
+
+    /// Borrow all fields at once as an array of slices, indexed by field
+    /// position.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let mut soa: SoaVec<u8, 2> = SoaVec::new();
+    /// soa.push([1, 2]);
+    /// soa.push([3, 4]);
+    /// let fields = soa.all_fields();
+    /// assert_eq!(fields[0], &[1, 3]);
+    /// assert_eq!(fields[1], &[2, 4]);
+    /// ```
+    pub fn all_fields(&self) -> [&[T]; N] {
+        array::from_fn(|i| self.fields[i].as_slice())
+    }
+
+    /// Iterate over chunks of `chunk_len` rows, yielding `[&[T]; N]` per
+    /// chunk. The last chunk may be shorter than `chunk_len`. Fast:
+    /// zero-copy slice borrow.
+    ///
+    /// # Panics
+    ///
+    /// Per stdlib `slice::chunks` semantics, panics if `chunk_len == 0`.
+    ///
+    /// # Example
+    ///
+    /// ```
+    /// use ndarray::hpc::soa::SoaVec;
+    /// let mut soa: SoaVec<u32, 2> = SoaVec::new();
+    /// for i in 0..5 {
+    ///     soa.push([i, i * 10]);
+    /// }
+    /// let mut total = 0u32;
+    /// for chunk in soa.chunks(2) {
+    ///     total += chunk[0].iter().sum::<u32>();
+    /// }
+    /// assert_eq!(total, 0 + 1 + 2 + 3 + 4);
+    /// ```
+    pub fn chunks(&self, chunk_len: usize) -> SoaChunks<'_, T, N> {
+        assert!(chunk_len > 0, "SoaVec::chunks: chunk_len must be > 0");
+        SoaChunks {
+            soa: self,
+            chunk_len,
+            cursor: 0,
+        }
+    }
+}
+
+impl<T, const N: usize> Default for SoaVec<T, N> {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+/// Iterator yielded by [`SoaVec::chunks`].
+pub struct SoaChunks<'a, T, const N: usize> {
+    soa: &'a SoaVec<T, N>,
+    chunk_len: usize,
+    cursor: usize,
+}
+
+impl<'a, T, const N: usize> Iterator for SoaChunks<'a, T, N> {
+    type Item = [&'a [T]; N];
+
+    fn next(&mut self) -> Option<Self::Item> {
+        let len = self.soa.len();
+        if self.cursor >= len {
+            return None;
+        }
+        let end = (self.cursor + self.chunk_len).min(len);
+        let chunk: [&'a [T]; N] = array::from_fn(|i| &self.soa.fields[i][self.cursor..end]);
+        self.cursor = end;
+        Some(chunk)
+    }
+}
+
+/// Generate a named-field SoA struct from a struct-like declaration.
+///
+/// Each declared field `name: T` becomes `name: Vec<T>` on the generated
+/// struct, alongside inherent `new`, `with_capacity`, `push`, `len`,
+/// `is_empty`, `clear` methods plus an `impl Default`. Per-field
+/// visibility is respected (`pub means_x: f32` stays `pub`); struct-level
+/// meta-attributes (e.g. `#[derive(Clone)]`) pass through to the
+/// generated struct.
+///
+/// # Reserved field names
+///
+/// The macro generates inherent methods on the struct. Choosing a field
+/// name that collides with a generated method will produce a cryptic
+/// compile error — the macro deliberately does not alias around them.
+/// Reserved names: `new`, `with_capacity`, `push`, `len`, `is_empty`,
+/// `clear`, `default`. Pick a different field name (`count`, `n`,
+/// `row_len`) if you need that semantic.
+///
+/// # Invariant ownership
+///
+/// Fields stay `pub` (or whatever visibility the user specifies) on
+/// purpose: the SoA ergonomic win is direct `&[T]` access for SIMD-style
+/// loops. The generated `len()` carries a `debug_assert!` that catches
+/// field-length-mismatch during development. If you mutate fields
+/// directly (e.g. `batch.means_x.truncate(5)`), you OWN the field-length
+/// invariant until you restore it. `push` and `clear` are the safe
+/// mutation paths.
+///
+/// # Example
+///
+/// ```
+/// use ndarray::soa_struct;
+///
+/// soa_struct! {
+///     pub struct GaussianBatch {
+///         pub means_x: f32,
+///         pub means_y: f32,
+///         pub means_z: f32,
+///     }
+/// }
+///
+/// let mut b = GaussianBatch::new();
+/// b.push(1.0, 2.0, 3.0);
+/// b.push(4.0, 5.0, 6.0);
+/// assert_eq!(b.len(), 2);
+/// assert_eq!(b.means_x.as_slice(), &[1.0, 4.0]);
+/// assert_eq!(b.means_y.as_slice(), &[2.0, 5.0]);
+/// assert_eq!(b.means_z.as_slice(), &[3.0, 6.0]);
+/// ```
+#[macro_export]
+macro_rules! soa_struct {
+    (
+        $(#[$meta:meta])*
+        $vis:vis struct $name:ident {
+            $($field_vis:vis $field:ident : $ty:ty),* $(,)?
+        }
+    ) => {
+        $(#[$meta])*
+        $vis struct $name {
+            $($field_vis $field: ::std::vec::Vec<$ty>),*
+        }
+
+        impl $name {
+            /// Construct an empty instance.
+            pub fn new() -> Self {
+                Self { $($field: ::std::vec::Vec::new()),* }
+            }
+
+            /// Construct with each field pre-allocated to `cap`.
+            pub fn with_capacity(cap: usize) -> Self {
+                Self { $($field: ::std::vec::Vec::with_capacity(cap)),* }
+            }
+
+            /// Append one row across all fields.
+            #[allow(clippy::too_many_arguments)]
+            pub fn push(&mut self, $($field: $ty),*) {
+                $(self.$field.push($field);)*
+            }
+
+            /// Length (all fields share this length; debug-asserted).
+            pub fn len(&self) -> usize {
+                let lens = [$(self.$field.len()),*];
+                debug_assert!(
+                    lens.iter().all(|&l| l == lens[0]),
+                    concat!(stringify!($name), ": field-length invariant violated")
+                );
+                lens[0]
+            }
+
+            /// Returns `true` if there are zero rows.
+            pub fn is_empty(&self) -> bool { self.len() == 0 }
+
+            /// Clear all fields. Capacity is retained.
+            pub fn clear(&mut self) {
+                $(self.$field.clear();)*
+            }
+        }
+
+        impl ::std::default::Default for $name {
+            fn default() -> Self { Self::new() }
+        }
+    };
+}
+
+/// Deinterleave an AoS slice into a [`SoaVec`] by extracting `N` field
+/// values per item via the user-supplied `extract` closure.
+///
+/// Scalar implementation. A future bench-justified wave may add per-arch
+/// SIMD gather (VPGATHERDD on AVX-512, LD3/LD4 on NEON) for stride-known
+/// dense layouts; the public API is forward-compatible — the dispatcher
+/// will grow internal per-arch arms without changing this signature.
+///
+/// `T` need not be `Copy`; only the extracted `[f32; N]` row is
+/// materialized.
+///
+/// # Inference
+///
+/// If the const-generic `N` fails to infer from the closure return type,
+/// annotate either with a turbofish or a closure return-type ascription:
+///
+/// ```ignore
+/// aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+/// aos_to_soa(&aos, |it| -> [f32; 3] { [it.a, it.b, it.c] });
+/// ```
+///
+/// (Verified on Rust 1.94.)
+///
+/// # Example
+///
+/// ```
+/// use ndarray::hpc::soa::aos_to_soa;
+/// struct Item { a: f32, b: f32, c: f32 }
+/// let aos = vec![
+///     Item { a: 1.0, b: 2.0, c: 3.0 },
+///     Item { a: 4.0, b: 5.0, c: 6.0 },
+/// ];
+/// let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+/// assert_eq!(soa.field(0), &[1.0, 4.0]);
+/// assert_eq!(soa.field(1), &[2.0, 5.0]);
+/// assert_eq!(soa.field(2), &[3.0, 6.0]);
+/// ```
+pub fn aos_to_soa<T, const N: usize, F>(aos: &[T], extract: F) -> SoaVec<f32, N>
+where
+    F: Fn(&T) -> [f32; N],
+{
+    let mut soa = SoaVec::<f32, N>::with_capacity(aos.len());
+    for item in aos {
+        soa.push(extract(item));
+    }
+    soa
+}
+
+/// Interleave a [`SoaVec`] into an AoS `Vec<T>` by building each item
+/// from the per-field values via the user-supplied `build` closure.
+///
+/// Scalar implementation. See [`aos_to_soa`] for the forward-compatible
+/// note on future SIMD acceleration.
+///
+/// Complexity: O(N·len) where N is the field count and len is the row
+/// count.
+///
+/// # Example
+///
+/// ```
+/// use ndarray::hpc::soa::{aos_to_soa, soa_to_aos};
+/// struct Item { a: f32, b: f32, c: f32 }
+/// let aos = vec![
+///     Item { a: 1.0, b: 2.0, c: 3.0 },
+///     Item { a: 4.0, b: 5.0, c: 6.0 },
+/// ];
+/// let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+/// let back: Vec<Item> = soa_to_aos(&soa, |[a, b, c]| Item { a, b, c });
+/// assert_eq!(back[0].a, 1.0);
+/// assert_eq!(back[1].c, 6.0);
+/// ```
+pub fn soa_to_aos<T, const N: usize, F>(soa: &SoaVec<f32, N>, build: F) -> Vec<T>
+where
+    F: Fn([f32; N]) -> T,
+{
+    let n = soa.len();
+    let fields = soa.all_fields();
+    let mut out = Vec::with_capacity(n);
+    for i in 0..n {
+        let row: [f32; N] = core::array::from_fn(|k| fields[k][i]);
+        out.push(build(row));
+    }
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    // -------------------------------------------------------------------
+    // SoaVec basics
+    // -------------------------------------------------------------------
+
+    #[test]
+    fn soa_vec_new_smoke() {
+        let soa: SoaVec<f32, 3> = SoaVec::new();
+        assert_eq!(soa.len(), 0);
+        assert!(soa.is_empty());
+        assert_eq!(soa.field(0), &[] as &[f32]);
+        assert_eq!(soa.field(1), &[] as &[f32]);
+        assert_eq!(soa.field(2), &[] as &[f32]);
+    }
+
+    #[test]
+    fn soa_vec_with_capacity_smoke() {
+        let soa: SoaVec<i64, 2> = SoaVec::with_capacity(64);
+        assert!(soa.is_empty());
+        // capacity is not directly observable through the public API,
+        // but constructing without panic is enough for the smoke test.
+    }
+
+    #[test]
+    fn soa_vec_default() {
+        let soa: SoaVec<u32, 4> = SoaVec::default();
+        assert!(soa.is_empty());
+        assert_eq!(soa.len(), 0);
+    }
+
+    #[test]
+    fn soa_vec_push_len_is_empty() {
+        let mut soa: SoaVec<f32, 3> = SoaVec::new();
+        assert!(soa.is_empty());
+        soa.push([1.0, 2.0, 3.0]);
+        assert!(!soa.is_empty());
+        assert_eq!(soa.len(), 1);
+        soa.push([4.0, 5.0, 6.0]);
+        soa.push([7.0, 8.0, 9.0]);
+        assert_eq!(soa.len(), 3);
+    }
+
+    #[test]
+    fn soa_vec_field_in_range() {
+        let mut soa: SoaVec<u32, 3> = SoaVec::new();
+        soa.push([10, 20, 30]);
+        soa.push([40, 50, 60]);
+        assert_eq!(soa.field(0), &[10, 40]);
+        assert_eq!(soa.field(1), &[20, 50]);
+        assert_eq!(soa.field(2), &[30, 60]);
+    }
+
+    #[test]
+    #[should_panic]
+    fn soa_vec_field_out_of_range_panics() {
+        let soa: SoaVec<u32, 3> = SoaVec::new();
+        // index N == 3 is out of range: panics via [Vec; N] bounds.
+        let _ = soa.field(3);
+    }
+
+    #[test]
+    fn soa_vec_field_n_compile_time() {
+        let mut soa: SoaVec<f32, 4> = SoaVec::new();
+        soa.push([1.0, 2.0, 3.0, 4.0]);
+        soa.push([5.0, 6.0, 7.0, 8.0]);
+        assert_eq!(soa.field_n::<0>(), &[1.0, 5.0]);
+        assert_eq!(soa.field_n::<1>(), &[2.0, 6.0]);
+        assert_eq!(soa.field_n::<2>(), &[3.0, 7.0]);
+        assert_eq!(soa.field_n::<3>(), &[4.0, 8.0]);
+    }
+
+    #[test]
+    fn soa_vec_field_mut_mutates() {
+        let mut soa: SoaVec<i32, 2> = SoaVec::new();
+        soa.push([1, 100]);
+        soa.push([2, 200]);
+        {
+            let f0 = soa.field_mut(0);
+            f0[0] = 999;
+        }
+        assert_eq!(soa.field(0), &[999, 2]);
+        // field 1 unaffected.
+        assert_eq!(soa.field(1), &[100, 200]);
+    }
+
+    #[test]
+    fn soa_vec_field_n_mut_compile_time() {
+        let mut soa: SoaVec<i32, 3> = SoaVec::new();
+        soa.push([1, 2, 3]);
+        soa.push([4, 5, 6]);
+        {
+            let f2 = soa.field_n_mut::<2>();
+            f2[1] = -1;
+        }
+        assert_eq!(soa.field_n::<2>(), &[3, -1]);
+    }
+
+    #[test]
+    fn soa_vec_all_fields_in_index_order() {
+        let mut soa: SoaVec<u8, 4> = SoaVec::new();
+        soa.push([1, 2, 3, 4]);
+        soa.push([5, 6, 7, 8]);
+        let fields = soa.all_fields();
+        assert_eq!(fields[0], &[1, 5]);
+        assert_eq!(fields[1], &[2, 6]);
+        assert_eq!(fields[2], &[3, 7]);
+        assert_eq!(fields[3], &[4, 8]);
+    }
+
+    // -------------------------------------------------------------------
+    // SoaVec::chunks
+    // -------------------------------------------------------------------
+
+    #[test]
+    fn soa_vec_chunks_divides_len() {
+        let mut soa: SoaVec<u32, 2> = SoaVec::new();
+        for i in 0..6 {
+            soa.push([i, i * 10]);
+        }
+        let chunks: Vec<[&[u32]; 2]> = soa.chunks(2).collect();
+        assert_eq!(chunks.len(), 3);
+        assert_eq!(chunks[0][0], &[0, 1]);
+        assert_eq!(chunks[0][1], &[0, 10]);
+        assert_eq!(chunks[1][0], &[2, 3]);
+        assert_eq!(chunks[1][1], &[20, 30]);
+        assert_eq!(chunks[2][0], &[4, 5]);
+        assert_eq!(chunks[2][1], &[40, 50]);
+    }
+
+    #[test]
+    fn soa_vec_chunks_does_not_divide_len() {
+        let mut soa: SoaVec<u32, 2> = SoaVec::new();
+        for i in 0..5 {
+            soa.push([i, i * 10]);
+        }
+        let chunks: Vec<[&[u32]; 2]> = soa.chunks(2).collect();
+        assert_eq!(chunks.len(), 3);
+        assert_eq!(chunks[0][0], &[0, 1]);
+        assert_eq!(chunks[1][0], &[2, 3]);
+        // tail chunk is shorter than chunk_len.
+        assert_eq!(chunks[2][0], &[4]);
+        assert_eq!(chunks[2][1], &[40]);
+    }
+
+    #[test]
+    fn soa_vec_chunks_chunk_len_greater_than_len() {
+        let mut soa: SoaVec<u32, 2> = SoaVec::new();
+        soa.push([1, 100]);
+        soa.push([2, 200]);
+        let chunks: Vec<[&[u32]; 2]> = soa.chunks(10).collect();
+        assert_eq!(chunks.len(), 1);
+        assert_eq!(chunks[0][0], &[1, 2]);
+        assert_eq!(chunks[0][1], &[100, 200]);
+    }
+
+    #[test]
+    fn soa_vec_chunks_chunk_len_one() {
+        let mut soa: SoaVec<u32, 2> = SoaVec::new();
+        soa.push([1, 100]);
+        soa.push([2, 200]);
+        soa.push([3, 300]);
+        let chunks: Vec<[&[u32]; 2]> = soa.chunks(1).collect();
+        assert_eq!(chunks.len(), 3);
+        for (i, c) in chunks.iter().enumerate() {
+            assert_eq!(c[0].len(), 1);
+            assert_eq!(c[1].len(), 1);
+            assert_eq!(c[0][0], (i + 1) as u32);
+            assert_eq!(c[1][0], (i + 1) as u32 * 100);
+        }
+    }
+
+    #[test]
+    #[should_panic]
+    fn soa_vec_chunks_chunk_len_zero_panics() {
+        // Mirrors stdlib `slice::chunks(0)`: documented to panic.
+        let mut soa: SoaVec<u32, 2> = SoaVec::new();
+        soa.push([1, 2]);
+        let _ = soa.chunks(0);
+    }
+
+    #[test]
+    fn soa_vec_chunks_on_empty_yields_nothing() {
+        let soa: SoaVec<u32, 2> = SoaVec::new();
+        let chunks: Vec<[&[u32]; 2]> = soa.chunks(4).collect();
+        assert!(chunks.is_empty());
+    }
+
+    // -------------------------------------------------------------------
+    // soa_struct! macro
+    // -------------------------------------------------------------------
+
+    soa_struct! {
+        /// 2-field test struct (private fields).
+        struct Soa2 {
+            a: f32,
+            b: f32,
+        }
+    }
+
+    soa_struct! {
+        /// 3-field test struct with public fields.
+        pub struct Soa3 {
+            pub x: f32,
+            pub y: f32,
+            pub z: f32,
+        }
+    }
+
+    soa_struct! {
+        /// 4-field test struct, mixed visibility, derives Clone.
+        #[derive(Clone)]
+        pub struct Soa4 {
+            pub a: i32,
+            pub b: i32,
+            pub c: i32,
+            pub d: i32,
+        }
+    }
+
+    #[test]
+    fn macro_2_fields_push_len_clear() {
+        let mut s = Soa2::new();
+        assert!(s.is_empty());
+        s.push(1.0, 2.0);
+        s.push(3.0, 4.0);
+        assert_eq!(s.len(), 2);
+        // private fields: not accessible from outer scope, but since the
+        // tests module is inside the same module, we can read them here.
+        assert_eq!(s.a.as_slice(), &[1.0, 3.0]);
+        assert_eq!(s.b.as_slice(), &[2.0, 4.0]);
+        s.clear();
+        assert!(s.is_empty());
+        assert_eq!(s.a.len(), 0);
+        assert_eq!(s.b.len(), 0);
+    }
+
+    #[test]
+    fn macro_3_fields_push_len_clear() {
+        let mut s = Soa3::new();
+        s.push(1.0, 2.0, 3.0);
+        s.push(4.0, 5.0, 6.0);
+        s.push(7.0, 8.0, 9.0);
+        assert_eq!(s.len(), 3);
+        assert_eq!(s.x.as_slice(), &[1.0, 4.0, 7.0]);
+        assert_eq!(s.y.as_slice(), &[2.0, 5.0, 8.0]);
+        assert_eq!(s.z.as_slice(), &[3.0, 6.0, 9.0]);
+        s.clear();
+        assert!(s.is_empty());
+    }
+
+    #[test]
+    fn macro_4_fields_push_len_clear() {
+        let mut s = Soa4::new();
+        s.push(1, 2, 3, 4);
+        s.push(5, 6, 7, 8);
+        assert_eq!(s.len(), 2);
+        assert_eq!(s.a, vec![1, 5]);
+        assert_eq!(s.b, vec![2, 6]);
+        assert_eq!(s.c, vec![3, 7]);
+        assert_eq!(s.d, vec![4, 8]);
+        s.clear();
+        assert!(s.is_empty());
+    }
+
+    #[test]
+    fn macro_default_impl() {
+        let s: Soa3 = Soa3::default();
+        assert!(s.is_empty());
+        assert_eq!(s.len(), 0);
+    }
+
+    #[test]
+    fn macro_with_capacity() {
+        let s = Soa3::with_capacity(32);
+        assert!(s.is_empty());
+        // capacity not directly observable; smoke test only.
+    }
+
+    #[test]
+    fn macro_public_visibility_passthrough() {
+        // Soa3 has `pub` fields; verify the field is accessible
+        // (compilation alone proves visibility).
+        let mut s = Soa3::new();
+        s.x.push(1.0);
+        s.y.push(2.0);
+        s.z.push(3.0);
+        assert_eq!(s.len(), 1);
+    }
+
+    #[test]
+    fn macro_derive_clone_passthrough() {
+        // Soa4 has `#[derive(Clone)]`; this test compiles iff Clone is
+        // actually derived on the generated struct.
+        let mut s = Soa4::new();
+        s.push(1, 2, 3, 4);
+        let cloned = s.clone();
+        assert_eq!(cloned.len(), 1);
+        assert_eq!(cloned.a, vec![1]);
+        assert_eq!(cloned.d, vec![4]);
+    }
+
+    // -------------------------------------------------------------------
+    // aos_to_soa / soa_to_aos
+    // -------------------------------------------------------------------
+
+    #[derive(Clone, PartialEq, Debug)]
+    struct ItemN2 {
+        a: f32,
+        b: f32,
+    }
+
+    #[derive(Clone, PartialEq, Debug)]
+    struct ItemN3 {
+        a: f32,
+        b: f32,
+        c: f32,
+    }
+
+    #[derive(Clone, PartialEq, Debug)]
+    struct ItemN4 {
+        a: f32,
+        b: f32,
+        c: f32,
+        d: f32,
+    }
+
+    #[test]
+    fn aos_to_soa_n2_roundtrip() {
+        let aos = vec![ItemN2 { a: 1.0, b: 2.0 }, ItemN2 { a: 3.0, b: 4.0 }, ItemN2 { a: 5.0, b: 6.0 }];
+        let soa = aos_to_soa::<_, 2, _>(&aos, |it| [it.a, it.b]);
+        assert_eq!(soa.len(), 3);
+        assert_eq!(soa.field(0), &[1.0, 3.0, 5.0]);
+        assert_eq!(soa.field(1), &[2.0, 4.0, 6.0]);
+        let back: Vec<ItemN2> = soa_to_aos(&soa, |[a, b]| ItemN2 { a, b });
+        assert_eq!(back, aos);
+    }
+
+    #[test]
+    fn aos_to_soa_n3_roundtrip() {
+        let aos = vec![ItemN3 { a: 1.0, b: 2.0, c: 3.0 }, ItemN3 { a: 4.0, b: 5.0, c: 6.0 }];
+        let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+        assert_eq!(soa.field(0), &[1.0, 4.0]);
+        assert_eq!(soa.field(1), &[2.0, 5.0]);
+        assert_eq!(soa.field(2), &[3.0, 6.0]);
+        let back: Vec<ItemN3> = soa_to_aos(&soa, |[a, b, c]| ItemN3 { a, b, c });
+        assert_eq!(back, aos);
+    }
+
+    #[test]
+    fn aos_to_soa_n4_roundtrip() {
+        let aos = vec![
+            ItemN4 {
+                a: 1.0,
+                b: 2.0,
+                c: 3.0,
+                d: 4.0,
+            },
+            ItemN4 {
+                a: 5.0,
+                b: 6.0,
+                c: 7.0,
+                d: 8.0,
+            },
+            ItemN4 {
+                a: 9.0,
+                b: 10.0,
+                c: 11.0,
+                d: 12.0,
+            },
+        ];
+        let soa = aos_to_soa::<_, 4, _>(&aos, |it| [it.a, it.b, it.c, it.d]);
+        assert_eq!(soa.field(0), &[1.0, 5.0, 9.0]);
+        assert_eq!(soa.field(1), &[2.0, 6.0, 10.0]);
+        assert_eq!(soa.field(2), &[3.0, 7.0, 11.0]);
+        assert_eq!(soa.field(3), &[4.0, 8.0, 12.0]);
+        let back: Vec<ItemN4> = soa_to_aos(&soa, |[a, b, c, d]| ItemN4 { a, b, c, d });
+        assert_eq!(back, aos);
+    }
+
+    #[test]
+    fn aos_to_soa_empty_input() {
+        let aos: Vec<ItemN3> = Vec::new();
+        let soa = aos_to_soa::<_, 3, _>(&aos, |it| [it.a, it.b, it.c]);
+        assert!(soa.is_empty());
+        assert_eq!(soa.field(0), &[] as &[f32]);
+        assert_eq!(soa.field(1), &[] as &[f32]);
+        assert_eq!(soa.field(2), &[] as &[f32]);
+        let back: Vec<ItemN3> = soa_to_aos(&soa, |[a, b, c]| ItemN3 { a, b, c });
+        assert!(back.is_empty());
+    }
+
+    #[test]
+    fn aos_to_soa_closure_captures_external_constant() {
+        // Verifies the `Fn(&T) -> [f32; N]` accepts a closure that
+        // captures an external constant and that the captured value is
+        // applied per row.
+        let scale: f32 = 10.0;
+        let aos = vec![ItemN2 { a: 1.0, b: 2.0 }, ItemN2 { a: 3.0, b: 4.0 }];
+        let soa = aos_to_soa::<_, 2, _>(&aos, |it| [it.a * scale, it.b * scale]);
+        assert_eq!(soa.field(0), &[10.0, 30.0]);
+        assert_eq!(soa.field(1), &[20.0, 40.0]);
+    }
+
+    #[test]
+    fn soa_to_aos_empty() {
+        let soa: SoaVec<f32, 2> = SoaVec::new();
+        let back: Vec<ItemN2> = soa_to_aos(&soa, |[a, b]| ItemN2 { a, b });
+        assert!(back.is_empty());
+    }
+}