Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
2d59517
docs(hhtl): PR-X4 amend — SIMD-bundle contract + Railway smoke gates
claude May 19, 2026
b63c4b1
.claude: widen permissions to cover compound 'cd && X' commands
claude May 19, 2026
2e6fcf7
.claude: planning scaffold for PR-X4 12-agent fan-out
claude May 19, 2026
7fb3492
docs(hhtl): PR-X4 Phase-1 planning briefs (A1-A4 + risk + simd-stub)
claude May 19, 2026
ebf578a
chore(splat4d): PR-X4 anticipatory skeletons — salvage from off-path …
claude May 19, 2026
8e2f8ab
chore: post-cleanup commit after disk crash
claude May 19, 2026
f2a7ded
revert: remove PR-X4 anticipatory salvage (off-path)
claude May 19, 2026
a445eef
chore(pr-x4-planning): drop stub sentinel briefs
claude May 19, 2026
449e73e
draft(pr-x1): carved-out final files (no bodies)
claude May 19, 2026
fa0ebae
impl(pr-x1): fill carved-out bodies + savant fixes + simd re-export
claude May 19, 2026
2a2dfbf
refactor(pr-x1): array_window → array_windows (plural, std convention)
claude May 20, 2026
9a5cb6a
docs(array_windows): clarify non-overlapping semantics vs std::slice:…
claude May 20, 2026
5474d1b
fix(pr-x1): rename array_windows → array_chunks (collision with std)
claude May 20, 2026
8483ae3
refactor(pr-x1): move SIMD primitives to simd_soa.rs
claude May 20, 2026
d64c5e0
refactor(pr-x1): split simd_soa (carriers) / simd_ops (slicing)
claude May 20, 2026
6b52a46
refactor(simd_soa): iterators yield typed lanes via crate::simd::*
claude May 20, 2026
c317041
fix(pr-x1): two P2 codex findings on PR #167
claude May 20, 2026
b0f16b2
ci(pr-x1): fix fmt + clippy/1.95.0 + hpc-stream-parallel/rayon
claude May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 108 additions & 21 deletions .claude/knowledge/hhtl-pr-x4-splat-cascade-pre-sprint-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ the level-4 fix is the only blocker. **PR-X4 must NOT re-introduce its
own Hilbert-3D encode** — wait for the A12b fix, then consume
`linalg::hilbert::hilbert3d_encode`.

## Schedule slot — W4-W5 (5 workers), no new Q-marker
## Schedule slot — W4-W5 (6 workers), no new Q-marker

The 8-week schedule from `hhtl-substrate-execution-prompt.md`:76-91
already places PR-X4 at W4-W5 alongside PR-X12 (codec, 8 workers):
Expand All @@ -245,7 +245,7 @@ already places PR-X4 at W4-W5 alongside PR-X12 (codec, 8 workers):
| W1-W2 | PR-X10 (linalg-core foundation) | 12 |
| **W2.5/W3** | **PR-X1 + PR-X2 (GridLake) + PR-X14′ (contract)** ← Q-NEW-1/Q-NEW-2 pending | 4-10 depending on cell |
| W3 | PR-X11 (jc consolidation) + PR-X13 (OGIT bridge) | 6 + 4 |
| **W4-W5** | **PR-X12 (codec, 8) + PR-X4 (splat cascade, 5)** | **13** |
| **W4-W5** | **PR-X12 (codec, 8) + PR-X4 (splat cascade, 6)** | **14** |
| W6-W7 | PR-X9 (basin-codebook) | 6 |
| W8 | Integration + canary | 3 |

Expand All @@ -266,7 +266,7 @@ If GridLake + X14′ slip past W3 (e.g., Cell A-β extension), PR-X4
starts late by the same margin. **No schedule extension owed by
PR-X4 itself.**

## Worker spawn shape (5 workers)
## Worker spawn shape (6 workers)

```
A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor (chain dep)
Expand All @@ -279,14 +279,25 @@ A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor (chain dep)
├──→ A4: G2 INT4×32 packed dot (3 backends + parity test)
└──→ A5: G3 NARS truth-revision kernel + G4 fast_exp_x16
precision audit (combined; A5 verdict determines if
precise_exp_x16 follow-up needed)
├──→ A5: G3 NARS truth-revision kernel + G4 fast_exp_x16
│ precision audit (combined; A5 verdict determines if
│ precise_exp_x16 follow-up needed)
└──→ A6: Railway smoke deployment — splat4d::cascade::frame_pipeline
wired to HLS over a minimal axum/warp service, Prom metrics
endpoint, FPS + jitter histogram surfaced in the player UI.
Depends on A1 + A5 only (no cross-deps to A2/A3/A4) so the
banal smoke test can ship even if A12b's L4 Hilbert fix
slips past W3 — A6 exercises L1-L3 cascade and the
composition closure, which is enough to falsify a latency
regression.
```

A1 is the only chain dep. A2-A5 are parallel after A1 lands.
A1 is the only chain dep. A2-A6 are parallel after A1 lands.
A2 has an additional gate dep on PR-X10 A12b's L4 fix landing on
master (not just PR-X10 W2 completion).
master (not just PR-X10 W2 completion). A6 has an additional dep on
A5's composition closure being callable from the pipeline (needed for
SG4); the alpha-only branch of A6 can ship without A5.

## Done criteria

Expand Down Expand Up @@ -343,6 +354,60 @@ The sprint is done when ALL of the following hold:
union currently green on master) still passes after the refactor.
`cargo clippy -- -D warnings` clean.

7. **Smoke gates pass on Railway** (see § "Smoke acceptance gates"):
SG1 ≥ 60 fps median, SG2 ≤ 20 ms p95, SG3 zero stutter events
over 10 minutes, SG4 same envelope under the `splat4d-nars-compose`
feature flag. A6 must be deployed and metrics scraped before the
sprint closes.

## Smoke acceptance gates — Railway-hosted video player

The cascade ships not just as a refactor but as a service: a small
Railway-deployed binary that streams a video through the splat4d
pipeline. **Banality is the test.** If the cascade can stream a 1080p
video on a Railway hobby tier without stuttering, every bundle is
honoring its latency contract under sustained load — and any cliff
that hides on a workstation surfaces in the deployment envelope.

### Why this beats synthetic benchmarks (PSNR, throughput-only)

- **PSNR is a number; stuttering is a sensation.** A dropped frame is
unfalsifiable: you either see it or you don't. PSNR averages across
frames and hides lane-traffic spikes — exactly the pathology bundle
contracts are designed to prevent.
- **Railway adds the deployment envelope for free.** Hobby-tier CPU
caps, real network egress, container memory limits. Anything that
hides on a workstation surfaces here.
- **60 fps × 10 minutes = 36,000 frames each inside a 16.6 ms budget.**
No batch averaging, no "we'll catch up." The harshest possible test
of per-bundle latency dressed up as the most boring deliverable.

### What ships in A6

- **Service**: minimal Railway-deployed binary (axum or warp), HTML5
player + FPS counter + jitter histogram in the UI
- **Server path**: video frames flow through
`splat4d::cascade::frame_pipeline` — decode → B-Interleave-Transpose
→ cascade L1..L4 (L1-L3 only if A12b slipped) → B-Compose (alpha) →
emit
- **Client**: stock `<video>` element over HLS, no special player
- **Metrics**: median FPS, p95 frame time, stutter events (count of
inter-frame gaps > 33 ms), exposed on a Prom endpoint

### Gates

| Gate | Target | What it proves |
|---|---|---|
| **SG1: median FPS** | ≥ 60 fps for a 1080p H.264 input, 10-minute Big Buck Bunny | Steady-state throughput; the cascade isn't dropping behind |
| **SG2: p95 frame time** | ≤ 20 ms | No bundle is silently decomposing into its constituent ops |
| **SG3: stutter count** | 0 events > 33 ms over 10 minutes | Every bundle honors its latency contract under sustained load |
| **SG4: closure swap** | Same SG1-SG3 envelope with `splat4d-nars-compose` feature on | NARS-revision path has the same latency class as alpha, as designed |

SG4 is conditional on G3 (the NARS truth-revision kernel) being
ULP-correct against the scalar reference; an SG4 failure with G3
passing is evidence that the NARS B-Compose kernel has a worse latency
class than alpha and must be re-staged before the W7 closure swap.

## Forbidden constraints

Five invariants the sprint MUST NOT violate:
Expand All @@ -353,11 +418,29 @@ Five invariants the sprint MUST NOT violate:
worker stubs the L4 path (returns `Err(NotReadyL4)`) and ships L1-L3
addressing only.

2. **No `crate::simd::*` extension from inside PR-X4**. Any new SIMD
primitive (e.g., a missing lane width for G2 INT4×32) must be
proposed against `vertical-simd-consumer-contract.md` and land in
ndarray's `src/simd_*.rs` before PR-X4 consumes it. PR-X4 must not
reach for raw `std::arch::*` intrinsics.
2. **PR-X4 consumes — and must not extend — the following SIMD bundles
from `ndarray::simd`.** Each bundle is a fused multi-op transaction
with its own latency budget; reaching past a bundle into raw
`std::arch::*` intrinsics, or proposing new lane primitives without
going through `vertical-simd-consumer-contract.md`, breaks the
contract and re-introduces the bespoke-binner pathology v1 is
leaving behind.

| Bundle | Composition | Cognitive role |
|---|---|---|
| **B-Splat** | `splat_f32x16`, `splat_i32x16` | Broadcast a Gaussian center / NARS truth-value across the 16 tile lanes of a single L_k cell. The identity of a single belief across its support. |
| **B-Gather-FMA** | `gather_idx_f32x16` ∘ `fmadd_f32x16` | Pick up the 16 neighbouring Gaussians of a tile and fuse-multiply-add their contributions in one shot. Evidence-aggregation across siblings. |
| **B-Pack-Dot** | `pack_int4x32` ∘ `dot_i4x32_to_i32` ∘ `dequant_f32` | The INT4×32 packed dot of A4. SH-coefficient evaluation, NARS confidence × frequency products. Three backends (AVX-512 VNNI, NEON UDOT, scalar) with parity tests. |
| **B-Cascade-Permute** | `shuffle_lanes_4x4` ∘ `transpose_16x16` | Cross-tier rotation L_k → L_{k+1}. The 4×4 stride identity made executable — without this bundle the cascade is just a hierarchy of independent grids. |
| **B-Compose** | `hreduce_sum_f32x16` for alpha; `revise_truth_f32x16` for NARS | Closure-swappable horizontal reduction. The `splat4d-nars-compose` feature gate selects which kernel binds; same lane width, same latency class, different algebra. |
| **B-Interleave-Transpose** | `interleave_f32x16` ∘ `transpose_inplace` | Row-major splat3d ↔ lane-major splat4d. Boundary primitive between v1 binner and v2 cascade. |

The forbidden thing is reaching past a bundle into its internal lane
primitives — that breaks the latency contract that the A6 Railway
smoke gates (SG2 p95 ≤ 20 ms, SG3 zero stutter) are designed to
falsify. Missing bundles get proposed against
`vertical-simd-consumer-contract.md` and land in `src/simd_*.rs`
before PR-X4 consumes them; PR-X4 itself never adds a primitive.

3. **No write to lance-graph upstream**. PR-X4 lives entirely in
ndarray (`src/hpc/splat3d_v2/`, `src/hpc/splat4d/`). It consumes
Expand Down Expand Up @@ -427,12 +510,16 @@ Five invariants the sprint MUST NOT violate:
PR-X4 promotes splat3d from "bespoke 16×16 tile binner" to "typed
multi-resolution cognitive evolution operator" with the
(4×4)×(4×4)×(4×4)×(4×4) tier scheme as its load-bearing structural
identity. Slots at W4-W5 (5 workers). Consumes GridLake +
identity. Slots at W4-W5 (6 workers). Consumes GridLake +
`lance-graph-contract::column` + PR-X10 A6/A8/A12b + PR-X11 jc Spd3.
Gates on PR-X10 A12b's L4 Hilbert-3D P0-4 fix (`hilbert3d_encode([15,15,15], 4)
→ 2925, expected 4095`). Ships four splat gaps (G1 deg-3 SH inquiry-
direction, G2 INT4×32 packed dot, G3 NARS truth-revision kernel,
G4 fast_exp_x16 precision audit). Alpha-compositing stays default;
NARS-revision composition gated behind `splat4d-nars-compose` feature
flag until W7 closure swap. CTU-mode encoder is PR-X9's deliverable,
not PR-X4's.
Consumes (and does not extend) six SIMD bundles from `ndarray::simd`:
B-Splat, B-Gather-FMA, B-Pack-Dot, B-Cascade-Permute, B-Compose,
B-Interleave-Transpose. Gates on PR-X10 A12b's L4 Hilbert-3D P0-4 fix
(`hilbert3d_encode([15,15,15], 4) → 2925, expected 4095`). Ships four
splat gaps (G1 deg-3 SH inquiry-direction, G2 INT4×32 packed dot, G3
NARS truth-revision kernel, G4 fast_exp_x16 precision audit) and four
Railway smoke gates (SG1 ≥ 60 fps, SG2 p95 ≤ 20 ms, SG3 zero stutter,
SG4 same envelope under NARS feature flag). Alpha-compositing stays
default; NARS-revision composition gated behind `splat4d-nars-compose`
feature flag until W7 closure swap. CTU-mode encoder is PR-X9's
deliverable, not PR-X4's.
104 changes: 104 additions & 0 deletions .claude/knowledge/pr-x4-planning/01-a1-tileinstance-v2-brief.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor

Worker A1 of PR-X4 (W4-W5). **Chain dep** for A2-A6 — must land
before any other worker spawns. Owns the structural port from the
bespoke 16×16 binner to the typed BlockedGrid substrate.

## Scope

Lift the existing splat3d binner (`src/hpc/splat3d/{tile,frame,
gaussian,project,raster,sh,spd3,ply,mod}.rs`) into `splat3d_v2/` as a
sibling tree on `BlockedGrid<SplatBinList, 1, 1>` from PR-X3. Same
algorithmic shape (project → bin → sort → rasterize) on the typed
substrate, with the tier field A2 will populate.

## File moves

| v1 path | v2 path |
|-------------------------------|--------------------------------------|
| `splat3d/tile.rs` | `splat3d_v2/tile.rs` |
| `splat3d/frame.rs` | `splat3d_v2/frame.rs` |
| `splat3d/gaussian.rs` | `splat3d_v2/gaussian.rs` |
| `splat3d/project.rs` | `splat3d_v2/project.rs` |
| `splat3d/raster.rs` | `splat3d_v2/raster.rs` |
| `splat3d/sh.rs` | `splat3d_v2/sh.rs` (A3 expands) |
| `splat3d/spd3.rs` | `splat3d_v2/spd3.rs` |
| `splat3d/mod.rs` | `splat3d_v2/mod.rs` |

Side-by-side per pr-x4-design § Q1 — `splat3d/` stays unchanged until
W7 closure swap. Both compile.

## Verbatim struct (pre-sprint lines 95-103)

```rust
#[repr(C, align(16))]
pub struct TileInstance {
pub tier: u8, // 1 = L1, 2 = L2, 3 = L3, 4 = L4
pub _pad: [u8; 3],
pub block_row: u16,
pub block_col: u16,
pub gaussian_id: u32,
pub confidence: f32, // replaces depth — sort key, highest-first
}
```

A1 emits `tier == 1` only. L2-L4 emission is A2's deliverable, so
**A1 is NOT gated on PR-X10 A12b's L4 Hilbert-3D fix.** For the
graphics-compat layer, `confidence = 1.0 / (depth + EPS)` so
highest-first sort recovers front-to-back order under the new key.

## BlockedGrid<SplatBinList, 1, 1> migration

`SplatBinList` is the per-block payload — `SmallVec<[TileInstance; 8]>`
or equivalent — replacing v1's `Vec<TileInstance> + Vec<u32> prefix`
hand-rolled CSR. The `<1, 1>` block-params mean **1×1 cells per
substrate block**: each tile is its own atomic block. Cascade-tier
striding belongs to A2.

Constructor: `BlockedGrid::<SplatBinList, 1, 1>::with_dims(rows, cols)`,
populated by the two-pass count+emit pattern v1 uses. The packed-u64
radix sort survives unchanged; the prefix-sum CSR is replaced by
`BlockedGrid::iter_blocks()`.

The PP-13 PR4 P0 boundary-tile fix (`floor + 1` instead of `ceil`) at
`splat3d/tile.rs:241-243` MUST be preserved verbatim in the v2 port
— a regression here silently breaks SG3.

## SIMD bundles — B-Splat + B-Interleave-Transpose

A1 consumes exactly two bundles:

- **B-Splat** (`splat_f32x16`, `splat_i32x16`): broadcast a Gaussian
center across 16 tile lanes during the bin step.
- **B-Interleave-Transpose** (`interleave_f32x16 ∘ transpose_inplace`):
the row-major splat3d ↔ lane-major splat4d boundary primitive. A1
IS the v1↔v2 boundary, so this is its primary tool.

B-Gather-FMA and B-Cascade-Permute belong to A3, A2. A1 must not
reach past either bundle into raw intrinsics — breaks SG2 p95.

## Parity tests

1. **v1-vs-v2 binner parity**: feed both binners the same 16-Gaussian
fixture (seed `0xA1_B1_NA_RY`), assert emitted `TileInstance`
streams agree on `(tile_id ↔ (block_row, block_col), gaussian_id,
sort order under the depth/confidence transform)`.
2. **Boundary-tile coverage regression**: the PP-13 PR4 P0 case —
3σ-ellipse straddling a tile boundary — must produce identical
per-tile splat counts in v2 as v1.

The 2370-test no-regression line (done-criteria #6) requires the v1
suite to still pass with `splat3d/` untouched.

## Exit criteria — when A2-A6 may spawn

- [ ] `cargo test -p ndarray --lib splat3d_v2::` green
- [ ] v1↔v2 binner parity on the 16-Gaussian fixture
- [ ] Boundary-tile coverage regression passes
- [ ] `cargo clippy -- -D warnings` clean
- [ ] `splat3d_v2::TileInstance` + `BlockedGrid<SplatBinList,1,1>`
exported from `splat3d_v2::mod`
- [ ] A6's `frame_pipeline` skeleton can call into `splat3d_v2`
without depending on A2..A5

No AABB or Hilbert dep, no SH or INT4 dep. A1 is the chain dep gate.
95 changes: 95 additions & 0 deletions .claude/knowledge/pr-x4-planning/02-a2-cascadeaddr-brief.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# A2: CascadeAddr + from_position/to_position_center

Worker A2 of PR-X4 (W4-W5). Spawns after A1's TileInstance v2
refactor lands. **Hard gate on PR-X10 A12b's L4 Hilbert-3D fix
landing on master.**

## Gate — PR-X10 A12b L4 fix

Verbatim symptom (pp13-brutally-honest-tester-verdict.md P0-4):
`hilbert3d_encode([15,15,15], 4) → 2925, expected 4095`. A12b must
ship the `NEXT_STATE`/`H_TO_XYZ` re-derivation from Hamilton 2006
Table 2 + `round_trip_level4_exhaustive` (4096 cells × 4 µs ≈ 16 ms)
before A2 starts. **A2 must NOT re-introduce a bespoke Hilbert-3D**
(forbidden constraint #1).

If A12b slips past W3, A2 stubs the L4 path: `Err(NotReadyL4)`,
ships L1-L3 addressing only. `parent()`/`children()` remain
functional since they're pure nibble ops.

## API surface

```rust
pub struct CascadeAddr(u16); // 4 nibbles, one per tier level

impl CascadeAddr {
pub fn level(&self, l: u8) -> u8 { (self.0 >> (l * 4) & 0xF) as u8 }
pub fn parent(&self) -> CascadeAddr { CascadeAddr(self.0 & !0xF000) }
pub fn children(&self) -> [CascadeAddr; 16] { ... }
pub fn from_position(p: Vec3, bbox: AABB, level: u8) -> CascadeAddr {
CascadeAddr(linalg::hilbert::hilbert3d_encode(p_quantised, level) as u16)
}
pub fn to_position_center(&self, bbox: AABB) -> Vec3 { ... }
}
```

The 4-nibble layout: one nibble per L1..L4 tier, 16 children per
parent. `parent()` masks off the L4 nibble. `children()` enumerates
all 16 nibble values at the L4 slot.

## AABB quantisation convention

Per `linalg::hilbert::hilbert3d_encode` contract, the input is a
quantised 3-tuple of unsigned ints. At level `k`, each axis has
`2^k` cells:

| level | cells/axis | index range | bits |
|-------|------------|--------------|------|
| 1 | 2 | [0, 8) | 3 |
| 2 | 4 | [0, 64) | 6 |
| 3 | 8 | [0, 512) | 9 |
| 4 | 16 | [0, 4096) | 12 |

L4's 12-bit range fits within 3 of the 4 cascade-addr nibbles. NB:
if A12b's actual encode returns a monolithic per-call index rather
than packed cascade, A2 must call once per tier and assemble nibbles
itself. **Flag this discrepancy with the A12b author at spawn.**

Quantisation: `q = floor((p.x - bbox.min.x) / (bbox.size.x) * (1 << level))`
clamped to `[0, (1 << level) - 1]`. Same for y, z.

## Tests

- **Exhaustive level=4 round-trip** (4096 cells × 3 axes): for each
of 4096 quantised positions, `decode(encode(p)) == p`. ~16 ms.
- **Exhaustive level=1..3 round-trip**: already pass under current
A12b — just verify under the splat4d call sites.
- **AABB sanity**: corner cells map to `level()==0` and
`level()==(1<<level)-1` per axis.
- **parent/children round-trip**: for any addr,
`addr.children()[i].parent() == addr` for all i.

## SIMD bundle — B-Cascade-Permute

A2 consumes one bundle:

- **B-Cascade-Permute** (`shuffle_lanes_4x4 ∘ transpose_16x16`): the
cross-tier rotation L_k → L_{k+1}. The 4×4 stride identity made
executable. Without this bundle the cascade is just a hierarchy
of independent grids.

A2 must not reach past into raw shuffle intrinsics. If the bundle
primitive is missing in `ndarray::simd`, file a pre-PR-X4 gating
PR against the vertical-simd-consumer-contract before spawning.

## Exit criteria

- [ ] A12b's `hilbert3d_encode([15,15,15], 4) == 4095` and
`round_trip_level4_exhaustive` green on master
- [ ] A2's exhaustive level=4 round-trip green
- [ ] `CascadeAddr::from_position` and `to_position_center`
round-trip on 10K random positions within the unit AABB
- [ ] `parent`/`children` round-trip exhaustive
- [ ] L1-L3 addressing exercisable from A6's frame_pipeline (smoke
gate the cascade addressing layer)
- [ ] `cargo clippy -- -D warnings` clean
Loading
Loading