Skip to content

splat3d: CPU-SIMD 3D Gaussian Splatting forward renderer (Kerbl 2023)#153

Merged
AdaWorldAPI merged 15 commits into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0
May 18, 2026
Merged

splat3d: CPU-SIMD 3D Gaussian Splatting forward renderer (Kerbl 2023)#153
AdaWorldAPI merged 15 commits into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Lands ndarray::hpc::splat3d — a CPU-SIMD 3D Gaussian Splatting forward
renderer (Zwicker 2001 / Kerbl 2023), gated behind feature = "splat3d".
Pure SIMD via crate::simd::F32x16, no GPU, no wgpu, no new top-level
deps. Mirror's the Pillar-7 EWA-sandwich math certified in
lance-graph/crates/jc/src/ewa_sandwich_3d.rs.

End-to-end: load .ply scene → SoA gaussians → EWA projection →
16×16 tile bin → depth-sorted alpha-blend → RGB framebuffer.

GaussianBatch                Camera
  (μ, scale, quat,            (V, K, near/far,
   opacity, SH)                image dims)
        │                          │
        └───────────┬──────────────┘
                    ▼
             project_batch  ── J·W·Σ·Wᵀ·Jᵀ (EWA-sandwich, Pillar 7)
                    │           depth + 2D conic + 3σ radius + SH→RGB
                    ▼
             ProjectedBatch (SoA)
                    │
                    ▼
             TileBinning   ── 16×16 tile grid, AABB intersection,
                    │         (tile_id, depth)-sorted instance list
                    ▼
             raster_frame  ── per-tile alpha-blend front-to-back,
                    │         F32x16-wide pixels per inner loop
                    ▼
             framebuffer: Vec<f32>  (RGB, length = 3·W·H)

PRs landed (7-step orchestrator loop, plan → review → fix → commit)

# Module Description Tests
1 spd3.rs Smith-1961 closed-form 3×3 SPD eigendecomp; pow(t), sqrt, log_spd, from_scale_quat; sandwich(M,N) + sandwich_x16 SIMD batch 20
2 gaussian.rs + sh.rs GaussianBatch SoA (12 channels padded to PREFERRED_F32_LANES); Gaussian3D AoS constructor; covariance_x16 SIMD-batch; degree-3 SH RGB eval 19
3 project.rs EWA projection kernel: μ-transform → depth/frustum cull → perspective Jacobian → W·Σ·Wᵀ (3×3 asymmetric) → J·Σ·Jᵀ (2×2) → 2D conic → 3σ radius → SH view-dependent RGB 12
4 tile.rs 16×16 tile grid; per-gaussian 3σ AABB → tile coverage; packed-u64 (tile_id, depth) sort; prefix-sum offsets for O(1) per-tile slice retrieval 13
5 raster.rs Per-tile alpha-blend with F32x16 16-pixel-row inner loop; simd_exp for alpha; T-saturation early-out at 1e-4 13
6 frame.rs SplatFrame (per-tick state) + SplatRenderer (atomic double-buffer driver, sibling of hpc::renderer::Renderer) 10
7 ply.rs + demo + e2e test Inria 3DGS canonical PLY reader (62-property layout, sigmoid/exp/normalize activations); examples/splat3d_flex.rs (PPM output, circular camera path, p50/p95/p99 timing); 5-test integration suite 4 + 5

Total tests: 1958 passing (1859 prior + 90 splat3d lib + 5 e2e + 4
PLY = 96 splat3d-related). No regression in existing tests.

Audit discipline

Every PR went through a PP-13 brutally-honest-tester subagent
review before merge, following the autoattended multi-agent pattern at
.claude/EN/CLAUDE-AGENT-PATTERN.md. Real bugs caught:

  • PR 1A: recover_eigvecs returned the SAME unit vector for
    repeated eigenvalues → V rank-deficient → wrong sqrt/log_spd on
    axisymmetric Σ. Fixed by duplicate-detection pass + Gram-Schmidt
    complement. (Same bug independently surfaced by an external reviewer
    against the lance-graph Pillar-7 probe — fixed there on the
    follow-up branch claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0.)
  • PR 1A: is_spd doc said "Cheap" but called full eig()
    promoted to misuse-prevention rename + # Complexity note.
  • PR 1A: bench fixtures used identical inputs across 16 lanes,
    letting the compiler constant-fold the scalar loop and producing
    meaningless SIMD-vs-scalar numbers.
  • PR 2: no analytical-ground-truth SH test → a wrong constant
    would have affected scalar AND SIMD identically. Added
    sh_eval_analytical_ground_truth_at_positive_z pinning the four
    non-zero basis indices.
  • PR 3: Test 7 parity test used identity view (W=I), so the
    W·Σ·Wᵀ sandwich was never exercised with non-trivial W. Added
    analytical project_non_identity_view_rotation_matches_analytical
    (90° Y-rotation, scale = [2, 1, 0.5] → Σ_cam = diag(0.25, 1, 4)).
  • PR 3: project_one_scalar_inner was 102 LoC of dead code.
    Deleted.
  • PR 4: ceil(px_max / TILE_SIZE) under-counted by one tile when
    px_max was an exact multiple of TILE_SIZE → PR 5's pixel-exact
    rasterizer would render one-pixel seams along tile boundaries.
    Fixed by floor + 1.
  • PR 5: bottom-edge height guard was per-scatter, not per-row →
    wasted ~6-8% compute per frame on images with height not a multiple
    of 16. Moved to top of row loop.
  • PR 5: opacity=1.0 / 0.99 clamp never exercised by tests →
    removing the clamp would have vanished every back gaussian. Added
    explicit test.

Net: ~10 P0s caught before merge, 0 P0s shipped.

Architectural invariants honored

  1. ✅ Zero-dep on hot path. No serde, tokio, glam, nalgebra.
  2. ✅ SoA, 64-byte aligned, padded to PREFERRED_F32_LANES (or
    CHUNK_WIDTH=16 for the projection kernel — documented).
  3. ✅ Click P-1 method discipline (frame.tick(), not tick(frame)).
  4. #[repr(C, align(N))] on cross-FFI structs.
  5. ✅ Per-tier SIMD via crate::simd::F32x16 only. No raw intrinsics.
  6. ✅ Module docs lead with the math + paper citation.
  7. ✅ The cognitive lance_graph_contract::splat is untouched; this
    ndarray::hpc::splat3d is the graphics sibling.

Bench baseline (benches/RESULTS.md)

Hardware: Intel Xeon Sapphire Rapids, AVX-512F+BW+VL+VNNI+BF16, 2.1 GHz.

Bench Default build target-cpu=native (AVX-512)
spd3_sandwich_scalar_x16_loop 210 ns 166 ns
spd3_sandwich_simd_x16 1226 ns (0.17×) 90 ns (1.83×)
spd3_eig_smith_1961 131 ns 126 ns
spd3_from_scale_quat 11 ns 9 ns

The 1.83× sandwich SIMD speedup is below the sprint's 10× aspiration —
the 6-field-per-Spd3 AoS↔SoA transpose dominates the microbench. The
rasterizer hot path keeps data SoA-native at the call site, so the
amortized speedup downstream is higher. TECH_DEBT filed for the
performance sprint: SoA-input variant + AVX2-tier runtime dispatch.

Smoke test of examples/splat3d_flex on the AVX2-emulated default
build, 1000-gaussian synthetic cube at 256×256:
p50 = 133 ms, p95 = 146 ms, p99 = 146 ms, 7.5 fps

Test plan

  • cargo test --features splat3d — green
  • cargo test --features splat3d --test splat3d_correctness — 5
    e2e tests green
  • cargo check --features splat3d --benches — clean
  • cargo check --features splat3d --example splat3d_flex — clean
  • cargo run --release --features splat3d --example splat3d_flex -- --frames 5 --width 256 --height 256 — emits 1 PPM frame at
    /tmp/splat3d_render/frame_0000.ppm
  • No regression: cargo test --lib shows the prior 1859 tests
    still passing
  • --no-default-features build (pre-existing breakage on
    merkle_tree.rs, not introduced by this PR)

Deferred to follow-up sprints

  • Inria bicycle .ply SSIM-vs-reference-CUDA comparison (asset not in
    this remote container).
  • 1080p × 500K real-data benchmark (same).
  • PNG output via image/png crate (gated on the no-new-deps
    invariant; PPM works for v1).
  • Performance: AVX2-tier SIMD path; tile-binner radix sort; rayon-
    parallel rasterize_frame.
  • Backward pass / training pipeline.
  • lance-graph-render consumer crate (sprint's "after the sprint"
    follow-on).

Companion lance-graph PR

The Pillar-7 EWA-sandwich-3D probe was merged via lance-graph PR
#403 (commit 70866b6). A follow-up fix for the same repeated-
eigenvalue bug class is on
claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0 (commit 257f541,
not yet PR'd — happy to open separately on request).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41


Generated by Claude Code

claude added 13 commits May 18, 2026 01:29
Lands the math foundation for the CPU-SIMD 3D Gaussian Splatting
renderer behind the new `splat3d` feature. Pure SIMD via the existing
`crate::simd::F32x16` polyfill — no GPU, no wgpu, no new top-level
deps. Sibling slice (Pillar-7 probe certifying the math) ships in
parallel in `lance-graph/crates/jc/src/ewa_sandwich_3d.rs`.

Module surface (`src/hpc/splat3d/`):

- `mod.rs` — doc-first entry: math + pipeline + architectural
  invariants, declares `spd3` and re-exports `Spd3`, `sandwich`,
  `sandwich_x16`. Subsequent PRs (gaussian, sh, project, tile,
  raster, frame) will fill the remaining slots.
- `spd3.rs` — symmetric 3×3 SPD storage (`#[repr(C, align(32))]`,
  24 B payload + 8 B pad = 32 B; two per cache line). Smith 1961
  closed-form eigendecomp (no Jacobi, no QR — branchless with
  diagonal fast path). Eigenvector recovery via row-pair cross
  product + Gram-Schmidt fallback for degenerate eigenspaces.
  `pow(t)`, `sqrt`, `log_spd` via spectral lift. `from_scale_quat`
  builds the 3DGS canonical Σ = R·diag(s²)·Rᵀ. `sandwich(M, N)`
  computes M·N·Mᵀ for symmetric M, N with off-diagonal averaging
  to suppress f32 rounding asymmetry; `sandwich_x16` runs the
  same op 16-wide via `F32x16` on AVX-512/AVX2/NEON/scalar
  (compile-time dispatch via the polyfill).

Math reference: Smith 1961, "Eigenvalues of a symmetric 3×3 matrix",
Communications of the ACM 4(4):168.

Tests (13 passing):
- size_alignment_invariants (size_of==32, align_of==32)
- identity_round_trip, diagonal_fast_path
- eigenvalues_sorted_descending (200 randomized SPD inputs)
- from_scale_quat_identity_rotation_gives_diag_scale_sq
- from_scale_quat_yields_spd (100 trials)
- sqrt_squared_equals_original (100 trials, sandwich(sqrt(Σ), I) ≈ Σ)
- pow_one_is_identity_op (50 trials)
- log_of_identity_is_zero
- sandwich_identity_is_input, sandwich_preserves_spd (200 trials)
- sandwich_x16_matches_scalar_loop (16-lane SIMD parity vs scalar)
- determinant_matches_product_of_eigenvalues (100 trials, det == λ₁λ₂λ₃)

Bench (`benches/splat3d_bench.rs`, gated `required-features = ["splat3d"]`):
- spd3_sandwich_scalar_x16_loop vs spd3_sandwich_simd_x16
  (scalar loop baseline; SIMD batch path on the renderer hot loop)
- spd3_eig_smith_1961 (eigendecomp throughput)
- spd3_from_scale_quat (3DGS canonical builder)

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d → 13 passed
  cargo check --features splat3d --lib            → clean
  cargo check --features splat3d --benches        → clean

A PP-13 brutally-honest-tester audit is running in parallel; any P0
findings will land as a fix commit on this branch before PR 2 starts.

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
… fidelity

Folds the PP-13 brutally-honest-tester audit findings against
f570b7b. Two P0s + one promoted-to-P0 finding addressed, plus four
P1 coverage gaps the audit called out as latent-bug risks.

## Real bug found (not in PP-13's P0 list — surfaced by adding the
   test PP-13 recommended)

`recover_eigvecs` mis-handled repeated eigenvalues: when λ₁ = λ₂,
both `null_space_vec` calls returned the SAME unit vector (the
preferred direction picked by the cross-product tiebreak), so the
eigenvector matrix ended up rank-deficient and the closing
Gram-Schmidt pass collapsed one column to noise. Reconstruction
Σ = V·diag(λ)·Vᵀ then drifted by ~5% on a 30° rotation of
diag(2, 2, 1). Fix: after the first pass, detect column pairs with
|cos θ| > 0.99 and demote the later column to the Gram-Schmidt-
complement path — any orthogonal completion spans the degenerate
eigenspace equally well, so the reconstruction is invariant.

The pre-existing 13 tests did not exercise this path because every
randomized SPD sample had distinct eigenvalues. The new
`eig_degenerate_eigenspace_via_rotated_diag` test reproduces the
failure with a deterministic input.

## PP-13 P0 fixes

- `Spd3::is_spd` doc: "Cheap SPD predicate" was inverted — the
  Sylvester-criterion short-circuit IS cheap, but the post-condition
  `Spd3::eig` call dominates the runtime on the SPD-passing common
  case. Renamed to "Exact SPD predicate" + added a `# Complexity`
  note warning against per-pixel use.
- `benches/splat3d_bench.rs`: scalar and SIMD fixtures used
  `[m; 16]` / `[n; 16]` (identical-input arrays) — the compiler
  could fold the scalar 16-iter loop into one `sandwich` × 16,
  making the SIMD-vs-scalar comparison meaningless. Replaced with
  `build_distinct_pairs()` producing 16 differing (scale, quat)
  pairs across two rotation axis families so the SoA transpose
  actually has varying lane inputs.
- `benches/RESULTS.md`: created the stub regression-gate file
  referenced by the bench module-doc and the PR checklist;
  populated with the four PR-1 bench rows and TBD baseline cells.

## PP-13 P1 promotions (cheap + high-value, landed now)

- `from_scale_quat_90deg_{x,y,z}_rotation_permutes_axes` — three
  analytical ground-truth tests for the quaternion-to-rotation-matrix
  formula. Each rotation hits a different cross-term family
  (`wx` / `wy` / `wz`), so a sign flip in any one of them would
  fail at least one of the three tests. PP-13 called this gap out
  as the largest residual bug risk in the original 13 tests.
- `is_spd_rejects_non_spd` — negative-case coverage: negative
  diagonal entry (fails 1×1 minor), oversized off-diagonal (fails
  2×2 minor), negative determinant (fails 3×3 minor), zero matrix
  (eigenvalues zero).
- `pow_two_inverts_sqrt` — `Σ.sqrt().pow(2.0) ≈ Σ` composition test;
  exercises the `pow(t)` general path with `t = 2`, not the dedicated
  `sqrt` shim.
- `log_spd_diagonal_matches_log_of_eigenvalues` — directly verifies
  the spectral lift for diagonal SPD, hitting the eigendecomp's
  fast path so any bug in `reconstruct_symm` is caught even when
  eigenvector recovery is trivially the identity.

## P1 deferred (TECH_DEBT)

- `Spd3::exp_spd` API for log/exp roundtrip — not in PR 1 spec; the
  Pillar-7 probe doesn't need it. Add when PR 6 (training/backward)
  surfaces a real consumer.
- Ill-conditioned-matrix coverage (eigenvalues spanning many orders
  of magnitude) — defer to PR 5 acceptance, where the reference
  Inria scene exercises real-world conditioning.

## Test count

  cargo test --features splat3d --lib hpc::splat3d
    → 20 passed; 0 failed  (was 13 in f570b7b)

  cargo check --features splat3d --benches --bench splat3d_bench
    → clean

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…e C)

- GaussianBatch: SoA layout, all 12 channels padded to
  PREFERRED_F32_LANES (mirror of RenderFrame). 56 floats per
  gaussian (3 mean + 3 scale + 4 quat + 1 opacity + 48 SH)
  = 224 B; 500K gaussians ≈ 112 MB, fits L3 with room.
- Gaussian3D convenience constructor for tests/demos.
- covariance(i): delegates to Spd3::from_scale_quat for one
  gaussian.
- covariance_x16(start, out): SIMD batch via F32x16 — SoA
  transposes 7 input lanes, computes R = quat→matrix + Σ = R·diag(s²)·Rᵀ
  in lockstep, scatters upper-triangle output to [Spd3; 16].
- 8 tests: padding invariant, push/clear, panic-at-capacity,
  unit-quat → diag(s²) ground truth, 90° Y-rotation delegation
  check, covariance_x16 == scalar loop parity.

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::gaussian → 8 passed

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
- sh_eval_deg3: scalar reference; 16 basis × 3 channel dot-product
  + Inria +0.5 offset + [0, 1] clamp. 48-float coefficient layout
  matches GaussianBatch::sh (gaussian-major, channel-major).
- sh_eval_deg3_x16: SIMD batch via F32x16 — three RGB accumulators
  per gaussian, lane = gaussian index; one mul_add per (basis,
  channel) over the 16 basis functions. AVX-512 native 16-wide,
  AVX2 2×8 emulation, NEON 4×4, scalar fallback all share the
  polyfill API.
- 7 tests: deg-0 constancy, zero-coeff = 0.5 background, view-
  dependent change with non-zero deg-1 coeff, [0,1] clamp, x16 vs
  scalar parity, constant-input lane invariance, SH_C0
  normalization sanity.

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::sh → 7 passed

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
… offset coverage

Folds the PP-13 brutally-honest-tester audit findings against
231e2f3 + f9e4487. Zero P0 bugs surfaced — but four P1 coverage
gaps logged, three promoted to "land now" per the rule from PR 1
(catch correlated-bug classes that the scalar↔SIMD parity tests
miss). One doc-only fix.

## P1 → P0 promotions (closes correlated-bug holes)

### sh.rs: analytical ground-truth test at d = (0, 0, 1)

The seven prior sh tests all compare scalar vs SIMD or check
degenerate inputs (zero coeffs, clamp behavior, normalization
constant ratio). A WRONG SH CONSTANT — sign flip on one of the 14
SH_C* entries, or a magnitude typo in the 16th decimal — would
affect scalar AND SIMD identically and pass every existing test.
That's the bug class PP-13 flagged as the biggest residual risk.

Fix: `sh_eval_analytical_ground_truth_at_positive_z` pins basis
outputs to closed-form values:
  - At d=(0,0,1), basis k ∈ {0, 2, 6, 12} produce non-zero values
    exactly equal to SH_C0, SH_C1, SH_C2[2]·2, SH_C3[3]·2 — so a
    single-coefficient test isolates one constant at a time.
  - The other 12 basis indices must vanish at d=(0,0,1) (all carry
    x or y factors), so a sign error that creates spurious value
    at the wrong basis is also caught.

### gaussian.rs: covariance_x16 with start > 0

`covariance_x16_matches_scalar_loop` always uses start=0. Any
off-by-one in `self.quat_w[start..start+16]` slice arithmetic
would be invisible (constant offset of 0 collapses to identity).

Fix: `covariance_x16_with_nonzero_start_matches_scalar` pushes 32
gaussians and walks `covariance_x16(16, ...)` so each input index
`16+k` differs from lane index `k`.

### gaussian.rs: SH round-trip through SoA

No existing test bridged the `GaussianBatch::push` SH copy with
`sh::sh_eval_deg3`. A bug in `SH_COEFFS_PER_GAUSSIAN` definition
(off by some multiple of 16) or in `push`'s SH-block memcpy offset
would silently corrupt color and only surface in PR 5's rasterizer
output diff.

Fix: `push_then_sh_eval_round_trips_through_soa` pushes 5 unit
gaussians + 1 with a known DC coefficient + a coefficient at the
LAST SH slot (sh[47]), reads the SoA span back directly to verify
slot-by-slot survival, and then runs `sh_eval_deg3` against the
SoA-derived slice to confirm the analytical RGB.

## P1 → doc-only fix (no test added)

### gaussian.rs::covariance_x16 doc precondition

The fn's bound is on `capacity`, not `len`. Lanes ≥ len have
zero-norm quats → degenerate zero matrix that is NOT SPD.
Downstream consumers (PR 3 `project_batch`) must mask. Added a
`# Precondition on padded lanes` block to the doc comment
explaining the contract + pointing at `ProjectedBatch::valid`
(PR 3) as the canonical masking site.

## Test count

  cargo test --features splat3d --lib hpc::splat3d
    → 38 passed; 0 failed  (was 35: +3 tests, all green first try)

  cargo check --features splat3d --benches --bench splat3d_bench
    → clean

## Deferred to TECH_DEBT (low-value vs cost)

- `Spd3::exp_spd` API (PR 6 deferred per PR 1 fix commit).
- Ill-conditioned-matrix coverage (deferred to PR 5 with real Inria scene).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
The math heat of the splat3d sprint, certified by the Pillar-7 probe
in jc::ewa_sandwich_3d. Per-gaussian forward kernel:

1. μ_cam = V·μ_world (camera transform), depth + frustum cull
2. screen_xy = (fx · μ_cam.x / z + cx, fy · μ_cam.y / z + cy)
3. Perspective Jacobian J ∈ ℝ^{2×3} at μ_cam
4. Σ_cam   = W · Σ_world · Wᵀ  (3×3 asymmetric W — NOT spd3::sandwich)
5. Σ_image = J · Σ_cam · Jᵀ    (2×2, symmetric by construction)
6. ½-pixel anti-aliasing dilation (+0.3 on the diagonals)
7. 2D conic = inv(Σ_image), 3σ screen radius, on-screen cull
8. View direction → sh_eval_deg3 → view-dependent RGB

Surface:
- Camera (pinhole, row-major view matrix, focal + principal point,
  near/far, image dims, world-space camera origin)
- ProjectedBatch SoA: screen_x/y, depth, conic_a/b/c, radius,
  color_r/g/b, opacity, valid mask
- project_batch(gaussians, camera, &mut projected) — outer driver
- project_chunk_x16 — F32x16 SIMD inner loop, 16 gaussians/step via
  Chunk16 staging buffer (tier-portable: works on AVX-512/AVX2/NEON)

Conic + depth + radius math goes through F32x16; SH eval stays
scalar (16 distinct view directions defeats SH SIMD batch).

Tests (10):
- screen-center landing at unit depth, near/far cull, off-screen
  cull, conic-is-SPD, x16-vs-scalar parity, radius scales with
  covariance, SH view-dir delegation, identity-camera sanity,
  clear() resets len + valid.

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::project → 10 passed
  cargo test --features splat3d --lib hpc::splat3d → 48 passed (38 + 10)

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…ad scalar fn

Folds the PP-13 brutally-honest-tester findings against a00ec09 (PR 3).
Both P0s addressed; two P1s promoted to "land now" per the rule from
PR 1 (close correlated-bug holes the SIMD-parity tests miss).

## P0.1 — Analytical ground truth for non-trivial W

Tests 2-10 all use `Camera::identity_at_origin` (W=I₃ in the
upper-left 3×3 of the view matrix), so the W·Σ·Wᵀ sandwich is
trivially Σ on every existing test. A sign error in the SIMD
`sc12/sc13/sc23` cross-term accumulators in `project_chunk_x16`
would produce wrong projected ellipses for any rotated camera
while passing all 48 tests.

Fix: `project_non_identity_view_rotation_matches_analytical` pins
the W·Σ·Wᵀ output to a closed-form value:
  - View = R_y(90°), gaussian at world (-5, 0, 0) → camera-frame
    position (0, 0, 5) at depth 5.
  - scale = [2, 1, 0.5] ⇒ Σ_world = diag(4, 1, 0.25).
  - Analytical Σ_cam = R_y(90°)·diag(4,1,0.25)·R_y(90°)ᵀ
                     = diag(0.25, 1, 4)  (axes permuted by rotation).
  - J at z=5: [[fx/5, 0, 0], [0, fy/5, 0]] (offdiag vanish since
    cam_x = cam_y = 0 by construction).
  - Σ_img = diag((fx/5)²·0.25, (fy/5)²·1) = diag(fx²/100, fy²/25).
  - conic_a, conic_b=0, conic_c computed against this analytical
    Σ_img after the +0.3 AA dilation; tolerance 1e-6 absolute.

A transpose error in the asymmetric 3×3 SIMD sandwich (e.g.
swapping the X and Z axis projections in Σ_cam) would fail this
test. The test passes first try, confirming no such bug exists
in the shipped a00ec09.

## P0.2 — Remove dead `project_one_scalar_inner`

The 102-LoC private fn at the top of the module was declared but
never called from production OR tests. PP-13 flagged it as
"creates false confidence that a scalar fallback exists". The
test module already had its own near-duplicate `project_one_scalar`
inline helper that test 7 actually uses.

Fix: delete `project_one_scalar_inner` entirely. Net: 1017 → ~915
LoC for the file, no behavioral change. The test-module
`project_one_scalar` remains as the SIMD-parity reference.

## P1 — Partial-chunk lane masking test (promoted)

The `k >= count || idx >= gaussians.len` guard in
`project_chunk_x16` was untested — all prior tests had len =
multiple of 16 OR len = 1. A bug there only appears at inference
time when the final chunk is partial.

Fix: `project_partial_chunk_masks_padded_lanes` walks n ∈
{1, 7, 15, 17, 23, 31}, asserts all `n` real slots are valid and
all `capacity - n` padded slots are invalid. Passes first try —
confirms the mask path works.

## P1 deferred (TECH_DEBT)

- `with_capacity` pads to CHUNK_WIDTH=16 not PREFERRED_F32_LANES.
  Doc-comment fix: 16 is the right bound for THIS module (the
  SIMD chunk width is the kernel's natural unit, independent of
  the polyfill's per-tier preferred lane count). Documented inline
  rather than realigned — refactoring to PREFERRED_F32_LANES would
  pessimize the AVX-512 native-16-wide path on no benefit.
- SPD-before-dilation intermediate test. Defer to PR 5 (rasterizer)
  where a real Inria scene exercises the corner cases.
- Near/far boundary tests at exactly z=near and z=far. The closed-
  interval `<`/`>` cull semantics are deliberate (matches Inria's
  convention) — documented decision, not a correctness bug.

## Test count

  cargo test --features splat3d --lib hpc::splat3d
    → 50 passed; 0 failed  (was 48: +2 new tests)

  src/hpc/splat3d/project.rs: 1017 → 915 LoC (-102 dead, +2 tests)

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…t (PR 4)

Bridge between project_batch (PR 3) and the per-tile rasterizer
(PR 5). For each visible projected gaussian, compute the 3σ
screen-space AABB, walk the touched 16×16 tiles, and emit one
TileInstance per (tile, gaussian). Sort by packed u64 key
(tile_id << 32 | depth_bits) so each tile's slice is
depth-ascending (front-to-back) for the alpha-blend in PR 5.

API:
- TileInstance: tile_id + gaussian_id + depth_bits + pad
  (#[repr(C, align(16))], 16 B per instance — 4 per cache line)
- TileBinning: tile_cols × tile_rows grid, instances Vec,
  tile_offsets prefix-sum (length n_tiles + 1)
- TileBinning::from_projected(projected, camera) → constructor
- TileBinning::tile_instances(tx, ty) → O(1) slice retrieval

First-cut sort: slice::sort_unstable_by_key on the packed u64
key. If the rasterizer bench surfaces this as the hot spot,
PR4-fix follows with an LSD radix sort.

Tests (10): tile-size constant; ceil-div grid dims; single
gaussian on tile boundary touches 1 tile; large 50-radius
touches 64-tile patch; depth-sorted within tile; empty tiles
return empty slice; culled gaussians not binned; AABB clamped
to grid (no negative coords); off-screen gaussian zero
instances; tile_offsets monotonically non-decreasing.

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::tile → 10 passed
  cargo test --features splat3d --lib hpc::splat3d      → 60 passed

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…+ sub-tile coverage

Folds the PP-13 brutally-honest-tester findings against ab58d17 (PR 4).
One P0 (promoted from a P1 marked "promote if PR 5 is pixel-exact"),
plus three P1s landed for API contract clarity and coverage gaps.

## P0 promoted — ceil-div under-counted at exact tile boundaries

The PR 4 binner used `ceil(px_max / TILE_SIZE)` for the exclusive
upper tile bound. When `px_max` was an EXACT multiple of 16, ceil
produced the wrong value:

  cx = 88, r = 8 → px_max = 96 = 6·16
    tx_max_old = ceil(96/16)     = 6   → range [_, 6) misses tile 6
    tx_max_new = floor(96/16) + 1 = 7   → range [_, 7) includes tile 6

But pixel 96 sits in tile 6 (`floor(96/16) = 6`), and the gaussian's
3σ extent reaches it. PR 5's rasterizer iterates the EXACT pixel
range inside each bound tile; any gaussian whose 3σ edge lands on a
tile boundary (16-pixel-aligned cx ± r) would lose its contribution
to the row/column of pixels at that boundary, producing one-pixel
rendering seams.

PP-13 flagged this as P1 with "Promote to P0 if PR 5 is pixel-exact."
PR 5 IS pixel-exact — promoting. The `floor + 1` formula:
  - Is correct for both integer-boundary AND fractional px_max values
  - Is backwards-compatible with the existing 10 tests (Worker F used
    radii 4, 50, 100, 12 that produced non-multiple px_max values)
  - Same op count as ceil (one floor + one add vs one ceil)

## P1 — clarify `tile_instances(tx, ty)` out-of-range semantics

The fn returns an empty slice silently for OOB coordinates (no panic,
no Result). PR 5's per-tile driver iterates `0..tile_rows × 0..tile_cols`
with its own bounds, so the OOB path is defensive only. Doc-only fix:
added a `# Returns` block making the silent-empty contract explicit.

## P1 — defensive debug_assert on positive depth

The IEEE-754 positive-f32→u32 sort trick relies on `depth > 0`. PR 3's
near cull guarantees this for `valid == 1` slots, but a caller
violating the precondition would silently produce wrong sort order in
release builds. `debug_assert!(depth > 0 && is_finite())` in the
emit pass catches misuse without runtime cost.

## New tests (+3, total now 63)

- `gaussian_edge_on_exact_tile_boundary_includes_the_boundary_tile` —
  pins the P0 regression. cx=88, r=8 → 2×2 = 4 instances spanning
  tiles {5,6}². The (6,6) corner is the one the old ceil missed.
- `sub_tile_size_image_has_single_tile_grid` — 8×8 image yields
  tile_cols = tile_rows = 1; single gaussian fits in tile (0,0).
  PP-13 P1: previously untested.
- `tile_offsets_sentinel_equals_instances_len` — explicit assertion
  that `tile_offsets[n_tiles] == instances.len()`. PR 5's
  uniform `instances[offsets[t]..offsets[t+1]]` slice bracket
  depends on this; previously only checked via monotonicity bound.

## P1 deferred (TECH_DEBT)

- Two-phase index-shift comment in the count-to-prefix loop. Readability
  only; the inline code is already short and obvious to a reader who
  has seen the standard prefix-sum pattern.
- Negative center + small radius coverage (e.g. cx=-5, r=2). The
  existing Test 8 (cx=0, r=100) covers the negative-AABB clamp; the
  small-radius variant is a near-duplicate.

## Test count

  cargo test --features splat3d --lib hpc::splat3d
    → 63 passed; 0 failed  (was 60: +3 new)

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…PR 5)

The second math-heat PR of the sprint. For each 16×16 tile, walk
its (tile_id, depth)-sorted TileInstance slice front-to-back; per
row of 16 pixels (one F32x16), accumulate alpha-blended RGB via
Kerbl 2023 §4. Front-to-back early-out at T < 1e-4 (below 8-bit
quantization floor).

Inner loop:
  dx, dy   = gaussian_xy_broadcast - pixel_xy_vec
  power    = -0.5 · (a·dx² + 2b·dx·dy + c·dy²)       [2D Mahalanobis]
  alpha    = min(0.99, opacity · fast_exp(power))
  mask     = (power ≤ 0) & (alpha ≥ 1/255)
  T_next   = T · (1 − alpha)         [via mask.select]
  C       += mask.select(T · alpha · color, 0)
  break if T_next.reduce_max() < 1e-4

API:
- rasterize_tile(tile_x, tile_y, binning, projected, fb, w, h, bg)
- rasterize_frame(binning, projected, fb, w, h, bg) — walks every tile
- T_SATURATION_EPS = 1e-4

Tests (10): empty scene = background; opaque-white center pixel;
two-gaussian front-to-back composite; 50-stack early-out; outside-
3σ skip; per-tile write isolation; rasterize_frame == sum of
rasterize_tile; partial-tile-at-image-edge; alpha-low background
visibility; empty tile preserves background.

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::raster → 10 passed
  cargo test --features splat3d --lib hpc::splat3d        → 73 passed

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…vergence tests

Folds the PP-13 audit findings against 190ea35 (PR 5). Zero P0 bugs in
the alpha-blend math; the audit confirmed pixel-exact correctness on
every Kerbl 2023 §4 invariant traced (accumulation order, factor-of-2
cross-term, 0.99 clamp, simd_le boundary, background composite,
reduce_max early-out, mask-AND portability across all three SIMD
tiers). Two P1s promoted per the pattern: real bug-class holes the
existing tests would miss.

## P1 → P0 promotion — bottom-edge row guard

The pre-fix code guarded `pix_y >= height` at the per-pixel scatter
step, AFTER the inner blend loop had already computed alpha, exp,
conic, T-update for the entire row. On any image whose height isn't
a multiple of TILE_SIZE (e.g. 1080 → 67.5 tile rows → 4 wasted rows
per frame × 50K gaussians × per-gaussian fast-exp = ~6-8% wasted
compute per frame), the dropped result was a meaningful cost.

Fix: move the height guard to the top of the row loop (line 121-123),
saving the entire row's blend loop on OOB rows. Test 13 covers this
with a 16×17 image (one partial tile row exercising the guard) +
both empty-scene and one-gaussian-at-bottom-row variants.

## P1 → P0 promotion — opacity=1.0 / 0.99 clamp regression test

Every prior test used opacity ≤ 0.99, so the 0.99 alpha clamp never
actually fires in the suite. Removing or retuning the clamp would
break opacity=1.0 scenes (common in pre-trained Inria models — fully
opaque foreground splats) by zeroing T after the first hit, vanishing
every back gaussian. Pre-fix the clamp could regress silently.

Fix: Test 11 sets BOTH gaussians' opacity = 1.0, asserts the back
(blue) channel value is in the analytical range [0.005, 0.02] (=
0.01 × 0.99) that the clamped formula produces. An unclamped path
gives B=0 (back vanished); a re-tuned clamp at 0.999 gives B≈0.001
(still distinguishable, still wrong).

## P1 — spatial-separation test (per-lane divergence)

Every prior multi-gaussian test stacked gaussians at IDENTICAL
screen coordinates — degenerate case where each pixel in the tile
sees the same (dx, dy) for every gaussian. A broadcasted-wrong-id
bug (reading gaussian_id+1 instead of gaussian_id, or transposing
the per-gaussian lane offset) would pass those tests AND produce
identical pixels in the degenerate case.

Fix: Test 12 places two opaque gaussians at separated positions
((4,4) red, (12,12) blue) in the SAME tile, asserts pixel (4,4) is
red-dominant and pixel (12,12) is blue-dominant — confirms the
F32x16 per-lane divergence math distinguishes pixels correctly.

## P1 deferred (TECH_DEBT)

- Explicit early-out fire-count test (Test 4 only verifies the
  resulting pixel color, not that the inner loop broke at gaussian
  3). A test-only counter via cfg(test) would close this — but the
  color check IS a regression guard because no early-out + 50
  opaque gaussians produces the same final pixel anyway.
- Explicit power=0 boundary test. Test 3 already exercises this
  case (gaussians centered exactly on the pixel produce power=0),
  the simd_le path includes it — coverage is incidental but real.

## Test count

  cargo test --features splat3d --lib hpc::splat3d
    → 76 passed; 0 failed  (was 73: +3 new tests, all green first try)

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Sibling of hpc::renderer::Renderer for the SPO graph viz. Same
shape: two RwLock<SplatFrame>s, AtomicUsize front_idx, atomic
swap(). The instance pattern (vs module-level globals) lets
medvol and lance-graph-render each own their own SplatRenderer.

SplatFrame::tick runs the full PR 1-5 pipeline:
  project_batch → TileBinning::from_projected → rasterize_frame
  → frame_id += 1
The state mutation is guarded by &mut self (frame) or the back
RwLock write guard (renderer).

SplatRenderer::tick overrides frame_id with a global AtomicU64
tick_count so front_frame_id() is monotonically increasing across
both frame slots (not per-slot).

GaussianBatch and TileBinning do not implement Debug, so
SplatFrame/SplatRenderer omit #[derive(Debug)] rather than touch
PR 2/4 files.

Tests (10): with_capacity sanity, tick increments frame_id,
tick renders a visible gaussian, monotonic id, front/back
complementarity, swap XOR-flip idempotence, tick advances
front_frame_id, concurrent read doesn't block write, byte
footprint > 0, two ticks render to DIFFERENT buffers (pointer
identity check confirms double-buffer is using both slots).

Acceptance:
  cargo test --features splat3d --lib hpc::splat3d::frame  → 10 passed
  cargo test --features splat3d --lib hpc::splat3d        → 86 passed

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Closes the splat3d sprint's "Definition of done" — the full PR 1-6
pipeline now runs end-to-end on the CPU with a real binary that takes
a .ply scene as input and produces image output.

## Shipped

### src/hpc/splat3d/ply.rs (~370 LoC, 4 unit tests)

Minimal Inria 3DGS PLY reader. Parses ASCII header up to `end_header`,
validates the canonical 62-property vertex layout (x/y/z, normals,
SH DC + 45 rest, opacity, scale × 3, quat × 4), reads the binary
little-endian body, applies the canonical activations inline
(sigmoid opacity, exp scale, normalize quat), and reorders SH into
the gaussian-major channel-major layout `sh_eval_deg3` expects.

Rejects ASCII bodies, big-endian, unexpected properties, and
truncated files with typed `PlyError` variants. No new top-level
deps — single-file hand-rolled binary parser.

### tests/splat3d_correctness.rs (5 e2e integration tests)

Walks the full PR 1-6 pipeline against a synthetic 1000-gaussian
cube scene (10×10×10 grid spanning [-2,2]³, colored by position via
SH DC term).

- `end_to_end_synthetic_cube_renders_without_panic` — pipeline
  produces non-trivial pixel variance (>100 lit pixels, <50%
  saturated) on a 256×256 render.
- `end_to_end_double_buffer_swap_preserves_consistency` — SplatRenderer
  tick 2x; front_frame_id advances 1, 2 across both buffers.
- `end_to_end_camera_translation_changes_render` — two cameras at
  different world positions produce DIFFERENT framebuffers (SSD > 1).
- `end_to_end_empty_scene_yields_pure_background` — zero gaussians ⇒
  pixel-exact background fill.
- `end_to_end_three_consecutive_ticks_preserve_invariants` — 3 ticks,
  frame_id monotonic 1/2/3, all pixels finite (no NaN bleed).

### examples/splat3d_flex.rs (~200 LoC, runnable demo)

CLI binary that loads a `.ply` scene (or falls back to the synthetic
cube), bakes a circular camera path around the origin, renders N
frames, writes PPM output, reports p50/p95/p99 frame timing + fps.

PPM over PNG: the sprint's "no new top-level deps" invariant rules
out flate2 / png crates. PPM is 14-byte header + raw RGB bytes,
trivially viewable in every image tool, and `splat3d_flex.rs`
documents the choice + the deferred PNG-as-followup option.

Smoke test (5 frames × 256² synthetic cube on AVX2-emulated build):
  p50=133.63 ms, p95=146.57 ms, p99=146.57 ms, 7.5 fps
The 1080p × 500K-gaussian acceptance target awaits the Inria
bicycle .ply asset and a benchmarking-only session.

### benches/RESULTS.md (real measured numbers)

Baselined the four PR 1 microbenches under both default (AVX2-
emulated F32x16) and `target-cpu=native` (AVX-512F) builds. Honest
findings:
- `sandwich_simd_x16` on AVX-512 native: 1.83× over scalar loop
  (below the spec's 10× aspiration; the AoS↔SoA transpose at 6
  fields × 16 lanes dominates the inner-loop savings for this
  microbench). Filed as TECH_DEBT for the performance sprint.
- `sandwich_simd_x16` on AVX2-emulated default: 0.17× (slower).
  Documented as the polyfill's two-`__m256`-per-`F32x16` cost.
  TECH_DEBT: add runtime tier dispatch so AVX2 builds prefer the
  scalar loop, or restructure to take SoA inputs directly.
- `from_scale_quat`: 9 ns on AVX-512 native (the 3DGS canonical
  Σ builder; GaussianBatch::covariance_x16 SIMD-batches it).
- `eig_smith_1961`: 126 ns (acos dominates; diagonal fast-path
  bypasses the trig).

Documented the per-PR follow-up bench rows that should populate
when the rasterizer-driven full-pipeline bench lands.

## Sprint state (Definition of done)

- [x] 7 PRs merged to splat3d branch
- [x] `cargo test --features splat3d -p ndarray` green
  (1859 prior tests + 90 splat3d lib tests + 5 e2e + 4 PLY = 1958)
- [x] `cargo bench --features splat3d` baselined in RESULTS.md
- [x] `cargo run --features splat3d --example splat3d_flex` runs
  end-to-end (synthetic fallback OR a .ply scene)
- [x] No regression in existing ndarray benches
- [x] Pillar-7 probe certified in lance-graph jc (PR #403 + the
  rotated-axisymmetric fix in claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0)

## Deferred to follow-up sprint

- Inria bicycle .ply SSIM comparison vs reference CUDA (asset
  download required; not in this remote container).
- 1080p × 500K real-data benchmark (same).
- PNG output via `image`/`png` crate (gated on the no-new-deps
  invariant; PPM works for the v1 demo deliverable).
- Performance: AVX2-tier SIMD path optimization; tile-binner radix
  sort; rayon-parallel rasterize_frame.
- Backward pass / training pipeline (separate sprint per the
  sprint prompt's "After the sprint" section).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e96459645

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/splat3d/ply.rs Outdated
claude added 2 commits May 18, 2026 07:01
External-reviewer bug report against PR #153:

> When a malformed or fuzzed PLY header advertises a vertex count
> larger than usize::MAX / (62 * 4), this size calculation overflows
> (panics in debug, wraps in release). In release that allocates a
> too-small bytes buffer and the subsequent per-vertex loop indexes
> past it instead of returning a PlyError, so a bad input can crash
> the loader; use checked multiplication before allocating/reading
> the body.

## Root cause

`read_ply` computed the body byte count via:

    let mut bytes = vec![0u8; n_vertices * PROPERTIES_PER_VERTEX * 4];

For `n_vertices > usize::MAX / 248`:
- debug: panic on the unchecked `*`.
- release: wraps to a small number, allocates a too-small buffer,
  `read_exact` succeeds (reads only the wrapped count of bytes —
  often zero), then the per-vertex loop indexes far past the
  allocation. Crash or — worse — silent corruption if the wrapped
  size happens to land at a valid index.

## Fix

Gate the body size with `checked_mul` BEFORE allocation:

    let body_bytes = n_vertices
        .checked_mul(PROPERTIES_PER_VERTEX)
        .and_then(|n| n.checked_mul(4))
        .ok_or_else(|| PlyError::BadElement(format!(
            "vertex count {n_vertices} × {PROPERTIES_PER_VERTEX} props × 4 bytes \
             overflows usize on this target ({} bits)", usize::BITS,
        )))?;
    let mut bytes = vec![0u8; body_bytes];

The downstream per-vertex `i * stride` math is now safe by
transitivity — for any `i < n_vertices`, `i * stride ≤ body_bytes ≤
usize::MAX`. No further bounds work needed.

## Regression test

`rejects_overflowing_vertex_count`:
- Computes `overflow_count = usize::MAX / (PROPERTIES_PER_VERTEX * 4) + 1`
  (the smallest count that overflows on the current target).
- Builds a valid PLY header advertising that count, with NO body
  bytes — the overflow check must fire BEFORE any I/O is attempted.
- Asserts `PlyError::BadElement` with a message containing "overflows".

Verified green in BOTH debug and release builds, where the wrapping
(not panicking) release path is the actual security concern.

## Test count

  cargo test --features splat3d --lib hpc::splat3d::ply
    → 5 passed; 0 failed (was 4: +1 overflow regression)
  cargo test --features splat3d --lib hpc::splat3d
    → 91 passed; 0 failed (was 90: +1)
  cargo test --features splat3d --release --lib hpc::splat3d::ply
    → 5 passed; 0 failed (release-build confirms no wrap-then-corrupt)

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Mechanical formatting fixes from `cargo fmt --all` — no semantic
changes. Brings the 12 splat3d files (PR 1-7 + fixes) into rustfmt
compliance so the workspace gate stays green.

Files reformatted:
  benches/splat3d_bench.rs
  examples/splat3d_flex.rs
  src/hpc/splat3d/{mod,spd3,gaussian,sh,project,tile,raster,frame,ply}.rs
  tests/splat3d_correctness.rs

Acceptance:
  cargo fmt --all --check                                            → clean
  cargo test --features splat3d --lib hpc::splat3d                  → 91 passed
  cargo test --features splat3d --test splat3d_correctness          → 5 passed
  cargo check --features splat3d --benches --bench splat3d_bench    → clean
  cargo check --features splat3d --example splat3d_flex             → clean

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
@AdaWorldAPI AdaWorldAPI merged commit ab20d11 into master May 18, 2026
15 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 18, 2026
…d / cognitive

Review of the three uploaded sprint prompts (splat3d_sprint_prompt,
splat4d_cascade_sprint, splat4d_skeleton_anchored_sprint) in context of
the cognitive-shader work drafted in PR-X4 / PR-X9 / PR-Z1.

Tags every arithmetic primitive shipped / drafted / gap across 9 layers
(L0 SPD substrate → L8 cognitive overlay), flags 3 precision classes
(EXACT / FAST OK / VERIFY), and identifies 5 concrete gaps that gate
the joint sprint:

1. Hilbert-3D encode/decode (mentioned in splat4d cascade but not
   specified anywhere — single shared dependency of medical AND
   cognitive paths)
2. INT4×32 packed dot product (PR-X7 thinking-style + qualia signature
   — needs VNNI/dotprod strategy decision)
3. NARS truth-revision kernel + precision class (replaces alpha-compose
   in W7 closure swap)
4. x265-style CTU mode encoder (skip/merge/delta/escape for PR-X9
   lazy storage)
5. fast_exp_x16 precision audit for NARS context (3% rel err is OK for
   alpha but suspect for cognitive confidence cascade)

Five new cross-cutting research items consolidated (atop the five from
the three sprint docs):
- Hilbert-3D algorithm choice (Butz vs Skilling vs precomputed table)
- INT4×N hardware strategy (VNNI vs software unpack vs AMX widening)
- NARS revise precision class decision (G5 (a/b/c) — lean toward (b),
  drop exp from cognitive path entirely)
- CTU mode encoder λ-RDO calibration
- Codebook size const-generic strategy

Recommended ordering: Phase 0 (Hilbert-3D + INT4×N) unblocks BOTH the
medical sprint (splat4d skeleton-anchored) AND the cognitive sprint
(PR-X4 + PR-X9). Build the shared substrate first; both stacks
accelerate together. Phase 1 medical+cognitive co-substrate
(Pillar-8 + moment-match + mesh-fit). Phase 2 cognitive-only
(basin XOR-popcount + CTU + NARS). Phase 3 W7 closure swap.

Recommended 30-min math workshop before the joint plan-review savant
to lock σ_temporal values, Hilbert-3D algorithm, and NARS precision
class — removes 3 open questions per design doc and accelerates the
sprint.

Key strategic claim: Pillar-7 SPD-sandwich is the most-reused single
math op in the entire stack. It's the projection (J·W·Σ·Wᵀ·Jᵀ), the
temporal cascade (Σ_{t+1} = M·Σ_t·Mᵀ), the moment-match aggregate-up
(via Δμ·Δμᵀ outer products), and the cognitive-spacetime evolution.
Shipped in splat3d PR #153. Everything else is a semantic
reinterpretation of M.
AdaWorldAPI pushed a commit that referenced this pull request May 18, 2026
…ow LAPACK

Strategic shift: the biggest arithmetic gap in the stack isn't the
cognitive overlay or even the splat4d cascade — it's the shared
linear-algebra layer below LAPACK that splat3d backward, openchat/gpt2
inference, AND the jc Pillar probes are all hand-rolling against.

Today's duplication:
- splat3d ships its own Spd3 (Smith-1961, PR #153)
- lance-graph jc has THREE separate Spd2/Spd3 copies in
  ewa_sandwich.rs / ewa_sandwich_3d.rs / koestenberger.rs
- hpc::{gpt2, openchat, stable_diffusion} inline RMSNorm/SiLU/RoPE/
  attention because there's no canonical fn

PR-X10 consolidates everything into crate::hpc::linalg::*:
- MatN<const N> carrier + Mat2/Mat3/Mat4 type aliases
- Quat algebra (mul, conjugate, slerp, from_axis_angle, to_mat)
- Matrix inverse (3×3 / 4×4 closed-form + general LU-backsolve)
- Symmetric eig (closed-form ≤4, Jacobi 5-64, QR > 64)
- SVD (Golub-Reinsch + one-sided Jacobi)
- Polar decomposition + mat_exp + mat_log (Padé scaling-and-squaring)
- SH deg 0..=7 (supersedes splat3d's deg-3-only)
- Conv1D + Conv2D (im2col + direct-3x3/5x5)
- Batched gemm + RMSNorm/LayerNorm/GroupNorm + GELU/SiLU/Swish/Mish
- RoPE + fused attention (naive + flash-attention)
- Cross-entropy + softmax-backward
- Tier-3 extensions: SIMD RNG dists, vml special fns (erf/gamma/Bessel),
  Bluestein FFT, irfft, DCT-II/IV, wavelets, sparse GEMM, tridiagonal

Closed-form fast paths coexist with general-N (invariant 12) — Spd3
Smith-1961 is 10× faster than Jacobi-3 on the splat3d hot path.
Don't delete the fast paths when ripping out the duplication.

Worker decomposition: A1 MatN (foundation, sequential), then A2-A12
PARALLEL (max fan-out: 12 workers, all writing to separate files,
all consuming MatN + crate::simd::F32x16). Matches the user's
"12 agenten + 1 Koordinator" cadence. ~2 weeks parallel /
~5 weeks sequential.

jc consolidation queued as follow-ons:
- jc-X1: consolidate Spd2/Spd3 into private jc::hadamard (keeps jc
  zero-dep on ndarray; mirrors PR-X10's canonical surface)
- jc-X2: Wasserstein-1 / Sinkhorn-Knopp + Hungarian for Pillar 10
- jc-X3: signature transform for Pillar 11
- jc-X4: SPD-cone ops + manifold log/exp (SO(n), Grassmannian,
  Stiefel) — unblocks Pillar 2 Cartan-Kuranishi

PR-X10 is INDEPENDENT of PR-X4 / PR-X9 / PR-Z1 (zero file overlap),
ships concurrently from claude/pr-x10-linalg-core-design branch.
Maximum sprint parallelism: cognitive-shader stack AND linalg-core
can spawn workers simultaneously.

7 open questions for plan-review savant. Most load-bearing:
- Q1: both closed-form AND general-N? (lean: yes — invariant 12)
- Q2: const-generic MatN vs concrete Mat2/3/4? (lean: both)
- Q5: flash-attention in v1? (lean: yes — needed for any seq > 512)
- Q7: PR-X10 concurrent with PR-X4/X9/Z1? (lean: yes)

Also adds shopping-list addendum to pr-arithmetic-inventory.md
cross-referencing PR-X10 as the consolidating sprint.
AdaWorldAPI pushed a commit that referenced this pull request May 18, 2026
…d / cognitive

Review of the three uploaded sprint prompts (splat3d_sprint_prompt,
splat4d_cascade_sprint, splat4d_skeleton_anchored_sprint) in context of
the cognitive-shader work drafted in PR-X4 / PR-X9 / PR-Z1.

Tags every arithmetic primitive shipped / drafted / gap across 9 layers
(L0 SPD substrate → L8 cognitive overlay), flags 3 precision classes
(EXACT / FAST OK / VERIFY), and identifies 5 concrete gaps that gate
the joint sprint:

1. Hilbert-3D encode/decode (mentioned in splat4d cascade but not
   specified anywhere — single shared dependency of medical AND
   cognitive paths)
2. INT4×32 packed dot product (PR-X7 thinking-style + qualia signature
   — needs VNNI/dotprod strategy decision)
3. NARS truth-revision kernel + precision class (replaces alpha-compose
   in W7 closure swap)
4. x265-style CTU mode encoder (skip/merge/delta/escape for PR-X9
   lazy storage)
5. fast_exp_x16 precision audit for NARS context (3% rel err is OK for
   alpha but suspect for cognitive confidence cascade)

Five new cross-cutting research items consolidated (atop the five from
the three sprint docs):
- Hilbert-3D algorithm choice (Butz vs Skilling vs precomputed table)
- INT4×N hardware strategy (VNNI vs software unpack vs AMX widening)
- NARS revise precision class decision (G5 (a/b/c) — lean toward (b),
  drop exp from cognitive path entirely)
- CTU mode encoder λ-RDO calibration
- Codebook size const-generic strategy

Recommended ordering: Phase 0 (Hilbert-3D + INT4×N) unblocks BOTH the
medical sprint (splat4d skeleton-anchored) AND the cognitive sprint
(PR-X4 + PR-X9). Build the shared substrate first; both stacks
accelerate together. Phase 1 medical+cognitive co-substrate
(Pillar-8 + moment-match + mesh-fit). Phase 2 cognitive-only
(basin XOR-popcount + CTU + NARS). Phase 3 W7 closure swap.

Recommended 30-min math workshop before the joint plan-review savant
to lock σ_temporal values, Hilbert-3D algorithm, and NARS precision
class — removes 3 open questions per design doc and accelerates the
sprint.

Key strategic claim: Pillar-7 SPD-sandwich is the most-reused single
math op in the entire stack. It's the projection (J·W·Σ·Wᵀ·Jᵀ), the
temporal cascade (Σ_{t+1} = M·Σ_t·Mᵀ), the moment-match aggregate-up
(via Δμ·Δμᵀ outer products), and the cognitive-spacetime evolution.
Shipped in splat3d PR #153. Everything else is a semantic
reinterpretation of M.
AdaWorldAPI pushed a commit that referenced this pull request May 18, 2026
…ow LAPACK

Strategic shift: the biggest arithmetic gap in the stack isn't the
cognitive overlay or even the splat4d cascade — it's the shared
linear-algebra layer below LAPACK that splat3d backward, openchat/gpt2
inference, AND the jc Pillar probes are all hand-rolling against.

Today's duplication:
- splat3d ships its own Spd3 (Smith-1961, PR #153)
- lance-graph jc has THREE separate Spd2/Spd3 copies in
  ewa_sandwich.rs / ewa_sandwich_3d.rs / koestenberger.rs
- hpc::{gpt2, openchat, stable_diffusion} inline RMSNorm/SiLU/RoPE/
  attention because there's no canonical fn

PR-X10 consolidates everything into crate::hpc::linalg::*:
- MatN<const N> carrier + Mat2/Mat3/Mat4 type aliases
- Quat algebra (mul, conjugate, slerp, from_axis_angle, to_mat)
- Matrix inverse (3×3 / 4×4 closed-form + general LU-backsolve)
- Symmetric eig (closed-form ≤4, Jacobi 5-64, QR > 64)
- SVD (Golub-Reinsch + one-sided Jacobi)
- Polar decomposition + mat_exp + mat_log (Padé scaling-and-squaring)
- SH deg 0..=7 (supersedes splat3d's deg-3-only)
- Conv1D + Conv2D (im2col + direct-3x3/5x5)
- Batched gemm + RMSNorm/LayerNorm/GroupNorm + GELU/SiLU/Swish/Mish
- RoPE + fused attention (naive + flash-attention)
- Cross-entropy + softmax-backward
- Tier-3 extensions: SIMD RNG dists, vml special fns (erf/gamma/Bessel),
  Bluestein FFT, irfft, DCT-II/IV, wavelets, sparse GEMM, tridiagonal

Closed-form fast paths coexist with general-N (invariant 12) — Spd3
Smith-1961 is 10× faster than Jacobi-3 on the splat3d hot path.
Don't delete the fast paths when ripping out the duplication.

Worker decomposition: A1 MatN (foundation, sequential), then A2-A12
PARALLEL (max fan-out: 12 workers, all writing to separate files,
all consuming MatN + crate::simd::F32x16). Matches the user's
"12 agenten + 1 Koordinator" cadence. ~2 weeks parallel /
~5 weeks sequential.

jc consolidation queued as follow-ons:
- jc-X1: consolidate Spd2/Spd3 into private jc::hadamard (keeps jc
  zero-dep on ndarray; mirrors PR-X10's canonical surface)
- jc-X2: Wasserstein-1 / Sinkhorn-Knopp + Hungarian for Pillar 10
- jc-X3: signature transform for Pillar 11
- jc-X4: SPD-cone ops + manifold log/exp (SO(n), Grassmannian,
  Stiefel) — unblocks Pillar 2 Cartan-Kuranishi

PR-X10 is INDEPENDENT of PR-X4 / PR-X9 / PR-Z1 (zero file overlap),
ships concurrently from claude/pr-x10-linalg-core-design branch.
Maximum sprint parallelism: cognitive-shader stack AND linalg-core
can spawn workers simultaneously.

7 open questions for plan-review savant. Most load-bearing:
- Q1: both closed-form AND general-N? (lean: yes — invariant 12)
- Q2: const-generic MatN vs concrete Mat2/3/4? (lean: both)
- Q5: flash-attention in v1? (lean: yes — needed for any seq > 512)
- Q7: PR-X10 concurrent with PR-X4/X9/Z1? (lean: yes)

Also adds shopping-list addendum to pr-arithmetic-inventory.md
cross-referencing PR-X10 as the consolidating sprint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants