splat3d: CPU-SIMD 3D Gaussian Splatting forward renderer (Kerbl 2023)#153
Merged
Conversation
Lands the math foundation for the CPU-SIMD 3D Gaussian Splatting renderer behind the new `splat3d` feature. Pure SIMD via the existing `crate::simd::F32x16` polyfill — no GPU, no wgpu, no new top-level deps. Sibling slice (Pillar-7 probe certifying the math) ships in parallel in `lance-graph/crates/jc/src/ewa_sandwich_3d.rs`. Module surface (`src/hpc/splat3d/`): - `mod.rs` — doc-first entry: math + pipeline + architectural invariants, declares `spd3` and re-exports `Spd3`, `sandwich`, `sandwich_x16`. Subsequent PRs (gaussian, sh, project, tile, raster, frame) will fill the remaining slots. - `spd3.rs` — symmetric 3×3 SPD storage (`#[repr(C, align(32))]`, 24 B payload + 8 B pad = 32 B; two per cache line). Smith 1961 closed-form eigendecomp (no Jacobi, no QR — branchless with diagonal fast path). Eigenvector recovery via row-pair cross product + Gram-Schmidt fallback for degenerate eigenspaces. `pow(t)`, `sqrt`, `log_spd` via spectral lift. `from_scale_quat` builds the 3DGS canonical Σ = R·diag(s²)·Rᵀ. `sandwich(M, N)` computes M·N·Mᵀ for symmetric M, N with off-diagonal averaging to suppress f32 rounding asymmetry; `sandwich_x16` runs the same op 16-wide via `F32x16` on AVX-512/AVX2/NEON/scalar (compile-time dispatch via the polyfill). Math reference: Smith 1961, "Eigenvalues of a symmetric 3×3 matrix", Communications of the ACM 4(4):168. Tests (13 passing): - size_alignment_invariants (size_of==32, align_of==32) - identity_round_trip, diagonal_fast_path - eigenvalues_sorted_descending (200 randomized SPD inputs) - from_scale_quat_identity_rotation_gives_diag_scale_sq - from_scale_quat_yields_spd (100 trials) - sqrt_squared_equals_original (100 trials, sandwich(sqrt(Σ), I) ≈ Σ) - pow_one_is_identity_op (50 trials) - log_of_identity_is_zero - sandwich_identity_is_input, sandwich_preserves_spd (200 trials) - sandwich_x16_matches_scalar_loop (16-lane SIMD parity vs scalar) - determinant_matches_product_of_eigenvalues (100 trials, det == λ₁λ₂λ₃) Bench (`benches/splat3d_bench.rs`, gated `required-features = ["splat3d"]`): - spd3_sandwich_scalar_x16_loop vs spd3_sandwich_simd_x16 (scalar loop baseline; SIMD batch path on the renderer hot loop) - spd3_eig_smith_1961 (eigendecomp throughput) - spd3_from_scale_quat (3DGS canonical builder) Acceptance: cargo test --features splat3d --lib hpc::splat3d → 13 passed cargo check --features splat3d --lib → clean cargo check --features splat3d --benches → clean A PP-13 brutally-honest-tester audit is running in parallel; any P0 findings will land as a fix commit on this branch before PR 2 starts. https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
… fidelity Folds the PP-13 brutally-honest-tester audit findings against f570b7b. Two P0s + one promoted-to-P0 finding addressed, plus four P1 coverage gaps the audit called out as latent-bug risks. ## Real bug found (not in PP-13's P0 list — surfaced by adding the test PP-13 recommended) `recover_eigvecs` mis-handled repeated eigenvalues: when λ₁ = λ₂, both `null_space_vec` calls returned the SAME unit vector (the preferred direction picked by the cross-product tiebreak), so the eigenvector matrix ended up rank-deficient and the closing Gram-Schmidt pass collapsed one column to noise. Reconstruction Σ = V·diag(λ)·Vᵀ then drifted by ~5% on a 30° rotation of diag(2, 2, 1). Fix: after the first pass, detect column pairs with |cos θ| > 0.99 and demote the later column to the Gram-Schmidt- complement path — any orthogonal completion spans the degenerate eigenspace equally well, so the reconstruction is invariant. The pre-existing 13 tests did not exercise this path because every randomized SPD sample had distinct eigenvalues. The new `eig_degenerate_eigenspace_via_rotated_diag` test reproduces the failure with a deterministic input. ## PP-13 P0 fixes - `Spd3::is_spd` doc: "Cheap SPD predicate" was inverted — the Sylvester-criterion short-circuit IS cheap, but the post-condition `Spd3::eig` call dominates the runtime on the SPD-passing common case. Renamed to "Exact SPD predicate" + added a `# Complexity` note warning against per-pixel use. - `benches/splat3d_bench.rs`: scalar and SIMD fixtures used `[m; 16]` / `[n; 16]` (identical-input arrays) — the compiler could fold the scalar 16-iter loop into one `sandwich` × 16, making the SIMD-vs-scalar comparison meaningless. Replaced with `build_distinct_pairs()` producing 16 differing (scale, quat) pairs across two rotation axis families so the SoA transpose actually has varying lane inputs. - `benches/RESULTS.md`: created the stub regression-gate file referenced by the bench module-doc and the PR checklist; populated with the four PR-1 bench rows and TBD baseline cells. ## PP-13 P1 promotions (cheap + high-value, landed now) - `from_scale_quat_90deg_{x,y,z}_rotation_permutes_axes` — three analytical ground-truth tests for the quaternion-to-rotation-matrix formula. Each rotation hits a different cross-term family (`wx` / `wy` / `wz`), so a sign flip in any one of them would fail at least one of the three tests. PP-13 called this gap out as the largest residual bug risk in the original 13 tests. - `is_spd_rejects_non_spd` — negative-case coverage: negative diagonal entry (fails 1×1 minor), oversized off-diagonal (fails 2×2 minor), negative determinant (fails 3×3 minor), zero matrix (eigenvalues zero). - `pow_two_inverts_sqrt` — `Σ.sqrt().pow(2.0) ≈ Σ` composition test; exercises the `pow(t)` general path with `t = 2`, not the dedicated `sqrt` shim. - `log_spd_diagonal_matches_log_of_eigenvalues` — directly verifies the spectral lift for diagonal SPD, hitting the eigendecomp's fast path so any bug in `reconstruct_symm` is caught even when eigenvector recovery is trivially the identity. ## P1 deferred (TECH_DEBT) - `Spd3::exp_spd` API for log/exp roundtrip — not in PR 1 spec; the Pillar-7 probe doesn't need it. Add when PR 6 (training/backward) surfaces a real consumer. - Ill-conditioned-matrix coverage (eigenvalues spanning many orders of magnitude) — defer to PR 5 acceptance, where the reference Inria scene exercises real-world conditioning. ## Test count cargo test --features splat3d --lib hpc::splat3d → 20 passed; 0 failed (was 13 in f570b7b) cargo check --features splat3d --benches --bench splat3d_bench → clean https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…e C) - GaussianBatch: SoA layout, all 12 channels padded to PREFERRED_F32_LANES (mirror of RenderFrame). 56 floats per gaussian (3 mean + 3 scale + 4 quat + 1 opacity + 48 SH) = 224 B; 500K gaussians ≈ 112 MB, fits L3 with room. - Gaussian3D convenience constructor for tests/demos. - covariance(i): delegates to Spd3::from_scale_quat for one gaussian. - covariance_x16(start, out): SIMD batch via F32x16 — SoA transposes 7 input lanes, computes R = quat→matrix + Σ = R·diag(s²)·Rᵀ in lockstep, scatters upper-triangle output to [Spd3; 16]. - 8 tests: padding invariant, push/clear, panic-at-capacity, unit-quat → diag(s²) ground truth, 90° Y-rotation delegation check, covariance_x16 == scalar loop parity. Acceptance: cargo test --features splat3d --lib hpc::splat3d::gaussian → 8 passed https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
- sh_eval_deg3: scalar reference; 16 basis × 3 channel dot-product + Inria +0.5 offset + [0, 1] clamp. 48-float coefficient layout matches GaussianBatch::sh (gaussian-major, channel-major). - sh_eval_deg3_x16: SIMD batch via F32x16 — three RGB accumulators per gaussian, lane = gaussian index; one mul_add per (basis, channel) over the 16 basis functions. AVX-512 native 16-wide, AVX2 2×8 emulation, NEON 4×4, scalar fallback all share the polyfill API. - 7 tests: deg-0 constancy, zero-coeff = 0.5 background, view- dependent change with non-zero deg-1 coeff, [0,1] clamp, x16 vs scalar parity, constant-input lane invariance, SH_C0 normalization sanity. Acceptance: cargo test --features splat3d --lib hpc::splat3d::sh → 7 passed https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
… offset coverage Folds the PP-13 brutally-honest-tester audit findings against 231e2f3 + f9e4487. Zero P0 bugs surfaced — but four P1 coverage gaps logged, three promoted to "land now" per the rule from PR 1 (catch correlated-bug classes that the scalar↔SIMD parity tests miss). One doc-only fix. ## P1 → P0 promotions (closes correlated-bug holes) ### sh.rs: analytical ground-truth test at d = (0, 0, 1) The seven prior sh tests all compare scalar vs SIMD or check degenerate inputs (zero coeffs, clamp behavior, normalization constant ratio). A WRONG SH CONSTANT — sign flip on one of the 14 SH_C* entries, or a magnitude typo in the 16th decimal — would affect scalar AND SIMD identically and pass every existing test. That's the bug class PP-13 flagged as the biggest residual risk. Fix: `sh_eval_analytical_ground_truth_at_positive_z` pins basis outputs to closed-form values: - At d=(0,0,1), basis k ∈ {0, 2, 6, 12} produce non-zero values exactly equal to SH_C0, SH_C1, SH_C2[2]·2, SH_C3[3]·2 — so a single-coefficient test isolates one constant at a time. - The other 12 basis indices must vanish at d=(0,0,1) (all carry x or y factors), so a sign error that creates spurious value at the wrong basis is also caught. ### gaussian.rs: covariance_x16 with start > 0 `covariance_x16_matches_scalar_loop` always uses start=0. Any off-by-one in `self.quat_w[start..start+16]` slice arithmetic would be invisible (constant offset of 0 collapses to identity). Fix: `covariance_x16_with_nonzero_start_matches_scalar` pushes 32 gaussians and walks `covariance_x16(16, ...)` so each input index `16+k` differs from lane index `k`. ### gaussian.rs: SH round-trip through SoA No existing test bridged the `GaussianBatch::push` SH copy with `sh::sh_eval_deg3`. A bug in `SH_COEFFS_PER_GAUSSIAN` definition (off by some multiple of 16) or in `push`'s SH-block memcpy offset would silently corrupt color and only surface in PR 5's rasterizer output diff. Fix: `push_then_sh_eval_round_trips_through_soa` pushes 5 unit gaussians + 1 with a known DC coefficient + a coefficient at the LAST SH slot (sh[47]), reads the SoA span back directly to verify slot-by-slot survival, and then runs `sh_eval_deg3` against the SoA-derived slice to confirm the analytical RGB. ## P1 → doc-only fix (no test added) ### gaussian.rs::covariance_x16 doc precondition The fn's bound is on `capacity`, not `len`. Lanes ≥ len have zero-norm quats → degenerate zero matrix that is NOT SPD. Downstream consumers (PR 3 `project_batch`) must mask. Added a `# Precondition on padded lanes` block to the doc comment explaining the contract + pointing at `ProjectedBatch::valid` (PR 3) as the canonical masking site. ## Test count cargo test --features splat3d --lib hpc::splat3d → 38 passed; 0 failed (was 35: +3 tests, all green first try) cargo check --features splat3d --benches --bench splat3d_bench → clean ## Deferred to TECH_DEBT (low-value vs cost) - `Spd3::exp_spd` API (PR 6 deferred per PR 1 fix commit). - Ill-conditioned-matrix coverage (deferred to PR 5 with real Inria scene). https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
The math heat of the splat3d sprint, certified by the Pillar-7 probe
in jc::ewa_sandwich_3d. Per-gaussian forward kernel:
1. μ_cam = V·μ_world (camera transform), depth + frustum cull
2. screen_xy = (fx · μ_cam.x / z + cx, fy · μ_cam.y / z + cy)
3. Perspective Jacobian J ∈ ℝ^{2×3} at μ_cam
4. Σ_cam = W · Σ_world · Wᵀ (3×3 asymmetric W — NOT spd3::sandwich)
5. Σ_image = J · Σ_cam · Jᵀ (2×2, symmetric by construction)
6. ½-pixel anti-aliasing dilation (+0.3 on the diagonals)
7. 2D conic = inv(Σ_image), 3σ screen radius, on-screen cull
8. View direction → sh_eval_deg3 → view-dependent RGB
Surface:
- Camera (pinhole, row-major view matrix, focal + principal point,
near/far, image dims, world-space camera origin)
- ProjectedBatch SoA: screen_x/y, depth, conic_a/b/c, radius,
color_r/g/b, opacity, valid mask
- project_batch(gaussians, camera, &mut projected) — outer driver
- project_chunk_x16 — F32x16 SIMD inner loop, 16 gaussians/step via
Chunk16 staging buffer (tier-portable: works on AVX-512/AVX2/NEON)
Conic + depth + radius math goes through F32x16; SH eval stays
scalar (16 distinct view directions defeats SH SIMD batch).
Tests (10):
- screen-center landing at unit depth, near/far cull, off-screen
cull, conic-is-SPD, x16-vs-scalar parity, radius scales with
covariance, SH view-dir delegation, identity-camera sanity,
clear() resets len + valid.
Acceptance:
cargo test --features splat3d --lib hpc::splat3d::project → 10 passed
cargo test --features splat3d --lib hpc::splat3d → 48 passed (38 + 10)
https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…ad scalar fn Folds the PP-13 brutally-honest-tester findings against a00ec09 (PR 3). Both P0s addressed; two P1s promoted to "land now" per the rule from PR 1 (close correlated-bug holes the SIMD-parity tests miss). ## P0.1 — Analytical ground truth for non-trivial W Tests 2-10 all use `Camera::identity_at_origin` (W=I₃ in the upper-left 3×3 of the view matrix), so the W·Σ·Wᵀ sandwich is trivially Σ on every existing test. A sign error in the SIMD `sc12/sc13/sc23` cross-term accumulators in `project_chunk_x16` would produce wrong projected ellipses for any rotated camera while passing all 48 tests. Fix: `project_non_identity_view_rotation_matches_analytical` pins the W·Σ·Wᵀ output to a closed-form value: - View = R_y(90°), gaussian at world (-5, 0, 0) → camera-frame position (0, 0, 5) at depth 5. - scale = [2, 1, 0.5] ⇒ Σ_world = diag(4, 1, 0.25). - Analytical Σ_cam = R_y(90°)·diag(4,1,0.25)·R_y(90°)ᵀ = diag(0.25, 1, 4) (axes permuted by rotation). - J at z=5: [[fx/5, 0, 0], [0, fy/5, 0]] (offdiag vanish since cam_x = cam_y = 0 by construction). - Σ_img = diag((fx/5)²·0.25, (fy/5)²·1) = diag(fx²/100, fy²/25). - conic_a, conic_b=0, conic_c computed against this analytical Σ_img after the +0.3 AA dilation; tolerance 1e-6 absolute. A transpose error in the asymmetric 3×3 SIMD sandwich (e.g. swapping the X and Z axis projections in Σ_cam) would fail this test. The test passes first try, confirming no such bug exists in the shipped a00ec09. ## P0.2 — Remove dead `project_one_scalar_inner` The 102-LoC private fn at the top of the module was declared but never called from production OR tests. PP-13 flagged it as "creates false confidence that a scalar fallback exists". The test module already had its own near-duplicate `project_one_scalar` inline helper that test 7 actually uses. Fix: delete `project_one_scalar_inner` entirely. Net: 1017 → ~915 LoC for the file, no behavioral change. The test-module `project_one_scalar` remains as the SIMD-parity reference. ## P1 — Partial-chunk lane masking test (promoted) The `k >= count || idx >= gaussians.len` guard in `project_chunk_x16` was untested — all prior tests had len = multiple of 16 OR len = 1. A bug there only appears at inference time when the final chunk is partial. Fix: `project_partial_chunk_masks_padded_lanes` walks n ∈ {1, 7, 15, 17, 23, 31}, asserts all `n` real slots are valid and all `capacity - n` padded slots are invalid. Passes first try — confirms the mask path works. ## P1 deferred (TECH_DEBT) - `with_capacity` pads to CHUNK_WIDTH=16 not PREFERRED_F32_LANES. Doc-comment fix: 16 is the right bound for THIS module (the SIMD chunk width is the kernel's natural unit, independent of the polyfill's per-tier preferred lane count). Documented inline rather than realigned — refactoring to PREFERRED_F32_LANES would pessimize the AVX-512 native-16-wide path on no benefit. - SPD-before-dilation intermediate test. Defer to PR 5 (rasterizer) where a real Inria scene exercises the corner cases. - Near/far boundary tests at exactly z=near and z=far. The closed- interval `<`/`>` cull semantics are deliberate (matches Inria's convention) — documented decision, not a correctness bug. ## Test count cargo test --features splat3d --lib hpc::splat3d → 50 passed; 0 failed (was 48: +2 new tests) src/hpc/splat3d/project.rs: 1017 → 915 LoC (-102 dead, +2 tests) https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…t (PR 4) Bridge between project_batch (PR 3) and the per-tile rasterizer (PR 5). For each visible projected gaussian, compute the 3σ screen-space AABB, walk the touched 16×16 tiles, and emit one TileInstance per (tile, gaussian). Sort by packed u64 key (tile_id << 32 | depth_bits) so each tile's slice is depth-ascending (front-to-back) for the alpha-blend in PR 5. API: - TileInstance: tile_id + gaussian_id + depth_bits + pad (#[repr(C, align(16))], 16 B per instance — 4 per cache line) - TileBinning: tile_cols × tile_rows grid, instances Vec, tile_offsets prefix-sum (length n_tiles + 1) - TileBinning::from_projected(projected, camera) → constructor - TileBinning::tile_instances(tx, ty) → O(1) slice retrieval First-cut sort: slice::sort_unstable_by_key on the packed u64 key. If the rasterizer bench surfaces this as the hot spot, PR4-fix follows with an LSD radix sort. Tests (10): tile-size constant; ceil-div grid dims; single gaussian on tile boundary touches 1 tile; large 50-radius touches 64-tile patch; depth-sorted within tile; empty tiles return empty slice; culled gaussians not binned; AABB clamped to grid (no negative coords); off-screen gaussian zero instances; tile_offsets monotonically non-decreasing. Acceptance: cargo test --features splat3d --lib hpc::splat3d::tile → 10 passed cargo test --features splat3d --lib hpc::splat3d → 60 passed https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…+ sub-tile coverage Folds the PP-13 brutally-honest-tester findings against ab58d17 (PR 4). One P0 (promoted from a P1 marked "promote if PR 5 is pixel-exact"), plus three P1s landed for API contract clarity and coverage gaps. ## P0 promoted — ceil-div under-counted at exact tile boundaries The PR 4 binner used `ceil(px_max / TILE_SIZE)` for the exclusive upper tile bound. When `px_max` was an EXACT multiple of 16, ceil produced the wrong value: cx = 88, r = 8 → px_max = 96 = 6·16 tx_max_old = ceil(96/16) = 6 → range [_, 6) misses tile 6 tx_max_new = floor(96/16) + 1 = 7 → range [_, 7) includes tile 6 But pixel 96 sits in tile 6 (`floor(96/16) = 6`), and the gaussian's 3σ extent reaches it. PR 5's rasterizer iterates the EXACT pixel range inside each bound tile; any gaussian whose 3σ edge lands on a tile boundary (16-pixel-aligned cx ± r) would lose its contribution to the row/column of pixels at that boundary, producing one-pixel rendering seams. PP-13 flagged this as P1 with "Promote to P0 if PR 5 is pixel-exact." PR 5 IS pixel-exact — promoting. The `floor + 1` formula: - Is correct for both integer-boundary AND fractional px_max values - Is backwards-compatible with the existing 10 tests (Worker F used radii 4, 50, 100, 12 that produced non-multiple px_max values) - Same op count as ceil (one floor + one add vs one ceil) ## P1 — clarify `tile_instances(tx, ty)` out-of-range semantics The fn returns an empty slice silently for OOB coordinates (no panic, no Result). PR 5's per-tile driver iterates `0..tile_rows × 0..tile_cols` with its own bounds, so the OOB path is defensive only. Doc-only fix: added a `# Returns` block making the silent-empty contract explicit. ## P1 — defensive debug_assert on positive depth The IEEE-754 positive-f32→u32 sort trick relies on `depth > 0`. PR 3's near cull guarantees this for `valid == 1` slots, but a caller violating the precondition would silently produce wrong sort order in release builds. `debug_assert!(depth > 0 && is_finite())` in the emit pass catches misuse without runtime cost. ## New tests (+3, total now 63) - `gaussian_edge_on_exact_tile_boundary_includes_the_boundary_tile` — pins the P0 regression. cx=88, r=8 → 2×2 = 4 instances spanning tiles {5,6}². The (6,6) corner is the one the old ceil missed. - `sub_tile_size_image_has_single_tile_grid` — 8×8 image yields tile_cols = tile_rows = 1; single gaussian fits in tile (0,0). PP-13 P1: previously untested. - `tile_offsets_sentinel_equals_instances_len` — explicit assertion that `tile_offsets[n_tiles] == instances.len()`. PR 5's uniform `instances[offsets[t]..offsets[t+1]]` slice bracket depends on this; previously only checked via monotonicity bound. ## P1 deferred (TECH_DEBT) - Two-phase index-shift comment in the count-to-prefix loop. Readability only; the inline code is already short and obvious to a reader who has seen the standard prefix-sum pattern. - Negative center + small radius coverage (e.g. cx=-5, r=2). The existing Test 8 (cx=0, r=100) covers the negative-AABB clamp; the small-radius variant is a near-duplicate. ## Test count cargo test --features splat3d --lib hpc::splat3d → 63 passed; 0 failed (was 60: +3 new) https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…PR 5) The second math-heat PR of the sprint. For each 16×16 tile, walk its (tile_id, depth)-sorted TileInstance slice front-to-back; per row of 16 pixels (one F32x16), accumulate alpha-blended RGB via Kerbl 2023 §4. Front-to-back early-out at T < 1e-4 (below 8-bit quantization floor). Inner loop: dx, dy = gaussian_xy_broadcast - pixel_xy_vec power = -0.5 · (a·dx² + 2b·dx·dy + c·dy²) [2D Mahalanobis] alpha = min(0.99, opacity · fast_exp(power)) mask = (power ≤ 0) & (alpha ≥ 1/255) T_next = T · (1 − alpha) [via mask.select] C += mask.select(T · alpha · color, 0) break if T_next.reduce_max() < 1e-4 API: - rasterize_tile(tile_x, tile_y, binning, projected, fb, w, h, bg) - rasterize_frame(binning, projected, fb, w, h, bg) — walks every tile - T_SATURATION_EPS = 1e-4 Tests (10): empty scene = background; opaque-white center pixel; two-gaussian front-to-back composite; 50-stack early-out; outside- 3σ skip; per-tile write isolation; rasterize_frame == sum of rasterize_tile; partial-tile-at-image-edge; alpha-low background visibility; empty tile preserves background. Acceptance: cargo test --features splat3d --lib hpc::splat3d::raster → 10 passed cargo test --features splat3d --lib hpc::splat3d → 73 passed https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…vergence tests Folds the PP-13 audit findings against 190ea35 (PR 5). Zero P0 bugs in the alpha-blend math; the audit confirmed pixel-exact correctness on every Kerbl 2023 §4 invariant traced (accumulation order, factor-of-2 cross-term, 0.99 clamp, simd_le boundary, background composite, reduce_max early-out, mask-AND portability across all three SIMD tiers). Two P1s promoted per the pattern: real bug-class holes the existing tests would miss. ## P1 → P0 promotion — bottom-edge row guard The pre-fix code guarded `pix_y >= height` at the per-pixel scatter step, AFTER the inner blend loop had already computed alpha, exp, conic, T-update for the entire row. On any image whose height isn't a multiple of TILE_SIZE (e.g. 1080 → 67.5 tile rows → 4 wasted rows per frame × 50K gaussians × per-gaussian fast-exp = ~6-8% wasted compute per frame), the dropped result was a meaningful cost. Fix: move the height guard to the top of the row loop (line 121-123), saving the entire row's blend loop on OOB rows. Test 13 covers this with a 16×17 image (one partial tile row exercising the guard) + both empty-scene and one-gaussian-at-bottom-row variants. ## P1 → P0 promotion — opacity=1.0 / 0.99 clamp regression test Every prior test used opacity ≤ 0.99, so the 0.99 alpha clamp never actually fires in the suite. Removing or retuning the clamp would break opacity=1.0 scenes (common in pre-trained Inria models — fully opaque foreground splats) by zeroing T after the first hit, vanishing every back gaussian. Pre-fix the clamp could regress silently. Fix: Test 11 sets BOTH gaussians' opacity = 1.0, asserts the back (blue) channel value is in the analytical range [0.005, 0.02] (= 0.01 × 0.99) that the clamped formula produces. An unclamped path gives B=0 (back vanished); a re-tuned clamp at 0.999 gives B≈0.001 (still distinguishable, still wrong). ## P1 — spatial-separation test (per-lane divergence) Every prior multi-gaussian test stacked gaussians at IDENTICAL screen coordinates — degenerate case where each pixel in the tile sees the same (dx, dy) for every gaussian. A broadcasted-wrong-id bug (reading gaussian_id+1 instead of gaussian_id, or transposing the per-gaussian lane offset) would pass those tests AND produce identical pixels in the degenerate case. Fix: Test 12 places two opaque gaussians at separated positions ((4,4) red, (12,12) blue) in the SAME tile, asserts pixel (4,4) is red-dominant and pixel (12,12) is blue-dominant — confirms the F32x16 per-lane divergence math distinguishes pixels correctly. ## P1 deferred (TECH_DEBT) - Explicit early-out fire-count test (Test 4 only verifies the resulting pixel color, not that the inner loop broke at gaussian 3). A test-only counter via cfg(test) would close this — but the color check IS a regression guard because no early-out + 50 opaque gaussians produces the same final pixel anyway. - Explicit power=0 boundary test. Test 3 already exercises this case (gaussians centered exactly on the pixel produce power=0), the simd_le path includes it — coverage is incidental but real. ## Test count cargo test --features splat3d --lib hpc::splat3d → 76 passed; 0 failed (was 73: +3 new tests, all green first try) https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Sibling of hpc::renderer::Renderer for the SPO graph viz. Same shape: two RwLock<SplatFrame>s, AtomicUsize front_idx, atomic swap(). The instance pattern (vs module-level globals) lets medvol and lance-graph-render each own their own SplatRenderer. SplatFrame::tick runs the full PR 1-5 pipeline: project_batch → TileBinning::from_projected → rasterize_frame → frame_id += 1 The state mutation is guarded by &mut self (frame) or the back RwLock write guard (renderer). SplatRenderer::tick overrides frame_id with a global AtomicU64 tick_count so front_frame_id() is monotonically increasing across both frame slots (not per-slot). GaussianBatch and TileBinning do not implement Debug, so SplatFrame/SplatRenderer omit #[derive(Debug)] rather than touch PR 2/4 files. Tests (10): with_capacity sanity, tick increments frame_id, tick renders a visible gaussian, monotonic id, front/back complementarity, swap XOR-flip idempotence, tick advances front_frame_id, concurrent read doesn't block write, byte footprint > 0, two ticks render to DIFFERENT buffers (pointer identity check confirms double-buffer is using both slots). Acceptance: cargo test --features splat3d --lib hpc::splat3d::frame → 10 passed cargo test --features splat3d --lib hpc::splat3d → 86 passed https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Closes the splat3d sprint's "Definition of done" — the full PR 1-6 pipeline now runs end-to-end on the CPU with a real binary that takes a .ply scene as input and produces image output. ## Shipped ### src/hpc/splat3d/ply.rs (~370 LoC, 4 unit tests) Minimal Inria 3DGS PLY reader. Parses ASCII header up to `end_header`, validates the canonical 62-property vertex layout (x/y/z, normals, SH DC + 45 rest, opacity, scale × 3, quat × 4), reads the binary little-endian body, applies the canonical activations inline (sigmoid opacity, exp scale, normalize quat), and reorders SH into the gaussian-major channel-major layout `sh_eval_deg3` expects. Rejects ASCII bodies, big-endian, unexpected properties, and truncated files with typed `PlyError` variants. No new top-level deps — single-file hand-rolled binary parser. ### tests/splat3d_correctness.rs (5 e2e integration tests) Walks the full PR 1-6 pipeline against a synthetic 1000-gaussian cube scene (10×10×10 grid spanning [-2,2]³, colored by position via SH DC term). - `end_to_end_synthetic_cube_renders_without_panic` — pipeline produces non-trivial pixel variance (>100 lit pixels, <50% saturated) on a 256×256 render. - `end_to_end_double_buffer_swap_preserves_consistency` — SplatRenderer tick 2x; front_frame_id advances 1, 2 across both buffers. - `end_to_end_camera_translation_changes_render` — two cameras at different world positions produce DIFFERENT framebuffers (SSD > 1). - `end_to_end_empty_scene_yields_pure_background` — zero gaussians ⇒ pixel-exact background fill. - `end_to_end_three_consecutive_ticks_preserve_invariants` — 3 ticks, frame_id monotonic 1/2/3, all pixels finite (no NaN bleed). ### examples/splat3d_flex.rs (~200 LoC, runnable demo) CLI binary that loads a `.ply` scene (or falls back to the synthetic cube), bakes a circular camera path around the origin, renders N frames, writes PPM output, reports p50/p95/p99 frame timing + fps. PPM over PNG: the sprint's "no new top-level deps" invariant rules out flate2 / png crates. PPM is 14-byte header + raw RGB bytes, trivially viewable in every image tool, and `splat3d_flex.rs` documents the choice + the deferred PNG-as-followup option. Smoke test (5 frames × 256² synthetic cube on AVX2-emulated build): p50=133.63 ms, p95=146.57 ms, p99=146.57 ms, 7.5 fps The 1080p × 500K-gaussian acceptance target awaits the Inria bicycle .ply asset and a benchmarking-only session. ### benches/RESULTS.md (real measured numbers) Baselined the four PR 1 microbenches under both default (AVX2- emulated F32x16) and `target-cpu=native` (AVX-512F) builds. Honest findings: - `sandwich_simd_x16` on AVX-512 native: 1.83× over scalar loop (below the spec's 10× aspiration; the AoS↔SoA transpose at 6 fields × 16 lanes dominates the inner-loop savings for this microbench). Filed as TECH_DEBT for the performance sprint. - `sandwich_simd_x16` on AVX2-emulated default: 0.17× (slower). Documented as the polyfill's two-`__m256`-per-`F32x16` cost. TECH_DEBT: add runtime tier dispatch so AVX2 builds prefer the scalar loop, or restructure to take SoA inputs directly. - `from_scale_quat`: 9 ns on AVX-512 native (the 3DGS canonical Σ builder; GaussianBatch::covariance_x16 SIMD-batches it). - `eig_smith_1961`: 126 ns (acos dominates; diagonal fast-path bypasses the trig). Documented the per-PR follow-up bench rows that should populate when the rasterizer-driven full-pipeline bench lands. ## Sprint state (Definition of done) - [x] 7 PRs merged to splat3d branch - [x] `cargo test --features splat3d -p ndarray` green (1859 prior tests + 90 splat3d lib tests + 5 e2e + 4 PLY = 1958) - [x] `cargo bench --features splat3d` baselined in RESULTS.md - [x] `cargo run --features splat3d --example splat3d_flex` runs end-to-end (synthetic fallback OR a .ply scene) - [x] No regression in existing ndarray benches - [x] Pillar-7 probe certified in lance-graph jc (PR #403 + the rotated-axisymmetric fix in claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0) ## Deferred to follow-up sprint - Inria bicycle .ply SSIM comparison vs reference CUDA (asset download required; not in this remote container). - 1080p × 500K real-data benchmark (same). - PNG output via `image`/`png` crate (gated on the no-new-deps invariant; PPM works for the v1 demo deliverable). - Performance: AVX2-tier SIMD path optimization; tile-binner radix sort; rayon-parallel rasterize_frame. - Backward pass / training pipeline (separate sprint per the sprint prompt's "After the sprint" section). https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e96459645
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
External-reviewer bug report against PR #153: > When a malformed or fuzzed PLY header advertises a vertex count > larger than usize::MAX / (62 * 4), this size calculation overflows > (panics in debug, wraps in release). In release that allocates a > too-small bytes buffer and the subsequent per-vertex loop indexes > past it instead of returning a PlyError, so a bad input can crash > the loader; use checked multiplication before allocating/reading > the body. ## Root cause `read_ply` computed the body byte count via: let mut bytes = vec![0u8; n_vertices * PROPERTIES_PER_VERTEX * 4]; For `n_vertices > usize::MAX / 248`: - debug: panic on the unchecked `*`. - release: wraps to a small number, allocates a too-small buffer, `read_exact` succeeds (reads only the wrapped count of bytes — often zero), then the per-vertex loop indexes far past the allocation. Crash or — worse — silent corruption if the wrapped size happens to land at a valid index. ## Fix Gate the body size with `checked_mul` BEFORE allocation: let body_bytes = n_vertices .checked_mul(PROPERTIES_PER_VERTEX) .and_then(|n| n.checked_mul(4)) .ok_or_else(|| PlyError::BadElement(format!( "vertex count {n_vertices} × {PROPERTIES_PER_VERTEX} props × 4 bytes \ overflows usize on this target ({} bits)", usize::BITS, )))?; let mut bytes = vec![0u8; body_bytes]; The downstream per-vertex `i * stride` math is now safe by transitivity — for any `i < n_vertices`, `i * stride ≤ body_bytes ≤ usize::MAX`. No further bounds work needed. ## Regression test `rejects_overflowing_vertex_count`: - Computes `overflow_count = usize::MAX / (PROPERTIES_PER_VERTEX * 4) + 1` (the smallest count that overflows on the current target). - Builds a valid PLY header advertising that count, with NO body bytes — the overflow check must fire BEFORE any I/O is attempted. - Asserts `PlyError::BadElement` with a message containing "overflows". Verified green in BOTH debug and release builds, where the wrapping (not panicking) release path is the actual security concern. ## Test count cargo test --features splat3d --lib hpc::splat3d::ply → 5 passed; 0 failed (was 4: +1 overflow regression) cargo test --features splat3d --lib hpc::splat3d → 91 passed; 0 failed (was 90: +1) cargo test --features splat3d --release --lib hpc::splat3d::ply → 5 passed; 0 failed (release-build confirms no wrap-then-corrupt) https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Mechanical formatting fixes from `cargo fmt --all` — no semantic
changes. Brings the 12 splat3d files (PR 1-7 + fixes) into rustfmt
compliance so the workspace gate stays green.
Files reformatted:
benches/splat3d_bench.rs
examples/splat3d_flex.rs
src/hpc/splat3d/{mod,spd3,gaussian,sh,project,tile,raster,frame,ply}.rs
tests/splat3d_correctness.rs
Acceptance:
cargo fmt --all --check → clean
cargo test --features splat3d --lib hpc::splat3d → 91 passed
cargo test --features splat3d --test splat3d_correctness → 5 passed
cargo check --features splat3d --benches --bench splat3d_bench → clean
cargo check --features splat3d --example splat3d_flex → clean
https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 18, 2026
…d / cognitive
Review of the three uploaded sprint prompts (splat3d_sprint_prompt,
splat4d_cascade_sprint, splat4d_skeleton_anchored_sprint) in context of
the cognitive-shader work drafted in PR-X4 / PR-X9 / PR-Z1.
Tags every arithmetic primitive shipped / drafted / gap across 9 layers
(L0 SPD substrate → L8 cognitive overlay), flags 3 precision classes
(EXACT / FAST OK / VERIFY), and identifies 5 concrete gaps that gate
the joint sprint:
1. Hilbert-3D encode/decode (mentioned in splat4d cascade but not
specified anywhere — single shared dependency of medical AND
cognitive paths)
2. INT4×32 packed dot product (PR-X7 thinking-style + qualia signature
— needs VNNI/dotprod strategy decision)
3. NARS truth-revision kernel + precision class (replaces alpha-compose
in W7 closure swap)
4. x265-style CTU mode encoder (skip/merge/delta/escape for PR-X9
lazy storage)
5. fast_exp_x16 precision audit for NARS context (3% rel err is OK for
alpha but suspect for cognitive confidence cascade)
Five new cross-cutting research items consolidated (atop the five from
the three sprint docs):
- Hilbert-3D algorithm choice (Butz vs Skilling vs precomputed table)
- INT4×N hardware strategy (VNNI vs software unpack vs AMX widening)
- NARS revise precision class decision (G5 (a/b/c) — lean toward (b),
drop exp from cognitive path entirely)
- CTU mode encoder λ-RDO calibration
- Codebook size const-generic strategy
Recommended ordering: Phase 0 (Hilbert-3D + INT4×N) unblocks BOTH the
medical sprint (splat4d skeleton-anchored) AND the cognitive sprint
(PR-X4 + PR-X9). Build the shared substrate first; both stacks
accelerate together. Phase 1 medical+cognitive co-substrate
(Pillar-8 + moment-match + mesh-fit). Phase 2 cognitive-only
(basin XOR-popcount + CTU + NARS). Phase 3 W7 closure swap.
Recommended 30-min math workshop before the joint plan-review savant
to lock σ_temporal values, Hilbert-3D algorithm, and NARS precision
class — removes 3 open questions per design doc and accelerates the
sprint.
Key strategic claim: Pillar-7 SPD-sandwich is the most-reused single
math op in the entire stack. It's the projection (J·W·Σ·Wᵀ·Jᵀ), the
temporal cascade (Σ_{t+1} = M·Σ_t·Mᵀ), the moment-match aggregate-up
(via Δμ·Δμᵀ outer products), and the cognitive-spacetime evolution.
Shipped in splat3d PR #153. Everything else is a semantic
reinterpretation of M.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 18, 2026
…ow LAPACK Strategic shift: the biggest arithmetic gap in the stack isn't the cognitive overlay or even the splat4d cascade — it's the shared linear-algebra layer below LAPACK that splat3d backward, openchat/gpt2 inference, AND the jc Pillar probes are all hand-rolling against. Today's duplication: - splat3d ships its own Spd3 (Smith-1961, PR #153) - lance-graph jc has THREE separate Spd2/Spd3 copies in ewa_sandwich.rs / ewa_sandwich_3d.rs / koestenberger.rs - hpc::{gpt2, openchat, stable_diffusion} inline RMSNorm/SiLU/RoPE/ attention because there's no canonical fn PR-X10 consolidates everything into crate::hpc::linalg::*: - MatN<const N> carrier + Mat2/Mat3/Mat4 type aliases - Quat algebra (mul, conjugate, slerp, from_axis_angle, to_mat) - Matrix inverse (3×3 / 4×4 closed-form + general LU-backsolve) - Symmetric eig (closed-form ≤4, Jacobi 5-64, QR > 64) - SVD (Golub-Reinsch + one-sided Jacobi) - Polar decomposition + mat_exp + mat_log (Padé scaling-and-squaring) - SH deg 0..=7 (supersedes splat3d's deg-3-only) - Conv1D + Conv2D (im2col + direct-3x3/5x5) - Batched gemm + RMSNorm/LayerNorm/GroupNorm + GELU/SiLU/Swish/Mish - RoPE + fused attention (naive + flash-attention) - Cross-entropy + softmax-backward - Tier-3 extensions: SIMD RNG dists, vml special fns (erf/gamma/Bessel), Bluestein FFT, irfft, DCT-II/IV, wavelets, sparse GEMM, tridiagonal Closed-form fast paths coexist with general-N (invariant 12) — Spd3 Smith-1961 is 10× faster than Jacobi-3 on the splat3d hot path. Don't delete the fast paths when ripping out the duplication. Worker decomposition: A1 MatN (foundation, sequential), then A2-A12 PARALLEL (max fan-out: 12 workers, all writing to separate files, all consuming MatN + crate::simd::F32x16). Matches the user's "12 agenten + 1 Koordinator" cadence. ~2 weeks parallel / ~5 weeks sequential. jc consolidation queued as follow-ons: - jc-X1: consolidate Spd2/Spd3 into private jc::hadamard (keeps jc zero-dep on ndarray; mirrors PR-X10's canonical surface) - jc-X2: Wasserstein-1 / Sinkhorn-Knopp + Hungarian for Pillar 10 - jc-X3: signature transform for Pillar 11 - jc-X4: SPD-cone ops + manifold log/exp (SO(n), Grassmannian, Stiefel) — unblocks Pillar 2 Cartan-Kuranishi PR-X10 is INDEPENDENT of PR-X4 / PR-X9 / PR-Z1 (zero file overlap), ships concurrently from claude/pr-x10-linalg-core-design branch. Maximum sprint parallelism: cognitive-shader stack AND linalg-core can spawn workers simultaneously. 7 open questions for plan-review savant. Most load-bearing: - Q1: both closed-form AND general-N? (lean: yes — invariant 12) - Q2: const-generic MatN vs concrete Mat2/3/4? (lean: both) - Q5: flash-attention in v1? (lean: yes — needed for any seq > 512) - Q7: PR-X10 concurrent with PR-X4/X9/Z1? (lean: yes) Also adds shopping-list addendum to pr-arithmetic-inventory.md cross-referencing PR-X10 as the consolidating sprint.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 18, 2026
…d / cognitive
Review of the three uploaded sprint prompts (splat3d_sprint_prompt,
splat4d_cascade_sprint, splat4d_skeleton_anchored_sprint) in context of
the cognitive-shader work drafted in PR-X4 / PR-X9 / PR-Z1.
Tags every arithmetic primitive shipped / drafted / gap across 9 layers
(L0 SPD substrate → L8 cognitive overlay), flags 3 precision classes
(EXACT / FAST OK / VERIFY), and identifies 5 concrete gaps that gate
the joint sprint:
1. Hilbert-3D encode/decode (mentioned in splat4d cascade but not
specified anywhere — single shared dependency of medical AND
cognitive paths)
2. INT4×32 packed dot product (PR-X7 thinking-style + qualia signature
— needs VNNI/dotprod strategy decision)
3. NARS truth-revision kernel + precision class (replaces alpha-compose
in W7 closure swap)
4. x265-style CTU mode encoder (skip/merge/delta/escape for PR-X9
lazy storage)
5. fast_exp_x16 precision audit for NARS context (3% rel err is OK for
alpha but suspect for cognitive confidence cascade)
Five new cross-cutting research items consolidated (atop the five from
the three sprint docs):
- Hilbert-3D algorithm choice (Butz vs Skilling vs precomputed table)
- INT4×N hardware strategy (VNNI vs software unpack vs AMX widening)
- NARS revise precision class decision (G5 (a/b/c) — lean toward (b),
drop exp from cognitive path entirely)
- CTU mode encoder λ-RDO calibration
- Codebook size const-generic strategy
Recommended ordering: Phase 0 (Hilbert-3D + INT4×N) unblocks BOTH the
medical sprint (splat4d skeleton-anchored) AND the cognitive sprint
(PR-X4 + PR-X9). Build the shared substrate first; both stacks
accelerate together. Phase 1 medical+cognitive co-substrate
(Pillar-8 + moment-match + mesh-fit). Phase 2 cognitive-only
(basin XOR-popcount + CTU + NARS). Phase 3 W7 closure swap.
Recommended 30-min math workshop before the joint plan-review savant
to lock σ_temporal values, Hilbert-3D algorithm, and NARS precision
class — removes 3 open questions per design doc and accelerates the
sprint.
Key strategic claim: Pillar-7 SPD-sandwich is the most-reused single
math op in the entire stack. It's the projection (J·W·Σ·Wᵀ·Jᵀ), the
temporal cascade (Σ_{t+1} = M·Σ_t·Mᵀ), the moment-match aggregate-up
(via Δμ·Δμᵀ outer products), and the cognitive-spacetime evolution.
Shipped in splat3d PR #153. Everything else is a semantic
reinterpretation of M.
AdaWorldAPI
pushed a commit
that referenced
this pull request
May 18, 2026
…ow LAPACK Strategic shift: the biggest arithmetic gap in the stack isn't the cognitive overlay or even the splat4d cascade — it's the shared linear-algebra layer below LAPACK that splat3d backward, openchat/gpt2 inference, AND the jc Pillar probes are all hand-rolling against. Today's duplication: - splat3d ships its own Spd3 (Smith-1961, PR #153) - lance-graph jc has THREE separate Spd2/Spd3 copies in ewa_sandwich.rs / ewa_sandwich_3d.rs / koestenberger.rs - hpc::{gpt2, openchat, stable_diffusion} inline RMSNorm/SiLU/RoPE/ attention because there's no canonical fn PR-X10 consolidates everything into crate::hpc::linalg::*: - MatN<const N> carrier + Mat2/Mat3/Mat4 type aliases - Quat algebra (mul, conjugate, slerp, from_axis_angle, to_mat) - Matrix inverse (3×3 / 4×4 closed-form + general LU-backsolve) - Symmetric eig (closed-form ≤4, Jacobi 5-64, QR > 64) - SVD (Golub-Reinsch + one-sided Jacobi) - Polar decomposition + mat_exp + mat_log (Padé scaling-and-squaring) - SH deg 0..=7 (supersedes splat3d's deg-3-only) - Conv1D + Conv2D (im2col + direct-3x3/5x5) - Batched gemm + RMSNorm/LayerNorm/GroupNorm + GELU/SiLU/Swish/Mish - RoPE + fused attention (naive + flash-attention) - Cross-entropy + softmax-backward - Tier-3 extensions: SIMD RNG dists, vml special fns (erf/gamma/Bessel), Bluestein FFT, irfft, DCT-II/IV, wavelets, sparse GEMM, tridiagonal Closed-form fast paths coexist with general-N (invariant 12) — Spd3 Smith-1961 is 10× faster than Jacobi-3 on the splat3d hot path. Don't delete the fast paths when ripping out the duplication. Worker decomposition: A1 MatN (foundation, sequential), then A2-A12 PARALLEL (max fan-out: 12 workers, all writing to separate files, all consuming MatN + crate::simd::F32x16). Matches the user's "12 agenten + 1 Koordinator" cadence. ~2 weeks parallel / ~5 weeks sequential. jc consolidation queued as follow-ons: - jc-X1: consolidate Spd2/Spd3 into private jc::hadamard (keeps jc zero-dep on ndarray; mirrors PR-X10's canonical surface) - jc-X2: Wasserstein-1 / Sinkhorn-Knopp + Hungarian for Pillar 10 - jc-X3: signature transform for Pillar 11 - jc-X4: SPD-cone ops + manifold log/exp (SO(n), Grassmannian, Stiefel) — unblocks Pillar 2 Cartan-Kuranishi PR-X10 is INDEPENDENT of PR-X4 / PR-X9 / PR-Z1 (zero file overlap), ships concurrently from claude/pr-x10-linalg-core-design branch. Maximum sprint parallelism: cognitive-shader stack AND linalg-core can spawn workers simultaneously. 7 open questions for plan-review savant. Most load-bearing: - Q1: both closed-form AND general-N? (lean: yes — invariant 12) - Q2: const-generic MatN vs concrete Mat2/3/4? (lean: both) - Q5: flash-attention in v1? (lean: yes — needed for any seq > 512) - Q7: PR-X10 concurrent with PR-X4/X9/Z1? (lean: yes) Also adds shopping-list addendum to pr-arithmetic-inventory.md cross-referencing PR-X10 as the consolidating sprint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands
ndarray::hpc::splat3d— a CPU-SIMD 3D Gaussian Splatting forwardrenderer (Zwicker 2001 / Kerbl 2023), gated behind
feature = "splat3d".Pure SIMD via
crate::simd::F32x16, no GPU, no wgpu, no new top-leveldeps. Mirror's the Pillar-7 EWA-sandwich math certified in
lance-graph/crates/jc/src/ewa_sandwich_3d.rs.End-to-end: load
.plyscene → SoA gaussians → EWA projection →16×16 tile bin → depth-sorted alpha-blend → RGB framebuffer.
PRs landed (7-step orchestrator loop, plan → review → fix → commit)
spd3.rspow(t),sqrt,log_spd,from_scale_quat;sandwich(M,N)+sandwich_x16SIMD batchgaussian.rs+sh.rsGaussianBatchSoA (12 channels padded toPREFERRED_F32_LANES);Gaussian3DAoS constructor;covariance_x16SIMD-batch; degree-3 SH RGB evalproject.rstile.rsraster.rsF32x1616-pixel-row inner loop;simd_expfor alpha; T-saturation early-out at 1e-4frame.rsSplatFrame(per-tick state) +SplatRenderer(atomic double-buffer driver, sibling ofhpc::renderer::Renderer)ply.rs+ demo + e2e testexamples/splat3d_flex.rs(PPM output, circular camera path, p50/p95/p99 timing); 5-test integration suiteTotal tests: 1958 passing (1859 prior + 90 splat3d lib + 5 e2e + 4
PLY = 96 splat3d-related). No regression in existing tests.
Audit discipline
Every PR went through a PP-13 brutally-honest-tester subagent
review before merge, following the autoattended multi-agent pattern at
.claude/EN/CLAUDE-AGENT-PATTERN.md. Real bugs caught:recover_eigvecsreturned the SAME unit vector forrepeated eigenvalues → V rank-deficient → wrong
sqrt/log_spdonaxisymmetric Σ. Fixed by duplicate-detection pass + Gram-Schmidt
complement. (Same bug independently surfaced by an external reviewer
against the lance-graph Pillar-7 probe — fixed there on the
follow-up branch
claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0.)is_spddoc said "Cheap" but called fulleig()—promoted to misuse-prevention rename +
# Complexitynote.letting the compiler constant-fold the scalar loop and producing
meaningless SIMD-vs-scalar numbers.
would have affected scalar AND SIMD identically. Added
sh_eval_analytical_ground_truth_at_positive_zpinning the fournon-zero basis indices.
W·Σ·Wᵀ sandwich was never exercised with non-trivial W. Added
analytical
project_non_identity_view_rotation_matches_analytical(90° Y-rotation, scale = [2, 1, 0.5] → Σ_cam = diag(0.25, 1, 4)).
project_one_scalar_innerwas 102 LoC of dead code.Deleted.
ceil(px_max / TILE_SIZE)under-counted by one tile whenpx_maxwas an exact multiple ofTILE_SIZE→ PR 5's pixel-exactrasterizer would render one-pixel seams along tile boundaries.
Fixed by
floor + 1.wasted ~6-8% compute per frame on images with height not a multiple
of 16. Moved to top of row loop.
removing the clamp would have vanished every back gaussian. Added
explicit test.
Net: ~10 P0s caught before merge, 0 P0s shipped.
Architectural invariants honored
serde,tokio,glam,nalgebra.PREFERRED_F32_LANES(orCHUNK_WIDTH=16for the projection kernel — documented).frame.tick(), nottick(frame)).#[repr(C, align(N))]on cross-FFI structs.crate::simd::F32x16only. No raw intrinsics.lance_graph_contract::splatis untouched; thisndarray::hpc::splat3dis the graphics sibling.Bench baseline (
benches/RESULTS.md)Hardware: Intel Xeon Sapphire Rapids, AVX-512F+BW+VL+VNNI+BF16, 2.1 GHz.
target-cpu=native(AVX-512)spd3_sandwich_scalar_x16_loopspd3_sandwich_simd_x16spd3_eig_smith_1961spd3_from_scale_quatThe 1.83× sandwich SIMD speedup is below the sprint's 10× aspiration —
the 6-field-per-
Spd3AoS↔SoA transpose dominates the microbench. Therasterizer hot path keeps data SoA-native at the call site, so the
amortized speedup downstream is higher. TECH_DEBT filed for the
performance sprint: SoA-input variant + AVX2-tier runtime dispatch.
Smoke test of
examples/splat3d_flexon the AVX2-emulated defaultbuild, 1000-gaussian synthetic cube at 256×256:
p50 = 133 ms, p95 = 146 ms, p99 = 146 ms, 7.5 fps
Test plan
cargo test --features splat3d— greencargo test --features splat3d --test splat3d_correctness— 5e2e tests green
cargo check --features splat3d --benches— cleancargo check --features splat3d --example splat3d_flex— cleancargo run --release --features splat3d --example splat3d_flex -- --frames 5 --width 256 --height 256— emits 1 PPM frame at/tmp/splat3d_render/frame_0000.ppmcargo test --libshows the prior 1859 testsstill passing
--no-default-featuresbuild (pre-existing breakage onmerkle_tree.rs, not introduced by this PR)Deferred to follow-up sprints
this remote container).
image/pngcrate (gated on the no-new-depsinvariant; PPM works for v1).
parallel
rasterize_frame.lance-graph-renderconsumer crate (sprint's "after the sprint"follow-on).
Companion lance-graph PR
The Pillar-7 EWA-sandwich-3D probe was merged via
lance-graphPR#403 (commit
70866b6). A follow-up fix for the same repeated-eigenvalue bug class is on
claude/jc-pillar-7-eigvec-duplicate-fix-MAOO0(commit257f541,not yet PR'd — happy to open separately on request).
https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Generated by Claude Code