1BRC-on-substrate probe + BF16 tile-GEMM tier ladder (VDPBF16PS, PackedBf16B, LE contract) + AMX Gotcha 14#227
Conversation
…+ AMX BF16 tile-GEMM leg
Three certified paths for min/mean/max per station (413 stations, integer
tenths, bit-exact against a scalar i16/i64 reference):
- Morton scatter: stations minted as cells on a 64x64 Z-order grid,
morsel-batched (64K rows) into L1-resident SoA accumulators; measured
substrate tax vs raw hash-style reference ~2% (443 vs 453 Mrows/s).
- BF16 tile-GEMM group-by: (sum, n) as C += A[16xK]*B[Kx16] with per-row
one-hot station indicators and A rows {1, hi(t), lo(t), bf16(t)};
hi/lo split keeps every operand bf16-exact, f32 tile accumulation
exact at K=4096. Routed through the new simd re-export (below).
bf16-direct row measures the no-split cost: max |dmean| = 0.0123
tenths at 10M rows; single readings off by <= 4 tenths.
- Aggregate pyramid over Morton tiles: hierarchical (min,mean,max) per
tile/region/root in one pass, band-prune queries (90.2% prune).
simd.rs: re-export hpc::bf16_tile_gemm::bf16_tile_gemm_16x16 as
simd::bf16_tile_gemm_16x16_amx (W1a surface alignment, same pattern as
matmul_i8_to_i32; _amx suffix disambiguates the pure-FMA polyfill kernel).
AMX_GOTCHAS.md: new Gotcha 14, discovered by this probe - on an
oversubscribed VM, AMX tile state silently corrupts under host CPU
contention (idle: 413/413 exact at 100M rows; 4 busy loops: 89-152/413,
rows lost without faulting; guest core pinning does not mitigate;
AVX-512 path in the same run stays exact). Certification of AMX numerics
requires bare metal or provably idle hosts, and parity tests must also
run under deliberate load.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
The jc onebrc_agg certification (lance-graph 406b3a0) measured the exact bound exhaustively: RNE errs by at most HALF an ulp, so the bf16-direct single-reading error is <= 2 tenths (0.2 C), not 4. Attained at the range extremes. Wording-only fix in the probe report and blackboard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
Three-tier runtime ladder, polyfill kernel untouched: AMX TDPBF16PS -> AVX-512 VDPBF16PS -> decode + F32x16 FMA polyfill. - avx512bf16_path (private): native bf16-pair multiply per zmm via _mm512_dpbf16_ps (stable Rust 1.94, verified), f32 lane accumulators, no bf16->f32 decode. Same VNNI operand layout as the AMX tile, so one packed buffer serves both tile tiers. - PackedBf16B + bf16_tile_gemm_16x16_packed: hoist the per-call VNNI pack and its allocation out of hot loops; vnni_index(row, col) lets consumers stage B directly in VNNI layout (zero pack cost for sparse / one-hot staging). - bf16_tile_gemm_tier(): names the dispatch tier for run reports (Gotcha 9 discipline). - simd.rs: re-export the new surface through ndarray::simd::* (W1a), polyfill name untouched. Exactness boundary preserved: all tiers bit-exact for bf16-exact integer operands with accumulation < 2^24, asserted by assert_eq! parity tests (vnni_index vs vnni_pack_bf16, packed == unpacked == i64 reference, VDPBF16PS exact-integer + float tolerance vs polyfill, accumulate semantics). Gotcha-14 contention parity test ships #[ignore]d - it fails on oversubscribed VMs by design; run --ignored on dedicated silicon. onebrc_cascade_probe measured effect (direct-VNNI one-hot staging): GEMM leg 3.6 -> 21.3 Mrows/s (5.9x), 23.7 -> 141.9 GMAC/s single-thread, 413/413 stations still bit-exact. 8 lib tests + 2 doctests green; clippy -D warnings clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
…_le_bytes) The persistence/mailbox face of the packed tile buffer, per the lance-graph SoaEnvelope discipline (envelope bytes are LE from creation to tombstone): - as_le_bytes(): zero-cost &[u16] -> &[u8] reinterpret; LE by construction since this module is cfg(x86_64) and x86_64 is LE-only. - from_le_bytes(): endian-correct rebuild via u16::from_le_bytes (compiles to a plain copy on LE targets). - Contract test asserts byte 2i == low byte of lane i (true LE, not just native) and that a GEMM over the roundtripped buffer stays bit-exact. First brick of the SoA-Morton batch-writer / write-hiding design; the writer itself lands lance-graph-side. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
|
Warning Review limit reached
Next review available in: 42 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…iLaneColumn Follow-up unblocking the gridlake wiring (lance-graph #635 COMMENTARY): lane J's GridBatch carries i32 min/max and i64 sum columns, but MultiLaneColumn only exposed f32/f64/u64/u8 lane views — #227's onebrc gridlake probe got away with f32 min/max columns. Add the signed integer lane widths so a batch SoA can be viewed through the gridlake carrier directly, no f32 recast. - `i32x16_from_chunk` / `i64x8_from_chunk` — LE decoders mirroring the existing `f32x16_from_chunk` / `u64x8_from_chunk` (scalar `from_le_bytes` loop, lowered to a single register-width load on LE targets; no pointer cast of the u8-aligned Arc<[u8]>). - `iter_i32x16` / `iter_i64x8` methods + `len_i32x16` / `len_i64x8`, routed through `crate::simd::{I32x16, I64x8}` per the W1a layering rule (never dipping into simd_avx512/simd_neon/scalar directly). - Parity tests: `iter_i32x16_le_round_trip` (incl. negatives, proves sign-extension survives the decode) + `iter_i64x8_le_round_trip`; extended the empty-count, 3-lane-count, and len asserts. These are layout-only zero-copy reinterpretations of the backing store (the same category as the existing typed iterators), not new compute kernels — no per-arch AVX/NEON/scalar backend needed beyond the lane types crate::simd already provides. simd_soa: 13/13 tests pass; clippy -D warnings clean; fmt clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
What
Restates the One Billion Row Challenge workload (min/mean/max per station) on the Morton/gridlake substrate as a certified probe, and hardens the BF16 tile-GEMM path it exercises into a three-tier ladder with a pre-packed operand carrier and an explicit little-endian byte contract.
examples/onebrc_cascade_probe.rs— the probeThree aggregation paths, all certified bit-exact against a scalar
i16/i64reference (413 stations, integer tenths):(min, max, Σ, n)monoid fold. Measured substrate tax vs the raw reference: ~2% (448 vs 457 Mrows/s).(Σ, n)asC += A[16×K]·B[K×16]with per-row one-hot station indicators. The hi/lo split (hi = (t/256)·256,lo = t − hi) keeps every operand bf16-exact; A-row 3 carries naive bf16-RNE temps through the same tile as a measured answer to "is BF16 precise enough": per-station mean error 0.0123 tenths at 10M rows, single readings ≤ 2 tenths (half-ulp).src/hpc/bf16_tile_gemm.rs— tier ladder + packed BVDPBF16PS— native bf16-pair multiply per zmm, f32 lane accumulators, no bf16→f32 decode. Same VNNI operand layout as the AMX tile, so one packed buffer serves both tile tiers._mm512_dpbf16_psverified stable on Rust 1.94. Ladder: AMXTDPBF16PS→ AVX-512VDPBF16PS→ decode +F32x16FMA polyfill (polyfill kernel untouched).PackedBf16B+bf16_tile_gemm_16x16_packed— hoists the per-call VNNI pack (and its allocation) out of hot loops;vnni_index(row, col)supports staging B directly in VNNI layout (zero pack cost for one-hot/sparse staging). Probe effect: GEMM leg 3.6 → 21.3 Mrows/s (5.9×), 23.7 → 141.9 GMAC/s single-thread.as_le_bytes()(zero-cost reinterpret; LE by construction, module is x86_64-only) /from_le_bytes()(endian-correct anywhere). The persistence/mailbox face for a downstream batch writer, per the lance-graph SoaEnvelope discipline; contract test asserts byte2i= low byte of laneiand GEMM-over-roundtripped-bytes stays bit-exact.bf16_tile_gemm_tier()— names the dispatch tier for run reports (Gotcha 9 discipline).src/simd.rsre-exports the new surface throughndarray::simd::*(W1a); the_amxsuffix keeps the pure-polyfill kernel and the tile-dispatching wrapper distinct..claude/AMX_GOTCHAS.md— new Gotcha 14 (discovered by the probe)On an oversubscribed VM, AMX tile state silently corrupts under host CPU contention: idle = 413/413 exact at 100M rows; with 4 busy-loop competitors = 89–152/413 (whole rows lost, no fault); guest-side core pinning does not mitigate; the AVX-512 path in the same run stays exact, isolating the corruption to TMM state. Suspected host-vCPU-switch
XTILEDATAloss. Consequences documented: never certify AMX numerics on shared VMs; parity tests must also run under deliberate load. The correspondingtile_parity_under_cpu_contentiontest ships#[ignore]d (fails on oversubscribed VMs by design; run--ignoredon dedicated silicon).Exactness boundary
All tiers are bit-exact for bf16-exact integer operands with accumulation < 2²⁴, asserted with
assert_eq!(never tolerance) in the parity tests. Cross-repo: the algebra (partition/regroup invariance of the monoid fold, bf16 hi/lo decomposition over all 1999 tenth-values) is certified independently inlance-graph/crates/jc(onebrc_aggprobe, same branch) — kernels here, proof there.Testing
cargo test --release --lib bf16_tile_gemm: 9 passed, 1 ignored (Gotcha 14, by design) — includes VDPBF16PS exact-integer parity on realavx512bf16siliconcargo clippy --release --lib -- -D warningsclean;cargo fmt --checkclean[AMX TDPBF16PS]tier confirmed active🤖 Generated with Claude Code
https://claude.ai/code/session_01E7wPMi796LPvp4A6JubdWH
Generated by Claude Code