perf(queue): software prefetch + contiguous batch copy for SPSC rings#10
Merged
perf(queue): software prefetch + contiguous batch copy for SPSC rings#10
Conversation
Covers Phase 0 (codegen baseline), Phase 1 (software prefetch), Phase 2 (contiguous batch copy), Phase 3 (codegen regression gate), and Phase 4 (future hugepages). Based on literature review of mratsim/weave, Snellman, psy-lob-saw, rigtorp, and B-Queue papers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use existing prefetch API (compiler_hints.rs), not new trait - Fix aarch64 prefetch: inline asm (PRFM), not nonexistent __pld - Fix x86_64 write prefetch: _MM_HINT_ET0 for exclusive hint - Correct architecture: two separate engines, not shared - Document storage contiguity invariant for batch copy - Add pop_batch contiguous design (was missing) - Add contiguous batch copy test strategy (differential, proptest) - Acknowledge existing check-asm.sh script - Add rollback criteria for prefetch feature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10-task plan covering Phases 0-3: codegen baseline, prefetch enhancement (x86_64 ET0 + aarch64 PRFM), feature flags, RingEngine and CopyRingEngine prefetch integration, contiguous batch push/pop via copy_nonoverlapping, differential + property-based tests, codegen verification gate, and benchmarking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add one-ahead write prefetch in try_push and read prefetch in try_pop, both gated on #[cfg(feature = "prefetch")]. Unsafe slot_ptr calls are encapsulated in new raw::prefetch_slot_write / prefetch_slot_read helpers so engine.rs stays within the #![deny(unsafe_code)] boundary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-element CopyPolicy dispatch in push_batch with at most two copy_nonoverlapping calls (one per contiguous chunk around the ring wrap point). This compiles to memcpy which auto-vectorizes, versus 1000 separate copies for a 1000-element batch. Add write_batch_copy to copy_ring::raw using Storage::base_ptr() for correct Miri-validated pointer provenance over the full slot array. Add base_ptr() to the Storage trait (InlineStorage and HeapStorage), returning a *mut MaybeUninit<T> with provenance covering all capacity slots — required for multi-element bulk copies. Add push_batch_wraparound_contiguous test exercising the two-chunk path where head crosses the end of the backing array back to index 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the per-element pop_batch loop with a two-chunk copy_nonoverlapping pattern symmetric to push_batch. Add read_batch_copy in copy_ring/raw/mod.rs (mirrors write_batch_copy). Add pop_batch_wraparound_contiguous test covering the wrap path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ch ops Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch/1000: -68% (contiguous copy_nonoverlapping vs per-element loop) Single-op copy: -2.4% (prefetch on aarch64 Apple M4 Pro) General single-op: +2% (within noise threshold) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmark ReportCommit: LinuxCPU: Latency (ns/op, lower is better)Single Push+Pop
Burst 100
Burst 1000
Batch 100
Batch 1000
Full Drain
Instructions per Op (lower is better)Full results (all fields)
macOSCPU: Latency (ns/op, lower is better)Single Push+Pop
Burst 100
Burst 1000
Batch 100
Batch 1000
Full Drain
Instructions per Op (lower is better)Full results (all fields)
|
All implementations (inline, copy, rtrb, crossbeam) now test identical workload shapes: single, burst_100, burst_1000 × u64, msg48, msg64. Batch benchmarks remain mantis-only (copy ring). Removed non-comparable [u8;64] and [u8;256] benchmarks. Split functions to satisfy clippy too_many_lines lint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_MM_HINT_ET0) + aarch64PRFM PLDL1KEEP/PSTL1KEEPvia stableasm!, integrated into bothRingEngineandCopyRingEnginehot paths with one-ahead prefetch pattern. Behindprefetchfeature flag (off by default).push_batch/pop_batchloops with two-regioncopy_nonoverlapping(split at wrap boundary), bypassing CopyPolicy SIMD dispatch for bulkmemcpyauto-vectorization. Addedbase_ptr()to Storage trait for correct Miri provenance.check-asm.sh(warns if hot-path grows by >2 instructions).Benchmark Results (Apple M4 Pro,
target-cpu=native)Test plan
cargo +nightly test --features alloc,std— all tests passcargo +nightly miri test -p mantis-queue— no UB (including batch provenance viabase_ptr())cargo +nightly clippy --all-targets --features alloc,std -- -D warnings— cleancargo fmt --all --check— clean🤖 Generated with Claude Code