Skip to content

perf(queue): software prefetch + contiguous batch copy for SPSC rings#10

Merged
Milerius merged 16 commits intomainfrom
feat/spsc-ring-buffer
Mar 27, 2026
Merged

perf(queue): software prefetch + contiguous batch copy for SPSC rings#10
Milerius merged 16 commits intomainfrom
feat/spsc-ring-buffer

Conversation

@Milerius
Copy link
Copy Markdown
Owner

Summary

  • Software prefetch: x86_64 write-exclusive (_MM_HINT_ET0) + aarch64 PRFM PLDL1KEEP/PSTL1KEEP via stable asm!, integrated into both RingEngine and CopyRingEngine hot paths with one-ahead prefetch pattern. Behind prefetch feature flag (off by default).
  • Contiguous batch copy: Replace per-element push_batch/pop_batch loops with two-region copy_nonoverlapping (split at wrap boundary), bypassing CopyPolicy SIMD dispatch for bulk memcpy auto-vectorization. Added base_ptr() to Storage trait for correct Miri provenance.
  • Codegen regression gate: Soft instruction-count threshold in check-asm.sh (warns if hot-path grows by >2 instructions).

Benchmark Results (Apple M4 Pro, target-cpu=native)

Benchmark mantis rtrb crossbeam
single/u64 2.32ns 2.61ns 4.02ns
burst/100/u64 382ns 370ns 548ns
single/msg48 3.61ns (copy) 5.09ns 5.42ns
batch/100/msg48 55.9ns N/A N/A
batch/1000/msg48 620.6ns (was 1932ns, -68%) N/A N/A

Test plan

  • cargo +nightly test --features alloc,std — all tests pass
  • cargo +nightly miri test -p mantis-queue — no UB (including batch provenance via base_ptr())
  • Differential tests: 49 exhaustive combos for batch wrap-around correctness
  • Property-based tests: proptest FIFO ordering preserved for random batch sizes/fill levels
  • cargo +nightly clippy --all-targets --features alloc,std -- -D warnings — clean
  • cargo fmt --all --check — clean
  • CI benchmarks on x86_64 + aarch64

🤖 Generated with Claude Code

Milerias and others added 15 commits March 26, 2026 17:01
Covers Phase 0 (codegen baseline), Phase 1 (software prefetch),
Phase 2 (contiguous batch copy), Phase 3 (codegen regression gate),
and Phase 4 (future hugepages). Based on literature review of
mratsim/weave, Snellman, psy-lob-saw, rigtorp, and B-Queue papers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use existing prefetch API (compiler_hints.rs), not new trait
- Fix aarch64 prefetch: inline asm (PRFM), not nonexistent __pld
- Fix x86_64 write prefetch: _MM_HINT_ET0 for exclusive hint
- Correct architecture: two separate engines, not shared
- Document storage contiguity invariant for batch copy
- Add pop_batch contiguous design (was missing)
- Add contiguous batch copy test strategy (differential, proptest)
- Acknowledge existing check-asm.sh script
- Add rollback criteria for prefetch feature

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10-task plan covering Phases 0-3: codegen baseline, prefetch
enhancement (x86_64 ET0 + aarch64 PRFM), feature flags, RingEngine
and CopyRingEngine prefetch integration, contiguous batch push/pop
via copy_nonoverlapping, differential + property-based tests, codegen
verification gate, and benchmarking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add one-ahead write prefetch in try_push and read prefetch in try_pop,
both gated on #[cfg(feature = "prefetch")]. Unsafe slot_ptr calls are
encapsulated in new raw::prefetch_slot_write / prefetch_slot_read
helpers so engine.rs stays within the #![deny(unsafe_code)] boundary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-element CopyPolicy dispatch in push_batch with at most two
copy_nonoverlapping calls (one per contiguous chunk around the ring
wrap point). This compiles to memcpy which auto-vectorizes, versus 1000
separate copies for a 1000-element batch.

Add write_batch_copy to copy_ring::raw using Storage::base_ptr() for
correct Miri-validated pointer provenance over the full slot array.
Add base_ptr() to the Storage trait (InlineStorage and HeapStorage),
returning a *mut MaybeUninit<T> with provenance covering all capacity
slots — required for multi-element bulk copies.

Add push_batch_wraparound_contiguous test exercising the two-chunk path
where head crosses the end of the backing array back to index 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the per-element pop_batch loop with a two-chunk
copy_nonoverlapping pattern symmetric to push_batch. Add
read_batch_copy in copy_ring/raw/mod.rs (mirrors write_batch_copy).
Add pop_batch_wraparound_contiguous test covering the wrap path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ch ops

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Batch/1000: -68% (contiguous copy_nonoverlapping vs per-element loop)
Single-op copy: -2.4% (prefetch on aarch64 Apple M4 Pro)
General single-op: +2% (within noise threshold)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Benchmark Report

Commit: 62f655e0684d1e4fa356bfe9f2f3763b68e05ba3

Linux

CPU: AMD EPYC 7763 64-Core Processor | Arch: x86_64 | Compiler: rustc 1.96.0-nightly (23903d01c 2026-03-26)

Latency (ns/op, lower is better)

Single Push+Pop

Element crossbeam mantis/copy mantis/general mantis/inline rtrb
msg48 23.26 9.35 🏆 31.73 31.85 21.22
msg64 7.98 3.74 🏆 20.32 20.39 6.99
u64 5.01 2.51 🏆 - 3.77 3.42

Burst 100

Element crossbeam mantis/copy mantis/inline rtrb
msg48 3025.85 235.53 🏆 2595.34 2339.89
msg64 1048.93 218.79 🏆 1498.92 778.14
u64 780.8 176.14 🏆 376.36 311.86

Burst 1000

Element crossbeam mantis/copy mantis/inline rtrb
msg48 30234.29 2461.5 🏆 26208.54 23343.07
msg64 11541.63 2354.53 🏆 15330.93 8181.16
u64 7812 1661.06 🏆 3780.27 3062.3

Batch 100

Element mantis/copy
msg48 71.41 🏆
u64 14.98 🏆

Batch 1000

Element mantis/copy
msg48 1560.3 🏆
u64 100.02 🏆

Full Drain

Element mantis/inline
u64 1617.27 🏆
Instructions per Op (lower is better)
Full results (all fields)
Workload ns/op p50 p99 cycles insns bmiss l1d llc
spsc/inline/single_item/u64 3.77 3.7 4 12 - - - -
spsc/inline/single_item/msg48 31.85 31.7 34.4 107.6 - - - -
spsc/inline/single_item/msg64 20.39 20.3 21.6 80.8 - - - -
spsc/inline/burst_100/u64 376.36 374.9 394.2 1124.4 - - - -
spsc/inline/burst_100/msg48 2595.34 2592.9 2619.7 9444.3 - - - -
spsc/inline/burst_100/msg64 1498.92 1494.8 1517.6 4585.1 - - - -
spsc/inline/burst_1000/u64 3780.27 3739.8 4224.3 13231.5 - - - -
spsc/inline/burst_1000/msg48 26208.54 26140.1 27406.1 110451.6 - - - -
spsc/inline/burst_1000/msg64 15330.93 15252.9 16246 54501.5 - - - -
spsc/inline/full_drain/u64 1617.27 1611.9 1675.9 5147.2 - - - -
copy/single/u64 2.51 2.5 2.7 9.6 - - - -
copy/single/msg48 9.35 9.3 9.4 57.9 - - - -
copy/single/msg64 3.74 3.7 4.4 11.7 - - - -
general/single/msg48 31.73 31.7 32 147 - - - -
general/single/msg64 20.32 20.3 20.4 80.8 - - - -
copy/burst/100/u64 176.14 175.6 183.9 754 - - - -
copy/burst/100/msg48 235.53 237.5 244.2 770.5 - - - -
copy/burst/100/msg64 218.79 217.6 252.8 708.9 - - - -
copy/burst/1000/u64 1661.06 1648.9 1830.9 5414.4 - - - -
copy/burst/1000/msg48 2461.5 2463.4 2550.6 9827.5 - - - -
copy/burst/1000/msg64 2354.53 2362.8 2464.2 9165 - - - -
copy/batch/100/u64 14.98 14.9 16.2 48.3 - - - -
copy/batch/100/msg48 71.41 71.3 72.9 264.8 - - - -
copy/batch/1000/u64 100.02 99.5 105.6 301.7 - - - -
copy/batch/1000/msg48 1560.3 1551.9 1659.1 4886.2 - - - -
spsc/rtrb/single_item/u64 3.42 3.4 3.5 10.2 - - - -
spsc/rtrb/single_item/msg48 21.22 20.9 23.1 62 - - - -
spsc/rtrb/single_item/msg64 6.99 6.6 8.6 19.4 - - - -
spsc/rtrb/burst_100/u64 311.86 306.9 340.2 1224.5 - - - -
spsc/rtrb/burst_100/msg48 2339.89 2381.8 2610.6 9611.6 - - - -
spsc/rtrb/burst_100/msg64 778.14 839.6 1015 2946.1 - - - -
spsc/rtrb/burst_1000/u64 3062.3 3012.9 3291.8 9484.2 - - - -
spsc/rtrb/burst_1000/msg48 23343.07 23774.9 25589.4 77457.5 - - - -
spsc/rtrb/burst_1000/msg64 8181.16 7166 10853.7 25637.6 - - - -
spsc/crossbeam/single_item/u64 5.01 5 5.1 19.2 - - - -
spsc/crossbeam/single_item/msg48 23.26 23.1 24.3 66.1 - - - -
spsc/crossbeam/single_item/msg64 7.98 8 8.6 26.3 - - - -
spsc/crossbeam/burst_100/u64 780.8 779.4 796.4 2410.3 - - - -
spsc/crossbeam/burst_100/msg48 3025.85 2995.3 3285.6 9342.3 - - - -
spsc/crossbeam/burst_100/msg64 1048.93 1040.5 1146.9 3839 - - - -
spsc/crossbeam/burst_1000/u64 7812 7793 8097.5 27999 - - - -
spsc/crossbeam/burst_1000/msg48 30234.29 30058.8 32378.9 110312.2 - - - -
spsc/crossbeam/burst_1000/msg64 11541.63 11527.9 11680.5 36106.5 - - - -
macOS

CPU: Apple M1 (Virtual) | Arch: aarch64 | Compiler: rustc 1.96.0-nightly (23903d01c 2026-03-26)

Latency (ns/op, lower is better)

Single Push+Pop

Element crossbeam mantis/copy mantis/general mantis/inline rtrb
msg48 10.93 11.17 10.68 9.86 9.3 🏆
msg64 11.71 9.74 🏆 11.19 10.56 10.53
u64 8.55 9.58 - 8.59 7.16 🏆

Burst 100

Element crossbeam mantis/copy mantis/inline rtrb
msg48 1781.58 1909.82 1998.51 744.66 🏆
msg64 1923.23 1817.29 1868.02 805.86 🏆
u64 1747.66 1739.75 1625.03 649.78 🏆

Burst 1000

Element crossbeam mantis/copy mantis/inline rtrb
msg48 18495.54 18871.19 22051.55 7488.12 🏆
msg64 19562.54 18307.81 🏆 21628.76 21395.23
u64 17313.62 17110.55 17114.42 6762.28 🏆

Batch 100

Element mantis/copy
msg48 85.31 🏆
u64 24.7 🏆

Batch 1000

Element mantis/copy
msg48 839.11 🏆
u64 130.84 🏆

Full Drain

Element mantis/inline
u64 17826.35 🏆
Instructions per Op (lower is better)
Full results (all fields)
Workload ns/op p50 p99 cycles insns bmiss l1d llc
spsc/inline/single_item/u64 8.59 8.5 8.9 0.3 - - - -
spsc/inline/single_item/msg48 9.86 9.8 10.1 0.4 - - - -
spsc/inline/single_item/msg64 10.56 10.4 11.2 0.4 - - - -
spsc/inline/burst_100/u64 1625.03 1621.4 1659.7 50.9 - - - -
spsc/inline/burst_100/msg48 1998.51 1895.7 2431.5 68.5 - - - -
spsc/inline/burst_100/msg64 1868.02 1836 2121.5 61.7 - - - -
spsc/inline/burst_1000/u64 17114.42 17206.3 18301 649.2 - - - -
spsc/inline/burst_1000/msg48 22051.55 21758.5 25166.9 944.7 - - - -
spsc/inline/burst_1000/msg64 21628.76 21281.6 25375 866.6 - - - -
spsc/inline/full_drain/u64 17826.35 17975.1 20140.5 682.2 - - - -
copy/single/u64 9.58 9.4 11.1 0.4 - - - -
copy/single/msg48 11.17 10.5 14.2 0.4 - - - -
copy/single/msg64 9.74 9.8 10.5 0.4 - - - -
general/single/msg48 10.68 10.7 11.9 0.4 - - - -
general/single/msg64 11.19 11.3 12.6 0.3 - - - -
copy/burst/100/u64 1739.75 1745.1 1870.8 54.3 - - - -
copy/burst/100/msg48 1909.82 1912.5 2327.4 63.8 - - - -
copy/burst/100/msg64 1817.29 1847.1 1966.8 59.5 - - - -
copy/burst/1000/u64 17110.55 17432.3 20006.9 658.5 - - - -
copy/burst/1000/msg48 18871.19 18624.9 20524.9 749.6 - - - -
copy/burst/1000/msg64 18307.81 18383.6 19594.8 711.1 - - - -
copy/batch/100/u64 24.7 23.6 31.1 0.7 - - - -
copy/batch/100/msg48 85.31 83.6 101.5 3.4 - - - -
copy/batch/1000/u64 130.84 128.9 153.1 4.5 - - - -
copy/batch/1000/msg48 839.11 816 1061.2 24 - - - -
spsc/rtrb/single_item/u64 7.16 7.1 8.2 0.2 - - - -
spsc/rtrb/single_item/msg48 9.3 9.1 10.7 0.3 - - - -
spsc/rtrb/single_item/msg64 10.53 10.1 13.1 0.4 - - - -
spsc/rtrb/burst_100/u64 649.78 630.3 825 25.6 - - - -
spsc/rtrb/burst_100/msg48 744.66 733.9 824.5 21.8 - - - -
spsc/rtrb/burst_100/msg64 805.86 795.5 906.8 24.7 - - - -
spsc/rtrb/burst_1000/u64 6762.28 6567.5 7784.2 214.6 - - - -
spsc/rtrb/burst_1000/msg48 7488.12 7411.6 8021.6 257.8 - - - -
spsc/rtrb/burst_1000/msg64 21395.23 21535.8 23285.9 951.3 - - - -
spsc/crossbeam/single_item/u64 8.55 8.6 8.7 0.3 - - - -
spsc/crossbeam/single_item/msg48 10.93 10.9 11.5 0.4 - - - -
spsc/crossbeam/single_item/msg64 11.71 11.6 12.9 0.3 - - - -
spsc/crossbeam/burst_100/u64 1747.66 1745.9 1850.3 58.3 - - - -
spsc/crossbeam/burst_100/msg48 1781.58 1766.7 1886.4 58.7 - - - -
spsc/crossbeam/burst_100/msg64 1923.23 1894.8 2162.7 63.7 - - - -
spsc/crossbeam/burst_1000/u64 17313.62 17160 18866 674.7 - - - -
spsc/crossbeam/burst_1000/msg48 18495.54 18425.5 19152.4 726.5 - - - -
spsc/crossbeam/burst_1000/msg64 19562.54 19467 20690.7 797.2 - - - -

All implementations (inline, copy, rtrb, crossbeam) now test identical
workload shapes: single, burst_100, burst_1000 × u64, msg48, msg64.
Batch benchmarks remain mantis-only (copy ring). Removed non-comparable
[u8;64] and [u8;256] benchmarks. Split functions to satisfy clippy
too_many_lines lint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Milerius Milerius merged commit e229eda into main Mar 27, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants