perf(queue): software prefetch + contiguous batch copy for SPSC rings by Milerius · Pull Request #10 · Milerius/Mantis

Milerius · 2026-03-27T04:33:18Z

Summary

Software prefetch: x86_64 write-exclusive (_MM_HINT_ET0) + aarch64 PRFM PLDL1KEEP/PSTL1KEEP via stable asm!, integrated into both RingEngine and CopyRingEngine hot paths with one-ahead prefetch pattern. Behind prefetch feature flag (off by default).
Contiguous batch copy: Replace per-element push_batch/pop_batch loops with two-region copy_nonoverlapping (split at wrap boundary), bypassing CopyPolicy SIMD dispatch for bulk memcpy auto-vectorization. Added base_ptr() to Storage trait for correct Miri provenance.
Codegen regression gate: Soft instruction-count threshold in check-asm.sh (warns if hot-path grows by >2 instructions).

Benchmark Results (Apple M4 Pro, `target-cpu=native`)

Benchmark	mantis	rtrb	crossbeam
single/u64	2.32ns	2.61ns	4.02ns
burst/100/u64	382ns	370ns	548ns
single/msg48	3.61ns (copy)	5.09ns	5.42ns
batch/100/msg48	55.9ns	N/A	N/A
batch/1000/msg48	620.6ns (was 1932ns, -68%)	N/A	N/A

Test plan

cargo +nightly test --features alloc,std — all tests pass
cargo +nightly miri test -p mantis-queue — no UB (including batch provenance via base_ptr())
Differential tests: 49 exhaustive combos for batch wrap-around correctness
Property-based tests: proptest FIFO ordering preserved for random batch sizes/fill levels
cargo +nightly clippy --all-targets --features alloc,std -- -D warnings — clean
cargo fmt --all --check — clean
CI benchmarks on x86_64 + aarch64

🤖 Generated with Claude Code

Covers Phase 0 (codegen baseline), Phase 1 (software prefetch), Phase 2 (contiguous batch copy), Phase 3 (codegen regression gate), and Phase 4 (future hugepages). Based on literature review of mratsim/weave, Snellman, psy-lob-saw, rigtorp, and B-Queue papers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use existing prefetch API (compiler_hints.rs), not new trait - Fix aarch64 prefetch: inline asm (PRFM), not nonexistent __pld - Fix x86_64 write prefetch: _MM_HINT_ET0 for exclusive hint - Correct architecture: two separate engines, not shared - Document storage contiguity invariant for batch copy - Add pop_batch contiguous design (was missing) - Add contiguous batch copy test strategy (differential, proptest) - Acknowledge existing check-asm.sh script - Add rollback criteria for prefetch feature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

10-task plan covering Phases 0-3: codegen baseline, prefetch enhancement (x86_64 ET0 + aarch64 PRFM), feature flags, RingEngine and CopyRingEngine prefetch integration, contiguous batch push/pop via copy_nonoverlapping, differential + property-based tests, codegen verification gate, and benchmarking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add one-ahead write prefetch in try_push and read prefetch in try_pop, both gated on #[cfg(feature = "prefetch")]. Unsafe slot_ptr calls are encapsulated in new raw::prefetch_slot_write / prefetch_slot_read helpers so engine.rs stays within the #![deny(unsafe_code)] boundary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace per-element CopyPolicy dispatch in push_batch with at most two copy_nonoverlapping calls (one per contiguous chunk around the ring wrap point). This compiles to memcpy which auto-vectorizes, versus 1000 separate copies for a 1000-element batch. Add write_batch_copy to copy_ring::raw using Storage::base_ptr() for correct Miri-validated pointer provenance over the full slot array. Add base_ptr() to the Storage trait (InlineStorage and HeapStorage), returning a *mut MaybeUninit<T> with provenance covering all capacity slots — required for multi-element bulk copies. Add push_batch_wraparound_contiguous test exercising the two-chunk path where head crosses the end of the backing array back to index 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the per-element pop_batch loop with a two-chunk copy_nonoverlapping pattern symmetric to push_batch. Add read_batch_copy in copy_ring/raw/mod.rs (mirrors write_batch_copy). Add pop_batch_wraparound_contiguous test covering the wrap path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ch ops Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Batch/1000: -68% (contiguous copy_nonoverlapping vs per-element loop) Single-op copy: -2.4% (prefetch on aarch64 Apple M4 Pro) General single-op: +2% (within noise threshold) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-27T04:39:00Z

Benchmark Report

_{Commit: 62f655e0684d1e4fa356bfe9f2f3763b68e05ba3}

Linux

CPU: AMD EPYC 7763 64-Core Processor | Arch: x86_64 | Compiler: rustc 1.96.0-nightly (23903d01c 2026-03-26)

Latency (ns/op, lower is better)

Single Push+Pop

Element	crossbeam	mantis/copy	mantis/general	mantis/inline	rtrb
`msg48`	23.26	9.35 🏆	31.73	31.85	21.22
`msg64`	7.98	3.74 🏆	20.32	20.39	6.99
`u64`	5.01	2.51 🏆	-	3.77	3.42

Burst 100

Element	crossbeam	mantis/copy	mantis/inline	rtrb
`msg48`	3025.85	235.53 🏆	2595.34	2339.89
`msg64`	1048.93	218.79 🏆	1498.92	778.14
`u64`	780.8	176.14 🏆	376.36	311.86

Burst 1000

Element	crossbeam	mantis/copy	mantis/inline	rtrb
`msg48`	30234.29	2461.5 🏆	26208.54	23343.07
`msg64`	11541.63	2354.53 🏆	15330.93	8181.16
`u64`	7812	1661.06 🏆	3780.27	3062.3

Batch 100

Element	mantis/copy
`msg48`	71.41 🏆
`u64`	14.98 🏆

Batch 1000

Element	mantis/copy
`msg48`	1560.3 🏆
`u64`	100.02 🏆

Full Drain

Element	mantis/inline
`u64`	1617.27 🏆

Instructions per Op (lower is better)

Full results (all fields)

Workload	ns/op	p50	p99	cycles	insns	bmiss	l1d	llc
spsc/inline/single_item/u64	3.77	3.7	4	12	-	-	-	-
spsc/inline/single_item/msg48	31.85	31.7	34.4	107.6	-	-	-	-
spsc/inline/single_item/msg64	20.39	20.3	21.6	80.8	-	-	-	-
spsc/inline/burst_100/u64	376.36	374.9	394.2	1124.4	-	-	-	-
spsc/inline/burst_100/msg48	2595.34	2592.9	2619.7	9444.3	-	-	-	-
spsc/inline/burst_100/msg64	1498.92	1494.8	1517.6	4585.1	-	-	-	-
spsc/inline/burst_1000/u64	3780.27	3739.8	4224.3	13231.5	-	-	-	-
spsc/inline/burst_1000/msg48	26208.54	26140.1	27406.1	110451.6	-	-	-	-
spsc/inline/burst_1000/msg64	15330.93	15252.9	16246	54501.5	-	-	-	-
spsc/inline/full_drain/u64	1617.27	1611.9	1675.9	5147.2	-	-	-	-
copy/single/u64	2.51	2.5	2.7	9.6	-	-	-	-
copy/single/msg48	9.35	9.3	9.4	57.9	-	-	-	-
copy/single/msg64	3.74	3.7	4.4	11.7	-	-	-	-
general/single/msg48	31.73	31.7	32	147	-	-	-	-
general/single/msg64	20.32	20.3	20.4	80.8	-	-	-	-
copy/burst/100/u64	176.14	175.6	183.9	754	-	-	-	-
copy/burst/100/msg48	235.53	237.5	244.2	770.5	-	-	-	-
copy/burst/100/msg64	218.79	217.6	252.8	708.9	-	-	-	-
copy/burst/1000/u64	1661.06	1648.9	1830.9	5414.4	-	-	-	-
copy/burst/1000/msg48	2461.5	2463.4	2550.6	9827.5	-	-	-	-
copy/burst/1000/msg64	2354.53	2362.8	2464.2	9165	-	-	-	-
copy/batch/100/u64	14.98	14.9	16.2	48.3	-	-	-	-
copy/batch/100/msg48	71.41	71.3	72.9	264.8	-	-	-	-
copy/batch/1000/u64	100.02	99.5	105.6	301.7	-	-	-	-
copy/batch/1000/msg48	1560.3	1551.9	1659.1	4886.2	-	-	-	-
spsc/rtrb/single_item/u64	3.42	3.4	3.5	10.2	-	-	-	-
spsc/rtrb/single_item/msg48	21.22	20.9	23.1	62	-	-	-	-
spsc/rtrb/single_item/msg64	6.99	6.6	8.6	19.4	-	-	-	-
spsc/rtrb/burst_100/u64	311.86	306.9	340.2	1224.5	-	-	-	-
spsc/rtrb/burst_100/msg48	2339.89	2381.8	2610.6	9611.6	-	-	-	-
spsc/rtrb/burst_100/msg64	778.14	839.6	1015	2946.1	-	-	-	-
spsc/rtrb/burst_1000/u64	3062.3	3012.9	3291.8	9484.2	-	-	-	-
spsc/rtrb/burst_1000/msg48	23343.07	23774.9	25589.4	77457.5	-	-	-	-
spsc/rtrb/burst_1000/msg64	8181.16	7166	10853.7	25637.6	-	-	-	-
spsc/crossbeam/single_item/u64	5.01	5	5.1	19.2	-	-	-	-
spsc/crossbeam/single_item/msg48	23.26	23.1	24.3	66.1	-	-	-	-
spsc/crossbeam/single_item/msg64	7.98	8	8.6	26.3	-	-	-	-
spsc/crossbeam/burst_100/u64	780.8	779.4	796.4	2410.3	-	-	-	-
spsc/crossbeam/burst_100/msg48	3025.85	2995.3	3285.6	9342.3	-	-	-	-
spsc/crossbeam/burst_100/msg64	1048.93	1040.5	1146.9	3839	-	-	-	-
spsc/crossbeam/burst_1000/u64	7812	7793	8097.5	27999	-	-	-	-
spsc/crossbeam/burst_1000/msg48	30234.29	30058.8	32378.9	110312.2	-	-	-	-
spsc/crossbeam/burst_1000/msg64	11541.63	11527.9	11680.5	36106.5	-	-	-	-

macOS

CPU: Apple M1 (Virtual) | Arch: aarch64 | Compiler: rustc 1.96.0-nightly (23903d01c 2026-03-26)

Latency (ns/op, lower is better)

Single Push+Pop

Element	crossbeam	mantis/copy	mantis/general	mantis/inline	rtrb
`msg48`	10.93	11.17	10.68	9.86	9.3 🏆
`msg64`	11.71	9.74 🏆	11.19	10.56	10.53
`u64`	8.55	9.58	-	8.59	7.16 🏆

Burst 100

Element	crossbeam	mantis/copy	mantis/inline	rtrb
`msg48`	1781.58	1909.82	1998.51	744.66 🏆
`msg64`	1923.23	1817.29	1868.02	805.86 🏆
`u64`	1747.66	1739.75	1625.03	649.78 🏆

Burst 1000

Element	crossbeam	mantis/copy	mantis/inline	rtrb
`msg48`	18495.54	18871.19	22051.55	7488.12 🏆
`msg64`	19562.54	18307.81 🏆	21628.76	21395.23
`u64`	17313.62	17110.55	17114.42	6762.28 🏆

Batch 100

Element	mantis/copy
`msg48`	85.31 🏆
`u64`	24.7 🏆

Batch 1000

Element	mantis/copy
`msg48`	839.11 🏆
`u64`	130.84 🏆

Full Drain

Element	mantis/inline
`u64`	17826.35 🏆

Instructions per Op (lower is better)

Full results (all fields)

Workload	ns/op	p50	p99	cycles	insns	bmiss	l1d	llc
spsc/inline/single_item/u64	8.59	8.5	8.9	0.3	-	-	-	-
spsc/inline/single_item/msg48	9.86	9.8	10.1	0.4	-	-	-	-
spsc/inline/single_item/msg64	10.56	10.4	11.2	0.4	-	-	-	-
spsc/inline/burst_100/u64	1625.03	1621.4	1659.7	50.9	-	-	-	-
spsc/inline/burst_100/msg48	1998.51	1895.7	2431.5	68.5	-	-	-	-
spsc/inline/burst_100/msg64	1868.02	1836	2121.5	61.7	-	-	-	-
spsc/inline/burst_1000/u64	17114.42	17206.3	18301	649.2	-	-	-	-
spsc/inline/burst_1000/msg48	22051.55	21758.5	25166.9	944.7	-	-	-	-
spsc/inline/burst_1000/msg64	21628.76	21281.6	25375	866.6	-	-	-	-
spsc/inline/full_drain/u64	17826.35	17975.1	20140.5	682.2	-	-	-	-
copy/single/u64	9.58	9.4	11.1	0.4	-	-	-	-
copy/single/msg48	11.17	10.5	14.2	0.4	-	-	-	-
copy/single/msg64	9.74	9.8	10.5	0.4	-	-	-	-
general/single/msg48	10.68	10.7	11.9	0.4	-	-	-	-
general/single/msg64	11.19	11.3	12.6	0.3	-	-	-	-
copy/burst/100/u64	1739.75	1745.1	1870.8	54.3	-	-	-	-
copy/burst/100/msg48	1909.82	1912.5	2327.4	63.8	-	-	-	-
copy/burst/100/msg64	1817.29	1847.1	1966.8	59.5	-	-	-	-
copy/burst/1000/u64	17110.55	17432.3	20006.9	658.5	-	-	-	-
copy/burst/1000/msg48	18871.19	18624.9	20524.9	749.6	-	-	-	-
copy/burst/1000/msg64	18307.81	18383.6	19594.8	711.1	-	-	-	-
copy/batch/100/u64	24.7	23.6	31.1	0.7	-	-	-	-
copy/batch/100/msg48	85.31	83.6	101.5	3.4	-	-	-	-
copy/batch/1000/u64	130.84	128.9	153.1	4.5	-	-	-	-
copy/batch/1000/msg48	839.11	816	1061.2	24	-	-	-	-
spsc/rtrb/single_item/u64	7.16	7.1	8.2	0.2	-	-	-	-
spsc/rtrb/single_item/msg48	9.3	9.1	10.7	0.3	-	-	-	-
spsc/rtrb/single_item/msg64	10.53	10.1	13.1	0.4	-	-	-	-
spsc/rtrb/burst_100/u64	649.78	630.3	825	25.6	-	-	-	-
spsc/rtrb/burst_100/msg48	744.66	733.9	824.5	21.8	-	-	-	-
spsc/rtrb/burst_100/msg64	805.86	795.5	906.8	24.7	-	-	-	-
spsc/rtrb/burst_1000/u64	6762.28	6567.5	7784.2	214.6	-	-	-	-
spsc/rtrb/burst_1000/msg48	7488.12	7411.6	8021.6	257.8	-	-	-	-
spsc/rtrb/burst_1000/msg64	21395.23	21535.8	23285.9	951.3	-	-	-	-
spsc/crossbeam/single_item/u64	8.55	8.6	8.7	0.3	-	-	-	-
spsc/crossbeam/single_item/msg48	10.93	10.9	11.5	0.4	-	-	-	-
spsc/crossbeam/single_item/msg64	11.71	11.6	12.9	0.3	-	-	-	-
spsc/crossbeam/burst_100/u64	1747.66	1745.9	1850.3	58.3	-	-	-	-
spsc/crossbeam/burst_100/msg48	1781.58	1766.7	1886.4	58.7	-	-	-	-
spsc/crossbeam/burst_100/msg64	1923.23	1894.8	2162.7	63.7	-	-	-	-
spsc/crossbeam/burst_1000/u64	17313.62	17160	18866	674.7	-	-	-	-
spsc/crossbeam/burst_1000/msg48	18495.54	18425.5	19152.4	726.5	-	-	-	-
spsc/crossbeam/burst_1000/msg64	19562.54	19467	20690.7	797.2	-	-	-	-

All implementations (inline, copy, rtrb, crossbeam) now test identical workload shapes: single, burst_100, burst_1000 × u64, msg48, msg64. Batch benchmarks remain mantis-only (copy ring). Removed non-comparable [u8;64] and [u8;256] benchmarks. Split functions to satisfy clippy too_many_lines lint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Milerias and others added 15 commits March 26, 2026 17:01

docs: fix aarch64 asm! nightly note — stable since 1.59

083e74a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: add copy-ring batch shims to asm baseline

4b151a0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(platform): x86_64 write-exclusive prefetch + aarch64 PRFM support

57a1166

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add prefetch feature flag to platform and queue crates

5c0504e

feat(queue): integrate prefetch into CopyRingEngine push/pop hot paths

7595eb9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(queue): differential and property-based tests for contiguous bat…

5b0eb17

…ch ops Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add instruction count regression gate to check-asm.sh

15e7512

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(queue): resolve clippy cast_sign_loss in batch tests

a7e9821

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pull-request-size bot added the size/XXL label Mar 27, 2026

Milerius merged commit e229eda into main Mar 27, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(queue): software prefetch + contiguous batch copy for SPSC rings#10

perf(queue): software prefetch + contiguous batch copy for SPSC rings#10
Milerius merged 16 commits intomainfrom
feat/spsc-ring-buffer

Milerius commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026 •

edited

Loading

Latency (ns/op, lower is better)

Single Push+Pop

Burst 100

Burst 1000

Batch 100

Batch 1000

Full Drain

Instructions per Op (lower is better)

Latency (ns/op, lower is better)

Single Push+Pop

Burst 100

Burst 1000

Batch 100

Batch 1000

Full Drain

Instructions per Op (lower is better)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Milerius commented Mar 27, 2026

Summary

Benchmark Results (Apple M4 Pro, target-cpu=native)

Test plan

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Report

Latency (ns/op, lower is better)

Single Push+Pop

Burst 100

Burst 1000

Batch 100

Batch 1000

Full Drain

Instructions per Op (lower is better)

Latency (ns/op, lower is better)

Single Push+Pop

Burst 100

Burst 1000

Batch 100

Batch 1000

Full Drain

Instructions per Op (lower is better)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark Results (Apple M4 Pro, `target-cpu=native`)

github-actions bot commented Mar 27, 2026 •

edited

Loading