Add MpscRingBuffer primitive for pre-allocated slot rings#11492
Draft
dougqh wants to merge 8 commits into
Draft
Conversation
Bounded multi-producer / single-consumer ring buffer of long-lived T instances. Producers mutate slots in place via callbacks; the consumer reads them the same way. No allocation per write/read after construction. BiConsumer/TriConsumer variants take context object(s) before the slot, matching the TagMap.forEach / Hashtable.forEach convention so callers can use static final non-capturing lambdas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five benchmarks: producer-only throughput at 1/8/16 threads with a background drainer, plus an e2e @group bench pairing 8 producers with 1 consumer for system throughput. Ring capacity is a @Param so runs can sweep capacities without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Head-to-head benches in dd-trace-core's jmh source set (which already depends on jctools): MpscRingBuffer mutating pre-allocated slots vs MpscArrayQueue with a fresh Slot allocated per publish -- the latter mirrors the existing SpanSnapshot pattern in the CSS code. write_ring_8p / write_queue_8p compare publish cost with a background drainer; e2e_ring_8p / e2e_queue_8p use @group to pair 8 producers with 1 consumer for end-to-end throughput. Run with -prof gc to see per-op allocation rate where the ring's win shows up loudest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master resultsStartup Time
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
Spell out the contract that slot users rely on: plain (non-volatile) fields, no retention past handler return, slot-reference-not-shared. Producer fillers must not throw if possible -- and if they do, the slot is now published anyway (try/finally) so the consumer can't deadlock waiting for an unfinished sequence. Test covers the throw-then-recover path; the ring's cursors stay healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Saves one wrapper allocation per ring instance: producerCursor becomes a volatile long on the instance, paired with a static AtomicLongFieldUpdater for CAS. Same memory ordering as the prior AtomicLong (volatile field + field-updater CAS), but no per-instance wrapper object. publishedSequences stays AtomicLongArray -- the field updater approach doesn't apply to array element access. consumerCursor was already a plain volatile long with no wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related cache-line fixes for the producer hot path under heavy contention: 1. Stride publishedSequences by 8 longs (one cache line). Without this, adjacent logical slots share cache lines and concurrent producers writing nearby sequences ping-pong the same line between cores. The array grows by 8x but the upfront cost is bounded by the ring's capacity (e.g. 8 MB at the CSS default cap=131072). 2. Cache-line-pad the producerCursor and consumerCursor against each other using the standard Disruptor class-hierarchy pattern. Every consumer-side advance of consumerCursor would otherwise invalidate the line producers read for producerCursor (and vice versa). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
A single CAS claims n contiguous slots, returning a Batch handle that the caller fills via fillAndPublish(slot...). Designed for callers whose work has a natural batch boundary (e.g. CSS publishing a trace's worth of metrics-eligible spans in one shot): cuts producer-cursor contention from O(N) CASes to O(1) per call. All-or-nothing: tryClaim(n) returns null if the ring can't fit the whole batch. The Batch is single-threaded (owned by the claiming thread), short-lived (scoped to one publish call), and has no thread-shutdown hazard -- the batch is fully consumed before returning. Filler-throw safety matches the existing tryWrite contract: the slot is published in a finally block so the consumer can advance, and the batch's published counter increments either way. Tests cover: requested size, capacity rejection, all-or-nothing, three filler overloads, over-publish IllegalStateException, throw recovery, and 8-producer concurrency (200 batches/thread x 16 size = 25600 items, single consumer sees every value exactly once). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Batch handle from tryClaim was supposed to be scalar-replaced by escape analysis, but JMH measurements showed it's not -- the inner-class implicit this$0 plus the CAS-retry inside tryClaim block scalarization on HotSpot. Result: ~24 bytes of Batch + cursor state allocated per publish on the hot path, ~50% throughput drop on single-element claims in CSS-style benches. Add three sequence-based primitives that callers manage directly: long tryClaimRange(int n) -> start sequence or -1L T slotAt(long seq) -> slot for that sequence void publish(long seq) -> release the slot to the consumer No per-call allocation, no callback dispatch. Callers handle the sequence arithmetic themselves and trade safety (forget to publish -> ring stuck) for hot-path predictability. The Batch API stays for safer use cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New
datadog.trace.util.concurrent.MpscRingBuffer<T>— a bounded multi-producer / single-consumer ring buffer of pre-allocated, recyclableTinstances. Producers and consumers mutate and read slots in place via callbacks; no allocation occurs per write/read after construction.Lives in
internal-api. No callers yet — this PR adds the primitive and tests/benches only. CSS integration to follow in a separate PR.Motivation
Producer/consumer split in CSS v1.3.0 (#11381) allocates one
SpanSnapshotper metrics-eligible span and hands the reference through anMpscArrayQueue. At typical heap sizes (≥256m) this is fine — the snapshots die young. At tight heap (we hit it at Xmx64m in spring-petclinic load testing) the in-flight snapshots overflow G1 survivor regions, triggering To-space Exhausted → fallback Full GC storms.Moving from "reference passing of fresh objects" to "in-place mutation of pre-allocated slots" eliminates the per-publish allocation entirely. Smoke benchmarks (capacity=1024, 8 producers, 1 consumer on M-series Mac):
MpscArrayQueue<Slot>+ per-publishnew Slot(...)MpscRingBuffer<Slot>write_*_8p(publish only)e2e_*_8p(8P→1C end-to-end)Allocation rate falls to zero on the ring path (vs
sizeof(Slot)× producer ops/s on the queue path) — that's the heap-pressure side of the win, not visible in the throughput numbers above but the key reason this matters for tight-heap workloads.API
Callback-style, with 0/1/2 context-object overloads. Contexts come before the slot, matching
TagMap.forEach/Hashtable.forEach:tryWritereturnsfalsewhen full (producer drops the value).drainreturns the count processed.Design
AtomicLong.compareAndSetin a retry loop). Stale read of the consumer cursor is fine — a false "full" reading just causes a drop.AtomicLongArray: a slot is published at sequencesiffsequences[s & mask] == s. This handles out-of-order publishes from concurrent producers — the consumer only advances over contiguous published slots.volatile; written only by the consumer thread, read by producers to detect free space.mask = capacity - 1works.Test/bench coverage
MpscRingBufferTest— 12 JUnit 5 tests covering construction, FIFO order, capacity bounds, fill-drain-fill round trip, 0/1/2 context variants, and an 8-producer × 50,000-writes concurrency test (400K writes, ~150ms).MpscRingBufferBenchmark(internal-api/src/jmh/...) —write_1p/8p/16pat varying producer counts +e2e_8p@Group(8 producers + 1 consumer). Parameterised oncapacity.RingVsQueueBenchmark(dd-trace-core/src/jmh/...— sits alongside the CSS benches, where jctools is already a dep) — head-to-head againstMpscArrayQueue<Slot>+ per-publish allocation. Numbers above come from this.Test plan
:internal-api:test—MpscRingBufferTestpasses (12/12 locally):internal-api:jmhrunsMpscRingBufferBenchmark(smoke run passes):dd-trace-core:jmhrunsRingVsQueueBenchmark(smoke run passes; 1.65×–1.92× ratio)🤖 Generated with Claude Code