Skip to content

Add MpscRingBuffer primitive for pre-allocated slot rings#11492

Draft
dougqh wants to merge 8 commits into
masterfrom
dougqh/ring-buffer
Draft

Add MpscRingBuffer primitive for pre-allocated slot rings#11492
dougqh wants to merge 8 commits into
masterfrom
dougqh/ring-buffer

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 28, 2026

Summary

New datadog.trace.util.concurrent.MpscRingBuffer<T> — a bounded multi-producer / single-consumer ring buffer of pre-allocated, recyclable T instances. Producers and consumers mutate and read slots in place via callbacks; no allocation occurs per write/read after construction.

Lives in internal-api. No callers yet — this PR adds the primitive and tests/benches only. CSS integration to follow in a separate PR.

Motivation

Producer/consumer split in CSS v1.3.0 (#11381) allocates one SpanSnapshot per metrics-eligible span and hands the reference through an MpscArrayQueue. At typical heap sizes (≥256m) this is fine — the snapshots die young. At tight heap (we hit it at Xmx64m in spring-petclinic load testing) the in-flight snapshots overflow G1 survivor regions, triggering To-space Exhausted → fallback Full GC storms.

Moving from "reference passing of fresh objects" to "in-place mutation of pre-allocated slots" eliminates the per-publish allocation entirely. Smoke benchmarks (capacity=1024, 8 producers, 1 consumer on M-series Mac):

Benchmark MpscArrayQueue<Slot> + per-publish new Slot(...) MpscRingBuffer<Slot> Ratio
write_*_8p (publish only) 399 M ops/s 768 M ops/s 1.92×
e2e_*_8p (8P→1C end-to-end) 514 M ops/s 846 M ops/s 1.65×

Allocation rate falls to zero on the ring path (vs sizeof(Slot) × producer ops/s on the queue path) — that's the heap-pressure side of the win, not visible in the throughput numbers above but the key reason this matters for tight-heap workloads.

API

Callback-style, with 0/1/2 context-object overloads. Contexts come before the slot, matching TagMap.forEach / Hashtable.forEach:

// 0 context
ring.tryWrite(slot -> { slot.value = 1; });
ring.drain(slot -> handle(slot));

// 1 context — static BiConsumer<C, T> avoids per-call lambda capture
static final BiConsumer<Span, Slot> FILL = (span, slot) -> slot.svc = span.getService();
ring.tryWrite(span, FILL);

// 2 contexts — TriConsumer<C1, C2, T>
static final TriConsumer<Span, Tag, Slot> FILL2 = (span, tag, slot) -> { ... };
ring.tryWrite(span, tag, FILL2);

tryWrite returns false when full (producer drops the value). drain returns the count processed.

Design

  • Producer cursor is CAS-claimed (AtomicLong.compareAndSet in a retry loop). Stale read of the consumer cursor is fine — a false "full" reading just causes a drop.
  • Per-slot publication is gated by an AtomicLongArray: a slot is published at sequence s iff sequences[s & mask] == s. This handles out-of-order publishes from concurrent producers — the consumer only advances over contiguous published slots.
  • Consumer cursor is volatile; written only by the consumer thread, read by producers to detect free space.
  • Capacity rounded up to power-of-two so mask = capacity - 1 works.

Test/bench coverage

  • MpscRingBufferTest — 12 JUnit 5 tests covering construction, FIFO order, capacity bounds, fill-drain-fill round trip, 0/1/2 context variants, and an 8-producer × 50,000-writes concurrency test (400K writes, ~150ms).
  • MpscRingBufferBenchmark (internal-api/src/jmh/...) — write_1p/8p/16p at varying producer counts + e2e_8p @Group (8 producers + 1 consumer). Parameterised on capacity.
  • RingVsQueueBenchmark (dd-trace-core/src/jmh/... — sits alongside the CSS benches, where jctools is already a dep) — head-to-head against MpscArrayQueue<Slot> + per-publish allocation. Numbers above come from this.

Test plan

  • :internal-api:testMpscRingBufferTest passes (12/12 locally)
  • :internal-api:jmh runs MpscRingBufferBenchmark (smoke run passes)
  • :dd-trace-core:jmh runs RingVsQueueBenchmark (smoke run passes; 1.65×–1.92× ratio)
  • No callers yet — production code paths unchanged

🤖 Generated with Claude Code

dougqh and others added 3 commits May 28, 2026 09:54
Bounded multi-producer / single-consumer ring buffer of long-lived T
instances. Producers mutate slots in place via callbacks; the consumer
reads them the same way. No allocation per write/read after construction.

BiConsumer/TriConsumer variants take context object(s) before the slot,
matching the TagMap.forEach / Hashtable.forEach convention so callers can
use static final non-capturing lambdas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five benchmarks: producer-only throughput at 1/8/16 threads with a
background drainer, plus an e2e @group bench pairing 8 producers with
1 consumer for system throughput. Ring capacity is a @Param so runs
can sweep capacities without recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Head-to-head benches in dd-trace-core's jmh source set (which already
depends on jctools): MpscRingBuffer mutating pre-allocated slots vs
MpscArrayQueue with a fresh Slot allocated per publish -- the latter
mirrors the existing SpanSnapshot pattern in the CSS code.

write_ring_8p / write_queue_8p compare publish cost with a background
drainer; e2e_ring_8p / e2e_queue_8p use @group to pair 8 producers
with 1 consumer for end-to-end throughput. Run with -prof gc to see
per-op allocation rate where the ring's win shows up loudest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 28, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 28, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario This PR master Change
insecure-bank / iast 14,799 ms 14,733 ms +0.4%
insecure-bank / tracing 13,584 ms 13,752 ms -1.2%
petclinic / appsec 16,477 ms 16,424 ms +0.3%
petclinic / iast 16,430 ms 16,576 ms -0.9%
petclinic / profiling 16,330 ms 15,538 ms +5.1%
petclinic / tracing 15,677 ms 15,859 ms -1.1%

Commit: 780a2042 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

dougqh and others added 3 commits May 28, 2026 11:17
Spell out the contract that slot users rely on: plain (non-volatile)
fields, no retention past handler return, slot-reference-not-shared.
Producer fillers must not throw if possible -- and if they do, the slot
is now published anyway (try/finally) so the consumer can't deadlock
waiting for an unfinished sequence. Test covers the throw-then-recover
path; the ring's cursors stay healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Saves one wrapper allocation per ring instance: producerCursor becomes
a volatile long on the instance, paired with a static AtomicLongFieldUpdater
for CAS. Same memory ordering as the prior AtomicLong (volatile field +
field-updater CAS), but no per-instance wrapper object.

publishedSequences stays AtomicLongArray -- the field updater approach
doesn't apply to array element access. consumerCursor was already a
plain volatile long with no wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related cache-line fixes for the producer hot path under heavy
contention:

1. Stride publishedSequences by 8 longs (one cache line). Without this,
   adjacent logical slots share cache lines and concurrent producers
   writing nearby sequences ping-pong the same line between cores. The
   array grows by 8x but the upfront cost is bounded by the ring's
   capacity (e.g. 8 MB at the CSS default cap=131072).

2. Cache-line-pad the producerCursor and consumerCursor against each
   other using the standard Disruptor class-hierarchy pattern. Every
   consumer-side advance of consumerCursor would otherwise invalidate
   the line producers read for producerCursor (and vice versa).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh and others added 2 commits May 28, 2026 16:17
A single CAS claims n contiguous slots, returning a Batch handle that
the caller fills via fillAndPublish(slot...). Designed for callers
whose work has a natural batch boundary (e.g. CSS publishing a trace's
worth of metrics-eligible spans in one shot): cuts producer-cursor
contention from O(N) CASes to O(1) per call.

All-or-nothing: tryClaim(n) returns null if the ring can't fit the
whole batch. The Batch is single-threaded (owned by the claiming
thread), short-lived (scoped to one publish call), and has no
thread-shutdown hazard -- the batch is fully consumed before returning.

Filler-throw safety matches the existing tryWrite contract: the slot
is published in a finally block so the consumer can advance, and the
batch's published counter increments either way.

Tests cover: requested size, capacity rejection, all-or-nothing, three
filler overloads, over-publish IllegalStateException, throw recovery,
and 8-producer concurrency (200 batches/thread x 16 size = 25600
items, single consumer sees every value exactly once).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Batch handle from tryClaim was supposed to be scalar-replaced by
escape analysis, but JMH measurements showed it's not -- the inner-class
implicit this$0 plus the CAS-retry inside tryClaim block scalarization
on HotSpot. Result: ~24 bytes of Batch + cursor state allocated per
publish on the hot path, ~50% throughput drop on single-element claims
in CSS-style benches.

Add three sequence-based primitives that callers manage directly:
  long tryClaimRange(int n)  -> start sequence or -1L
  T    slotAt(long seq)      -> slot for that sequence
  void publish(long seq)     -> release the slot to the consumer

No per-call allocation, no callback dispatch. Callers handle the sequence
arithmetic themselves and trade safety (forget to publish -> ring stuck)
for hot-path predictability. The Batch API stays for safer use cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant