This repository contains a set of corrected and extended ring-buffer implementations for the single-consumer, multiple-producer (SCMP) setting.
It is modified from this implementation, which is a single-producer, single-consumer (SPSC) ring buffer. The original SPSC implementation is correct under its intended concurrency model, but it has concurrency bugs when used with multiple producers, primarily due to data races and incorrect publication ordering among concurrent producers.
In particular, when multiple producers try to reserve and publish slots concurrently, the original design does not provide a safe mechanism to:
- serialize slot reservation (avoid multiple producers writing the same slot),
- enforce in-order publication (ensure the consumer does not observe partially-written entries), or
- coordinate tail/commit progression across producers.
This repository provides multiple alternative fixes and a set of further optimizations.
.
├── data # Results
├── include
│ ├── common.hpp # Common functions
│ ├── lock.hpp # Simple locking
│ ├── notify.hpp # Wait-for-notification
│ ├── optimized.hpp # Optimized implementation
│ ├── single.hpp # Single producer (original)
│ ├── spin.hpp # Busy waiting for prior commits
│ ├── tail.hpp # Change tail pointer to non-atomic
│ ├── yield.hpp # Yielding in spin lock
│ └── free.hpp # Lock-free producer (same as `single` but with `&` wrapping)
└── src
├── main.cpp # Driver application
└── single.cpp # Driver for single producerTo check differences between implementations, run diff directly. For example:
diff include/spin.hpp include/notify.hppThe target setting is multiple producers inserting into a shared ring buffer, with a single consumer draining it. In this setting, the original SPSC ring buffer (in include/single.hpp) is unsafe because it implicitly assumes only one producer updates producer-side indices and publishes writes.
With multiple producers, typical failure modes include:
- Duplicate slot reservation: two producers choose the same slot index and overwrite each other.
- Out-of-order publication: a later producer makes its slot visible before an earlier producer has finished writing, allowing the consumer to read incomplete or inconsistent data.
- Incorrect index advancement: concurrent updates to producer-side indices are not coordinated, causing missed entries or buffer corruption.
We implement three producer-side coordination strategies to resolve the above issues:
-
Locking (
lock) Use a mutex to gain exclusive access to the producer critical section (reserve slot, write element, publish/advance tail).- Pros: simplest correctness argument, good under low producer count.
- Cons: contention increases rapidly as producers scale.
-
Spin lock style commit ordering (
spin) Producers atomically claim space (reserve a slot) and then busy spin until all earlier producers have published their writes, enforcing ordered commit.- Pros: avoids kernel transitions; can be fast at moderate contention.
- Cons: wastes CPU under high contention.
-
Wait-for-notification (
notify) Producers atomically claim space, then block (or wait efficiently) until they are notified that earlier producers have published, instead of spinning.- Pros: reduces wasted CPU cycles under contention.
- Cons: higher coordination overhead; sensitive to notification strategy.
In addition to correctness fixes, we provide an optimized configuration (include/optimized.hpp) with several micro- and macro-optimizations:
-
Replace modulo operations with bit masking When
RING_SIZEis a power of two, replace:idx % RING_SIZEwith:idx & (RING_SIZE - 1)
-
Make
SafeTailnon-atomic Under the chosen publication protocol,SafeTailcan be demoted from atomic to non-atomic to reduce synchronization overhead (only safe if the specific ordering guarantees are preserved). -
Relax memory barriers Some implementations include stronger-than-necessary fences. The optimized version relaxes barriers while preserving correctness (the trade-off is architecture- and compiler-sensitive, so review changes carefully when porting).
-
Adaptive strategy (spin → yield under contention) When contention is high, spinning becomes wasteful. The
yieldvariant yields CPU after a threshold, improving system-wide throughput and reducing tail latency. -
Use locking upon resource overcommitment When the system is overcommitted (for example,
Ncores withN+1runnable producer threads), aggressive spinning can degrade performance. The optimized strategy can fall back to locking or blocking behavior in these regimes.
make allNote
Change TOTAL_CORES in include/common.hpp to the number of cores on your machine (default: 32, it is the number of logical cores).
Caution
Use the -DARM flag to compile for ARM architecture.
All experiments were run on a bare-metal server with:
- CPU: Intel Xeon E5-2630 v3 @ 2.40GHz
- Cores: 16 physical CPUs (32 logical cores)
- Memory: 64 GiB DRAM
We sweep the number of producer threads from 1 to 32 in logarithmic steps, holding the consumer configuration fixed, to observe scaling behavior and contention regimes.
- Warmup: ignore the first 5% of requests as warmup.
- Repetitions: 3 runs per data point.
- Aggregation: report the arithmetic mean across runs.
- Uncertainty: show 95% confidence intervals.
The figure below demonstrates how each strategy fixes the multi-producer data race compared to the original (buggy) baseline,

The below figure illustrates the performance impact of successive optimizations.

The figure below shows the ablation studies isolating each optimization.
