A high-performance single-producer single-consumer (SPSC) ring buffer implemented in C++, with a focus on low-latency communication, contention behavior, and performance tradeoffs under different synchronization strategies.
This project explores how design decisions such as wait strategies and buffer capacity impact throughput, latency distribution, and contention in concurrent systems.
Producer Thread
|
| try_push(message)
v
+----------------------+
| Lock-Free RingBuffer |
| |
| [ ][ ][ ][ ][ ][ ] |
| ^ ^ |
| head tail |
+----------------------+
|
| try_pop(message)
v
Consumer Thread
The producer writes messages into the ring buffer using try_push, while the consumer reads messages using try_pop. The buffer uses atomic head/tail indices to coordinate between threads without a mutex.
Wait Strategy Under Contention
try_push / try_pop fails
|
v
+------------------+
| Wait Strategy |
+------------------+
| Spin | keep retrying immediately
| Yield | give CPU back to scheduler
| SpinThenYield | spin briefly, then yield
+------------------+
- Fixed-size circular buffer with atomic head/tail indices
- Wait-free for producer/consumer under non-contention conditions
- Cache-line aligned to reduce false sharing
Supports multiple retry behaviors under contention:
- Spin (busy-wait)
- Yield (scheduler handoff)
- Spin-Then-Yield (hybrid strategy)
Measures:
- Messages per second
- Failed push/pop attempts (contention proxy)
Captures end-to-end latency using timestamped messages:
- Average latency
- p50 / p95 / p99 / p99.9
- Max latency
Evaluates how buffer size affects:
- Throughput
- Latency distribution
- Backpressure and contention
=== Throughput: Wait Strategy Comparison ===
Benchmark msgs/s(M) fail_push fail_pop
----------------------------------------------------------------------
Spin 17.17 2088 1719467
Yield 24.04 477 57196
SpinThenYield 19.77 1662 1186547
=== Latency: Wait Strategy Comparison ===
Benchmark msgs/s(M) avg(ns) p50 p95 p99 p99.9 max fail_push fail_pop
----------------------------------------------------------------------------------------------------------------------------------------
Spin 11.02 92762.43 91441.60 98108.40 123666.60 193225.20 234341.80 1116163 55
Yield 13.11 77623.62 75766.80 88949.60 109983.20 176575.00 189375.40 201993 0
SpinThenYield 11.07 92333.90 91316.80 101749.80 112850.00 155141.40 160617.00 1061926 11
An exponential backoff strategy was introduced as an adaptive wait mechanism under contention.
Unlike fixed strategies (Spin, Yield, SpinThenYield), this approach increases wait time after repeated failures, attempting to reduce contention by backing off more aggressively when the buffer is busy.
=== Throughput: Wait Strategy Comparison ===
Benchmark msgs/s(M) fail_push fail_pop
----------------------------------------------------------------------
Spin 16.41 19 1921676
Yield 27.38 204 32985
SpinThenYield 24.55 1499 551720
ExponentialBackoff 20.21 2645 327247
=== Latency: Wait Strategy Comparison ===
Benchmark msgs/s(M) avg(ns) p50 p95 p99 p99.9 max fail_push fail_pop
----------------------------------------------------------------------------------------------------------------------------------------
Spin 10.88 93846.92 92441.60 106425.20 115458.20 130000.00 134708.60 1221062 0
Yield 13.01 78351.52 77100.00 88166.40 96366.60 123441.80 139583.20 214151 1
SpinThenYield 11.00 92853.83 92175.00 100424.80 110566.40 125058.40 128641.80 1150151 0
ExponentialBackoff 10.45 98410.68 95683.60 120675.00 131950.00 149158.40 174291.60 1117606 5
=== Throughput: Capacity Sweep (SpinThenYield) ===
Benchmark msgs/s(M)
--------------------------------------
64 18.90
256 19.26
1024 19.90
8192 20.14
=== Latency: Capacity Sweep (SpinThenYield) ===
Benchmark p99(us)
------------------------------------
64 7.7
256 29.7
1024 112.7
8192 1291.4
- Fixed-capacity circular buffer
- Uses atomic head/tail indices
- Employs memory ordering (acquire/release) for correctness
- Designed for SPSC usage
- Producer thread generates messages
- Consumer thread processes messages
- Configurable wait strategies control contention behavior
- Results aggregated and printed in structured tables
This ring buffer is designed for single-producer single-consumer usage.
Only the producer writes to tail, and only the consumer writes to head. Both threads may read the other index, but neither contends on writes to the same index.
This ownership model keeps the implementation simpler and faster than a multi-producer multi-consumer queue, which requires additional coordination around shared writes, compare/exchange loops, and more complex contention handling.
The ring buffer uses acquire/release memory ordering to safely publish messages between producer and consumer without a mutex.
Producer:
- Writes the message into the buffer slot.
- Release-stores the updated tail index.
Consumer:
- Acquire-loads the tail index.
- If data is available, reads the message.
The release/acquire pair guarantees that once the consumer observes the updated tail, it also observes the message written before that tail update.
The benchmark includes thread affinity support to reduce scheduling noise during measurement.
Producer and consumer threads are assigned to separate cores when supported. This helps reduce thread migration, improves cache locality, and makes benchmark behavior more consistent.
On macOS, strict CPU pinning is not exposed in the same way as Linux, so the implementation uses an affinity hint rather than a hard guarantee.
In this benchmark, thread affinity exposed queueing behavior more clearly: with larger buffers, the producer can run ahead more consistently, causing messages to spend longer waiting in the queue.
Using a compile-time capacity:
- avoids dynamic allocation
- improves cache locality
- enables predictable performance
Atomic operations eliminate:
- mutex overhead
- kernel scheduling cost
Separating retry behavior allows:
- controlled experimentation
- comparison of CPU vs latency tradeoffs
Benchmarks are:
- repeatable
- isolated by variable (strategy vs capacity)
- reported with percentile metrics
- Spin
- low median latency
- highest contention due to aggressive retrying
- Yield
- highest throughput
- lowest overall latency in this workload due to better CPU sharing
- Spin-Then-Yield
- balanced approach
- reduces contention while maintaining stable latency
- Exponential Backoff
- lower throughput and higher latency in this workload
- adaptive waiting increased delay under sustained contention
Simpler strategies can outperform adaptive ones when contention is continuous. Yield performed best by allowing faster coordination between threads.
- Small buffers
- lower latency due to minimal queueing
- higher contention and backpressure
- Large buffers
- slightly higher throughput
- significantly increased latency due to queueing delay
Larger buffers reduce backpressure but allow the producer to run ahead, increasing queue depth and tail latency.
- Pinning threads reduced scheduling variability and made behavior more consistent
- More stable execution exposed queueing effects more clearly
- Latency increased under larger buffers as producer-consumer imbalance became more consistent
Improved stability does not always reduce latency—removing scheduling noise can reveal underlying system imbalances.
mkdir build cd build cmake .. make
./ring_buffer_benchmark
include/ ring_buffer.hpp mutex_queue.hpp wait_strategy.hpp benchmark_math.hpp benchmark_format.hpp
benchmarks/ benchmark.cpp
examples/ basic_usage.cpp
- Multi-producer / multi-consumer (MPMC) support
- NUMA-aware benchmarking
- Cache line padding experiments
- Lock-free memory reclamation strategies
Low-latency systems require careful balancing of:
- throughput
- latency
- contention
- CPU utilization
This project explores those tradeoffs through controlled experiments, providing insight into how synchronization strategies and buffer design impact real-world performance.
Throughput, latency, and contention are tightly coupled. Optimizing one often comes at the cost of another, and the best strategy depends on workload characteristics and system behavior.