A price-time priority order book matching engine benchmarked across a 2×2 implementation matrix. Built to demonstrate the performance impact of two independent design axes: price→level mapping strategy and field memory layout.
| p50 | p99 | p999 | |
|---|---|---|---|
| VectorAoS | 30.1 ns | 250.5 ns | 731.4 ns |
| VectorSoA | 30.1 ns | 210.4 ns | 541.0 ns |
| MapAoS | 70.1 ns | 2254.2 ns | 15108.4 ns |
| MapSoA | 50.1 ns | 2414.5 ns | 14827.8 ns |
Latency measured with RDTSC, calibrated against QueryPerformanceFrequency. 1,048,576 samples per implementation. Release x64, Windows.
Two orthogonal design axes, four implementations, one OrderBookLike concept.
| Description | Lookup | |
|---|---|---|
| Vector | pmr::vector pre-allocated to 20,000 levels, indexed directly by price_idx |
O(1) |
| Map | std::map<uint32_t, Level>, sparse, heap-allocated nodes |
O(log n) |
| Description | |
|---|---|
| AoS | total_quantity lives inside each level struct alongside its ring buffer |
| SoA | total_quantity hoisted into a separate flat uint32_t[20000] array across all levels |
The SoA flat array enables a contiguous drain scan — after a level empties, finding the next populated level reads sequential uint32_t values rather than striding across ~8KB ring buffers. AVX2 can sweep 8 levels per instruction.
VectorSoA beats VectorAoS by 16% at p99. MapSoA shows no gain over MapAoS. This isolates the SoA win — it comes from the contiguous sweep, not from field separation alone. The Map defeats the sweep because levels are heap-scattered regardless of layout.
p50 is identical between VectorAoS and VectorSoA because steady-state matching only touches ring.front() — there is no cross-level scan to vectorise.
[Feed Thread] → SPSC Queue → [Match Thread] → SPSC Queue → [Publisher Thread]
- Feed thread — generates a simulated order stream, pushes to a lock-free SPSC queue
- Match thread — spin-polls on a pinned core, processes orders via price-time priority matching, records RDTSC latency per order
- Publisher thread — (in progress) uWebSockets event loop, broadcasts book state via WebSocket, serves REST endpoints
Each thread pinned to a dedicated core via SetThreadAffinityMask. Startup synchronised with std::latch{3}.
One 64MB PMR slab allocated at startup. All book allocation happens at construction — zero heap allocation on the hot path. monotonic_buffer_resource — only the match thread ever touches the book.
All four books are one templated struct:
template<typename StoragePolicy, typename LayoutPolicy, std::size_t Capacity = 1024>
struct OrderBook { ... };
using VectorAoS = OrderBook<VectorStorage, AoS>;
using VectorSoA = OrderBook<VectorStorage, SoA>;
using MapAoS = OrderBook<MapStorage, AoS>;
using MapSoA = OrderBook<MapStorage, SoA>;if constexpr on policy tags resolves all divergence at compile time. Matcher and engine are templated on an OrderBookLike concept — swapping implementations is one line in main.cpp, zero runtime cost.
Fixed-capacity ring buffer (RingLevel<1024>) used across all four implementations. O(1) push/pop, no heap allocation. Overflow is a fatal assert — bounded by market microstructure, not a runtime concern.
The v2 baseline used pmr::vector with erase(begin()) — O(n) shift directly visible as a ~4μs p99 spike. RingLevel dropped p99 from ~4μs to ~250ns.
Windows, MSVC, Visual Studio 2022 17.13+.
- Standard:
/std:c++latest - Scan Sources for Module Dependencies: Yes
- All
.ixxfiles: Compile As C++ Module Interface
BookingEngine.sln
To benchmark a different implementation, change the template argument in main.cpp:
auto engine = std::make_unique<
BookingEngine::EngineThread<BookingEngine::Containers::VectorSoA<>>
>(&pool, tsc_ghz);- Publisher thread — uWebSockets event loop, SPSC drain via
loop->defer() - REST API —
/stats,/book,/health - WebSocket — JWT-authenticated live feed (Supabase, RS256 validated locally)
- React dashboard — live bid/ask, spread, depth, latency percentiles
- Deploy —
orderbook.parkinson.industries, DigitalOcean VPS, nginx - AVX2 scan — replace scalar drain scan in
VectorSoAwith_mm256sweep