A modern C++23 header-only thread pool with dependency-driven scheduling, priority domains, and lock-free SPSC buffering.
- Header-only — Drop the
include/freebyrd/directory into your project; no compiled library needed - Lock-free task queues — Bounded MPMC ring buffers using Vyukov's algorithm with per-slot sequence counters and cache-line padded head/tail atomics
- Dependency gates — Atomic countdown barriers that auto-schedule a completion task when all dependencies are satisfied; chain gates for multi-phase pipelines
- Priority domains — Named task categories with dedicated queues and configurable priority ordering; workers dequeue from higher-priority domains first
- Per-worker local queues — Cache-affine task submission with work stealing across workers for load balancing
- SPSC swath buffers — Double/triple/quad-buffered handoff for producer–consumer pipelines; all state packed into a single
atomic<uint64_t>with C++20wait/notify - Small-buffer optimized tasks — Type-erased callables with 48-byte SBO; lambdas with typical captures avoid heap allocation entirely
- Clean lifecycle —
std::jthreadwithstd::stop_tokenfor cooperative shutdown; no condition variables, no mutexes on the hot path
- C++23 compiler (GCC 13+, Clang 17+)
- CMake 3.20+ (only needed for tests)
- pthreads (Linux/macOS)
Copy the include/freebyrd/ directory into your project's include path. Then:
#include <freebyrd/freebyrd.h>
int main() {
frb::thread_pool pool;
pool.submit(frb::task([]{ /* work */ }));
pool.wait_idle();
}Compile with -std=c++23 -lpthread.
add_subdirectory(path/to/freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)The CMake target freebyrd is an INTERFACE library — it only propagates include paths, the C++23 standard requirement, and the Threads dependency. No static/shared library is built.
include(FetchContent)
FetchContent_Declare(freebyrd
GIT_REPOSITORY https://your-repo/freebyrd.git
GIT_TAG main
)
FetchContent_MakeAvailable(freebyrd)
target_link_libraries(your_target PRIVATE freebyrd)mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./freebird_testsRun a subset of tests by name filter:
./freebird_tests pipeline # runs only tests containing "pipeline"
./freebird_tests stress # runs only stress tests| Header | Class | Purpose |
|---|---|---|
cache_line.h |
cache_aligned<T> |
Pads any type to a full cache line (64 bytes or hardware_destructive_interference_size) to prevent false sharing. Uses a variadic constructor so non-movable types like std::atomic can be constructed in-place. |
task.h |
task |
Type-erased callable with 48-byte small-buffer optimization. Captures ≤ 48 bytes are stored inline (no heap allocation). Larger captures fall back to new. Move-only; uses bitwise copy of the SBO storage for zero-overhead moves. |
task_queue.h |
task_queue |
Bounded MPMC (multi-producer, multi-consumer) ring buffer based on Dmitry Vyukov's algorithm. Each slot has an atomic sequence counter that acts as both a ticket and a fence. Head and tail atomics are cache-line padded to eliminate false sharing. |
dependency_gate.h |
dependency_gate |
Atomic dependency counter. Constructed with a count and a completion task. Each call to arrive() decrements the counter; when it hits zero, the completion task is automatically submitted to the pool. Gates can be reused via reset(). |
task_domain.h |
task_domain |
A named task category with its own task_queue and a sequence_index that determines dequeue priority (lower = higher priority). Workers sweep domains in priority order. |
swath_buffer.h |
swath_buffer<T> |
SPSC (single-producer, single-consumer) double/triple/quad-buffered handoff. Packs consumer-available count (M), producer-available count (D), and a wait flag (W) into a single atomic<uint64_t>. Uses C++20 atomic::wait()/notify_all() for blocking consumers — no condition variables. |
thread_pool.h |
thread_pool |
The pool orchestrator. Manages std::jthread workers, each with a local task_queue. Supports global queue fallback, domain-priority dequeue, and round-robin work stealing. Idle workers spin briefly then sleep via atomic::wait(). wait_idle() blocks until all submitted tasks (including those spawned by running tasks) complete. |
Each worker thread follows this priority when looking for work:
- Own local queue — best cache locality
- Preferred domain queue — the domain this worker is associated with
- Other domain queues — swept in
sequence_indexorder (lower = higher priority) - Global queue — catch-all fallback
- Steal from other workers — round-robin across peer local queues
- Back off — spin for 64 iterations (with CPU pause), then sleep via
atomic::wait()
Traditional barrier-based parallelism (all tasks in phase N must complete before any task in phase N+1 can start) leaves cores idle at phase boundaries. Freebird uses dependency gates instead:
┌─────────────────────────────────────────────────────┐
│ Phase 1: Decode blocks (20,480 tasks) │
│ Each block calls gate.arrive() on its strip's │
│ dependency_gate when finished │
└───────┬─────────────────────────────────────────────┘
│ (no barrier — strips start as soon as
│ their blocks complete)
▼
┌─────────────────────────────────────────────────────┐
│ Phase 2: Horizontal filter (640 tasks) │
│ Each strip's gate fires a h_filter task when │
│ all blocks in that strip are decoded. │
│ Each h_filter task calls arrive() on the │
│ resolution's v_gate. │
└───────┬─────────────────────────────────────────────┘
│ (no barrier — resolutions start as soon
│ as all their strips complete)
▼
┌─────────────────────────────────────────────────────┐
│ Phase 3: Vertical filter (40 tasks) │
│ Fires when all h_filter strips for a resolution │
│ are done. │
└───────┬─────────────────────────────────────────────┘
│ (v_filter completion submits disk write)
▼
┌─────────────────────────────────────────────────────┐
│ Phase 4: Disk write + fdatasync (40 tasks) │
│ Writes decoded strip data to the real filesystem │
│ and forces to storage with fdatasync(). Runs │
│ concurrently with decode/filter work on │
│ remaining tiles. │
└─────────────────────────────────────────────────────┘
This means Phase 2 tasks for early strips can overlap with Phase 1 tasks for later strips, and Phase 4 disk writes overlap with all earlier phases — the pipeline stays full and all cores stay busy.
On a 16-core machine (GCC 15, Release, -O3):
| Metric | Pipeline only | Full suite (33 tests) |
|---|---|---|
| CPU utilization | 1465% (91.6% of 16 cores) | 1398% (87.4%) |
| Wall time | 2.25s | 2.50s |
| User+System time | 33.0s | 34.8s |
| Disk I/O (NVMe, fdatasync) | ~20 MB, 0% CPU impact | ~20 MB |
The disk writes (512 KB × 40 strips with fdatasync() to btrfs/NVMe) complete within the compute budget of remaining tiles — zero measurable impact on CPU utilization.
frb::pool_config cfg{
.num_threads = 0, // 0 = hardware_concurrency()
.global_queue_capacity = 65536,
.local_queue_capacity = 1024,
};
frb::thread_pool pool(cfg);
// Submit a task (default domain).
pool.submit(frb::task([] { do_work(); }));
// Submit to a specific domain.
auto dom_id = pool.register_domain("decode", /*sequence_index=*/0);
pool.submit(frb::task([] { decode_block(); }), dom_id);
// Submit to a specific worker's local queue (cache affinity).
pool.submit_local(frb::task([] { hot_path(); }), worker_id, dom_id);
// Block until all pending tasks complete.
pool.wait_idle();
// Query current worker index from within a task (-1 if non-worker thread).
int32_t wid = frb::thread_pool::current_worker_id();frb::thread_pool pool;
// Create a gate that fires when 10 dependencies arrive.
frb::dependency_gate gate(pool, 10, frb::task([] {
// Runs exactly once, when the 10th arrive() is called.
std::printf("All dependencies satisfied!\n");
}));
// From various tasks:
gate.arrive(); // decrement by 1
gate.arrive(3); // decrement by 3
// Reuse the gate for the next round:
gate.reset(pool, 5, frb::task([] { /* new completion */ }));frb::swath_buffer<std::vector<float>> buf(3); // triple-buffered
// Producer thread:
auto idx = buf.try_acquire_producer(); // returns std::optional<uint32_t>
if(idx) {
buf[*idx] = produce_data();
buf.release_to_consumer();
}
// Consumer thread:
auto idx = buf.acquire_consumer_blocking(); // blocks until data available
process(buf[idx]);
buf.release_to_producer();
// Non-blocking consumer:
if(auto idx = buf.try_acquire_consumer()) {
process(buf[*idx]);
buf.release_to_producer();
}// Register domains with priority ordering (lower sequence_index = higher priority).
auto d_decode = pool.register_domain("decode", /*sequence_index=*/0);
auto d_transform = pool.register_domain("transform", /*sequence_index=*/1);
auto d_output = pool.register_domain("output", /*sequence_index=*/2);
// Submit to specific domains — workers service higher-priority domains first.
pool.submit(frb::task([] { decode(); }), d_decode);
pool.submit(frb::task([] { transform(); }), d_transform);33 tests covering unit, integration, and stress scenarios:
| Category | Tests | Description |
|---|---|---|
task |
4 | SBO path, heap path, move semantics, empty state |
task_queue |
4 | Push/pop, empty pop, fill/drain, MPMC stress (100K tasks, 4P/4C) |
thread_pool |
5 | Single task, 10K tasks, wait-idle-empty, submit-from-task chains, worker ID |
domains |
1 | Multi-domain submit and completion |
dependency_gate |
4 | Single dep, multi-dep, concurrent arrive (10K), 3-stage chain |
swath_buffer |
3 | Basic handoff, double-buffer fill/drain, SPSC concurrent (50K items) |
stress |
11 | Rapid submit/wait (100 rounds × 1K), fan-in (500-wide, 20 rounds), 200-deep gate chains, mixed domains (10K), pipeline simulation (512 blocks → 32 strips), recursive fan-out (4^5=1365 tasks), contention hammer (100K tasks, 16 threads), local submit (10K), pool create/destroy (20 cycles), gate reset/reuse (50 rounds), 3-stage swath pipeline (50K items) |
pipeline |
1 | Full-scale multi-tile multi-resolution codec pipeline with disk I/O: 8 tiles × 5 resolutions × 16 strips × 32 blocks = 20,480 decode tasks → 640 h_filter tasks → 40 v_filter tasks → 40 disk write+fdatasync tasks, 4 priority domains, parallel tile submission, real filesystem output |
All tests pass under:
- Release (optimized, no sanitizers)
- Debug + AddressSanitizer (memory safety)
- RelWithDebInfo + ThreadSanitizer (data race detection)
freebyrd/
├── CMakeLists.txt # INTERFACE library + test target
├── README.md
├── include/
│ └── freebyrd/
│ ├── freebyrd.h # umbrella header (include this)
│ ├── cache_line.h # cache_aligned<T>
│ ├── task.h # task (type-erased callable, 48-byte SBO)
│ ├── task_queue.h # task_queue (Vyukov MPMC ring buffer)
│ ├── task_domain.h # task_domain (named priority queue)
│ ├── dependency_gate.h # dependency_gate (atomic countdown)
│ ├── swath_buffer.h # swath_buffer<T> (SPSC multi-buffered handoff)
│ └── thread_pool.h # thread_pool (orchestrator + all inline impl)
├── tests/
│ └── test_pool.cpp # 33 tests (unit, stress, full pipeline)
└── build/ # out-of-source build directory
This project is licensed under the GNU General Public License v3.0.
Copyright (C) 2025-2026 Grok Image Compression Inc.