LLAM is a stackful user-thread runtime for C applications. It lets C code express concurrency with task-oriented APIs such as spawn, join, sleep, channels, read, write, accept, connect, and poll, while the runtime schedules many user tasks over a smaller set of OS worker threads.
LLAM is not Linux-only. The Linux backend uses io_uring/liburing, the macOS/Darwin backend uses kqueue-based watch and completion paths, and the native Windows 10/11 backend uses IOCP for overlapped Winsock read/write/accept/connect plus generic HANDLE ReadFile/WriteFile requests.
- Stackful tasks with natural C control flow.
- N:M scheduling over runtime worker threads.
- Linux I/O backend based on io_uring/liburing.
- macOS/Darwin I/O backend based on kqueue.
- Windows 10/11 backend with IOCP request completions for sockets and overlapped HANDLEs, Windows wake handles, and x86_64 context-switch assembly.
- Task primitives:
spawn,yield,join,sleep, deadlines, and task metadata. - Synchronization primitives: mutex, condition variable, channel, and cancellation token.
- Channel multiplexing with
llam_channel_select()and focused select benchmarks. - Blocking integration through
llam_call_blocking,llam_enter_blocking, andllam_leave_blocking. - Runtime tuning through profiles, dynamic workers, worker rings, SQPOLL, and idle-spin controls.
- Observability through runtime stats and debug dumps.
- Stable ABI metadata for dynamic language-runtime loaders.
- Static and shared library build targets.
- Built-in demo, chat server, stress, benchmark, Docker verification, and Go/Tokio comparison scripts.
| Platform | Status | I/O backend | Recommended compiler | Verification |
|---|---|---|---|---|
| Linux x86_64 | Primary Linux path | io_uring/liburing | GCC or Clang | make verify-linux CC=gcc |
| Linux aarch64 | Supported | io_uring/liburing | GCC or Clang | make verify-linux CC=gcc |
| macOS arm64 | Primary macOS path | kqueue | Apple Clang | CC=clang make verify-darwin |
| macOS x86_64 | Supported | kqueue + x86_64 asm context switch | Apple Clang | CC=clang make verify-darwin |
| Windows 10/11 | Supported native x86_64 backend | IOCP for WSARecv/WSASend/AcceptEx/ConnectEx, overlapped HANDLE ReadFile/WriteFile, plus gated TCP POLLOUT and UDP POLLIN; TCP POLLIN defaults to fallback unless LLAM_WINDOWS_IOCP_TCP_POLLIN=1 is enabled |
MinGW and MSVC/MASM via CMake | CMake Windows build plus test_windows_policy, test_windows_runtime_smoke, test_windows_iocp_io, and test_windows_handle_io; scripts/verify_windows.ps1 -Native |
Native Windows runtime support covers scheduler/core, wake handles, x86_64 context switching, IOCP-backed socket requests, and overlapped HANDLE I/O. Windows 10 and Windows 11 use the same public API; LLAM selects conservative Windows 10 tuning or batched Windows 11 tuning at runtime, and CI forces both policy branches on native Windows runners.
Production and stress-operation guidance is documented in docs/operations.md.
Install Linux/WSL dependencies:
sudo apt install build-essential liburing-devInstall macOS command-line tools:
xcode-select --installBuild on Linux:
make -j4 CC=gccBuild on macOS:
CC=clang make -j4Build native Windows with CMake:
cmake -S . -B build-windows -G "Ninja" -DCMAKE_BUILD_TYPE=Release
cmake --build build-windows
ctest --test-dir build-windows --output-on-failure.\scripts\verify_windows.ps1 still verifies the Linux backend through WSL. .\scripts\verify_windows.ps1 -Native builds the native Windows CMake targets and runs the Windows CTest suite.
Build with CMake:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4Install with CMake:
cmake --install build --prefix "$HOME/.local"Run the included programs:
./demo
./stress
./bench
./server 7777Stress the chat server with real TCP clients:
make server-stress
python3 scripts/stress_server.py --clients 64 --messages 16 --payload-bytes 64Run the native maximum-throughput flood driver:
make server-flood
./server_flood --clients 16 --duration 60 --message-bytes 8 --batch 64 --target-mps 0.30server_flood reports both inbound messages/sec and observed broadcast
deliveries/sec. For chat fanout, one inbound message can produce clients - 1
peer deliveries, so million-level delivery rates can appear before inbound
message rates reach the same scale.
Run the full composite server stress suite:
make server-stress-composite
make server-stress-composite-quick
make server-stress-composite-hour
python3 scripts/stress_server_composite.py --quickThe composite suite combines exact fanout checks, 60-second native flood, payload-size variation, connection churn, slow receivers, half-close/reset patterns, and RSS/fd sampling.
--quick is intended for hosted CI runners. It keeps exact-delivery checks but
uses a lower absolute flood delivery threshold than standard and hour-long runs.
The one-hour profile runs the same classes of checks with a long soak layout: 30 minutes of main flood, two 5-minute payload flood phases, and 20 minutes of mixed edge stress.
Run focused API/ABI tests:
make testBuild outputs:
demo: runnable examples of the public runtime API.stress: regression coverage for scheduling, sync, timeouts, I/O, and dynamic workers.bench: microbenchmarks for spawn/join, channels, channel select, I/O, poll, sleep fanout, and opaque blocking.server: minimal LLAM-backed TCP chat backend for local testing.server_flood: native nonblocking throughput flood driver for the chat server.scripts/stress_server.py: TCP fanout stress test for the chat server.scripts/stress_server_composite.py: long-running composite server stability suite.test_abi_contract: ABI metadata and size handshakes.test_connect_io: direct and runtime-managedllam_connect()success and invalid-input checks.test_runtime_core: lifecycle, task metadata, yielding, sleeping, blocking callbacks, and stats checks.test_sync_primitives: mutex, condition variable, channel, timeout, and close semantics.test_io_buffers: direct and managed poll/read/write, owned buffers, andMSG_PEEK.test_shared_load:dlopen()coverage for the shared library ABI surface.
The top-level Makefile builds the bundled executables directly. For application integration, the simplest path is the CMake target llam_runtime.
add_subdirectory(path/to/LLAM)
add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam_runtime)Use llam_runtime_shared when a language runtime needs to load LLAM dynamically.
The Makefile equivalent is make shared.
Release archives include the public headers, docs, bundled examples, runtime
libraries, pkg-config metadata, and CMake package files. Tag pushes such as
v1.0.1 build and publish .tar.xz archives for Linux x86_64, Linux aarch64,
macOS x86_64, and macOS arm64, plus a native Windows x86_64 .zip archive
through .github/workflows/release.yml.
The 1.0 release gate is intentionally platform-local: Linux must pass
make verify-linux or Docker verification, macOS must pass the Darwin verify
path, and Windows must pass native CMake/CTest plus Windows 2022/2025 stress
smoke. The full operational checklist is in docs/operations.md.
Use an installed SDK with CMake:
find_package(llam CONFIG REQUIRED)
add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam::runtime)Use an installed SDK with pkg-config:
cc main.c $(pkg-config --cflags --libs llam) -o my_appInstall on Linux/macOS:
curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --prefix "$HOME/.local"Install a specific Linux/macOS target:
curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --target macos-aarch64 --prefix "$HOME/.local"Install on Windows x86_64:
Invoke-WebRequest "https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.ps1" -OutFile install.ps1; .\install.ps1 -Version 1.0.1 -Prefix "$env:LOCALAPPDATA\LLAM"Include the canonical public API:
#include "llam/runtime.h"Dynamic loaders should check llam_abi_version() or llam_abi_get_info() before binding the rest of the API. FFI bindings should prefer llam_runtime_init_ex() and llam_spawn_ex() so inbound option structs carry an explicit caller-side size. The ABI and semantic contract is documented in docs/abi.md.
Embedding code should use llam_runtime_create(), llam_runtime_run_handle(), and llam_runtime_destroy(), while treating LLAM 1.0 as one active runtime per process.
True multi-runtime isolation is a post-1.0 migration item; do not create/destroy
LLAM concurrently from multiple host runtime instances.
macOS-specific performance gates and remaining structural work are covered by the platform-local release checklist in docs/operations.md.
Windows backend scope, policy split, and acceptance gates are tracked in docs/operations.md.
A typical LLAM program follows this lifecycle:
- Initialize the runtime with
llam_runtime_init(). - Spawn one or more root tasks with
llam_spawn(). - Run the scheduler with
llam_run(). - Shut the runtime down with
llam_runtime_shutdown().
#include "llam/runtime.h"
#include <stdio.h>
static void worker(void *arg) {
const char *name = arg;
printf("hello from %s\n", name);
llam_yield();
printf("bye from %s\n", name);
}
int main(void) {
if (llam_runtime_init(NULL) != 0) {
return 1;
}
if (llam_spawn(worker, "LLAM", NULL) == NULL) {
llam_runtime_shutdown();
return 1;
}
int rc = llam_run();
llam_runtime_shutdown();
return rc;
}A task is a void (*)(void *) function. Pass shared state through the task
argument and use llam_join() when a parent task needs the child to finish.
Every task handle returned by llam_spawn*() must be consumed by either a
successful join or llam_detach().
#include "llam/runtime.h"
#include <stdint.h>
#include <stdio.h>
typedef struct job {
int input;
int output;
} job_t;
static void child(void *arg) {
job_t *job = arg;
llam_sleep_ns(1ULL * 1000ULL * 1000ULL);
job->output = job->input * job->input;
}
static void root(void *arg) {
(void)arg;
job_t job = {.input = 12};
llam_task_t *task = llam_spawn(child, &job, NULL);
if (task != NULL && llam_join(task) == 0) {
printf("result=%d\n", job.output);
}
}Deadline-based APIs use absolute timestamps from llam_now_ns().
uint64_t deadline = llam_now_ns() + 10ULL * 1000ULL * 1000ULL;
int rc = llam_join_until(task, deadline);A channel transfers pointer values between tasks. Capacity must be at least 1; capacity 1 or greater behaves like a bounded buffer.
#include "llam/runtime.h"
#include <stdio.h>
typedef struct pipe_state {
llam_channel_t *channel;
} pipe_state_t;
static void producer(void *arg) {
pipe_state_t *state = arg;
(void)llam_channel_send(state->channel, "ping");
(void)llam_channel_send(state->channel, "pong");
(void)llam_channel_close(state->channel);
}
static void consumer(void *arg) {
pipe_state_t *state = arg;
const char *msg;
while ((msg = llam_channel_recv(state->channel)) != NULL) {
printf("recv=%s\n", msg);
}
}
static void root(void *arg) {
(void)arg;
pipe_state_t state = {
.channel = llam_channel_create(2),
};
if (state.channel == NULL) {
return;
}
llam_task_t *a = llam_spawn(producer, &state, NULL);
llam_task_t *b = llam_spawn(consumer, &state, NULL);
if (a != NULL) {
(void)llam_join(a);
}
if (b != NULL) {
(void)llam_join(b);
}
llam_channel_destroy(state.channel);
}LLAM I/O calls are written like blocking calls from inside a task, while the runtime backend handles readiness and completion. Linux uses io_uring, macOS uses kqueue, and Windows uses IOCP for overlapped Winsock read, write, accept, connect, generic HANDLE ReadFile/WriteFile, gated TCP POLLOUT, and UDP POLLIN requests. Windows TCP POLLIN defaults to the cooperative/direct fallback path unless LLAM_WINDOWS_IOCP_TCP_POLLIN=1 is enabled for controlled smoke or benchmark runs; unsupported poll masks remain fallback. The current I/O primitive set covers read, read_when_ready, write, HANDLE read/write, accept, connect, fd polling, HANDLE polling, and owned-buffer reads on supported native backends. Use LLAM_INVALID_FD or LLAM_FD_IS_INVALID(fd) for descriptor-returning failures such as llam_accept(), and LLAM_INVALID_HANDLE or LLAM_HANDLE_IS_INVALID(handle) for HANDLE-returning integrations.
#include "llam/runtime.h"
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
typedef struct echo_state {
int reader;
int writer;
} echo_state_t;
static void reader_task(void *arg) {
echo_state_t *state = arg;
char buf[64];
ssize_t n = llam_read(state->reader, buf, sizeof(buf));
if (n > 0) {
printf("read=%.*s\n", (int)n, buf);
}
}
static void writer_task(void *arg) {
echo_state_t *state = arg;
const char *msg = "hello";
(void)llam_write(state->writer, msg, strlen(msg));
}
static void root(void *arg) {
(void)arg;
int sv[2];
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) {
return;
}
echo_state_t state = {
.reader = sv[0],
.writer = sv[1],
};
llam_task_t *reader = llam_spawn(reader_task, &state, NULL);
llam_task_t *writer = llam_spawn(writer_task, &state, NULL);
if (reader != NULL) {
(void)llam_join(reader);
}
if (writer != NULL) {
(void)llam_join(writer);
}
close(sv[0]);
close(sv[1]);
}The owned-buffer API lets the runtime allocate the I/O buffer. Release it with llam_io_buffer_release().
EOF or a zero-byte read returns 0 with buffer == NULL; failures return -1, set errno, and also leave buffer == NULL.
llam_io_buffer_t *buffer = NULL;
ssize_t n = llam_read_owned(fd, 4096, &buffer);
if (n > 0 && buffer != NULL) {
void *data = llam_io_buffer_data(buffer);
size_t size = llam_io_buffer_size(buffer);
(void)data;
(void)size;
}
llam_io_buffer_release(buffer);Long CPU work or blocking syscalls can pin a worker if they run directly inside a task. Use llam_call_blocking_result() to offload such work without ambiguity, or wrap explicit blocking regions with llam_enter_blocking() and llam_leave_blocking().
#include "llam/runtime.h"
#include <unistd.h>
static void *slow_syscall(void *arg) {
(void)arg;
sleep(1);
return NULL;
}
static void task(void *arg) {
void *result = NULL;
(void)arg;
(void)llam_call_blocking_result(slow_syscall, NULL, &result);
}Manual blocking region:
if (llam_enter_blocking() == 0) {
/* Run a blocking syscall or external library call here. */
llam_leave_blocking();
}Runtime lifecycle:
| API | Purpose |
|---|---|
llam_runtime_opts_init |
Fill runtime options with ABI-safe library defaults. |
llam_runtime_init_ex |
Initialize the runtime with an explicit option struct size for FFI. |
llam_runtime_init |
Initialize the runtime. |
llam_runtime_request_stop |
Request cooperative scheduler stop and wake workers. |
llam_runtime_shutdown |
Shut the runtime down and release resources. |
llam_runtime_collect_stats_ex |
Collect stats with an explicit output struct size for FFI. |
llam_runtime_collect_stats |
Collect scheduler, I/O, blocking, and queue statistics. |
llam_runtime_write_stats_json |
Write a newline-terminated JSON stats snapshot to an fd. |
Task scheduling:
| API | Purpose |
|---|---|
llam_spawn_opts_init |
Fill spawn options with ABI-safe library defaults. |
llam_spawn_ex |
Create a task with an explicit option struct size for FFI. |
llam_spawn |
Create a task. |
llam_run |
Run the scheduler. |
llam_yield |
Yield the current task. |
llam_task_safepoint |
Mark progress in CPU-bound loops without forcing an immediate yield. |
llam_join |
Wait for task completion. |
llam_join_until |
Wait for task completion until a deadline. |
llam_detach |
Consume a task handle without waiting for completion. |
llam_sleep_ns |
Sleep for a duration. |
llam_sleep_until |
Sleep until an absolute deadline. |
llam_task_set_class |
Change the current task class; invalid class values fail with EINVAL. |
llam_current_task |
Return the current task handle. |
llam_task_id |
Return a task id. |
llam_task_state_name |
Return a task state string. |
llam_task_class |
Return a task class. |
llam_task_flags |
Return task flags. |
Spawn options:
| Type/value | Meaning |
|---|---|
LLAM_TASK_CLASS_LATENCY |
Latency-sensitive task. |
LLAM_TASK_CLASS_DEFAULT |
Default task class. |
LLAM_TASK_CLASS_BATCH |
Batch-oriented task. |
LLAM_STACK_CLASS_DEFAULT |
Default stack size class. |
LLAM_STACK_CLASS_LARGE |
Larger stack size class. |
LLAM_STACK_CLASS_HUGE |
Very large stack size class. |
LLAM_SPAWN_F_PINNED |
Hint that the task should stay pinned. |
LLAM_SPAWN_F_NO_PREEMPT |
Hint that preemption should be restricted. |
LLAM_SPAWN_F_SYS_TASK |
Runtime/system task hint. |
LLAM_SPAWN_F_LATENCY_CRITICAL |
Latency-critical task hint. |
Blocking:
| API | Purpose |
|---|---|
llam_call_blocking_result |
Run a blocking function through the unambiguous int + out API. |
llam_call_blocking |
Convenience blocking API; ambiguous when callback returns NULL. |
llam_enter_blocking |
Mark the current task as entering a blocking region. |
llam_leave_blocking |
Mark the current task as leaving a blocking region. |
Cancellation:
| API | Purpose |
|---|---|
llam_cancel_token_create |
Create a cancellation token. |
llam_cancel_token_destroy |
Destroy a cancellation token; live observers make it fail with EBUSY. |
llam_cancel_token_cancel |
Request cancellation. |
llam_cancel_token_is_cancelled |
Check cancellation state. |
Mutex and condition variables:
| API | Purpose |
|---|---|
llam_mutex_create / llam_mutex_destroy |
Create or destroy a mutex; destroy returns EBUSY while owned or waited on. |
llam_mutex_lock / llam_mutex_unlock |
Lock or unlock a non-recursive mutex; self-lock returns EDEADLK, non-owner unlock returns EPERM. |
llam_mutex_lock_until |
Wait for a mutex until a deadline; self-lock returns EDEADLK. |
llam_mutex_trylock |
Try to lock immediately; returns EBUSY when already locked. |
llam_cond_create / llam_cond_destroy |
Create or destroy a condition variable; destroy returns EBUSY while waited on. |
llam_cond_wait |
Wait on a condition variable; caller must own the mutex and wait in a predicate loop. |
llam_cond_wait_until |
Wait on a condition variable until a deadline; reacquires the mutex before returning. |
llam_cond_signal |
Wake one waiter; may be called with or without the mutex and outside a managed task. |
llam_cond_broadcast |
Wake all waiters; may be called with or without the mutex and outside a managed task. |
Channels:
| API | Purpose |
|---|---|
llam_channel_create / llam_channel_destroy |
Create or destroy a channel; destroy returns EBUSY while buffered values or waiters remain. |
llam_channel_send |
Send a value. |
llam_channel_send_until |
Send a value until a deadline. |
llam_channel_recv_result |
Receive a value through an unambiguous int + out API. |
llam_channel_recv_until_result |
Receive a value until a deadline through an unambiguous int + out API. |
llam_channel_recv |
Convenience receive API; use result form if NULL is a valid payload. |
llam_channel_recv_until |
Convenience timed receive API; use result form if NULL is a valid payload. |
llam_channel_close |
Idempotently close a channel; buffered values remain drainable and sends fail with EPIPE. |
I/O:
| API | Purpose |
|---|---|
llam_read |
Read from an fd. |
llam_write |
Write to an fd. |
llam_read_handle |
Read from a platform handle; Windows uses overlapped ReadFile through IOCP when possible, POSIX aliases to fd read. |
llam_write_handle |
Write to a platform handle; Windows uses overlapped WriteFile through IOCP when possible, POSIX aliases to fd write. |
llam_read_owned |
Read into a runtime-owned buffer. |
llam_recv_owned |
Receive with flags into a runtime-owned buffer. |
llam_io_buffer_release |
Release an owned buffer. |
llam_io_buffer_data |
Return the owned buffer data pointer. |
llam_io_buffer_size |
Return the number of bytes read. |
llam_io_buffer_capacity |
Return owned buffer capacity. |
llam_accept |
Accept a connection from a listener fd; returns LLAM_INVALID_FD on failure. |
llam_connect |
Connect a socket without blocking the scheduler worker. |
llam_poll_fd |
Wait for fd readiness. |
llam_poll_handle |
Wait for platform handle state; Windows uses WaitForSingleObject semantics and POSIX aliases to fd poll. |
Time, debug, and platform:
| API | Purpose |
|---|---|
llam_now_ns |
Return a monotonic nanosecond timestamp. |
llam_dump_runtime_state |
Dump runtime state to an fd. |
llam_fd_t |
Platform-specific fd/socket handle type. |
llam_handle_t |
Platform-specific generic handle type for HANDLE I/O APIs. |
LLAM_INVALID_FD / LLAM_FD_IS_INVALID |
Platform-correct invalid descriptor sentinel and predicate. |
LLAM_INVALID_HANDLE / LLAM_HANDLE_IS_INVALID |
Platform-correct invalid generic-handle sentinel and predicate. |
LLAM_PLATFORM_LINUX |
Linux build flag. |
LLAM_PLATFORM_DARWIN |
macOS/Darwin build flag. |
LLAM_PLATFORM_WINDOWS |
Windows build flag. |
LLAM_PLATFORM_NAME |
Platform name string. |
Pass NULL to llam_runtime_init() for the default runtime configuration. Pass llam_runtime_opts_t when you need explicit tuning. Dynamic loaders and language bindings should initialize option structs with llam_runtime_opts_init(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE) and llam_spawn_opts_init(&opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), then call llam_runtime_init_ex(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE), llam_spawn_ex(fn, arg, &opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), and llam_runtime_collect_stats_ex(&stats, LLAM_RUNTIME_STATS_CURRENT_SIZE).
Public option and stats structs use fixed-width integer storage for ABI-facing
scalar fields. Enum constants remain available for C readability, but FFI
bindings should model task classes, stack classes, profiles, flags, and
32-bit state counters as uint32_t; sqpoll_cpu is int32_t.
llam_runtime_opts_t opts = {
.deterministic = 0,
.forced_yield_every = 0,
.experimental_flags =
LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS |
LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ,
.profile = LLAM_RUNTIME_PROFILE_BALANCED,
};
if (llam_runtime_init(&opts) != 0) {
return 1;
}Important fields:
| Field | Meaning |
|---|---|
deterministic |
Deterministic scheduling mode. |
forced_yield_every |
Force a yield at a fixed interval. |
experimental_flags |
Bitwise OR of LLAM_RUNTIME_EXPERIMENTAL_F_* flags. |
idle_spin_ns |
Spin before idle poll fallback. |
idle_spin_max_iters |
Maximum idle-spin iterations. |
sqpoll_cpu |
CPU reserved for SQPOLL. |
profile |
Runtime policy profile: balanced, release-fast, debug-safe, or io-latency. |
Experimental flags:
| Flag | Meaning |
|---|---|
LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS |
Experimental per-worker I/O ring mode. |
LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS_MULTISHOT |
Allow multishot watches with worker rings. |
LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS |
Soft-park and reactivate idle workers. |
LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ |
Use the lock-free normal queue. |
LLAM_RUNTIME_EXPERIMENTAL_F_HUGE_ALLOC |
Prefer hugepage-friendly allocator backing. |
LLAM_RUNTIME_EXPERIMENTAL_F_SQPOLL |
Experimental Linux io_uring SQPOLL mode. |
Selected environment variables:
| Variable | Example values | Meaning |
|---|---|---|
LLAM_RUNTIME_PROFILE |
balanced, release-fast, debug-safe, io-latency |
Override the runtime profile. |
LLAM_EXPERIMENTAL_DYNAMIC_WORKERS |
0, 1 |
Toggle dynamic workers. |
LLAM_EXPERIMENTAL_LOCKFREE_NORMQ |
0, 1 |
Toggle the lock-free normal queue. |
LLAM_EXPERIMENTAL_WORKER_RINGS |
0, 1 |
Toggle worker ring mode. |
LLAM_EXPERIMENTAL_WORKER_RINGS_MULTISHOT |
0, 1 |
Toggle worker-ring multishot watches. |
LLAM_EXPERIMENTAL_HUGE_ALLOC |
0, 1 |
Toggle huge allocator mode. |
LLAM_EXPERIMENTAL_SQPOLL |
0, 1 |
Toggle Linux SQPOLL. |
LLAM_SQPOLL_CPU |
CPU number | Select the SQPOLL CPU. |
LLAM_IDLE_SPIN_NS |
nanoseconds | Idle spin time. |
LLAM_IDLE_SPIN_ITERS |
iteration count | Idle spin iteration limit. |
LLAM_BIND_WORKERS |
0, 1 |
Bind worker threads to platform CPUs when supported. |
LLAM_DARWIN_MACH_SCHED |
0, 1 |
Toggle Darwin Mach/QoS scheduler hints; default is enabled on macOS. |
LLAM_WINDOWS_UNSAFE_SKIP_TASK_SIMD |
0, 1 |
Experimental Windows x64 ceiling mode: skip task-context XMM6-XMM15 save/restore. Only valid when managed tasks do not rely on callee-saved SIMD state across LLAM yields/waits. |
LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD |
0, 1 |
Experimental macOS/Linux ARM64 ceiling mode: skip scheduler-context SIMD save/restore while task contexts still preserve ABI-required d8-d15. |
LLAM_ARM64_UNSAFE_SKIP_SCHEDULER_SIMD |
0, 1 |
Alias for LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD. |
LLAM_DIRECT_BLOCKING_IO |
0, 1 |
Allow eligible blocking socket read/write operations to run through compensated direct blocking regions. |
LLAM_DIRECT_BLOCKING_POLL |
0, 1, unset |
Control direct blocking poll fallback; Linux/Windows auto mode handles finite waits directly when profitable. |
LLAM_ACCEPT_DIRECT_BLOCKING |
0, 1 |
Route managed accept calls that cannot use multishot accept-watch through a compensated helper poll loop; default is enabled on macOS and disabled elsewhere. |
LLAM_IO_POLL_REDIRECT_TIMEOUT_MS |
milliseconds | Redirect long direct-poll waits through opaque blocking compensation on Linux. |
LLAM_IO_COOP_YIELD |
0, 1 |
Enable cooperative yields around direct I/O fast paths; default is enabled on macOS, Linux, and Windows. |
LLAM_IO_POLL_COOP_YIELD |
0, 1 |
Enable cooperative yields in poll readiness paths; default is enabled on macOS, Linux, and Windows. |
LLAM_IO_POLL_PRE_YIELD |
0, 1 |
Let poll hand off to same-shard runnable producers before the first readiness probe; default is enabled on macOS and Windows. |
LLAM_IO_POLL_EXTRA_YIELD |
0, 1 |
Add an extra poll-readiness yield; default is enabled on macOS and Windows. |
LLAM_IO_POLL_READY_YIELDS |
0-8 |
Bound short same-shard ready-yield probes before poll parks in the backend. |
LLAM_READ_READY_INITIAL_HANDOFF |
0, 1 |
Let llam_read_when_ready() hand off once to local producers before its first read probe; default is disabled. |
LLAM_READ_READY_DIRECT_BLOCKING |
0, 1 |
Let infinite llam_read_when_ready() use compensated direct blocking reads; default is disabled. |
LLAM_POLL_SOCKET_PEEK |
0, 1 |
Use MSG_PEEK for socket POLLIN fast checks; default is enabled on macOS and opt-in elsewhere. |
LLAM_IO_WRITE_HANDOFF |
0, 1 |
Yield after small socket writes so local readers can run; default is enabled on macOS and Linux. |
LLAM_IO_WRITE_DIRECT_LOCAL_HANDOFF |
0, 1 |
Prefer direct same-shard task handoff after eligible socket writes; default is enabled on macOS, Linux, and Windows. |
LLAM_YIELD_DIRECT_HANDOFF |
0, 1, unset |
Allow ordinary yields to switch directly to same-shard runnable work when no timers or inject work are pending. |
LLAM_OPAQUE_REDIRECT_FASTPATH |
0, 1 |
Prefer redirect over helper handoff for opaque blocking; default is enabled on Linux. |
LLAM_TIMER_HEAP_PREWARM |
timer slots | Preallocate shard timer heap slots to avoid growth during sleep/deadline fanout. |
LLAM_STACK_CACHE_PREWARM |
stack count | Prewarm the default stack cache before high fanout workloads. |
LLAM_TASK_CACHE_PREWARM |
task count | Prewarm task metadata slabs before high fanout workloads. |
LLAM_STACK_SAMPLING |
0, 1 |
Enable stack high-water sampling diagnostics. |
LLAM_TRACE_EVENTS |
0, 1 |
Enable per-worker trace ring diagnostics. |
LLAM_WAKE_LATENCY_METRICS |
0, 1 |
Enable wake-latency diagnostics. |
LLAM_STRESS_DYNAMIC_LIVE_POLL_WAITERS |
waiter count | Stress live poll/accept/inflight waiters; automatically clamped by fd budget. |
Run all LLAM benchmark cases:
./benchRun one benchmark case:
LLAM_BENCH_ONLY=spawn_join ./bench
LLAM_BENCH_ONLY=channel_pingpong ./bench
LLAM_BENCH_ONLY=io_echo ./bench
LLAM_BENCH_ONLY=poll_wake ./bench
LLAM_BENCH_ONLY=sleep_fanout ./bench
LLAM_BENCH_ONLY=opaque_block ./benchScale benchmark size:
LLAM_BENCH_ROUNDS=31 LLAM_BENCH_WARMUP_ROUNDS=5 ./bench
LLAM_BENCH_SPAWN_TASKS=512 ./bench
LLAM_BENCH_CHANNEL_MESSAGES=4096 ./bench
LLAM_BENCH_IO_MESSAGES=512 ./bench
LLAM_BENCH_POLL_EVENTS=512 ./bench
LLAM_BENCH_SLEEP_TASKS=1024 ./bench
LLAM_BENCH_OPAQUE_SCOPES=64 ./benchCompare against Go:
go run scripts/bench_go_compare.goCompare LLAM, Go, and Tokio:
python3 scripts/bench_runtime_compare.py --runtime allGraph generation requires Python matplotlib. Without it, the script still writes CSV and prints tables.
The scheduled Runtime Benchmarks workflow runs the same comparison on Linux
x86_64, macOS arm64, macOS x86_64, Windows Server 2022, and Windows Server
2025, then uploads CSV/PNG artifacts for regression tracking.
Run the benchmark matrix:
make bench-matrixRun focused tests:
make testBuild a local release archive:
make clean all test
./scripts/package_release.shOr use the Makefile package target:
make packageVerify Linux:
make verify-linux CC=gccVerify Linux with experimental paths:
LLAM_VERIFY_LINUX_EXPERIMENTAL=1 make verify-linux CC=gccVerify macOS:
CC=clang make verify-darwinVerify macOS with experimental paths:
LLAM_VERIFY_DARWIN_EXPERIMENTAL=1 CC=clang make verify-darwinVerify Linux in Docker:
./scripts/docker_verify_linux.shCheck Windows status:
.\scripts\verify_windows.ps1
.\scripts\verify_windows.ps1 -NativeThe default command verifies through WSL when available. The -Native command builds native Windows targets and runs the Windows CTest suite.
Remove generated files:
make cleanmake clean removes generated files such as object/, build/, CMake cache files, example and benchmark binaries, and perf.data*.
LLAM is a user-level N:M thread scheduler. A small number of OS worker threads (typically one per CPU core) run many lightweight tasks. Each task has its own stack and can be suspended and resumed without kernel intervention.
flowchart TB
subgraph UserSpace["User Space"]
direction TB
Tasks["Tasks (N lightweight fibers)"] --> Shards
subgraph Shards["Scheduler Shards (per worker)"]
direction LR
S0["Shard 0\nhot_q / norm_q / inject_q\ntimers / allocator"]
S1["Shard 1\nhot_q / norm_q / inject_q\ntimers / allocator"]
S0 <-->|work steal| S1
end
Shards --> Nodes
subgraph Nodes["I/O Nodes"]
direction LR
N0["Node 0\nio_uring / kqueue\nwatch tables"]
N1["Node 1\nio_uring / kqueue\nwatch tables"]
end
Watchdog["Watchdog\nprobe / scale / merge / rehome"] -.-> Shards
BlockPool["Blocking Thread Pool"] -.-> Shards
OpaqueHelpers["Opaque Helpers\n(per-shard compensation)"] -.-> Shards
end
Shards -->|runs on| Workers["OS Threads (M pthreads)"]
flowchart LR
App["Application"] --> API["include/llam\npublic API"]
API --> Core["src/core\nscheduler / tasks / sync"]
API --> IO["src/io\nI/O API + backends"]
Core --> Engine["src/engine\nworkers / watchdog"]
Engine --> Blocking["blocking\ncompensation"]
IO --> Linux["src/io/linux\nio_uring"]
IO --> Darwin["src/io/darwin\nkqueue"]
IO --> Windows["Windows\nIOCP sockets + HANDLEs"]
Core --> ASM["src/asm\ncontext switch"]
Tasks are the fundamental unit of execution. Each task is a void (*)(void *) function with its own fiber stack allocated via mmap with a guard page. Tasks are scheduled cooperatively onto OS worker threads; the runtime never preempts a task without its participation (safepoints, yields, or I/O waits).
Shards are per-worker scheduler partitions. Each shard owns:
- Three run queues:
hot_q(latency-critical, capacity 1024),norm_q(normal, capacity 4096), andinject_q(cross-shard, capacity 1024). - A timer heap: min-heap ordered by deadline, used for
llam_sleep_untiland timed waits. - A per-shard allocator: slab-based pools for tasks, wait nodes, timer nodes, I/O requests, and I/O buffers, each with lock-free remote-free queues for cross-shard deallocation.
- A stack cache: per-class (default/large/huge) stack mapping reuse pool.
- Scheduler context: the fiber context (
llam_ctx_t) the scheduler loop itself runs on. - An opaque helper thread: a pre-spawned compensation thread that takes over scheduling when the primary worker enters a blocking region.
The norm_q has two implementations selected at init time: a mutex-guarded FIFO queue, or a Chase-Lev lock-free deque (llam_cldeque_t) for work-stealing. The lock-free deque is a bounded circular buffer of 4096 task pointers with separated top (steal end) and bottom (push/pop end) atomics, each on its own cache line.
Nodes are platform I/O event backends. Each node owns either an io_uring ring (Linux) or a kqueue fd (Darwin), plus watch tables, submit queues, and control queues. Shards submit I/O requests to nodes; nodes complete requests back to the owning shard's task.
The core scheduler loop (llam_scheduler_loop) runs on each worker thread:
loop:
1. Check runtime drain (live_tasks == 0 → stop)
2. Handle merge-pause requests from the watchdog
3. Handle dynamic-worker offline state
4. Drain inject queue (up to 32 tasks per pass)
5. Fire expired timers
6. Dequeue from hot_q, then norm_q
7. Attempt work-steal from a random sibling shard
8. If no task found → idle wait (eventfd/kqueue/futex)
9. Context switch to the selected task
10. On return: record metrics, check safepoints, repeat
Task selection priority: hot queue → normal queue → inject queue → steal. The hot queue is reserved for latency-class tasks and I/O completions. The inject queue receives cross-shard work and is drained with a budget cap to prevent starvation.
Context switches are performed in hand-written assembly for each supported platform. The runtime saves and restores only the callee-saved registers required by the platform ABI:
| Platform | Saved registers | Mechanism |
|---|---|---|
| Linux x86_64 | rbx, rbp, r12-r15, rsp |
Direct mov/ret in context_x86_64.S |
| Linux aarch64 | x19-x29, x30, sp |
stp/ldp in context_arm64.S |
| Darwin x86_64 | rbx, rbp, r12-r15, rsp |
Same register set as Linux x86_64 |
| Darwin arm64 | x19-x29, x30, sp |
Same register set as Linux aarch64 |
| Fallback | Full register set | ucontext via swapcontext() |
On x86_64, the fast path is approximately:
mov [rdi], rsp ; save current SP
mov rsp, [rsi] ; restore target SP
ret ; jump to target's saved return addressNo syscall, no privilege transition, no TLB flush. Typical cost is tens of nanoseconds versus microseconds for a kernel thread switch.
FP state isolation: the runtime detects XSAVE support at init and tracks the XSAVE mask and area size. Tasks that modify x87/SSE rounding modes (MXCSR, x87 CW) are isolated from one another across context switches.
LLAM provides scheduler-safe I/O through a multi-tier completion strategy. Each I/O call (llam_read, llam_write, llam_read_handle, llam_write_handle, llam_accept, llam_connect, llam_poll_fd, llam_poll_handle) follows this path:
1. Direct nonblocking fast path (try without parking)
2. Cooperative yield + retry if local work is pending
3. Direct blocking heuristic for short-lived ops
4. Async backend submission (io_uring SQE or kqueue kevent)
5. Blocking-worker fallback if the backend cannot handle the fd
Linux backend (src/io/linux/): each llam_node_t owns an io_uring instance with a ring depth of 256. Features include:
- One-shot and multishot poll, accept, and recv operations.
- Provided-buffer rings (128 entries per node) for zero-copy recv.
- SQPOLL mode (experimental) where the kernel polls the SQ without syscalls.
- Completion-driven task wake: CQE processing identifies the owning shard and re-enqueues the task.
Darwin backend (src/io/darwin/): each node uses kqueue with EVFILT_READ, EVFILT_WRITE, and EVFILT_USER for wake signaling. Darwin nodes use Mach ports (semaphore_t, mach_port_t) for cross-thread wake when available.
Windows backend (src/io/windows/): each node uses IOCP for overlapped Winsock operations and generic HANDLE ReadFile/WriteFile requests. Socket readiness policy stays conservative for stream POLLIN unless explicitly enabled; waitable HANDLE polling uses the platform wait path because it observes signaled handle state, not socket readiness.
Linux and Darwin backends support multishot watches — a single registered watch serves multiple waiters, avoiding redundant kernel registrations for the same fd. Watch tables track fd identity by (dev_t, ino_t) to detect fd reuse.
I/O request lifecycle: requests (llam_io_req_t) carry an atomic wait_mode that transitions through NONE → SUBMIT_QUEUE → INFLIGHT → (completion) or NONE → POLL_WATCH/ACCEPT_WATCH/RECV_WATCH → (completion). An atomic abort_reason field handles cancellation and timeout races without locks.
Per-shard allocator (llam_allocator_t): each shard maintains free lists for 5 object types:
| Object | Slab size | Purpose |
|---|---|---|
llam_task_t |
16 per slab | Task metadata + embedded wait node, I/O req, timer |
llam_wait_node_t |
64 per slab | Mutex/cond/channel waiter nodes |
llam_timer_node_t |
64 per slab | Timer heap entries |
llam_io_req_t |
64 per slab | I/O operation descriptors |
llam_io_buffer_t |
16 per slab | Owned I/O buffers (4KB inline data each) |
Remote-free queues: when a task completes on a different shard than it was allocated on, the deallocation goes through a lock-free atomic MPSC list (task_remote_free, wait_remote_free, etc.) and is drained into the local free list during llam_allocator_quiescent() at safe points. Each remote-free queue sits on its own cache line (_Alignas(64)) to avoid false sharing.
Stack cache: two-tier cache with per-shard local pools and a runtime-global fallback:
| Stack class | Size | Shard cache limit | Global cache limit |
|---|---|---|---|
| Default | 64 KB | 256 (512 in release-fast) | 4096 |
| Large | 256 KB | 64 | 512 |
| Huge | 1 MB | 16 | 128 |
Stack mappings use mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK) with a guard page (mprotect(PROT_NONE)) at the bottom. Returned stacks are cached for reuse; the cache is pre-warmed at init via llam_runtime_prewarm_stack_cache.
When a task calls llam_enter_blocking() or llam_call_blocking(), the runtime must keep the shard's scheduler running while the worker thread is pinned in foreign code.
Opaque helper thread: each shard has a pre-spawned helper thread (opaque_helper_thread). When the primary worker enters a blocking region:
Primary worker:
1. Increment opaque_compensation_depth
2. Wake the helper thread (futex on Linux, Mach semaphore or condvar on Darwin)
3. Execute the blocking work
Helper thread:
1. Wake up and take over the shard's scheduler loop (using opaque_scheduler_ctx)
2. Run tasks from the shard's queues while the primary is blocked
3. When the primary calls llam_leave_blocking(), relinquish control
Blocking thread pool: a separate pool of block_worker_count threads handles llam_call_blocking jobs via a global FIFO job queue (block_head/block_tail). Jobs transition through QUEUED → RUNNING → FINISHED/ABORTED.
The connect fallback (llam_blocking_connect_impl) drives a nonblocking connect() + poll(POLLOUT, 10ms) loop with SO_ERROR verification, running in the blocking pool rather than pinning a scheduler worker.
The watchdog thread (src/engine/) runs at 1ms intervals (LLAM_WATCHDOG_INTERVAL_NS) and performs:
| Module | File | Function |
|---|---|---|
| Probe | runtime_watchdog_probe.c |
Detect stalled safepoints, measure queue pressure, suspect deadlocks after 4 consecutive observations |
| Scale | runtime_watchdog_scale.c |
Dynamic worker scaling: scale up after 2 consecutive pressure observations, scale down after 12 consecutive idle observations, with a 4-tick cooldown |
| Merge | runtime_watchdog_merge.c |
Offline a shard by draining its queues and migrating tasks to a target shard |
| Rehome | runtime_watchdog_rehome.c |
Atomically transfer ownership of parked waiters, in-flight I/O, submit-queue entries, and multishot watch state from an offline shard to a target shard |
Rehome validates the entire waiter list before any migration. If a single entry cannot be rehomed (pinned task, incompatible I/O state), the entire list migration is aborted to prevent partial ownership inconsistency.
All sync primitives are runtime-aware: when called from a managed task, blocking waits park the task (freeing the worker thread) instead of blocking the OS thread.
- Mutex (
llam_mutex_t): atomic owner fast path +llam_wait_queue_tfor contention. Non-recursive.EDEADLKon self-lock,EPERMon non-owner unlock. - Condition variable (
llam_cond_t): FIFO waiter queue. Signal/broadcast can be called from outside managed tasks. - Channel (
llam_channel_t): bounded pointer-valued ring buffer with separate send and receive wait queues. Supports close semantics (sends fail withEPIPE, buffered values remain drainable). - Cancel token (
llam_cancel_token_t): explicit cancellation handle with a waiter list. Registered tasks and I/O operations observe cancellation throughECANCELED.
LLAM is licensed under the Apache License 2.0.
