LLAM

LLAM is a stackful user-thread runtime for C applications. It lets C code express concurrency with task-oriented APIs such as spawn, join, sleep, channels, read, write, accept, connect, and poll, while the runtime schedules many user tasks over a smaller set of OS worker threads.

LLAM is not Linux-only. The Linux backend uses io_uring/liburing, the macOS/Darwin backend uses kqueue-based watch and completion paths, and the native Windows 10/11 backend uses IOCP for overlapped Winsock read/write/accept/connect plus generic HANDLE ReadFile/WriteFile requests.

Key Features

Stackful tasks with natural C control flow.
N:M scheduling over runtime worker threads.
Linux I/O backend based on io_uring/liburing.
macOS/Darwin I/O backend based on kqueue.
Windows 10/11 backend with IOCP request completions for sockets and overlapped HANDLEs, Windows wake handles, and x86_64 context-switch assembly.
Task primitives: spawn, yield, join, sleep, deadlines, and task metadata.
Synchronization primitives: mutex, condition variable, channel, and cancellation token.
Channel multiplexing with llam_channel_select() and focused select benchmarks.
Blocking integration through llam_call_blocking, llam_enter_blocking, and llam_leave_blocking.
Runtime tuning through profiles, dynamic workers, worker rings, SQPOLL, and idle-spin controls.
Observability through runtime stats and debug dumps.
Stable ABI metadata for dynamic language-runtime loaders.
Static and shared library build targets.
Built-in demo, chat server, stress, benchmark, Docker verification, and Go/Tokio comparison scripts.

Platform Support

Platform	Status	I/O backend	Recommended compiler	Verification
Linux x86_64	Primary Linux path	io_uring/liburing	GCC or Clang	`make verify-linux CC=gcc`
Linux aarch64	Supported	io_uring/liburing	GCC or Clang	`make verify-linux CC=gcc`
macOS arm64	Primary macOS path	kqueue	Apple Clang	`CC=clang make verify-darwin`
macOS x86_64	Supported	kqueue + x86_64 asm context switch	Apple Clang	`CC=clang make verify-darwin`
Windows 10/11	Supported native x86_64 backend	IOCP for WSARecv/WSASend/AcceptEx/ConnectEx, overlapped HANDLE ReadFile/WriteFile, plus gated TCP `POLLOUT` and UDP `POLLIN`; TCP `POLLIN` defaults to fallback unless `LLAM_WINDOWS_IOCP_TCP_POLLIN=1` is enabled	MinGW and MSVC/MASM via CMake	CMake Windows build plus `test_windows_policy`, `test_windows_runtime_smoke`, `test_windows_iocp_io`, and `test_windows_handle_io`; `scripts/verify_windows.ps1 -Native`

Native Windows runtime support covers scheduler/core, wake handles, x86_64 context switching, IOCP-backed socket requests, and overlapped HANDLE I/O. Windows 10 and Windows 11 use the same public API; LLAM selects conservative Windows 10 tuning or batched Windows 11 tuning at runtime, and CI forces both policy branches on native Windows runners.

Production and stress-operation guidance is documented in docs/operations.md.

Getting Started

Install Linux/WSL dependencies:

sudo apt install build-essential liburing-dev

Install macOS command-line tools:

xcode-select --install

Build on Linux:

make -j4 CC=gcc

Build on macOS:

CC=clang make -j4

Build native Windows with CMake:

cmake -S . -B build-windows -G "Ninja" -DCMAKE_BUILD_TYPE=Release
cmake --build build-windows
ctest --test-dir build-windows --output-on-failure

.\scripts\verify_windows.ps1 still verifies the Linux backend through WSL. .\scripts\verify_windows.ps1 -Native builds the native Windows CMake targets and runs the Windows CTest suite.

Build with CMake:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4

Install with CMake:

cmake --install build --prefix "$HOME/.local"

Run the included programs:

./demo
./stress
./bench
./server 7777

Stress the chat server with real TCP clients:

make server-stress
python3 scripts/stress_server.py --clients 64 --messages 16 --payload-bytes 64

Run the native maximum-throughput flood driver:

make server-flood
./server_flood --clients 16 --duration 60 --message-bytes 8 --batch 64 --target-mps 0.30

server_flood reports both inbound messages/sec and observed broadcast deliveries/sec. For chat fanout, one inbound message can produce clients - 1 peer deliveries, so million-level delivery rates can appear before inbound message rates reach the same scale.

Run the full composite server stress suite:

make server-stress-composite
make server-stress-composite-quick
make server-stress-composite-hour
python3 scripts/stress_server_composite.py --quick

The composite suite combines exact fanout checks, 60-second native flood, payload-size variation, connection churn, slow receivers, half-close/reset patterns, and RSS/fd sampling.

--quick is intended for hosted CI runners. It keeps exact-delivery checks but uses a lower absolute flood delivery threshold than standard and hour-long runs.

The one-hour profile runs the same classes of checks with a long soak layout: 30 minutes of main flood, two 5-minute payload flood phases, and 20 minutes of mixed edge stress.

Run focused API/ABI tests:

make test

Build outputs:

demo: runnable examples of the public runtime API.
stress: regression coverage for scheduling, sync, timeouts, I/O, and dynamic workers.
bench: microbenchmarks for spawn/join, channels, channel select, I/O, poll, sleep fanout, and opaque blocking.
server: minimal LLAM-backed TCP chat backend for local testing.
server_flood: native nonblocking throughput flood driver for the chat server.
scripts/stress_server.py: TCP fanout stress test for the chat server.
scripts/stress_server_composite.py: long-running composite server stability suite.
test_abi_contract: ABI metadata and size handshakes.
test_connect_io: direct and runtime-managed llam_connect() success and invalid-input checks.
test_runtime_core: lifecycle, task metadata, yielding, sleeping, blocking callbacks, and stats checks.
test_sync_primitives: mutex, condition variable, channel, timeout, and close semantics.
test_io_buffers: direct and managed poll/read/write, owned buffers, and MSG_PEEK.
test_shared_load: dlopen() coverage for the shared library ABI surface.

Using LLAM In An Application

The top-level Makefile builds the bundled executables directly. For application integration, the simplest path is the CMake target llam_runtime.

add_subdirectory(path/to/LLAM)

add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam_runtime)

Use llam_runtime_shared when a language runtime needs to load LLAM dynamically. The Makefile equivalent is make shared.

Release archives include the public headers, docs, bundled examples, runtime libraries, pkg-config metadata, and CMake package files. Tag pushes such as v1.0.1 build and publish .tar.xz archives for Linux x86_64, Linux aarch64, macOS x86_64, and macOS arm64, plus a native Windows x86_64 .zip archive through .github/workflows/release.yml.

The 1.0 release gate is intentionally platform-local: Linux must pass make verify-linux or Docker verification, macOS must pass the Darwin verify path, and Windows must pass native CMake/CTest plus Windows 2022/2025 stress smoke. The full operational checklist is in docs/operations.md.

Use an installed SDK with CMake:

find_package(llam CONFIG REQUIRED)

add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam::runtime)

Use an installed SDK with pkg-config:

cc main.c $(pkg-config --cflags --libs llam) -o my_app

Install on Linux/macOS:

curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --prefix "$HOME/.local"

Install a specific Linux/macOS target:

curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --target macos-aarch64 --prefix "$HOME/.local"

Install on Windows x86_64:

Invoke-WebRequest "https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.ps1" -OutFile install.ps1; .\install.ps1 -Version 1.0.1 -Prefix "$env:LOCALAPPDATA\LLAM"

Include the canonical public API:

#include "llam/runtime.h"

Dynamic loaders should check llam_abi_version() or llam_abi_get_info() before binding the rest of the API. FFI bindings should prefer llam_runtime_init_ex() and llam_spawn_ex() so inbound option structs carry an explicit caller-side size. The ABI and semantic contract is documented in docs/abi.md. Embedding code should use llam_runtime_create(), llam_runtime_run_handle(), and llam_runtime_destroy(), while treating LLAM 1.0 as one active runtime per process. True multi-runtime isolation is a post-1.0 migration item; do not create/destroy LLAM concurrently from multiple host runtime instances. macOS-specific performance gates and remaining structural work are covered by the platform-local release checklist in docs/operations.md. Windows backend scope, policy split, and acceptance gates are tracked in docs/operations.md.

Execution Model

A typical LLAM program follows this lifecycle:

Initialize the runtime with llam_runtime_init().
Spawn one or more root tasks with llam_spawn().
Run the scheduler with llam_run().
Shut the runtime down with llam_runtime_shutdown().

#include "llam/runtime.h"

#include <stdio.h>

static void worker(void *arg) {
    const char *name = arg;

    printf("hello from %s\n", name);
    llam_yield();
    printf("bye from %s\n", name);
}

int main(void) {
    if (llam_runtime_init(NULL) != 0) {
        return 1;
    }

    if (llam_spawn(worker, "LLAM", NULL) == NULL) {
        llam_runtime_shutdown();
        return 1;
    }

    int rc = llam_run();
    llam_runtime_shutdown();
    return rc;
}

Task, Join, And Sleep

A task is a void (*)(void *) function. Pass shared state through the task argument and use llam_join() when a parent task needs the child to finish. Every task handle returned by llam_spawn*() must be consumed by either a successful join or llam_detach().

#include "llam/runtime.h"

#include <stdint.h>
#include <stdio.h>

typedef struct job {
    int input;
    int output;
} job_t;

static void child(void *arg) {
    job_t *job = arg;

    llam_sleep_ns(1ULL * 1000ULL * 1000ULL);
    job->output = job->input * job->input;
}

static void root(void *arg) {
    (void)arg;

    job_t job = {.input = 12};
    llam_task_t *task = llam_spawn(child, &job, NULL);

    if (task != NULL && llam_join(task) == 0) {
        printf("result=%d\n", job.output);
    }
}

Deadline-based APIs use absolute timestamps from llam_now_ns().

uint64_t deadline = llam_now_ns() + 10ULL * 1000ULL * 1000ULL;
int rc = llam_join_until(task, deadline);

Channels

A channel transfers pointer values between tasks. Capacity must be at least 1; capacity 1 or greater behaves like a bounded buffer.

#include "llam/runtime.h"

#include <stdio.h>

typedef struct pipe_state {
    llam_channel_t *channel;
} pipe_state_t;

static void producer(void *arg) {
    pipe_state_t *state = arg;

    (void)llam_channel_send(state->channel, "ping");
    (void)llam_channel_send(state->channel, "pong");
    (void)llam_channel_close(state->channel);
}

static void consumer(void *arg) {
    pipe_state_t *state = arg;
    const char *msg;

    while ((msg = llam_channel_recv(state->channel)) != NULL) {
        printf("recv=%s\n", msg);
    }
}

static void root(void *arg) {
    (void)arg;

    pipe_state_t state = {
        .channel = llam_channel_create(2),
    };

    if (state.channel == NULL) {
        return;
    }

    llam_task_t *a = llam_spawn(producer, &state, NULL);
    llam_task_t *b = llam_spawn(consumer, &state, NULL);

    if (a != NULL) {
        (void)llam_join(a);
    }
    if (b != NULL) {
        (void)llam_join(b);
    }
    llam_channel_destroy(state.channel);
}

I/O

LLAM I/O calls are written like blocking calls from inside a task, while the runtime backend handles readiness and completion. Linux uses io_uring, macOS uses kqueue, and Windows uses IOCP for overlapped Winsock read, write, accept, connect, generic HANDLE ReadFile/WriteFile, gated TCP POLLOUT, and UDP POLLIN requests. Windows TCP POLLIN defaults to the cooperative/direct fallback path unless LLAM_WINDOWS_IOCP_TCP_POLLIN=1 is enabled for controlled smoke or benchmark runs; unsupported poll masks remain fallback. The current I/O primitive set covers read, read_when_ready, write, HANDLE read/write, accept, connect, fd polling, HANDLE polling, and owned-buffer reads on supported native backends. Use LLAM_INVALID_FD or LLAM_FD_IS_INVALID(fd) for descriptor-returning failures such as llam_accept(), and LLAM_INVALID_HANDLE or LLAM_HANDLE_IS_INVALID(handle) for HANDLE-returning integrations.

#include "llam/runtime.h"

#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>

typedef struct echo_state {
    int reader;
    int writer;
} echo_state_t;

static void reader_task(void *arg) {
    echo_state_t *state = arg;
    char buf[64];

    ssize_t n = llam_read(state->reader, buf, sizeof(buf));
    if (n > 0) {
        printf("read=%.*s\n", (int)n, buf);
    }
}

static void writer_task(void *arg) {
    echo_state_t *state = arg;
    const char *msg = "hello";

    (void)llam_write(state->writer, msg, strlen(msg));
}

static void root(void *arg) {
    (void)arg;

    int sv[2];
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) {
        return;
    }

    echo_state_t state = {
        .reader = sv[0],
        .writer = sv[1],
    };

    llam_task_t *reader = llam_spawn(reader_task, &state, NULL);
    llam_task_t *writer = llam_spawn(writer_task, &state, NULL);

    if (reader != NULL) {
        (void)llam_join(reader);
    }
    if (writer != NULL) {
        (void)llam_join(writer);
    }

    close(sv[0]);
    close(sv[1]);
}

The owned-buffer API lets the runtime allocate the I/O buffer. Release it with llam_io_buffer_release(). EOF or a zero-byte read returns 0 with buffer == NULL; failures return -1, set errno, and also leave buffer == NULL.

llam_io_buffer_t *buffer = NULL;
ssize_t n = llam_read_owned(fd, 4096, &buffer);

if (n > 0 && buffer != NULL) {
    void *data = llam_io_buffer_data(buffer);
    size_t size = llam_io_buffer_size(buffer);
    (void)data;
    (void)size;
}
llam_io_buffer_release(buffer);

Blocking Work

Long CPU work or blocking syscalls can pin a worker if they run directly inside a task. Use llam_call_blocking_result() to offload such work without ambiguity, or wrap explicit blocking regions with llam_enter_blocking() and llam_leave_blocking().

#include "llam/runtime.h"

#include <unistd.h>

static void *slow_syscall(void *arg) {
    (void)arg;
    sleep(1);
    return NULL;
}

static void task(void *arg) {
    void *result = NULL;

    (void)arg;

    (void)llam_call_blocking_result(slow_syscall, NULL, &result);
}

Manual blocking region:

if (llam_enter_blocking() == 0) {
    /* Run a blocking syscall or external library call here. */
    llam_leave_blocking();
}

Public API Summary

Runtime lifecycle:

API	Purpose
`llam_runtime_opts_init`	Fill runtime options with ABI-safe library defaults.
`llam_runtime_init_ex`	Initialize the runtime with an explicit option struct size for FFI.
`llam_runtime_init`	Initialize the runtime.
`llam_runtime_request_stop`	Request cooperative scheduler stop and wake workers.
`llam_runtime_shutdown`	Shut the runtime down and release resources.
`llam_runtime_collect_stats_ex`	Collect stats with an explicit output struct size for FFI.
`llam_runtime_collect_stats`	Collect scheduler, I/O, blocking, and queue statistics.
`llam_runtime_write_stats_json`	Write a newline-terminated JSON stats snapshot to an fd.

Task scheduling:

API	Purpose
`llam_spawn_opts_init`	Fill spawn options with ABI-safe library defaults.
`llam_spawn_ex`	Create a task with an explicit option struct size for FFI.
`llam_spawn`	Create a task.
`llam_run`	Run the scheduler.
`llam_yield`	Yield the current task.
`llam_task_safepoint`	Mark progress in CPU-bound loops without forcing an immediate yield.
`llam_join`	Wait for task completion.
`llam_join_until`	Wait for task completion until a deadline.
`llam_detach`	Consume a task handle without waiting for completion.
`llam_sleep_ns`	Sleep for a duration.
`llam_sleep_until`	Sleep until an absolute deadline.
`llam_task_set_class`	Change the current task class; invalid class values fail with `EINVAL`.
`llam_current_task`	Return the current task handle.
`llam_task_id`	Return a task id.
`llam_task_state_name`	Return a task state string.
`llam_task_class`	Return a task class.
`llam_task_flags`	Return task flags.

Spawn options:

Type/value	Meaning
`LLAM_TASK_CLASS_LATENCY`	Latency-sensitive task.
`LLAM_TASK_CLASS_DEFAULT`	Default task class.
`LLAM_TASK_CLASS_BATCH`	Batch-oriented task.
`LLAM_STACK_CLASS_DEFAULT`	Default stack size class.
`LLAM_STACK_CLASS_LARGE`	Larger stack size class.
`LLAM_STACK_CLASS_HUGE`	Very large stack size class.
`LLAM_SPAWN_F_PINNED`	Hint that the task should stay pinned.
`LLAM_SPAWN_F_NO_PREEMPT`	Hint that preemption should be restricted.
`LLAM_SPAWN_F_SYS_TASK`	Runtime/system task hint.
`LLAM_SPAWN_F_LATENCY_CRITICAL`	Latency-critical task hint.

Blocking:

API	Purpose
`llam_call_blocking_result`	Run a blocking function through the unambiguous `int + out` API.
`llam_call_blocking`	Convenience blocking API; ambiguous when callback returns `NULL`.
`llam_enter_blocking`	Mark the current task as entering a blocking region.
`llam_leave_blocking`	Mark the current task as leaving a blocking region.

Cancellation:

API	Purpose
`llam_cancel_token_create`	Create a cancellation token.
`llam_cancel_token_destroy`	Destroy a cancellation token; live observers make it fail with `EBUSY`.
`llam_cancel_token_cancel`	Request cancellation.
`llam_cancel_token_is_cancelled`	Check cancellation state.

Mutex and condition variables:

API	Purpose
`llam_mutex_create` / `llam_mutex_destroy`	Create or destroy a mutex; destroy returns `EBUSY` while owned or waited on.
`llam_mutex_lock` / `llam_mutex_unlock`	Lock or unlock a non-recursive mutex; self-lock returns `EDEADLK`, non-owner unlock returns `EPERM`.
`llam_mutex_lock_until`	Wait for a mutex until a deadline; self-lock returns `EDEADLK`.
`llam_mutex_trylock`	Try to lock immediately; returns `EBUSY` when already locked.
`llam_cond_create` / `llam_cond_destroy`	Create or destroy a condition variable; destroy returns `EBUSY` while waited on.
`llam_cond_wait`	Wait on a condition variable; caller must own the mutex and wait in a predicate loop.
`llam_cond_wait_until`	Wait on a condition variable until a deadline; reacquires the mutex before returning.
`llam_cond_signal`	Wake one waiter; may be called with or without the mutex and outside a managed task.
`llam_cond_broadcast`	Wake all waiters; may be called with or without the mutex and outside a managed task.

Channels:

API	Purpose
`llam_channel_create` / `llam_channel_destroy`	Create or destroy a channel; destroy returns `EBUSY` while buffered values or waiters remain.
`llam_channel_send`	Send a value.
`llam_channel_send_until`	Send a value until a deadline.
`llam_channel_recv_result`	Receive a value through an unambiguous `int + out` API.
`llam_channel_recv_until_result`	Receive a value until a deadline through an unambiguous `int + out` API.
`llam_channel_recv`	Convenience receive API; use result form if `NULL` is a valid payload.
`llam_channel_recv_until`	Convenience timed receive API; use result form if `NULL` is a valid payload.
`llam_channel_close`	Idempotently close a channel; buffered values remain drainable and sends fail with `EPIPE`.

I/O:

API	Purpose
`llam_read`	Read from an fd.
`llam_write`	Write to an fd.
`llam_read_handle`	Read from a platform handle; Windows uses overlapped `ReadFile` through IOCP when possible, POSIX aliases to fd read.
`llam_write_handle`	Write to a platform handle; Windows uses overlapped `WriteFile` through IOCP when possible, POSIX aliases to fd write.
`llam_read_owned`	Read into a runtime-owned buffer.
`llam_recv_owned`	Receive with flags into a runtime-owned buffer.
`llam_io_buffer_release`	Release an owned buffer.
`llam_io_buffer_data`	Return the owned buffer data pointer.
`llam_io_buffer_size`	Return the number of bytes read.
`llam_io_buffer_capacity`	Return owned buffer capacity.
`llam_accept`	Accept a connection from a listener fd; returns `LLAM_INVALID_FD` on failure.
`llam_connect`	Connect a socket without blocking the scheduler worker.
`llam_poll_fd`	Wait for fd readiness.
`llam_poll_handle`	Wait for platform handle state; Windows uses `WaitForSingleObject` semantics and POSIX aliases to fd poll.

Time, debug, and platform:

API	Purpose
`llam_now_ns`	Return a monotonic nanosecond timestamp.
`llam_dump_runtime_state`	Dump runtime state to an fd.
`llam_fd_t`	Platform-specific fd/socket handle type.
`llam_handle_t`	Platform-specific generic handle type for HANDLE I/O APIs.
`LLAM_INVALID_FD` / `LLAM_FD_IS_INVALID`	Platform-correct invalid descriptor sentinel and predicate.
`LLAM_INVALID_HANDLE` / `LLAM_HANDLE_IS_INVALID`	Platform-correct invalid generic-handle sentinel and predicate.
`LLAM_PLATFORM_LINUX`	Linux build flag.
`LLAM_PLATFORM_DARWIN`	macOS/Darwin build flag.
`LLAM_PLATFORM_WINDOWS`	Windows build flag.
`LLAM_PLATFORM_NAME`	Platform name string.

Runtime Options

Pass NULL to llam_runtime_init() for the default runtime configuration. Pass llam_runtime_opts_t when you need explicit tuning. Dynamic loaders and language bindings should initialize option structs with llam_runtime_opts_init(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE) and llam_spawn_opts_init(&opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), then call llam_runtime_init_ex(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE), llam_spawn_ex(fn, arg, &opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), and llam_runtime_collect_stats_ex(&stats, LLAM_RUNTIME_STATS_CURRENT_SIZE).

Public option and stats structs use fixed-width integer storage for ABI-facing scalar fields. Enum constants remain available for C readability, but FFI bindings should model task classes, stack classes, profiles, flags, and 32-bit state counters as uint32_t; sqpoll_cpu is int32_t.

llam_runtime_opts_t opts = {
    .deterministic = 0,
    .forced_yield_every = 0,
    .experimental_flags =
        LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS |
        LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ,
    .profile = LLAM_RUNTIME_PROFILE_BALANCED,
};

if (llam_runtime_init(&opts) != 0) {
    return 1;
}

Important fields:

Field	Meaning
`deterministic`	Deterministic scheduling mode.
`forced_yield_every`	Force a yield at a fixed interval.
`experimental_flags`	Bitwise OR of `LLAM_RUNTIME_EXPERIMENTAL_F_*` flags.
`idle_spin_ns`	Spin before idle poll fallback.
`idle_spin_max_iters`	Maximum idle-spin iterations.
`sqpoll_cpu`	CPU reserved for SQPOLL.
`profile`	Runtime policy profile: balanced, release-fast, debug-safe, or io-latency.

Experimental flags:

Flag	Meaning
`LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS`	Experimental per-worker I/O ring mode.
`LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS_MULTISHOT`	Allow multishot watches with worker rings.
`LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS`	Soft-park and reactivate idle workers.
`LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ`	Use the lock-free normal queue.
`LLAM_RUNTIME_EXPERIMENTAL_F_HUGE_ALLOC`	Prefer hugepage-friendly allocator backing.
`LLAM_RUNTIME_EXPERIMENTAL_F_SQPOLL`	Experimental Linux io_uring SQPOLL mode.

Selected environment variables:

Variable	Example values	Meaning
`LLAM_RUNTIME_PROFILE`	`balanced`, `release-fast`, `debug-safe`, `io-latency`	Override the runtime profile.
`LLAM_EXPERIMENTAL_DYNAMIC_WORKERS`	`0`, `1`	Toggle dynamic workers.
`LLAM_EXPERIMENTAL_LOCKFREE_NORMQ`	`0`, `1`	Toggle the lock-free normal queue.
`LLAM_EXPERIMENTAL_WORKER_RINGS`	`0`, `1`	Toggle worker ring mode.
`LLAM_EXPERIMENTAL_WORKER_RINGS_MULTISHOT`	`0`, `1`	Toggle worker-ring multishot watches.
`LLAM_EXPERIMENTAL_HUGE_ALLOC`	`0`, `1`	Toggle huge allocator mode.
`LLAM_EXPERIMENTAL_SQPOLL`	`0`, `1`	Toggle Linux SQPOLL.
`LLAM_SQPOLL_CPU`	CPU number	Select the SQPOLL CPU.
`LLAM_IDLE_SPIN_NS`	nanoseconds	Idle spin time.
`LLAM_IDLE_SPIN_ITERS`	iteration count	Idle spin iteration limit.
`LLAM_BIND_WORKERS`	`0`, `1`	Bind worker threads to platform CPUs when supported.
`LLAM_DARWIN_MACH_SCHED`	`0`, `1`	Toggle Darwin Mach/QoS scheduler hints; default is enabled on macOS.
`LLAM_WINDOWS_UNSAFE_SKIP_TASK_SIMD`	`0`, `1`	Experimental Windows x64 ceiling mode: skip task-context XMM6-XMM15 save/restore. Only valid when managed tasks do not rely on callee-saved SIMD state across LLAM yields/waits.
`LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD`	`0`, `1`	Experimental macOS/Linux ARM64 ceiling mode: skip scheduler-context SIMD save/restore while task contexts still preserve ABI-required `d8-d15`.
`LLAM_ARM64_UNSAFE_SKIP_SCHEDULER_SIMD`	`0`, `1`	Alias for `LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD`.
`LLAM_DIRECT_BLOCKING_IO`	`0`, `1`	Allow eligible blocking socket read/write operations to run through compensated direct blocking regions.
`LLAM_DIRECT_BLOCKING_POLL`	`0`, `1`, unset	Control direct blocking poll fallback; Linux/Windows auto mode handles finite waits directly when profitable.
`LLAM_ACCEPT_DIRECT_BLOCKING`	`0`, `1`	Route managed `accept` calls that cannot use multishot accept-watch through a compensated helper poll loop; default is enabled on macOS and disabled elsewhere.
`LLAM_IO_POLL_REDIRECT_TIMEOUT_MS`	milliseconds	Redirect long direct-poll waits through opaque blocking compensation on Linux.
`LLAM_IO_COOP_YIELD`	`0`, `1`	Enable cooperative yields around direct I/O fast paths; default is enabled on macOS, Linux, and Windows.
`LLAM_IO_POLL_COOP_YIELD`	`0`, `1`	Enable cooperative yields in poll readiness paths; default is enabled on macOS, Linux, and Windows.
`LLAM_IO_POLL_PRE_YIELD`	`0`, `1`	Let poll hand off to same-shard runnable producers before the first readiness probe; default is enabled on macOS and Windows.
`LLAM_IO_POLL_EXTRA_YIELD`	`0`, `1`	Add an extra poll-readiness yield; default is enabled on macOS and Windows.
`LLAM_IO_POLL_READY_YIELDS`	`0`-`8`	Bound short same-shard ready-yield probes before poll parks in the backend.
`LLAM_READ_READY_INITIAL_HANDOFF`	`0`, `1`	Let `llam_read_when_ready()` hand off once to local producers before its first read probe; default is disabled.
`LLAM_READ_READY_DIRECT_BLOCKING`	`0`, `1`	Let infinite `llam_read_when_ready()` use compensated direct blocking reads; default is disabled.
`LLAM_POLL_SOCKET_PEEK`	`0`, `1`	Use `MSG_PEEK` for socket `POLLIN` fast checks; default is enabled on macOS and opt-in elsewhere.
`LLAM_IO_WRITE_HANDOFF`	`0`, `1`	Yield after small socket writes so local readers can run; default is enabled on macOS and Linux.
`LLAM_IO_WRITE_DIRECT_LOCAL_HANDOFF`	`0`, `1`	Prefer direct same-shard task handoff after eligible socket writes; default is enabled on macOS, Linux, and Windows.
`LLAM_YIELD_DIRECT_HANDOFF`	`0`, `1`, unset	Allow ordinary yields to switch directly to same-shard runnable work when no timers or inject work are pending.
`LLAM_OPAQUE_REDIRECT_FASTPATH`	`0`, `1`	Prefer redirect over helper handoff for opaque blocking; default is enabled on Linux.
`LLAM_TIMER_HEAP_PREWARM`	timer slots	Preallocate shard timer heap slots to avoid growth during sleep/deadline fanout.
`LLAM_STACK_CACHE_PREWARM`	stack count	Prewarm the default stack cache before high fanout workloads.
`LLAM_TASK_CACHE_PREWARM`	task count	Prewarm task metadata slabs before high fanout workloads.
`LLAM_STACK_SAMPLING`	`0`, `1`	Enable stack high-water sampling diagnostics.
`LLAM_TRACE_EVENTS`	`0`, `1`	Enable per-worker trace ring diagnostics.
`LLAM_WAKE_LATENCY_METRICS`	`0`, `1`	Enable wake-latency diagnostics.
`LLAM_STRESS_DYNAMIC_LIVE_POLL_WAITERS`	waiter count	Stress live poll/accept/inflight waiters; automatically clamped by fd budget.

Benchmarks

Run all LLAM benchmark cases:

./bench

Run one benchmark case:

LLAM_BENCH_ONLY=spawn_join ./bench
LLAM_BENCH_ONLY=channel_pingpong ./bench
LLAM_BENCH_ONLY=io_echo ./bench
LLAM_BENCH_ONLY=poll_wake ./bench
LLAM_BENCH_ONLY=sleep_fanout ./bench
LLAM_BENCH_ONLY=opaque_block ./bench

Scale benchmark size:

LLAM_BENCH_ROUNDS=31 LLAM_BENCH_WARMUP_ROUNDS=5 ./bench
LLAM_BENCH_SPAWN_TASKS=512 ./bench
LLAM_BENCH_CHANNEL_MESSAGES=4096 ./bench
LLAM_BENCH_IO_MESSAGES=512 ./bench
LLAM_BENCH_POLL_EVENTS=512 ./bench
LLAM_BENCH_SLEEP_TASKS=1024 ./bench
LLAM_BENCH_OPAQUE_SCOPES=64 ./bench

Compare against Go:

go run scripts/bench_go_compare.go

Compare LLAM, Go, and Tokio:

python3 scripts/bench_runtime_compare.py --runtime all

Graph generation requires Python matplotlib. Without it, the script still writes CSV and prints tables. The scheduled Runtime Benchmarks workflow runs the same comparison on Linux x86_64, macOS arm64, macOS x86_64, Windows Server 2022, and Windows Server 2025, then uploads CSV/PNG artifacts for regression tracking.

Run the benchmark matrix:

make bench-matrix

Verification And Cleanup

Run focused tests:

make test

Build a local release archive:

make clean all test
./scripts/package_release.sh

Or use the Makefile package target:

make package

Verify Linux:

make verify-linux CC=gcc

Verify Linux with experimental paths:

LLAM_VERIFY_LINUX_EXPERIMENTAL=1 make verify-linux CC=gcc

Verify macOS:

CC=clang make verify-darwin

Verify macOS with experimental paths:

LLAM_VERIFY_DARWIN_EXPERIMENTAL=1 CC=clang make verify-darwin

Verify Linux in Docker:

./scripts/docker_verify_linux.sh

Check Windows status:

.\scripts\verify_windows.ps1
.\scripts\verify_windows.ps1 -Native

The default command verifies through WSL when available. The -Native command builds native Windows targets and runs the Windows CTest suite.

Remove generated files:

make clean

make clean removes generated files such as object/, build/, CMake cache files, example and benchmark binaries, and perf.data*.

Architecture

Overview

LLAM is a user-level N:M thread scheduler. A small number of OS worker threads (typically one per CPU core) run many lightweight tasks. Each task has its own stack and can be suspended and resumed without kernel intervention.

flowchart TB
    subgraph UserSpace["User Space"]
        direction TB
        Tasks["Tasks (N lightweight fibers)"] --> Shards
        subgraph Shards["Scheduler Shards (per worker)"]
            direction LR
            S0["Shard 0\nhot_q / norm_q / inject_q\ntimers / allocator"]
            S1["Shard 1\nhot_q / norm_q / inject_q\ntimers / allocator"]
            S0 <-->|work steal| S1
        end
        Shards --> Nodes
        subgraph Nodes["I/O Nodes"]
            direction LR
            N0["Node 0\nio_uring / kqueue\nwatch tables"]
            N1["Node 1\nio_uring / kqueue\nwatch tables"]
        end
        Watchdog["Watchdog\nprobe / scale / merge / rehome"] -.-> Shards
        BlockPool["Blocking Thread Pool"] -.-> Shards
        OpaqueHelpers["Opaque Helpers\n(per-shard compensation)"] -.-> Shards
    end
    Shards -->|runs on| Workers["OS Threads (M pthreads)"]

flowchart LR
    App["Application"] --> API["include/llam\npublic API"]
    API --> Core["src/core\nscheduler / tasks / sync"]
    API --> IO["src/io\nI/O API + backends"]
    Core --> Engine["src/engine\nworkers / watchdog"]
    Engine --> Blocking["blocking\ncompensation"]
    IO --> Linux["src/io/linux\nio_uring"]
    IO --> Darwin["src/io/darwin\nkqueue"]
    IO --> Windows["Windows\nIOCP sockets + HANDLEs"]
    Core --> ASM["src/asm\ncontext switch"]

N:M Threading Model

Tasks are the fundamental unit of execution. Each task is a void (*)(void *) function with its own fiber stack allocated via mmap with a guard page. Tasks are scheduled cooperatively onto OS worker threads; the runtime never preempts a task without its participation (safepoints, yields, or I/O waits).

Shards are per-worker scheduler partitions. Each shard owns:

Three run queues: hot_q (latency-critical, capacity 1024), norm_q (normal, capacity 4096), and inject_q (cross-shard, capacity 1024).
A timer heap: min-heap ordered by deadline, used for llam_sleep_until and timed waits.
A per-shard allocator: slab-based pools for tasks, wait nodes, timer nodes, I/O requests, and I/O buffers, each with lock-free remote-free queues for cross-shard deallocation.
A stack cache: per-class (default/large/huge) stack mapping reuse pool.
Scheduler context: the fiber context (llam_ctx_t) the scheduler loop itself runs on.
An opaque helper thread: a pre-spawned compensation thread that takes over scheduling when the primary worker enters a blocking region.

The norm_q has two implementations selected at init time: a mutex-guarded FIFO queue, or a Chase-Lev lock-free deque (llam_cldeque_t) for work-stealing. The lock-free deque is a bounded circular buffer of 4096 task pointers with separated top (steal end) and bottom (push/pop end) atomics, each on its own cache line.

Nodes are platform I/O event backends. Each node owns either an io_uring ring (Linux) or a kqueue fd (Darwin), plus watch tables, submit queues, and control queues. Shards submit I/O requests to nodes; nodes complete requests back to the owning shard's task.

Scheduler Loop

The core scheduler loop (llam_scheduler_loop) runs on each worker thread:

loop:
    1. Check runtime drain (live_tasks == 0 → stop)
    2. Handle merge-pause requests from the watchdog
    3. Handle dynamic-worker offline state
    4. Drain inject queue (up to 32 tasks per pass)
    5. Fire expired timers
    6. Dequeue from hot_q, then norm_q
    7. Attempt work-steal from a random sibling shard
    8. If no task found → idle wait (eventfd/kqueue/futex)
    9. Context switch to the selected task
   10. On return: record metrics, check safepoints, repeat

Task selection priority: hot queue → normal queue → inject queue → steal. The hot queue is reserved for latency-class tasks and I/O completions. The inject queue receives cross-shard work and is drained with a budget cap to prevent starvation.

Context Switching

Context switches are performed in hand-written assembly for each supported platform. The runtime saves and restores only the callee-saved registers required by the platform ABI:

Platform	Saved registers	Mechanism
Linux x86_64	`rbx, rbp, r12-r15`, `rsp`	Direct `mov`/`ret` in `context_x86_64.S`
Linux aarch64	`x19-x29, x30, sp`	`stp`/`ldp` in `context_arm64.S`
Darwin x86_64	`rbx, rbp, r12-r15`, `rsp`	Same register set as Linux x86_64
Darwin arm64	`x19-x29, x30, sp`	Same register set as Linux aarch64
Fallback	Full register set	`ucontext` via `swapcontext()`

On x86_64, the fast path is approximately:

mov [rdi], rsp        ; save current SP
mov rsp, [rsi]        ; restore target SP
ret                   ; jump to target's saved return address

No syscall, no privilege transition, no TLB flush. Typical cost is tens of nanoseconds versus microseconds for a kernel thread switch.

FP state isolation: the runtime detects XSAVE support at init and tracks the XSAVE mask and area size. Tasks that modify x87/SSE rounding modes (MXCSR, x87 CW) are isolated from one another across context switches.

I/O Backend

LLAM provides scheduler-safe I/O through a multi-tier completion strategy. Each I/O call (llam_read, llam_write, llam_read_handle, llam_write_handle, llam_accept, llam_connect, llam_poll_fd, llam_poll_handle) follows this path:

1. Direct nonblocking fast path (try without parking)
2. Cooperative yield + retry if local work is pending
3. Direct blocking heuristic for short-lived ops
4. Async backend submission (io_uring SQE or kqueue kevent)
5. Blocking-worker fallback if the backend cannot handle the fd

Linux backend (src/io/linux/): each llam_node_t owns an io_uring instance with a ring depth of 256. Features include:

One-shot and multishot poll, accept, and recv operations.
Provided-buffer rings (128 entries per node) for zero-copy recv.
SQPOLL mode (experimental) where the kernel polls the SQ without syscalls.
Completion-driven task wake: CQE processing identifies the owning shard and re-enqueues the task.

Darwin backend (src/io/darwin/): each node uses kqueue with EVFILT_READ, EVFILT_WRITE, and EVFILT_USER for wake signaling. Darwin nodes use Mach ports (semaphore_t, mach_port_t) for cross-thread wake when available.

Windows backend (src/io/windows/): each node uses IOCP for overlapped Winsock operations and generic HANDLE ReadFile/WriteFile requests. Socket readiness policy stays conservative for stream POLLIN unless explicitly enabled; waitable HANDLE polling uses the platform wait path because it observes signaled handle state, not socket readiness.

Linux and Darwin backends support multishot watches — a single registered watch serves multiple waiters, avoiding redundant kernel registrations for the same fd. Watch tables track fd identity by (dev_t, ino_t) to detect fd reuse.

I/O request lifecycle: requests (llam_io_req_t) carry an atomic wait_mode that transitions through NONE → SUBMIT_QUEUE → INFLIGHT → (completion) or NONE → POLL_WATCH/ACCEPT_WATCH/RECV_WATCH → (completion). An atomic abort_reason field handles cancellation and timeout races without locks.

Memory Management

Per-shard allocator (llam_allocator_t): each shard maintains free lists for 5 object types:

Object	Slab size	Purpose
`llam_task_t`	16 per slab	Task metadata + embedded wait node, I/O req, timer
`llam_wait_node_t`	64 per slab	Mutex/cond/channel waiter nodes
`llam_timer_node_t`	64 per slab	Timer heap entries
`llam_io_req_t`	64 per slab	I/O operation descriptors
`llam_io_buffer_t`	16 per slab	Owned I/O buffers (4KB inline data each)

Remote-free queues: when a task completes on a different shard than it was allocated on, the deallocation goes through a lock-free atomic MPSC list (task_remote_free, wait_remote_free, etc.) and is drained into the local free list during llam_allocator_quiescent() at safe points. Each remote-free queue sits on its own cache line (_Alignas(64)) to avoid false sharing.

Stack cache: two-tier cache with per-shard local pools and a runtime-global fallback:

Stack class	Size	Shard cache limit	Global cache limit
Default	64 KB	256 (512 in release-fast)	4096
Large	256 KB	64	512
Huge	1 MB	16	128

Stack mappings use mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK) with a guard page (mprotect(PROT_NONE)) at the bottom. Returned stacks are cached for reuse; the cache is pre-warmed at init via llam_runtime_prewarm_stack_cache.

Blocking Compensation

When a task calls llam_enter_blocking() or llam_call_blocking(), the runtime must keep the shard's scheduler running while the worker thread is pinned in foreign code.

Opaque helper thread: each shard has a pre-spawned helper thread (opaque_helper_thread). When the primary worker enters a blocking region:

Primary worker:
  1. Increment opaque_compensation_depth
  2. Wake the helper thread (futex on Linux, Mach semaphore or condvar on Darwin)
  3. Execute the blocking work

Helper thread:
  1. Wake up and take over the shard's scheduler loop (using opaque_scheduler_ctx)
  2. Run tasks from the shard's queues while the primary is blocked
  3. When the primary calls llam_leave_blocking(), relinquish control

Blocking thread pool: a separate pool of block_worker_count threads handles llam_call_blocking jobs via a global FIFO job queue (block_head/block_tail). Jobs transition through QUEUED → RUNNING → FINISHED/ABORTED.

The connect fallback (llam_blocking_connect_impl) drives a nonblocking connect() + poll(POLLOUT, 10ms) loop with SO_ERROR verification, running in the blocking pool rather than pinning a scheduler worker.

Watchdog System

The watchdog thread (src/engine/) runs at 1ms intervals (LLAM_WATCHDOG_INTERVAL_NS) and performs:

Module	File	Function
Probe	`runtime_watchdog_probe.c`	Detect stalled safepoints, measure queue pressure, suspect deadlocks after 4 consecutive observations
Scale	`runtime_watchdog_scale.c`	Dynamic worker scaling: scale up after 2 consecutive pressure observations, scale down after 12 consecutive idle observations, with a 4-tick cooldown
Merge	`runtime_watchdog_merge.c`	Offline a shard by draining its queues and migrating tasks to a target shard
Rehome	`runtime_watchdog_rehome.c`	Atomically transfer ownership of parked waiters, in-flight I/O, submit-queue entries, and multishot watch state from an offline shard to a target shard

Rehome validates the entire waiter list before any migration. If a single entry cannot be rehomed (pinned task, incompatible I/O state), the entire list migration is aborted to prevent partial ownership inconsistency.

Synchronization Primitives

All sync primitives are runtime-aware: when called from a managed task, blocking waits park the task (freeing the worker thread) instead of blocking the OS thread.

Mutex (llam_mutex_t): atomic owner fast path + llam_wait_queue_t for contention. Non-recursive. EDEADLK on self-lock, EPERM on non-owner unlock.
Condition variable (llam_cond_t): FIFO waiter queue. Signal/broadcast can be called from outside managed tasks.
Channel (llam_channel_t): bounded pointer-valued ring buffer with separate send and receive wait queues. Supports close semantics (sends fail with EPIPE, buffered values remain drainable).
Cancel token (llam_cancel_token_t): explicit cancellation handle with a waiter list. Registered tasks and I/O operations observe cancellation through ECANCELED.

License

LLAM is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLAM

Key Features

Platform Support

Getting Started

Using LLAM In An Application

Execution Model

Task, Join, And Sleep

Channels

I/O

Blocking Work

Public API Summary

Runtime Options

Benchmarks

Verification And Cleanup

Architecture

Overview

N:M Threading Model

Scheduler Loop

Context Switching

I/O Backend

Memory Management

Blocking Compensation

Watchdog System

Synchronization Primitives

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
cmake		cmake
docker/linux		docker/linux
docs		docs
examples		examples
image		image
include/llam		include/llam
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

LLAM

Key Features

Platform Support

Getting Started

Using LLAM In An Application

Execution Model

Task, Join, And Sleep

Channels

I/O

Blocking Work

Public API Summary

Runtime Options

Benchmarks

Verification And Cleanup

Architecture

Overview

N:M Threading Model

Scheduler Loop

Context Switching

I/O Backend

Memory Management

Blocking Compensation

Watchdog System

Synchronization Primitives

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages