Skip to content

Feralthedogg/LLAM

Repository files navigation

LLAM

LLAM logo

C11 Linux macOS Windows Build License

LLAM is a stackful user-thread runtime for C applications. It lets C code express concurrency with task-oriented APIs such as spawn, join, sleep, channels, read, write, accept, connect, and poll, while the runtime schedules many user tasks over a smaller set of OS worker threads.

LLAM is not Linux-only. The Linux backend uses io_uring/liburing, the macOS/Darwin backend uses kqueue-based watch and completion paths, and the native Windows 10/11 backend uses IOCP for overlapped Winsock read/write/accept/connect plus generic HANDLE ReadFile/WriteFile requests.

Key Features

  • Stackful tasks with natural C control flow.
  • N:M scheduling over runtime worker threads.
  • Linux I/O backend based on io_uring/liburing.
  • macOS/Darwin I/O backend based on kqueue.
  • Windows 10/11 backend with IOCP request completions for sockets and overlapped HANDLEs, Windows wake handles, and x86_64 context-switch assembly.
  • Task primitives: spawn, yield, join, sleep, deadlines, and task metadata.
  • Synchronization primitives: mutex, condition variable, channel, and cancellation token.
  • Channel multiplexing with llam_channel_select() and focused select benchmarks.
  • Blocking integration through llam_call_blocking, llam_enter_blocking, and llam_leave_blocking.
  • Runtime tuning through profiles, dynamic workers, worker rings, SQPOLL, and idle-spin controls.
  • Observability through runtime stats and debug dumps.
  • Stable ABI metadata for dynamic language-runtime loaders.
  • Static and shared library build targets.
  • Built-in demo, chat server, stress, benchmark, Docker verification, and Go/Tokio comparison scripts.

Platform Support

Platform Status I/O backend Recommended compiler Verification
Linux x86_64 Primary Linux path io_uring/liburing GCC or Clang make verify-linux CC=gcc
Linux aarch64 Supported io_uring/liburing GCC or Clang make verify-linux CC=gcc
macOS arm64 Primary macOS path kqueue Apple Clang CC=clang make verify-darwin
macOS x86_64 Supported kqueue + x86_64 asm context switch Apple Clang CC=clang make verify-darwin
Windows 10/11 Supported native x86_64 backend IOCP for WSARecv/WSASend/AcceptEx/ConnectEx, overlapped HANDLE ReadFile/WriteFile, plus gated TCP POLLOUT and UDP POLLIN; TCP POLLIN defaults to fallback unless LLAM_WINDOWS_IOCP_TCP_POLLIN=1 is enabled MinGW and MSVC/MASM via CMake CMake Windows build plus test_windows_policy, test_windows_runtime_smoke, test_windows_iocp_io, and test_windows_handle_io; scripts/verify_windows.ps1 -Native

Native Windows runtime support covers scheduler/core, wake handles, x86_64 context switching, IOCP-backed socket requests, and overlapped HANDLE I/O. Windows 10 and Windows 11 use the same public API; LLAM selects conservative Windows 10 tuning or batched Windows 11 tuning at runtime, and CI forces both policy branches on native Windows runners.

Production and stress-operation guidance is documented in docs/operations.md.

Getting Started

Install Linux/WSL dependencies:

sudo apt install build-essential liburing-dev

Install macOS command-line tools:

xcode-select --install

Build on Linux:

make -j4 CC=gcc

Build on macOS:

CC=clang make -j4

Build native Windows with CMake:

cmake -S . -B build-windows -G "Ninja" -DCMAKE_BUILD_TYPE=Release
cmake --build build-windows
ctest --test-dir build-windows --output-on-failure

.\scripts\verify_windows.ps1 still verifies the Linux backend through WSL. .\scripts\verify_windows.ps1 -Native builds the native Windows CMake targets and runs the Windows CTest suite.

Build with CMake:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j4

Install with CMake:

cmake --install build --prefix "$HOME/.local"

Run the included programs:

./demo
./stress
./bench
./server 7777

Stress the chat server with real TCP clients:

make server-stress
python3 scripts/stress_server.py --clients 64 --messages 16 --payload-bytes 64

Run the native maximum-throughput flood driver:

make server-flood
./server_flood --clients 16 --duration 60 --message-bytes 8 --batch 64 --target-mps 0.30

server_flood reports both inbound messages/sec and observed broadcast deliveries/sec. For chat fanout, one inbound message can produce clients - 1 peer deliveries, so million-level delivery rates can appear before inbound message rates reach the same scale.

Run the full composite server stress suite:

make server-stress-composite
make server-stress-composite-quick
make server-stress-composite-hour
python3 scripts/stress_server_composite.py --quick

The composite suite combines exact fanout checks, 60-second native flood, payload-size variation, connection churn, slow receivers, half-close/reset patterns, and RSS/fd sampling.

--quick is intended for hosted CI runners. It keeps exact-delivery checks but uses a lower absolute flood delivery threshold than standard and hour-long runs.

The one-hour profile runs the same classes of checks with a long soak layout: 30 minutes of main flood, two 5-minute payload flood phases, and 20 minutes of mixed edge stress.

Run focused API/ABI tests:

make test

Build outputs:

  • demo: runnable examples of the public runtime API.
  • stress: regression coverage for scheduling, sync, timeouts, I/O, and dynamic workers.
  • bench: microbenchmarks for spawn/join, channels, channel select, I/O, poll, sleep fanout, and opaque blocking.
  • server: minimal LLAM-backed TCP chat backend for local testing.
  • server_flood: native nonblocking throughput flood driver for the chat server.
  • scripts/stress_server.py: TCP fanout stress test for the chat server.
  • scripts/stress_server_composite.py: long-running composite server stability suite.
  • test_abi_contract: ABI metadata and size handshakes.
  • test_connect_io: direct and runtime-managed llam_connect() success and invalid-input checks.
  • test_runtime_core: lifecycle, task metadata, yielding, sleeping, blocking callbacks, and stats checks.
  • test_sync_primitives: mutex, condition variable, channel, timeout, and close semantics.
  • test_io_buffers: direct and managed poll/read/write, owned buffers, and MSG_PEEK.
  • test_shared_load: dlopen() coverage for the shared library ABI surface.

Using LLAM In An Application

The top-level Makefile builds the bundled executables directly. For application integration, the simplest path is the CMake target llam_runtime.

add_subdirectory(path/to/LLAM)

add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam_runtime)

Use llam_runtime_shared when a language runtime needs to load LLAM dynamically. The Makefile equivalent is make shared.

Release archives include the public headers, docs, bundled examples, runtime libraries, pkg-config metadata, and CMake package files. Tag pushes such as v1.0.1 build and publish .tar.xz archives for Linux x86_64, Linux aarch64, macOS x86_64, and macOS arm64, plus a native Windows x86_64 .zip archive through .github/workflows/release.yml.

The 1.0 release gate is intentionally platform-local: Linux must pass make verify-linux or Docker verification, macOS must pass the Darwin verify path, and Windows must pass native CMake/CTest plus Windows 2022/2025 stress smoke. The full operational checklist is in docs/operations.md.

Use an installed SDK with CMake:

find_package(llam CONFIG REQUIRED)

add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE llam::runtime)

Use an installed SDK with pkg-config:

cc main.c $(pkg-config --cflags --libs llam) -o my_app

Install on Linux/macOS:

curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --prefix "$HOME/.local"

Install a specific Linux/macOS target:

curl -fsSL https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.sh | sh -s -- --version 1.0.1 --target macos-aarch64 --prefix "$HOME/.local"

Install on Windows x86_64:

Invoke-WebRequest "https://github.com/Feralthedogg/LLAM/releases/download/1.0.1/install.ps1" -OutFile install.ps1; .\install.ps1 -Version 1.0.1 -Prefix "$env:LOCALAPPDATA\LLAM"

Include the canonical public API:

#include "llam/runtime.h"

Dynamic loaders should check llam_abi_version() or llam_abi_get_info() before binding the rest of the API. FFI bindings should prefer llam_runtime_init_ex() and llam_spawn_ex() so inbound option structs carry an explicit caller-side size. The ABI and semantic contract is documented in docs/abi.md. Embedding code should use llam_runtime_create(), llam_runtime_run_handle(), and llam_runtime_destroy(), while treating LLAM 1.0 as one active runtime per process. True multi-runtime isolation is a post-1.0 migration item; do not create/destroy LLAM concurrently from multiple host runtime instances. macOS-specific performance gates and remaining structural work are covered by the platform-local release checklist in docs/operations.md. Windows backend scope, policy split, and acceptance gates are tracked in docs/operations.md.

Execution Model

A typical LLAM program follows this lifecycle:

  1. Initialize the runtime with llam_runtime_init().
  2. Spawn one or more root tasks with llam_spawn().
  3. Run the scheduler with llam_run().
  4. Shut the runtime down with llam_runtime_shutdown().
#include "llam/runtime.h"

#include <stdio.h>

static void worker(void *arg) {
    const char *name = arg;

    printf("hello from %s\n", name);
    llam_yield();
    printf("bye from %s\n", name);
}

int main(void) {
    if (llam_runtime_init(NULL) != 0) {
        return 1;
    }

    if (llam_spawn(worker, "LLAM", NULL) == NULL) {
        llam_runtime_shutdown();
        return 1;
    }

    int rc = llam_run();
    llam_runtime_shutdown();
    return rc;
}

Task, Join, And Sleep

A task is a void (*)(void *) function. Pass shared state through the task argument and use llam_join() when a parent task needs the child to finish. Every task handle returned by llam_spawn*() must be consumed by either a successful join or llam_detach().

#include "llam/runtime.h"

#include <stdint.h>
#include <stdio.h>

typedef struct job {
    int input;
    int output;
} job_t;

static void child(void *arg) {
    job_t *job = arg;

    llam_sleep_ns(1ULL * 1000ULL * 1000ULL);
    job->output = job->input * job->input;
}

static void root(void *arg) {
    (void)arg;

    job_t job = {.input = 12};
    llam_task_t *task = llam_spawn(child, &job, NULL);

    if (task != NULL && llam_join(task) == 0) {
        printf("result=%d\n", job.output);
    }
}

Deadline-based APIs use absolute timestamps from llam_now_ns().

uint64_t deadline = llam_now_ns() + 10ULL * 1000ULL * 1000ULL;
int rc = llam_join_until(task, deadline);

Channels

A channel transfers pointer values between tasks. Capacity must be at least 1; capacity 1 or greater behaves like a bounded buffer.

#include "llam/runtime.h"

#include <stdio.h>

typedef struct pipe_state {
    llam_channel_t *channel;
} pipe_state_t;

static void producer(void *arg) {
    pipe_state_t *state = arg;

    (void)llam_channel_send(state->channel, "ping");
    (void)llam_channel_send(state->channel, "pong");
    (void)llam_channel_close(state->channel);
}

static void consumer(void *arg) {
    pipe_state_t *state = arg;
    const char *msg;

    while ((msg = llam_channel_recv(state->channel)) != NULL) {
        printf("recv=%s\n", msg);
    }
}

static void root(void *arg) {
    (void)arg;

    pipe_state_t state = {
        .channel = llam_channel_create(2),
    };

    if (state.channel == NULL) {
        return;
    }

    llam_task_t *a = llam_spawn(producer, &state, NULL);
    llam_task_t *b = llam_spawn(consumer, &state, NULL);

    if (a != NULL) {
        (void)llam_join(a);
    }
    if (b != NULL) {
        (void)llam_join(b);
    }
    llam_channel_destroy(state.channel);
}

I/O

LLAM I/O calls are written like blocking calls from inside a task, while the runtime backend handles readiness and completion. Linux uses io_uring, macOS uses kqueue, and Windows uses IOCP for overlapped Winsock read, write, accept, connect, generic HANDLE ReadFile/WriteFile, gated TCP POLLOUT, and UDP POLLIN requests. Windows TCP POLLIN defaults to the cooperative/direct fallback path unless LLAM_WINDOWS_IOCP_TCP_POLLIN=1 is enabled for controlled smoke or benchmark runs; unsupported poll masks remain fallback. The current I/O primitive set covers read, read_when_ready, write, HANDLE read/write, accept, connect, fd polling, HANDLE polling, and owned-buffer reads on supported native backends. Use LLAM_INVALID_FD or LLAM_FD_IS_INVALID(fd) for descriptor-returning failures such as llam_accept(), and LLAM_INVALID_HANDLE or LLAM_HANDLE_IS_INVALID(handle) for HANDLE-returning integrations.

#include "llam/runtime.h"

#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>

typedef struct echo_state {
    int reader;
    int writer;
} echo_state_t;

static void reader_task(void *arg) {
    echo_state_t *state = arg;
    char buf[64];

    ssize_t n = llam_read(state->reader, buf, sizeof(buf));
    if (n > 0) {
        printf("read=%.*s\n", (int)n, buf);
    }
}

static void writer_task(void *arg) {
    echo_state_t *state = arg;
    const char *msg = "hello";

    (void)llam_write(state->writer, msg, strlen(msg));
}

static void root(void *arg) {
    (void)arg;

    int sv[2];
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) {
        return;
    }

    echo_state_t state = {
        .reader = sv[0],
        .writer = sv[1],
    };

    llam_task_t *reader = llam_spawn(reader_task, &state, NULL);
    llam_task_t *writer = llam_spawn(writer_task, &state, NULL);

    if (reader != NULL) {
        (void)llam_join(reader);
    }
    if (writer != NULL) {
        (void)llam_join(writer);
    }

    close(sv[0]);
    close(sv[1]);
}

The owned-buffer API lets the runtime allocate the I/O buffer. Release it with llam_io_buffer_release(). EOF or a zero-byte read returns 0 with buffer == NULL; failures return -1, set errno, and also leave buffer == NULL.

llam_io_buffer_t *buffer = NULL;
ssize_t n = llam_read_owned(fd, 4096, &buffer);

if (n > 0 && buffer != NULL) {
    void *data = llam_io_buffer_data(buffer);
    size_t size = llam_io_buffer_size(buffer);
    (void)data;
    (void)size;
}
llam_io_buffer_release(buffer);

Blocking Work

Long CPU work or blocking syscalls can pin a worker if they run directly inside a task. Use llam_call_blocking_result() to offload such work without ambiguity, or wrap explicit blocking regions with llam_enter_blocking() and llam_leave_blocking().

#include "llam/runtime.h"

#include <unistd.h>

static void *slow_syscall(void *arg) {
    (void)arg;
    sleep(1);
    return NULL;
}

static void task(void *arg) {
    void *result = NULL;

    (void)arg;

    (void)llam_call_blocking_result(slow_syscall, NULL, &result);
}

Manual blocking region:

if (llam_enter_blocking() == 0) {
    /* Run a blocking syscall or external library call here. */
    llam_leave_blocking();
}

Public API Summary

Runtime lifecycle:

API Purpose
llam_runtime_opts_init Fill runtime options with ABI-safe library defaults.
llam_runtime_init_ex Initialize the runtime with an explicit option struct size for FFI.
llam_runtime_init Initialize the runtime.
llam_runtime_request_stop Request cooperative scheduler stop and wake workers.
llam_runtime_shutdown Shut the runtime down and release resources.
llam_runtime_collect_stats_ex Collect stats with an explicit output struct size for FFI.
llam_runtime_collect_stats Collect scheduler, I/O, blocking, and queue statistics.
llam_runtime_write_stats_json Write a newline-terminated JSON stats snapshot to an fd.

Task scheduling:

API Purpose
llam_spawn_opts_init Fill spawn options with ABI-safe library defaults.
llam_spawn_ex Create a task with an explicit option struct size for FFI.
llam_spawn Create a task.
llam_run Run the scheduler.
llam_yield Yield the current task.
llam_task_safepoint Mark progress in CPU-bound loops without forcing an immediate yield.
llam_join Wait for task completion.
llam_join_until Wait for task completion until a deadline.
llam_detach Consume a task handle without waiting for completion.
llam_sleep_ns Sleep for a duration.
llam_sleep_until Sleep until an absolute deadline.
llam_task_set_class Change the current task class; invalid class values fail with EINVAL.
llam_current_task Return the current task handle.
llam_task_id Return a task id.
llam_task_state_name Return a task state string.
llam_task_class Return a task class.
llam_task_flags Return task flags.

Spawn options:

Type/value Meaning
LLAM_TASK_CLASS_LATENCY Latency-sensitive task.
LLAM_TASK_CLASS_DEFAULT Default task class.
LLAM_TASK_CLASS_BATCH Batch-oriented task.
LLAM_STACK_CLASS_DEFAULT Default stack size class.
LLAM_STACK_CLASS_LARGE Larger stack size class.
LLAM_STACK_CLASS_HUGE Very large stack size class.
LLAM_SPAWN_F_PINNED Hint that the task should stay pinned.
LLAM_SPAWN_F_NO_PREEMPT Hint that preemption should be restricted.
LLAM_SPAWN_F_SYS_TASK Runtime/system task hint.
LLAM_SPAWN_F_LATENCY_CRITICAL Latency-critical task hint.

Blocking:

API Purpose
llam_call_blocking_result Run a blocking function through the unambiguous int + out API.
llam_call_blocking Convenience blocking API; ambiguous when callback returns NULL.
llam_enter_blocking Mark the current task as entering a blocking region.
llam_leave_blocking Mark the current task as leaving a blocking region.

Cancellation:

API Purpose
llam_cancel_token_create Create a cancellation token.
llam_cancel_token_destroy Destroy a cancellation token; live observers make it fail with EBUSY.
llam_cancel_token_cancel Request cancellation.
llam_cancel_token_is_cancelled Check cancellation state.

Mutex and condition variables:

API Purpose
llam_mutex_create / llam_mutex_destroy Create or destroy a mutex; destroy returns EBUSY while owned or waited on.
llam_mutex_lock / llam_mutex_unlock Lock or unlock a non-recursive mutex; self-lock returns EDEADLK, non-owner unlock returns EPERM.
llam_mutex_lock_until Wait for a mutex until a deadline; self-lock returns EDEADLK.
llam_mutex_trylock Try to lock immediately; returns EBUSY when already locked.
llam_cond_create / llam_cond_destroy Create or destroy a condition variable; destroy returns EBUSY while waited on.
llam_cond_wait Wait on a condition variable; caller must own the mutex and wait in a predicate loop.
llam_cond_wait_until Wait on a condition variable until a deadline; reacquires the mutex before returning.
llam_cond_signal Wake one waiter; may be called with or without the mutex and outside a managed task.
llam_cond_broadcast Wake all waiters; may be called with or without the mutex and outside a managed task.

Channels:

API Purpose
llam_channel_create / llam_channel_destroy Create or destroy a channel; destroy returns EBUSY while buffered values or waiters remain.
llam_channel_send Send a value.
llam_channel_send_until Send a value until a deadline.
llam_channel_recv_result Receive a value through an unambiguous int + out API.
llam_channel_recv_until_result Receive a value until a deadline through an unambiguous int + out API.
llam_channel_recv Convenience receive API; use result form if NULL is a valid payload.
llam_channel_recv_until Convenience timed receive API; use result form if NULL is a valid payload.
llam_channel_close Idempotently close a channel; buffered values remain drainable and sends fail with EPIPE.

I/O:

API Purpose
llam_read Read from an fd.
llam_write Write to an fd.
llam_read_handle Read from a platform handle; Windows uses overlapped ReadFile through IOCP when possible, POSIX aliases to fd read.
llam_write_handle Write to a platform handle; Windows uses overlapped WriteFile through IOCP when possible, POSIX aliases to fd write.
llam_read_owned Read into a runtime-owned buffer.
llam_recv_owned Receive with flags into a runtime-owned buffer.
llam_io_buffer_release Release an owned buffer.
llam_io_buffer_data Return the owned buffer data pointer.
llam_io_buffer_size Return the number of bytes read.
llam_io_buffer_capacity Return owned buffer capacity.
llam_accept Accept a connection from a listener fd; returns LLAM_INVALID_FD on failure.
llam_connect Connect a socket without blocking the scheduler worker.
llam_poll_fd Wait for fd readiness.
llam_poll_handle Wait for platform handle state; Windows uses WaitForSingleObject semantics and POSIX aliases to fd poll.

Time, debug, and platform:

API Purpose
llam_now_ns Return a monotonic nanosecond timestamp.
llam_dump_runtime_state Dump runtime state to an fd.
llam_fd_t Platform-specific fd/socket handle type.
llam_handle_t Platform-specific generic handle type for HANDLE I/O APIs.
LLAM_INVALID_FD / LLAM_FD_IS_INVALID Platform-correct invalid descriptor sentinel and predicate.
LLAM_INVALID_HANDLE / LLAM_HANDLE_IS_INVALID Platform-correct invalid generic-handle sentinel and predicate.
LLAM_PLATFORM_LINUX Linux build flag.
LLAM_PLATFORM_DARWIN macOS/Darwin build flag.
LLAM_PLATFORM_WINDOWS Windows build flag.
LLAM_PLATFORM_NAME Platform name string.

Runtime Options

Pass NULL to llam_runtime_init() for the default runtime configuration. Pass llam_runtime_opts_t when you need explicit tuning. Dynamic loaders and language bindings should initialize option structs with llam_runtime_opts_init(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE) and llam_spawn_opts_init(&opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), then call llam_runtime_init_ex(&opts, LLAM_RUNTIME_OPTS_CURRENT_SIZE), llam_spawn_ex(fn, arg, &opts, LLAM_SPAWN_OPTS_CURRENT_SIZE), and llam_runtime_collect_stats_ex(&stats, LLAM_RUNTIME_STATS_CURRENT_SIZE).

Public option and stats structs use fixed-width integer storage for ABI-facing scalar fields. Enum constants remain available for C readability, but FFI bindings should model task classes, stack classes, profiles, flags, and 32-bit state counters as uint32_t; sqpoll_cpu is int32_t.

llam_runtime_opts_t opts = {
    .deterministic = 0,
    .forced_yield_every = 0,
    .experimental_flags =
        LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS |
        LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ,
    .profile = LLAM_RUNTIME_PROFILE_BALANCED,
};

if (llam_runtime_init(&opts) != 0) {
    return 1;
}

Important fields:

Field Meaning
deterministic Deterministic scheduling mode.
forced_yield_every Force a yield at a fixed interval.
experimental_flags Bitwise OR of LLAM_RUNTIME_EXPERIMENTAL_F_* flags.
idle_spin_ns Spin before idle poll fallback.
idle_spin_max_iters Maximum idle-spin iterations.
sqpoll_cpu CPU reserved for SQPOLL.
profile Runtime policy profile: balanced, release-fast, debug-safe, or io-latency.

Experimental flags:

Flag Meaning
LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS Experimental per-worker I/O ring mode.
LLAM_RUNTIME_EXPERIMENTAL_F_WORKER_RINGS_MULTISHOT Allow multishot watches with worker rings.
LLAM_RUNTIME_EXPERIMENTAL_F_DYNAMIC_WORKERS Soft-park and reactivate idle workers.
LLAM_RUNTIME_EXPERIMENTAL_F_LOCKFREE_NORMQ Use the lock-free normal queue.
LLAM_RUNTIME_EXPERIMENTAL_F_HUGE_ALLOC Prefer hugepage-friendly allocator backing.
LLAM_RUNTIME_EXPERIMENTAL_F_SQPOLL Experimental Linux io_uring SQPOLL mode.

Selected environment variables:

Variable Example values Meaning
LLAM_RUNTIME_PROFILE balanced, release-fast, debug-safe, io-latency Override the runtime profile.
LLAM_EXPERIMENTAL_DYNAMIC_WORKERS 0, 1 Toggle dynamic workers.
LLAM_EXPERIMENTAL_LOCKFREE_NORMQ 0, 1 Toggle the lock-free normal queue.
LLAM_EXPERIMENTAL_WORKER_RINGS 0, 1 Toggle worker ring mode.
LLAM_EXPERIMENTAL_WORKER_RINGS_MULTISHOT 0, 1 Toggle worker-ring multishot watches.
LLAM_EXPERIMENTAL_HUGE_ALLOC 0, 1 Toggle huge allocator mode.
LLAM_EXPERIMENTAL_SQPOLL 0, 1 Toggle Linux SQPOLL.
LLAM_SQPOLL_CPU CPU number Select the SQPOLL CPU.
LLAM_IDLE_SPIN_NS nanoseconds Idle spin time.
LLAM_IDLE_SPIN_ITERS iteration count Idle spin iteration limit.
LLAM_BIND_WORKERS 0, 1 Bind worker threads to platform CPUs when supported.
LLAM_DARWIN_MACH_SCHED 0, 1 Toggle Darwin Mach/QoS scheduler hints; default is enabled on macOS.
LLAM_WINDOWS_UNSAFE_SKIP_TASK_SIMD 0, 1 Experimental Windows x64 ceiling mode: skip task-context XMM6-XMM15 save/restore. Only valid when managed tasks do not rely on callee-saved SIMD state across LLAM yields/waits.
LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD 0, 1 Experimental macOS/Linux ARM64 ceiling mode: skip scheduler-context SIMD save/restore while task contexts still preserve ABI-required d8-d15.
LLAM_ARM64_UNSAFE_SKIP_SCHEDULER_SIMD 0, 1 Alias for LLAM_AARCH64_UNSAFE_SKIP_SCHEDULER_SIMD.
LLAM_DIRECT_BLOCKING_IO 0, 1 Allow eligible blocking socket read/write operations to run through compensated direct blocking regions.
LLAM_DIRECT_BLOCKING_POLL 0, 1, unset Control direct blocking poll fallback; Linux/Windows auto mode handles finite waits directly when profitable.
LLAM_ACCEPT_DIRECT_BLOCKING 0, 1 Route managed accept calls that cannot use multishot accept-watch through a compensated helper poll loop; default is enabled on macOS and disabled elsewhere.
LLAM_IO_POLL_REDIRECT_TIMEOUT_MS milliseconds Redirect long direct-poll waits through opaque blocking compensation on Linux.
LLAM_IO_COOP_YIELD 0, 1 Enable cooperative yields around direct I/O fast paths; default is enabled on macOS, Linux, and Windows.
LLAM_IO_POLL_COOP_YIELD 0, 1 Enable cooperative yields in poll readiness paths; default is enabled on macOS, Linux, and Windows.
LLAM_IO_POLL_PRE_YIELD 0, 1 Let poll hand off to same-shard runnable producers before the first readiness probe; default is enabled on macOS and Windows.
LLAM_IO_POLL_EXTRA_YIELD 0, 1 Add an extra poll-readiness yield; default is enabled on macOS and Windows.
LLAM_IO_POLL_READY_YIELDS 0-8 Bound short same-shard ready-yield probes before poll parks in the backend.
LLAM_READ_READY_INITIAL_HANDOFF 0, 1 Let llam_read_when_ready() hand off once to local producers before its first read probe; default is disabled.
LLAM_READ_READY_DIRECT_BLOCKING 0, 1 Let infinite llam_read_when_ready() use compensated direct blocking reads; default is disabled.
LLAM_POLL_SOCKET_PEEK 0, 1 Use MSG_PEEK for socket POLLIN fast checks; default is enabled on macOS and opt-in elsewhere.
LLAM_IO_WRITE_HANDOFF 0, 1 Yield after small socket writes so local readers can run; default is enabled on macOS and Linux.
LLAM_IO_WRITE_DIRECT_LOCAL_HANDOFF 0, 1 Prefer direct same-shard task handoff after eligible socket writes; default is enabled on macOS, Linux, and Windows.
LLAM_YIELD_DIRECT_HANDOFF 0, 1, unset Allow ordinary yields to switch directly to same-shard runnable work when no timers or inject work are pending.
LLAM_OPAQUE_REDIRECT_FASTPATH 0, 1 Prefer redirect over helper handoff for opaque blocking; default is enabled on Linux.
LLAM_TIMER_HEAP_PREWARM timer slots Preallocate shard timer heap slots to avoid growth during sleep/deadline fanout.
LLAM_STACK_CACHE_PREWARM stack count Prewarm the default stack cache before high fanout workloads.
LLAM_TASK_CACHE_PREWARM task count Prewarm task metadata slabs before high fanout workloads.
LLAM_STACK_SAMPLING 0, 1 Enable stack high-water sampling diagnostics.
LLAM_TRACE_EVENTS 0, 1 Enable per-worker trace ring diagnostics.
LLAM_WAKE_LATENCY_METRICS 0, 1 Enable wake-latency diagnostics.
LLAM_STRESS_DYNAMIC_LIVE_POLL_WAITERS waiter count Stress live poll/accept/inflight waiters; automatically clamped by fd budget.

Benchmarks

Run all LLAM benchmark cases:

./bench

Run one benchmark case:

LLAM_BENCH_ONLY=spawn_join ./bench
LLAM_BENCH_ONLY=channel_pingpong ./bench
LLAM_BENCH_ONLY=io_echo ./bench
LLAM_BENCH_ONLY=poll_wake ./bench
LLAM_BENCH_ONLY=sleep_fanout ./bench
LLAM_BENCH_ONLY=opaque_block ./bench

Scale benchmark size:

LLAM_BENCH_ROUNDS=31 LLAM_BENCH_WARMUP_ROUNDS=5 ./bench
LLAM_BENCH_SPAWN_TASKS=512 ./bench
LLAM_BENCH_CHANNEL_MESSAGES=4096 ./bench
LLAM_BENCH_IO_MESSAGES=512 ./bench
LLAM_BENCH_POLL_EVENTS=512 ./bench
LLAM_BENCH_SLEEP_TASKS=1024 ./bench
LLAM_BENCH_OPAQUE_SCOPES=64 ./bench

Compare against Go:

go run scripts/bench_go_compare.go

Compare LLAM, Go, and Tokio:

python3 scripts/bench_runtime_compare.py --runtime all

Graph generation requires Python matplotlib. Without it, the script still writes CSV and prints tables. The scheduled Runtime Benchmarks workflow runs the same comparison on Linux x86_64, macOS arm64, macOS x86_64, Windows Server 2022, and Windows Server 2025, then uploads CSV/PNG artifacts for regression tracking.

Run the benchmark matrix:

make bench-matrix

Verification And Cleanup

Run focused tests:

make test

Build a local release archive:

make clean all test
./scripts/package_release.sh

Or use the Makefile package target:

make package

Verify Linux:

make verify-linux CC=gcc

Verify Linux with experimental paths:

LLAM_VERIFY_LINUX_EXPERIMENTAL=1 make verify-linux CC=gcc

Verify macOS:

CC=clang make verify-darwin

Verify macOS with experimental paths:

LLAM_VERIFY_DARWIN_EXPERIMENTAL=1 CC=clang make verify-darwin

Verify Linux in Docker:

./scripts/docker_verify_linux.sh

Check Windows status:

.\scripts\verify_windows.ps1
.\scripts\verify_windows.ps1 -Native

The default command verifies through WSL when available. The -Native command builds native Windows targets and runs the Windows CTest suite.

Remove generated files:

make clean

make clean removes generated files such as object/, build/, CMake cache files, example and benchmark binaries, and perf.data*.

Architecture

Overview

LLAM is a user-level N:M thread scheduler. A small number of OS worker threads (typically one per CPU core) run many lightweight tasks. Each task has its own stack and can be suspended and resumed without kernel intervention.

flowchart TB
    subgraph UserSpace["User Space"]
        direction TB
        Tasks["Tasks (N lightweight fibers)"] --> Shards
        subgraph Shards["Scheduler Shards (per worker)"]
            direction LR
            S0["Shard 0\nhot_q / norm_q / inject_q\ntimers / allocator"]
            S1["Shard 1\nhot_q / norm_q / inject_q\ntimers / allocator"]
            S0 <-->|work steal| S1
        end
        Shards --> Nodes
        subgraph Nodes["I/O Nodes"]
            direction LR
            N0["Node 0\nio_uring / kqueue\nwatch tables"]
            N1["Node 1\nio_uring / kqueue\nwatch tables"]
        end
        Watchdog["Watchdog\nprobe / scale / merge / rehome"] -.-> Shards
        BlockPool["Blocking Thread Pool"] -.-> Shards
        OpaqueHelpers["Opaque Helpers\n(per-shard compensation)"] -.-> Shards
    end
    Shards -->|runs on| Workers["OS Threads (M pthreads)"]
Loading
flowchart LR
    App["Application"] --> API["include/llam\npublic API"]
    API --> Core["src/core\nscheduler / tasks / sync"]
    API --> IO["src/io\nI/O API + backends"]
    Core --> Engine["src/engine\nworkers / watchdog"]
    Engine --> Blocking["blocking\ncompensation"]
    IO --> Linux["src/io/linux\nio_uring"]
    IO --> Darwin["src/io/darwin\nkqueue"]
    IO --> Windows["Windows\nIOCP sockets + HANDLEs"]
    Core --> ASM["src/asm\ncontext switch"]
Loading

N:M Threading Model

Tasks are the fundamental unit of execution. Each task is a void (*)(void *) function with its own fiber stack allocated via mmap with a guard page. Tasks are scheduled cooperatively onto OS worker threads; the runtime never preempts a task without its participation (safepoints, yields, or I/O waits).

Shards are per-worker scheduler partitions. Each shard owns:

  • Three run queues: hot_q (latency-critical, capacity 1024), norm_q (normal, capacity 4096), and inject_q (cross-shard, capacity 1024).
  • A timer heap: min-heap ordered by deadline, used for llam_sleep_until and timed waits.
  • A per-shard allocator: slab-based pools for tasks, wait nodes, timer nodes, I/O requests, and I/O buffers, each with lock-free remote-free queues for cross-shard deallocation.
  • A stack cache: per-class (default/large/huge) stack mapping reuse pool.
  • Scheduler context: the fiber context (llam_ctx_t) the scheduler loop itself runs on.
  • An opaque helper thread: a pre-spawned compensation thread that takes over scheduling when the primary worker enters a blocking region.

The norm_q has two implementations selected at init time: a mutex-guarded FIFO queue, or a Chase-Lev lock-free deque (llam_cldeque_t) for work-stealing. The lock-free deque is a bounded circular buffer of 4096 task pointers with separated top (steal end) and bottom (push/pop end) atomics, each on its own cache line.

Nodes are platform I/O event backends. Each node owns either an io_uring ring (Linux) or a kqueue fd (Darwin), plus watch tables, submit queues, and control queues. Shards submit I/O requests to nodes; nodes complete requests back to the owning shard's task.

Scheduler Loop

The core scheduler loop (llam_scheduler_loop) runs on each worker thread:

loop:
    1. Check runtime drain (live_tasks == 0 → stop)
    2. Handle merge-pause requests from the watchdog
    3. Handle dynamic-worker offline state
    4. Drain inject queue (up to 32 tasks per pass)
    5. Fire expired timers
    6. Dequeue from hot_q, then norm_q
    7. Attempt work-steal from a random sibling shard
    8. If no task found → idle wait (eventfd/kqueue/futex)
    9. Context switch to the selected task
   10. On return: record metrics, check safepoints, repeat

Task selection priority: hot queue → normal queue → inject queue → steal. The hot queue is reserved for latency-class tasks and I/O completions. The inject queue receives cross-shard work and is drained with a budget cap to prevent starvation.

Context Switching

Context switches are performed in hand-written assembly for each supported platform. The runtime saves and restores only the callee-saved registers required by the platform ABI:

Platform Saved registers Mechanism
Linux x86_64 rbx, rbp, r12-r15, rsp Direct mov/ret in context_x86_64.S
Linux aarch64 x19-x29, x30, sp stp/ldp in context_arm64.S
Darwin x86_64 rbx, rbp, r12-r15, rsp Same register set as Linux x86_64
Darwin arm64 x19-x29, x30, sp Same register set as Linux aarch64
Fallback Full register set ucontext via swapcontext()

On x86_64, the fast path is approximately:

mov [rdi], rsp        ; save current SP
mov rsp, [rsi]        ; restore target SP
ret                   ; jump to target's saved return address

No syscall, no privilege transition, no TLB flush. Typical cost is tens of nanoseconds versus microseconds for a kernel thread switch.

FP state isolation: the runtime detects XSAVE support at init and tracks the XSAVE mask and area size. Tasks that modify x87/SSE rounding modes (MXCSR, x87 CW) are isolated from one another across context switches.

I/O Backend

LLAM provides scheduler-safe I/O through a multi-tier completion strategy. Each I/O call (llam_read, llam_write, llam_read_handle, llam_write_handle, llam_accept, llam_connect, llam_poll_fd, llam_poll_handle) follows this path:

1. Direct nonblocking fast path (try without parking)
2. Cooperative yield + retry if local work is pending
3. Direct blocking heuristic for short-lived ops
4. Async backend submission (io_uring SQE or kqueue kevent)
5. Blocking-worker fallback if the backend cannot handle the fd

Linux backend (src/io/linux/): each llam_node_t owns an io_uring instance with a ring depth of 256. Features include:

  • One-shot and multishot poll, accept, and recv operations.
  • Provided-buffer rings (128 entries per node) for zero-copy recv.
  • SQPOLL mode (experimental) where the kernel polls the SQ without syscalls.
  • Completion-driven task wake: CQE processing identifies the owning shard and re-enqueues the task.

Darwin backend (src/io/darwin/): each node uses kqueue with EVFILT_READ, EVFILT_WRITE, and EVFILT_USER for wake signaling. Darwin nodes use Mach ports (semaphore_t, mach_port_t) for cross-thread wake when available.

Windows backend (src/io/windows/): each node uses IOCP for overlapped Winsock operations and generic HANDLE ReadFile/WriteFile requests. Socket readiness policy stays conservative for stream POLLIN unless explicitly enabled; waitable HANDLE polling uses the platform wait path because it observes signaled handle state, not socket readiness.

Linux and Darwin backends support multishot watches — a single registered watch serves multiple waiters, avoiding redundant kernel registrations for the same fd. Watch tables track fd identity by (dev_t, ino_t) to detect fd reuse.

I/O request lifecycle: requests (llam_io_req_t) carry an atomic wait_mode that transitions through NONE → SUBMIT_QUEUE → INFLIGHT → (completion) or NONE → POLL_WATCH/ACCEPT_WATCH/RECV_WATCH → (completion). An atomic abort_reason field handles cancellation and timeout races without locks.

Memory Management

Per-shard allocator (llam_allocator_t): each shard maintains free lists for 5 object types:

Object Slab size Purpose
llam_task_t 16 per slab Task metadata + embedded wait node, I/O req, timer
llam_wait_node_t 64 per slab Mutex/cond/channel waiter nodes
llam_timer_node_t 64 per slab Timer heap entries
llam_io_req_t 64 per slab I/O operation descriptors
llam_io_buffer_t 16 per slab Owned I/O buffers (4KB inline data each)

Remote-free queues: when a task completes on a different shard than it was allocated on, the deallocation goes through a lock-free atomic MPSC list (task_remote_free, wait_remote_free, etc.) and is drained into the local free list during llam_allocator_quiescent() at safe points. Each remote-free queue sits on its own cache line (_Alignas(64)) to avoid false sharing.

Stack cache: two-tier cache with per-shard local pools and a runtime-global fallback:

Stack class Size Shard cache limit Global cache limit
Default 64 KB 256 (512 in release-fast) 4096
Large 256 KB 64 512
Huge 1 MB 16 128

Stack mappings use mmap(MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK) with a guard page (mprotect(PROT_NONE)) at the bottom. Returned stacks are cached for reuse; the cache is pre-warmed at init via llam_runtime_prewarm_stack_cache.

Blocking Compensation

When a task calls llam_enter_blocking() or llam_call_blocking(), the runtime must keep the shard's scheduler running while the worker thread is pinned in foreign code.

Opaque helper thread: each shard has a pre-spawned helper thread (opaque_helper_thread). When the primary worker enters a blocking region:

Primary worker:
  1. Increment opaque_compensation_depth
  2. Wake the helper thread (futex on Linux, Mach semaphore or condvar on Darwin)
  3. Execute the blocking work

Helper thread:
  1. Wake up and take over the shard's scheduler loop (using opaque_scheduler_ctx)
  2. Run tasks from the shard's queues while the primary is blocked
  3. When the primary calls llam_leave_blocking(), relinquish control

Blocking thread pool: a separate pool of block_worker_count threads handles llam_call_blocking jobs via a global FIFO job queue (block_head/block_tail). Jobs transition through QUEUED → RUNNING → FINISHED/ABORTED.

The connect fallback (llam_blocking_connect_impl) drives a nonblocking connect() + poll(POLLOUT, 10ms) loop with SO_ERROR verification, running in the blocking pool rather than pinning a scheduler worker.

Watchdog System

The watchdog thread (src/engine/) runs at 1ms intervals (LLAM_WATCHDOG_INTERVAL_NS) and performs:

Module File Function
Probe runtime_watchdog_probe.c Detect stalled safepoints, measure queue pressure, suspect deadlocks after 4 consecutive observations
Scale runtime_watchdog_scale.c Dynamic worker scaling: scale up after 2 consecutive pressure observations, scale down after 12 consecutive idle observations, with a 4-tick cooldown
Merge runtime_watchdog_merge.c Offline a shard by draining its queues and migrating tasks to a target shard
Rehome runtime_watchdog_rehome.c Atomically transfer ownership of parked waiters, in-flight I/O, submit-queue entries, and multishot watch state from an offline shard to a target shard

Rehome validates the entire waiter list before any migration. If a single entry cannot be rehomed (pinned task, incompatible I/O state), the entire list migration is aborted to prevent partial ownership inconsistency.

Synchronization Primitives

All sync primitives are runtime-aware: when called from a managed task, blocking waits park the task (freeing the worker thread) instead of blocking the OS thread.

  • Mutex (llam_mutex_t): atomic owner fast path + llam_wait_queue_t for contention. Non-recursive. EDEADLK on self-lock, EPERM on non-owner unlock.
  • Condition variable (llam_cond_t): FIFO waiter queue. Signal/broadcast can be called from outside managed tasks.
  • Channel (llam_channel_t): bounded pointer-valued ring buffer with separate send and receive wait queues. Supports close semantics (sends fail with EPIPE, buffered values remain drainable).
  • Cancel token (llam_cancel_token_t): explicit cancellation handle with a waiter list. Registered tasks and I/O operations observe cancellation through ECANCELED.

License

LLAM is licensed under the Apache License 2.0.

About

Low Level Asynchronous Machine

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors