Phase 1: refactor engine to one shared context with batched decode by FuJacob · Pull Request #4 · FuJacob/cotabbyinference

FuJacob · 2026-05-28T10:37:54Z

Stacked on #3.

Summary

Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified.

Why

Phase 0 (#3) measured 1.43x aggregate throughput at N=2 and 2.35x at N=4 on Gemma 3 1B / M-series. The win comes from fusing matmul weight reads across sequences in a single llama_decode — per-token decode is memory-bound on Apple Silicon, so batching reuses the same weight read for multiple sequences.

Design

One shared llama_context, n_ctx = configured * MAX_SEQUENCES. Each SequenceState owns a llama_seq_id slot (0..MAX_SEQUENCES-1).
Decoder thread loop: wait for a request, wait an additional BATCH_WINDOW_MICROS (200 µs) for siblings, build one llama_batch carrying all pending tokens with their seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, resolve every request's std::promise.
sampleNext fast path: deliver the seed token sampled at decodePrompt time. No decoder round-trip on the first sample after a prompt.
sampleNext steady-state path: queue a PendingRequest and wait on the promise.
decodePrompt holds decode_mutex for its chunk decodes and takes the seed sample inline while the prompt's logits are still resident.
trimKV holds decode_mutex, calls llama_memory_seq_rm, and invalidates pending seed/input so the caller re-primes via decodePrompt.

The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window. Single-sequence callers pay one window per token (~2% of a ~10 ms decode).

Validation

swift build
# Build complete!

COTABBY_TEST_MODEL_PATH=...Qwen3-0.6B-Q4_K_M.gguf swift test
# Executed 12 tests, with 0 failures

Two new tests gated on COTABBY_TEST_MODEL_PATH:

testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts; asserts identical token output. Validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context.
testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work.

Existing testEndToEndWithModel passes unchanged — verifies the API stayed source-compatible for the single-sequence flow Cotabby uses today.

Follow-ups

Bench scenario c_engine_threaded that exercises the engine via its public API from two threads, for end-to-end throughput validation. Phase 0 numbers in Phase 0 spike: batched-decode vs separate-context throughput benchmark #3 are at the raw llama.cpp level; the engine adds the decoder-thread coordination layer and the 200 µs batching window, so its end-to-end number may differ by a few percent.
README update: the "no shared decode mutex, no contention" line is no longer accurate. The new design has one decode_mutex serializing the single llama_context. The contention is productive (it enables batching), but the docs should reflect the new model.
Cotabby side (tabby-4): no code changes needed for correctness, but LlamaRuntimeCore.autocompleteLock is now redundant for context isolation (the engine handles its own locking). It still serializes autocomplete-specific Swift-side state (autocompletePromptTokens etc.), so leave as-is unless we refactor that state too.

Risk / rollout notes

Memory footprint goes up: n_ctx is multiplied by MAX_SEQUENCES = 4, so KV cache memory ~4x. For Gemma 3 1B with 2048 ctx, that's ~107 MB — tolerable on M-series. Worth checking on lower-spec hardware.
The seed-token-in-decodePrompt path is new and has the subtlest interaction: the seed is sampled while the prompt's logits are live in the shared context, before any other sequence can decode. decodePrompt holds decode_mutex for its entire body, which serializes against the decoder thread. Verified with the interleaved-multi-sequence test.
Cancellation latency is now bounded by one batch (~10 ms) rather than zero — a cancelled token's decode slot is still consumed even though the sampler is skipped. Acceptable trade-off.

Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified. Why --- Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4 vs the current "separate llama_context per sequence" design. The win comes from fusing matmul weight reads across sequences in a single llama_decode call: per-token decode is memory-bound on Apple Silicon, so a single decode that serves two sequences reuses the same weight read. The "Metal command queue serializes everything" pessimism does not survive empirically. Design ------ - Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the shared KV cache. - Decoder thread loop: wait for at least one pending request, wait an additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to pile in, then build one llama_batch carrying all pending tokens with their respective seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, and resolve every request's promise. - sampleNext fast path: deliver the seed token sampled at decodePrompt time. This avoids the decoder round-trip for the very first sample after a prompt, where there is no input token to feedback-decode. - sampleNext steady-state path: queue a PendingRequest (input token = previously-sampled token, position = current KV count, sampler = this sequence's chain) and wait on a std::promise resolved by the decoder thread. - decodePrompt holds decode_mutex for the prompt's chunk decode and takes the seed sample inline while the prompt's logits are still resident in the shared context. - trimKV holds decode_mutex, calls llama_memory_seq_rm for this sequence's seq_id, and invalidates any pending seed/input so the caller has to re-prime via decodePrompt before the next sampleNext. The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window without any caller-side coordination. Single-sequence callers pay one window's worth of latency per token (~2% of a ~10 ms decode); acceptable. Tunable later via a setter if needed. Cancellation ------------ - Existing one-way atomic flag preserved. - Checked at sampleNext entry (returns immediately) and again in processBatch after llama_decode but before sampling (skips wasted sample work, returns was_cancelled=true). The decode slot for a cancelled token is still consumed, which is fine — the slot is cheap; the win is not running the sampler. Tests ----- Added two integration tests gated on COTABBY_TEST_MODEL_PATH: - testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts, asserts identical output (validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context). - testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work. Existing testEndToEndWithModel passes unchanged. Follow-ups ---------- - Bench scenario c_engine_threaded that exercises the full engine via its public API from two threads, for end-to-end throughput validation (Phase 0 numbers above are at the raw llama.cpp level). - README update: the "no shared decode mutex, no contention" claim is no longer accurate. The new design has a single decode_mutex serializing access to one llama_context. The contention is productive — it enables batching — but the README should reflect the new model.

FuJacob deleted the branch feat/batched-decode-bench May 28, 2026 10:49

FuJacob closed this May 28, 2026

FuJacob mentioned this pull request May 28, 2026

Phase 1: refactor engine to one shared context with batched decode #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: refactor engine to one shared context with batched decode#4

Phase 1: refactor engine to one shared context with batched decode#4
FuJacob wants to merge 1 commit into
feat/batched-decode-benchfrom
feat/batched-decode-engine

FuJacob commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FuJacob commented May 28, 2026

Summary

Why

Design

Validation

Follow-ups

Risk / rollout notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant