Skip to content

Phase 1: refactor engine to one shared context with batched decode#4

Closed
FuJacob wants to merge 1 commit into
feat/batched-decode-benchfrom
feat/batched-decode-engine
Closed

Phase 1: refactor engine to one shared context with batched decode#4
FuJacob wants to merge 1 commit into
feat/batched-decode-benchfrom
feat/batched-decode-engine

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 28, 2026

Stacked on #3.

Summary

Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified.

Why

Phase 0 (#3) measured 1.43x aggregate throughput at N=2 and 2.35x at N=4 on Gemma 3 1B / M-series. The win comes from fusing matmul weight reads across sequences in a single llama_decode — per-token decode is memory-bound on Apple Silicon, so batching reuses the same weight read for multiple sequences.

Design

  • One shared llama_context, n_ctx = configured * MAX_SEQUENCES. Each SequenceState owns a llama_seq_id slot (0..MAX_SEQUENCES-1).
  • Decoder thread loop: wait for a request, wait an additional BATCH_WINDOW_MICROS (200 µs) for siblings, build one llama_batch carrying all pending tokens with their seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, resolve every request's std::promise.
  • sampleNext fast path: deliver the seed token sampled at decodePrompt time. No decoder round-trip on the first sample after a prompt.
  • sampleNext steady-state path: queue a PendingRequest and wait on the promise.
  • decodePrompt holds decode_mutex for its chunk decodes and takes the seed sample inline while the prompt's logits are still resident.
  • trimKV holds decode_mutex, calls llama_memory_seq_rm, and invalidates pending seed/input so the caller re-primes via decodePrompt.

The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window. Single-sequence callers pay one window per token (~2% of a ~10 ms decode).

Validation

swift build
# Build complete!

COTABBY_TEST_MODEL_PATH=...Qwen3-0.6B-Q4_K_M.gguf swift test
# Executed 12 tests, with 0 failures

Two new tests gated on COTABBY_TEST_MODEL_PATH:

  • testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts; asserts identical token output. Validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context.
  • testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work.

Existing testEndToEndWithModel passes unchanged — verifies the API stayed source-compatible for the single-sequence flow Cotabby uses today.

Follow-ups

  • Bench scenario c_engine_threaded that exercises the engine via its public API from two threads, for end-to-end throughput validation. Phase 0 numbers in Phase 0 spike: batched-decode vs separate-context throughput benchmark #3 are at the raw llama.cpp level; the engine adds the decoder-thread coordination layer and the 200 µs batching window, so its end-to-end number may differ by a few percent.
  • README update: the "no shared decode mutex, no contention" line is no longer accurate. The new design has one decode_mutex serializing the single llama_context. The contention is productive (it enables batching), but the docs should reflect the new model.
  • Cotabby side (tabby-4): no code changes needed for correctness, but LlamaRuntimeCore.autocompleteLock is now redundant for context isolation (the engine handles its own locking). It still serializes autocomplete-specific Swift-side state (autocompletePromptTokens etc.), so leave as-is unless we refactor that state too.

Risk / rollout notes

  • Memory footprint goes up: n_ctx is multiplied by MAX_SEQUENCES = 4, so KV cache memory ~4x. For Gemma 3 1B with 2048 ctx, that's ~107 MB — tolerable on M-series. Worth checking on lower-spec hardware.
  • The seed-token-in-decodePrompt path is new and has the subtlest interaction: the seed is sampled while the prompt's logits are live in the shared context, before any other sequence can decode. decodePrompt holds decode_mutex for its entire body, which serializes against the decoder thread. Verified with the interleaved-multi-sequence test.
  • Cancellation latency is now bounded by one batch (~10 ms) rather than zero — a cancelled token's decode slot is still consumed even though the sampler is skipped. Acceptable trade-off.

Replaces the per-sequence llama_context architecture with a single
shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder
thread that coalesces sample-step requests from multiple sequences into
one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is
unchanged; Cotabby's Swift code does not need to be modified.

Why
---
Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode
delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4
vs the current "separate llama_context per sequence" design. The win
comes from fusing matmul weight reads across sequences in a single
llama_decode call: per-token decode is memory-bound on Apple Silicon,
so a single decode that serves two sequences reuses the same weight
read. The "Metal command queue serializes everything" pessimism does
not survive empirically.

Design
------
- Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES
  and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a
  llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the
  shared KV cache.
- Decoder thread loop: wait for at least one pending request, wait an
  additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to
  pile in, then build one llama_batch carrying all pending tokens with
  their respective seq_ids, llama_decode once, sample each sequence's
  next token using its own sampler chain at its assigned batch index,
  and resolve every request's promise.
- sampleNext fast path: deliver the seed token sampled at decodePrompt
  time. This avoids the decoder round-trip for the very first sample
  after a prompt, where there is no input token to feedback-decode.
- sampleNext steady-state path: queue a PendingRequest (input token =
  previously-sampled token, position = current KV count, sampler =
  this sequence's chain) and wait on a std::promise resolved by the
  decoder thread.
- decodePrompt holds decode_mutex for the prompt's chunk decode and
  takes the seed sample inline while the prompt's logits are still
  resident in the shared context.
- trimKV holds decode_mutex, calls llama_memory_seq_rm for this
  sequence's seq_id, and invalidates any pending seed/input so the
  caller has to re-prime via decodePrompt before the next sampleNext.

The 200 µs window is the throughput knob. Multi-sequence workloads
naturally fall into lockstep because each sequence resubmits as soon as
its sample returns, so successive requests usually arrive within the
window without any caller-side coordination. Single-sequence callers
pay one window's worth of latency per token (~2% of a ~10 ms decode);
acceptable. Tunable later via a setter if needed.

Cancellation
------------
- Existing one-way atomic flag preserved.
- Checked at sampleNext entry (returns immediately) and again in
  processBatch after llama_decode but before sampling (skips wasted
  sample work, returns was_cancelled=true). The decode slot for a
  cancelled token is still consumed, which is fine — the slot is
  cheap; the win is not running the sampler.

Tests
-----
Added two integration tests gated on COTABBY_TEST_MODEL_PATH:

- testInterleavedMultiSequenceSampling: alternates sampleNext between
  two sequences with greedy sampling and identical prompts, asserts
  identical output (validates the seed-token / feedback-decode handoff
  and per-sequence sampler isolation in the shared context).
- testCancellationStopsSamplingPromptly: verifies sampleNext after
  cancelSequence returns was_cancelled=true without model work.

Existing testEndToEndWithModel passes unchanged.

Follow-ups
----------
- Bench scenario c_engine_threaded that exercises the full engine via
  its public API from two threads, for end-to-end throughput
  validation (Phase 0 numbers above are at the raw llama.cpp level).
- README update: the "no shared decode mutex, no contention" claim is
  no longer accurate. The new design has a single decode_mutex
  serializing access to one llama_context. The contention is
  productive — it enables batching — but the README should reflect the
  new model.
@FuJacob FuJacob deleted the branch feat/batched-decode-bench May 28, 2026 10:49
@FuJacob FuJacob closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant