Summary
Add cooperative cancellation to the generate() token loop in LlamaRuntimeCore so that when the user types during active generation, the stale prediction is abandoned immediately instead of running to completion.
Problem
When the user types a character while a suggestion is being generated, the old Swift Task is cancelled and a new prediction is queued. However, generate() (LlamaRuntimeCore.swift:192–217) never checks Task.isCancelled, so the loop continues sampling up to 30 tokens (for the default 12–20 word preset) on a result that will be discarded.
Because LlamaRuntimeCore is an actor, the new generation request is queued behind the old one — it cannot start until the wasted loop finishes. On smaller machines this adds noticeable latency between keystrokes and the next suggestion appearing.
Proposed direction
Add if Task.isCancelled { break } at the top of the token-sampling loop in generate(), matching the pattern already used in summarize() (LlamaRuntimeCore.swift:698–716). The existing defer cleanup block already handles partial generation correctly, so no additional teardown is needed.
for _ in 0 ..< options.maxPredictionTokens {
if Task.isCancelled { break } // ← add this
let nextToken = llama_sampler_sample(sampler, context, -1)
// ...
}
Additional context
summarize() in the same file already implements this exact pattern with a comment explaining cooperative cancellation.
SuggestionWorkController already cancels the old task and bumps a monotonic workID, so stale results are rejected downstream — this change just stops burning GPU/CPU cycles on them.
- Highest ROI of the latency-related improvements: one line, zero architectural risk.
Summary
Add cooperative cancellation to the
generate()token loop inLlamaRuntimeCoreso that when the user types during active generation, the stale prediction is abandoned immediately instead of running to completion.Problem
When the user types a character while a suggestion is being generated, the old Swift
Taskis cancelled and a new prediction is queued. However,generate()(LlamaRuntimeCore.swift:192–217) never checksTask.isCancelled, so the loop continues sampling up to 30 tokens (for the default 12–20 word preset) on a result that will be discarded.Because
LlamaRuntimeCoreis an actor, the new generation request is queued behind the old one — it cannot start until the wasted loop finishes. On smaller machines this adds noticeable latency between keystrokes and the next suggestion appearing.Proposed direction
Add
if Task.isCancelled { break }at the top of the token-sampling loop ingenerate(), matching the pattern already used insummarize()(LlamaRuntimeCore.swift:698–716). The existingdefercleanup block already handles partial generation correctly, so no additional teardown is needed.Additional context
summarize()in the same file already implements this exact pattern with a comment explaining cooperative cancellation.SuggestionWorkControlleralready cancels the old task and bumps a monotonicworkID, so stale results are rejected downstream — this change just stops burning GPU/CPU cycles on them.