Skip to content

[Feature] Cancel generation mid-token-loop for faster re-prediction #207

@FuJacob

Description

@FuJacob

Summary

Add cooperative cancellation to the generate() token loop in LlamaRuntimeCore so that when the user types during active generation, the stale prediction is abandoned immediately instead of running to completion.

Problem

When the user types a character while a suggestion is being generated, the old Swift Task is cancelled and a new prediction is queued. However, generate() (LlamaRuntimeCore.swift:192–217) never checks Task.isCancelled, so the loop continues sampling up to 30 tokens (for the default 12–20 word preset) on a result that will be discarded.

Because LlamaRuntimeCore is an actor, the new generation request is queued behind the old one — it cannot start until the wasted loop finishes. On smaller machines this adds noticeable latency between keystrokes and the next suggestion appearing.

Proposed direction

Add if Task.isCancelled { break } at the top of the token-sampling loop in generate(), matching the pattern already used in summarize() (LlamaRuntimeCore.swift:698–716). The existing defer cleanup block already handles partial generation correctly, so no additional teardown is needed.

for _ in 0 ..< options.maxPredictionTokens {
    if Task.isCancelled { break }          // ← add this
    let nextToken = llama_sampler_sample(sampler, context, -1)
    // ...
}

Additional context

  • summarize() in the same file already implements this exact pattern with a comment explaining cooperative cancellation.
  • SuggestionWorkController already cancels the old task and bumps a monotonic workID, so stale results are rejected downstream — this change just stops burning GPU/CPU cycles on them.
  • Highest ROI of the latency-related improvements: one line, zero architectural risk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions