Problem
Tabby currently waits for a completed model response before showing a suggestion. That can make autocomplete feel slower than necessary, especially with larger local models or longer suggestion length settings.
Goal
Explore whether Tabby can stream or incrementally surface useful completion chunks before the full generation is finished.
Proposed Scope
- Investigate whether the Open Source runtime can expose token streaming from
LlamaRuntimeCore.
- Investigate what, if anything, Apple Intelligence exposes for incremental responses.
- Define a safe partial-output normalization strategy so incomplete text is not shown in awkward states.
- Decide when the overlay should first appear: after first word, first stable phrase, or a latency threshold.
- Ensure cancellation and stale-result handling work while streaming.
- Preserve partial acceptance semantics if the user presses accept before generation fully completes.
Acceptance Criteria
- There is a documented recommendation for streaming/chunked generation feasibility by engine.
- If feasible for Open Source, the runtime can emit incremental chunks to the coordinator.
- The overlay can show stable partial suggestions without waiting for full completion.
- Stale generations and focus changes cancel active streams safely.
- Streaming does not regress output normalization or acceptance behavior.
Open Questions
- Should streaming be engine-specific or hidden behind one shared
SuggestionGenerating interface?
- What minimum chunk is stable enough to show?
- Should the app keep generating after the user accepts the first shown chunk?
- Does streaming meaningfully improve perceived latency with current model sizes?
Problem
Tabby currently waits for a completed model response before showing a suggestion. That can make autocomplete feel slower than necessary, especially with larger local models or longer suggestion length settings.
Goal
Explore whether Tabby can stream or incrementally surface useful completion chunks before the full generation is finished.
Proposed Scope
LlamaRuntimeCore.Acceptance Criteria
Open Questions
SuggestionGeneratinginterface?