Suppress low-confidence first-token suggestions (#81)#88
Open
Jam-Cai wants to merge 1 commit into
Open
Conversation
Adds a runtime-level abort for inline suggestions when the model's top-1 raw-logit softmax probability at position 0 is below a user-tunable threshold. Distinct axis from the deny-list gate in #24: - gating masks specific tokens - confidence suppression aborts the whole suggestion Why raw-logit softmax (not the sampled token's probability): temperature/top-p/min-p reshape the distribution. A sampled-token probability of 0.9 after top-p can correspond to a raw distribution where the true top-1 was 0.05 (the model was confused; the sampler concentrated mass on a survivor). For inline autocomplete we want to suppress when *the model itself* was uncertain. Implementation: - new LlamaRuntimeError.lowConfidenceSuppression carrying probability, threshold, and the token text for diagnostics - numerically-stable softmax in LlamaRuntimeCore (one O(nVocab) scan, only on first token, only when gate is enabled) - LlamaSuggestionEngine catches the error and returns an empty SuggestionResult; prompt-cache hint tracker is preserved because the KV stays valid (we threw before any sampled-token decode) - new published settings + UserDefaults keys with bounded clamp at the setter to keep the runtime's trust boundary clean - Settings UI: toggle + slider that only renders when the gate is on Defaults: toggle off, threshold 0.10. Off by default because the heuristic can hide useful suggestions when mistuned; we'll dial it in once the new `app.tabby` / `first-token-confidence` log gives us empirical data from real usage.
This was referenced May 11, 2026
Collaborator
Author
Owner
|
@Jam-Cai does this retry completion if suppressed? Think it is worse if user not get completion. Instead of seeing poor completion |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #81.
Stacked on #84. The base will retarget to
mainwhen #84 merges.Summary
SuggestionResultrather than surfacing an error: from the user's perspective Tabby simply shows no ghost text on uncertain prompts.subsystem=app.tabby category=first-token-confidence.Design choice — why softmax over raw logits, not the sampled token's probability
The issue suggested inspecting "the sampled token's probability." That's not really a confidence signal: temperature, top-p, and min-p reshape the distribution before sampling, so a sampled-token probability of 0.9 after top-p can correspond to a raw distribution where the true top-1 was 0.05 (the model was confused; the sampler concentrated mass on a survivor token). For inline autocomplete we want to suppress when the model itself was uncertain — i.e. when its raw distribution at position 0 is flat. So we compute a numerically-stable softmax over the full vocabulary's raw logits and compare its top-1 mass to the threshold.
This is also the same logit access we already use for #24's gate-fire log, so the cost story is identical: one O(nVocab) scan, only on the first token, only when the gate is enabled.
Distinct from the chat-opener gate (#24)
-inflogit bias. Surgical, evidence-backed, ships on by default.A single generation can fire neither, one, or both.
Defaults
KV-cache correctness
The runtime throws
lowConfidenceSuppressionbefore any sampled-token decode, so the prompt KV stays valid. The engine layer deliberately keeps the prompt-cache hint tracker intact in this branch — the next request can still benefit from prefix reuse against this exact prompt. The runtime's defer-block also short-circuits itsshouldResetPromptCache = truepath for this specific error.Test plan
log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-confidence"'for the suppression line.screencapture— happy to attach later, or @FuJacob can capture.🤖 Generated with Claude Code