Suppress low-confidence first-token suggestions (#81) by Jam-Cai · Pull Request #88 · FuJacob/cotabby

Jam-Cai · 2026-05-01T00:50:46Z

Closes #81.

Stacked on #84. The base will retarget to main when #84 merges.

Summary

Adds a runtime-level confidence gate that aborts inline suggestions when the model's top-1 raw-logit softmax probability at position 0 is below a user-tunable threshold.
Returns an empty SuggestionResult rather than surfacing an error: from the user's perspective Tabby simply shows no ghost text on uncertain prompts.
New Settings UI: toggle + threshold slider (only renders when the gate is enabled).
New filterable log channel: subsystem=app.tabby category=first-token-confidence.

Design choice — why softmax over raw logits, not the sampled token's probability

The issue suggested inspecting "the sampled token's probability." That's not really a confidence signal: temperature, top-p, and min-p reshape the distribution before sampling, so a sampled-token probability of 0.9 after top-p can correspond to a raw distribution where the true top-1 was 0.05 (the model was confused; the sampler concentrated mass on a survivor token). For inline autocomplete we want to suppress when the model itself was uncertain — i.e. when its raw distribution at position 0 is flat. So we compute a numerically-stable softmax over the full vocabulary's raw logits and compare its top-1 mass to the threshold.

This is also the same logit access we already use for #24's gate-fire log, so the cost story is identical: one O(nVocab) scan, only on the first token, only when the gate is enabled.

Distinct from the chat-opener gate (#24)

Gating (Logit gating on first generated token #24): masks specific deny-listed tokens with -inf logit bias. Surgical, evidence-backed, ships on by default.
Confidence suppression (this PR): aborts the whole suggestion when the distribution is flat. Heuristic, opt-in, ships off by default.

A single generation can fire neither, one, or both.

Defaults

Toggle: off (heuristic can hide useful suggestions if mistuned).
Threshold: 0.10 (deliberately gentle starting point — local models often peak at 0.30–0.60 on unambiguous continuations, so 0.10 catches the genuinely-confused cases). Tunable via slider in Settings (range 0.0–0.5, step 0.01).
Both persist under new UserDefaults keys; values clamped at the setter.

KV-cache correctness

The runtime throws lowConfidenceSuppression before any sampled-token decode, so the prompt KV stays valid. The engine layer deliberately keeps the prompt-cache hint tracker intact in this branch — the next request can still benefit from prefix reuse against this exact prompt. The runtime's defer-block also short-circuits its shouldResetPromptCache = true path for this specific error.

Test plan

swiftlint clean for the modified files (only pre-existing warnings remain).
xcodebuild build succeeds.
xcodebuild test succeeds (LlamaPromptRendererTests updated for the new SuggestionRequest args).
Manual: enable the gate at threshold 0.10 with a prompt that's known to confuse the model and verify the suggestion is suppressed; check log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-confidence"' for the suppression line.
Manual: disable the gate and verify suggestions return as before.
TODO: screenshot of the new Settings UI (toggle + threshold slider). Terminal lacks Screen Recording permission in this session so I couldn't run screencapture — happy to attach later, or @FuJacob can capture.

🤖 Generated with Claude Code

Adds a runtime-level abort for inline suggestions when the model's top-1 raw-logit softmax probability at position 0 is below a user-tunable threshold. Distinct axis from the deny-list gate in #24: - gating masks specific tokens - confidence suppression aborts the whole suggestion Why raw-logit softmax (not the sampled token's probability): temperature/top-p/min-p reshape the distribution. A sampled-token probability of 0.9 after top-p can correspond to a raw distribution where the true top-1 was 0.05 (the model was confused; the sampler concentrated mass on a survivor). For inline autocomplete we want to suppress when *the model itself* was uncertain. Implementation: - new LlamaRuntimeError.lowConfidenceSuppression carrying probability, threshold, and the token text for diagnostics - numerically-stable softmax in LlamaRuntimeCore (one O(nVocab) scan, only on first token, only when gate is enabled) - LlamaSuggestionEngine catches the error and returns an empty SuggestionResult; prompt-cache hint tracker is preserved because the KV stays valid (we threw before any sampled-token decode) - new published settings + UserDefaults keys with bounded clamp at the setter to keep the runtime's trust boundary clean - Settings UI: toggle + slider that only renders when the gate is on Defaults: toggle off, threshold 0.10. Off by default because the heuristic can hide useful suggestions when mistuned; we'll dial it in once the new `app.tabby` / `first-token-confidence` log gives us empirical data from real usage.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Jam-Cai · 2026-05-11T01:33:52Z

Tracking reminder: this PR is currently based on `feat/first-token-logit-gating` (PR #84) so the diff stays scoped. Once #84 merges, retarget the base of this PR to `main` before merging — otherwise the merge will pull in stale base-branch state.

```
gh pr edit 88 --base main
```

FuJacob · 2026-05-23T17:15:05Z

@Jam-Cai does this retry completion if suppressed? Think it is worse if user not get completion. Instead of seeing poor completion

FuJacob requested a review from Copilot May 1, 2026 18:15

Copilot started reviewing on behalf of FuJacob May 1, 2026 18:16 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

This was referenced May 11, 2026

Log first-token top-1 probability distribution for confidence-gate threshold tuning #98

Closed

Surface first-token gate + confidence-suppression counters in Settings/Diagnostics #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suppress low-confidence first-token suggestions (#81)#88

Suppress low-confidence first-token suggestions (#81)#88
Jam-Cai wants to merge 1 commit into
feat/first-token-logit-gatingfrom
feat/first-token-confidence-gating

Jam-Cai commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Jam-Cai commented May 11, 2026

Uh oh!

FuJacob commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Jam-Cai commented May 1, 2026

Summary

Design choice — why softmax over raw logits, not the sampled token's probability

Distinct from the chat-opener gate (#24)

Defaults

KV-cache correctness

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Jam-Cai commented May 11, 2026

Uh oh!

FuJacob commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants