Skip to content

Suppress low-confidence first-token suggestions (#81)#88

Open
Jam-Cai wants to merge 1 commit into
feat/first-token-logit-gatingfrom
feat/first-token-confidence-gating
Open

Suppress low-confidence first-token suggestions (#81)#88
Jam-Cai wants to merge 1 commit into
feat/first-token-logit-gatingfrom
feat/first-token-confidence-gating

Conversation

@Jam-Cai
Copy link
Copy Markdown
Collaborator

@Jam-Cai Jam-Cai commented May 1, 2026

Closes #81.

Stacked on #84. The base will retarget to main when #84 merges.

Summary

  • Adds a runtime-level confidence gate that aborts inline suggestions when the model's top-1 raw-logit softmax probability at position 0 is below a user-tunable threshold.
  • Returns an empty SuggestionResult rather than surfacing an error: from the user's perspective Tabby simply shows no ghost text on uncertain prompts.
  • New Settings UI: toggle + threshold slider (only renders when the gate is enabled).
  • New filterable log channel: subsystem=app.tabby category=first-token-confidence.

Design choice — why softmax over raw logits, not the sampled token's probability

The issue suggested inspecting "the sampled token's probability." That's not really a confidence signal: temperature, top-p, and min-p reshape the distribution before sampling, so a sampled-token probability of 0.9 after top-p can correspond to a raw distribution where the true top-1 was 0.05 (the model was confused; the sampler concentrated mass on a survivor token). For inline autocomplete we want to suppress when the model itself was uncertain — i.e. when its raw distribution at position 0 is flat. So we compute a numerically-stable softmax over the full vocabulary's raw logits and compare its top-1 mass to the threshold.

This is also the same logit access we already use for #24's gate-fire log, so the cost story is identical: one O(nVocab) scan, only on the first token, only when the gate is enabled.

Distinct from the chat-opener gate (#24)

  • Gating (Logit gating on first generated token #24): masks specific deny-listed tokens with -inf logit bias. Surgical, evidence-backed, ships on by default.
  • Confidence suppression (this PR): aborts the whole suggestion when the distribution is flat. Heuristic, opt-in, ships off by default.

A single generation can fire neither, one, or both.

Defaults

  • Toggle: off (heuristic can hide useful suggestions if mistuned).
  • Threshold: 0.10 (deliberately gentle starting point — local models often peak at 0.30–0.60 on unambiguous continuations, so 0.10 catches the genuinely-confused cases). Tunable via slider in Settings (range 0.0–0.5, step 0.01).
  • Both persist under new UserDefaults keys; values clamped at the setter.

KV-cache correctness

The runtime throws lowConfidenceSuppression before any sampled-token decode, so the prompt KV stays valid. The engine layer deliberately keeps the prompt-cache hint tracker intact in this branch — the next request can still benefit from prefix reuse against this exact prompt. The runtime's defer-block also short-circuits its shouldResetPromptCache = true path for this specific error.

Test plan

  • swiftlint clean for the modified files (only pre-existing warnings remain).
  • xcodebuild build succeeds.
  • xcodebuild test succeeds (LlamaPromptRendererTests updated for the new SuggestionRequest args).
  • Manual: enable the gate at threshold 0.10 with a prompt that's known to confuse the model and verify the suggestion is suppressed; check log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-confidence"' for the suppression line.
  • Manual: disable the gate and verify suggestions return as before.
  • TODO: screenshot of the new Settings UI (toggle + threshold slider). Terminal lacks Screen Recording permission in this session so I couldn't run screencapture — happy to attach later, or @FuJacob can capture.

🤖 Generated with Claude Code

Adds a runtime-level abort for inline suggestions when the model's top-1
raw-logit softmax probability at position 0 is below a user-tunable
threshold. Distinct axis from the deny-list gate in #24:
- gating masks specific tokens
- confidence suppression aborts the whole suggestion

Why raw-logit softmax (not the sampled token's probability):
temperature/top-p/min-p reshape the distribution. A sampled-token
probability of 0.9 after top-p can correspond to a raw distribution
where the true top-1 was 0.05 (the model was confused; the sampler
concentrated mass on a survivor). For inline autocomplete we want to
suppress when *the model itself* was uncertain.

Implementation:
- new LlamaRuntimeError.lowConfidenceSuppression carrying probability,
  threshold, and the token text for diagnostics
- numerically-stable softmax in LlamaRuntimeCore (one O(nVocab) scan,
  only on first token, only when gate is enabled)
- LlamaSuggestionEngine catches the error and returns an empty
  SuggestionResult; prompt-cache hint tracker is preserved because the
  KV stays valid (we threw before any sampled-token decode)
- new published settings + UserDefaults keys with bounded clamp at the
  setter to keep the runtime's trust boundary clean
- Settings UI: toggle + slider that only renders when the gate is on

Defaults: toggle off, threshold 0.10. Off by default because the
heuristic can hide useful suggestions when mistuned; we'll dial it in
once the new `app.tabby` / `first-token-confidence` log gives us
empirical data from real usage.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Jam-Cai
Copy link
Copy Markdown
Collaborator Author

Jam-Cai commented May 11, 2026

Tracking reminder: this PR is currently based on `feat/first-token-logit-gating` (PR #84) so the diff stays scoped. Once #84 merges, retarget the base of this PR to `main` before merging — otherwise the merge will pull in stale base-branch state.

```
gh pr edit 88 --base main
```

@FuJacob
Copy link
Copy Markdown
Owner

FuJacob commented May 23, 2026

@Jam-Cai does this retry completion if suppressed? Think it is worse if user not get completion. Instead of seeing poor completion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants