feat: First-token logit gating to suppress chat-residue openers by Jam-Cai · Pull Request #84 · FuJacob/cotabby

Jam-Cai · 2026-04-30T01:35:58Z

Implements issue #24. Instruction-tuned models routinely begin inline-autocomplete suggestions with conversational openers ("Sure,", "Here's", "Of course", a leading newline) — chat-reply behavior leaking into a continuation context. This adds a per-tokenizer deny list applied as a -inf logit mask on the first sampled token only, eliminating the residue deterministically without prompt engineering.

What's in here

Mechanism

tabby/Support/FirstTokenDenyList.swift — declarative deny strings. Pure data; no runtime deps.
LlamaRuntimeCore resolves strings to token IDs once per model load against the loaded vocabulary, deduplicating leading-token collisions.
A second sampler chain prepends llama_sampler_init_logit_bias with -inf entries and is used only at tokenIndex == 0; subsequent tokens use the standard sampler so deny-listed tokens can still appear naturally mid-suggestion (e.g. "I" as a second word).

Conservative starting list
"Sure", "Here", "Of course", "Certainly", "\n". Every entry is a token that has essentially no legitimate use as the first token of a continuation. Notably not included despite being suggestive in the issue: "I " (high false-positive rate in real prose), "Let me" / "Let" (too broad), language-specific openers — none of these have measurement to back them. The per-model switch is kept as the extension point so the list can grow from gate-fire log evidence rather than intuition.

Settings toggle
"Suppress Chat Openers", default on, persisted under tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose logit-level control, so the toggle only renders for the Open Source engine.

Debug logging (acceptance criterion)
Subsystem app.tabby, category first-token-gate. Logs the resolved deny list at model load, plus a per-fire signal whenever the un-gated argmax of the raw logits at position 0 is in the deny set. Argmax is checked rather than the sampled token because the sampled token passes through stochastic top-p / min-p stages — argmax precisely answers "did the gate prevent the model's strongest preference from being chosen?".

log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-gate"' --level debug

Validation

Ran locally:

swiftlint --reporter github-actions-logging — no new warnings on changed files
xcodebuild ... build CODE_SIGNING_ALLOWED=NO — BUILD SUCCEEDED
xcodebuild test ... CODE_SIGNING_ALLOWED=NO — TEST SUCCEEDED

LlamaPromptRendererTests updated to pass the new isFirstTokenGatingEnabled argument to its SuggestionRequest constructor.

Screenshot of the new Settings toggle: TODO — local build couldn't capture due to Screen Recording permissions; will attach when running on a machine with the permission, or happy to ship without if you'd prefer to verify in your build.

Linked issues

Closes #24.

Risk / rollout notes

Default-on is intentional: the gate only applies a -inf mask to a small set of tokens whose argmax-occurrence rate at position 0 is exactly the misbehavior we want to fix. False-positive cost on the conservative starting list should be near zero.
The gate is a pure sampling-time addition; if it ever needs to be turned off in the field, the toggle disables the second sampler chain and falls back to today's behavior.
Resolution is per-model-load, so adding entries to the deny list later is a no-op until the next model load.

Instruction-tuned models routinely begin inline-autocomplete suggestions with conversational openers — "Sure,", "Here's", "Let me", a leading newline, or a bare "I " — that read as chat replies rather than text continuation. Prompt engineering reduces this unevenly; a per-tokenizer deny list applied as a -inf logit mask on the first sampled token eliminates it deterministically and is model-agnostic. What this adds - FirstTokenDenyList: declarative per-model deny strings (Gemma instruct, Qwen3, conservative default for unknown GGUFs). Pure data; no runtime deps. - LlamaRuntimeCore resolves the strings to token IDs once per model load using the loaded vocabulary, deduplicating leading-token collisions. - A second sampler chain prepends llama_sampler_init_logit_bias with -inf entries and is used only at tokenIndex == 0; subsequent tokens use the standard sampler so deny-listed tokens can still appear naturally mid-suggestion (e.g. "I" as a second word). - Settings toggle "Suppress Chat Openers", default on, persisted under tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose logit-level control, so the toggle only renders for the Open Source engine. - Debug logging under subsystem=app.tabby category=first-token-gate: resolved deny list at model load, plus a per-fire signal whenever the un-gated argmax of the raw logits at position -1 is in the deny set. Argmax is checked rather than the sampled token because the sampled token passes through stochastic top-p/min-p stages — argmax answers the precise question "did the gate prevent the model's strongest preference from being chosen?". Stream the gate with: log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-gate"' --level debug Closes #24. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Deny list trimmed to evidence-supported entries only: "Sure", "Here", "Of course", "Certainly", "\n" Dropped "I " (high false-positive rate in prose), "Let me"/"Let" (too broad), and the per-model Chinese entries (no measurement). The per-model switch is kept as the extension point so we can add specific entries later from gate-fire log evidence rather than intuition. Also wraps the Settings caption to satisfy the 140-char line-length warning, removes trailing commas in the deny-list literal, and adds the new isFirstTokenGatingEnabled argument to the SuggestionRequest constructor in LlamaPromptRendererTests so the test target builds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

FuJacob · 2026-04-30T23:58:05Z

+    static func denyStrings(for modelFilename: String) -> [String] {
+        switch modelFilename {
+        case "gemma-3-1b-it-Q4_K_M.gguf",
+             "gemma-3n-E4B-it-Q4_K_M.gguf",
+             "Qwen3-0.6B-Q4_K_M.gguf":
+            return Self.conservativeDenyStrings
+        default:
+            return Self.conservativeDenyStrings
+        }
+    }


all case return same deny string list, do we need this then?

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Jam-Cai and others added 2 commits April 29, 2026 20:55

FuJacob reviewed Apr 30, 2026

View reviewed changes

Jam-Cai mentioned this pull request May 1, 2026

Suppress low-confidence first-token suggestions (#81) #88

Open

6 tasks

FuJacob requested a review from Copilot May 1, 2026 18:15

Copilot started reviewing on behalf of FuJacob May 1, 2026 18:16 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Jam-Cai mentioned this pull request May 11, 2026

Surface first-token gate + confidence-suppression counters in Settings/Diagnostics #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: First-token logit gating to suppress chat-residue openers#84

feat: First-token logit gating to suppress chat-residue openers#84
Jam-Cai wants to merge 2 commits into
mainfrom
feat/first-token-logit-gating

Jam-Cai commented Apr 30, 2026

Uh oh!

FuJacob Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Jam-Cai commented Apr 30, 2026

What's in here

Validation

Linked issues

Risk / rollout notes

Uh oh!

FuJacob Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants