feat: First-token logit gating to suppress chat-residue openers#84
Open
Jam-Cai wants to merge 2 commits into
Open
feat: First-token logit gating to suppress chat-residue openers#84Jam-Cai wants to merge 2 commits into
Jam-Cai wants to merge 2 commits into
Conversation
Instruction-tuned models routinely begin inline-autocomplete suggestions with conversational openers — "Sure,", "Here's", "Let me", a leading newline, or a bare "I " — that read as chat replies rather than text continuation. Prompt engineering reduces this unevenly; a per-tokenizer deny list applied as a -inf logit mask on the first sampled token eliminates it deterministically and is model-agnostic. What this adds - FirstTokenDenyList: declarative per-model deny strings (Gemma instruct, Qwen3, conservative default for unknown GGUFs). Pure data; no runtime deps. - LlamaRuntimeCore resolves the strings to token IDs once per model load using the loaded vocabulary, deduplicating leading-token collisions. - A second sampler chain prepends llama_sampler_init_logit_bias with -inf entries and is used only at tokenIndex == 0; subsequent tokens use the standard sampler so deny-listed tokens can still appear naturally mid-suggestion (e.g. "I" as a second word). - Settings toggle "Suppress Chat Openers", default on, persisted under tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose logit-level control, so the toggle only renders for the Open Source engine. - Debug logging under subsystem=app.tabby category=first-token-gate: resolved deny list at model load, plus a per-fire signal whenever the un-gated argmax of the raw logits at position -1 is in the deny set. Argmax is checked rather than the sampled token because the sampled token passes through stochastic top-p/min-p stages — argmax answers the precise question "did the gate prevent the model's strongest preference from being chosen?". Stream the gate with: log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-gate"' --level debug Closes #24. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deny list trimmed to evidence-supported entries only: "Sure", "Here", "Of course", "Certainly", "\n" Dropped "I " (high false-positive rate in prose), "Let me"/"Let" (too broad), and the per-model Chinese entries (no measurement). The per-model switch is kept as the extension point so we can add specific entries later from gate-fire log evidence rather than intuition. Also wraps the Settings caption to satisfy the 140-char line-length warning, removes trailing commas in the deny-list literal, and adds the new isFirstTokenGatingEnabled argument to the SuggestionRequest constructor in LlamaPromptRendererTests so the test target builds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
FuJacob
reviewed
Apr 30, 2026
Comment on lines
+51
to
+60
| static func denyStrings(for modelFilename: String) -> [String] { | ||
| switch modelFilename { | ||
| case "gemma-3-1b-it-Q4_K_M.gguf", | ||
| "gemma-3n-E4B-it-Q4_K_M.gguf", | ||
| "Qwen3-0.6B-Q4_K_M.gguf": | ||
| return Self.conservativeDenyStrings | ||
| default: | ||
| return Self.conservativeDenyStrings | ||
| } | ||
| } |
Owner
There was a problem hiding this comment.
all case return same deny string list, do we need this then?
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements issue #24. Instruction-tuned models routinely begin inline-autocomplete suggestions with conversational openers ("Sure,", "Here's", "Of course", a leading newline) — chat-reply behavior leaking into a continuation context. This adds a per-tokenizer deny list applied as a
-inflogit mask on the first sampled token only, eliminating the residue deterministically without prompt engineering.What's in here
Mechanism
tabby/Support/FirstTokenDenyList.swift— declarative deny strings. Pure data; no runtime deps.LlamaRuntimeCoreresolves strings to token IDs once per model load against the loaded vocabulary, deduplicating leading-token collisions.llama_sampler_init_logit_biaswith-infentries and is used only attokenIndex == 0; subsequent tokens use the standard sampler so deny-listed tokens can still appear naturally mid-suggestion (e.g. "I" as a second word).Conservative starting list
"Sure","Here","Of course","Certainly","\n". Every entry is a token that has essentially no legitimate use as the first token of a continuation. Notably not included despite being suggestive in the issue:"I "(high false-positive rate in real prose),"Let me"/"Let"(too broad), language-specific openers — none of these have measurement to back them. The per-model switch is kept as the extension point so the list can grow from gate-fire log evidence rather than intuition.Settings toggle
"Suppress Chat Openers", default on, persisted under
tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose logit-level control, so the toggle only renders for the Open Source engine.Debug logging (acceptance criterion)
Subsystem
app.tabby, categoryfirst-token-gate. Logs the resolved deny list at model load, plus a per-fire signal whenever the un-gated argmax of the raw logits at position 0 is in the deny set. Argmax is checked rather than the sampled token because the sampled token passes through stochastic top-p / min-p stages — argmax precisely answers "did the gate prevent the model's strongest preference from being chosen?".Validation
Ran locally:
swiftlint --reporter github-actions-logging— no new warnings on changed filesxcodebuild ... build CODE_SIGNING_ALLOWED=NO— BUILD SUCCEEDEDxcodebuild test ... CODE_SIGNING_ALLOWED=NO— TEST SUCCEEDEDLlamaPromptRendererTestsupdated to pass the newisFirstTokenGatingEnabledargument to itsSuggestionRequestconstructor.Screenshot of the new Settings toggle: TODO — local build couldn't capture due to Screen Recording permissions; will attach when running on a machine with the permission, or happy to ship without if you'd prefer to verify in your build.
Linked issues
Closes #24.
Risk / rollout notes
-infmask to a small set of tokens whose argmax-occurrence rate at position 0 is exactly the misbehavior we want to fix. False-positive cost on the conservative starting list should be near zero.