Skip to content

feat: First-token logit gating to suppress chat-residue openers#84

Open
Jam-Cai wants to merge 2 commits into
mainfrom
feat/first-token-logit-gating
Open

feat: First-token logit gating to suppress chat-residue openers#84
Jam-Cai wants to merge 2 commits into
mainfrom
feat/first-token-logit-gating

Conversation

@Jam-Cai
Copy link
Copy Markdown
Collaborator

@Jam-Cai Jam-Cai commented Apr 30, 2026

Implements issue #24. Instruction-tuned models routinely begin inline-autocomplete suggestions with conversational openers ("Sure,", "Here's", "Of course", a leading newline) — chat-reply behavior leaking into a continuation context. This adds a per-tokenizer deny list applied as a -inf logit mask on the first sampled token only, eliminating the residue deterministically without prompt engineering.

What's in here

Mechanism

  • tabby/Support/FirstTokenDenyList.swift — declarative deny strings. Pure data; no runtime deps.
  • LlamaRuntimeCore resolves strings to token IDs once per model load against the loaded vocabulary, deduplicating leading-token collisions.
  • A second sampler chain prepends llama_sampler_init_logit_bias with -inf entries and is used only at tokenIndex == 0; subsequent tokens use the standard sampler so deny-listed tokens can still appear naturally mid-suggestion (e.g. "I" as a second word).

Conservative starting list
"Sure", "Here", "Of course", "Certainly", "\n". Every entry is a token that has essentially no legitimate use as the first token of a continuation. Notably not included despite being suggestive in the issue: "I " (high false-positive rate in real prose), "Let me" / "Let" (too broad), language-specific openers — none of these have measurement to back them. The per-model switch is kept as the extension point so the list can grow from gate-fire log evidence rather than intuition.

Settings toggle
"Suppress Chat Openers", default on, persisted under tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose logit-level control, so the toggle only renders for the Open Source engine.

Debug logging (acceptance criterion)
Subsystem app.tabby, category first-token-gate. Logs the resolved deny list at model load, plus a per-fire signal whenever the un-gated argmax of the raw logits at position 0 is in the deny set. Argmax is checked rather than the sampled token because the sampled token passes through stochastic top-p / min-p stages — argmax precisely answers "did the gate prevent the model's strongest preference from being chosen?".

log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-gate"' --level debug

Validation

Ran locally:

  • swiftlint --reporter github-actions-logging — no new warnings on changed files
  • xcodebuild ... build CODE_SIGNING_ALLOWED=NOBUILD SUCCEEDED
  • xcodebuild test ... CODE_SIGNING_ALLOWED=NOTEST SUCCEEDED

LlamaPromptRendererTests updated to pass the new isFirstTokenGatingEnabled argument to its SuggestionRequest constructor.

Screenshot of the new Settings toggle: TODO — local build couldn't capture due to Screen Recording permissions; will attach when running on a machine with the permission, or happy to ship without if you'd prefer to verify in your build.

Linked issues

Closes #24.

Risk / rollout notes

  • Default-on is intentional: the gate only applies a -inf mask to a small set of tokens whose argmax-occurrence rate at position 0 is exactly the misbehavior we want to fix. False-positive cost on the conservative starting list should be near zero.
  • The gate is a pure sampling-time addition; if it ever needs to be turned off in the field, the toggle disables the second sampler chain and falls back to today's behavior.
  • Resolution is per-model-load, so adding entries to the deny list later is a no-op until the next model load.

Jam-Cai and others added 2 commits April 29, 2026 20:55
Instruction-tuned models routinely begin inline-autocomplete suggestions
with conversational openers — "Sure,", "Here's", "Let me", a leading
newline, or a bare "I " — that read as chat replies rather than text
continuation. Prompt engineering reduces this unevenly; a per-tokenizer
deny list applied as a -inf logit mask on the first sampled token
eliminates it deterministically and is model-agnostic.

What this adds
- FirstTokenDenyList: declarative per-model deny strings (Gemma instruct,
  Qwen3, conservative default for unknown GGUFs). Pure data; no runtime
  deps.
- LlamaRuntimeCore resolves the strings to token IDs once per model load
  using the loaded vocabulary, deduplicating leading-token collisions.
- A second sampler chain prepends llama_sampler_init_logit_bias with
  -inf entries and is used only at tokenIndex == 0; subsequent tokens
  use the standard sampler so deny-listed tokens can still appear
  naturally mid-suggestion (e.g. "I" as a second word).
- Settings toggle "Suppress Chat Openers", default on, persisted under
  tabbyFirstTokenGatingEnabled. Apple Intelligence does not expose
  logit-level control, so the toggle only renders for the Open Source
  engine.
- Debug logging under subsystem=app.tabby category=first-token-gate:
  resolved deny list at model load, plus a per-fire signal whenever the
  un-gated argmax of the raw logits at position -1 is in the deny set.
  Argmax is checked rather than the sampled token because the sampled
  token passes through stochastic top-p/min-p stages — argmax answers
  the precise question "did the gate prevent the model's strongest
  preference from being chosen?".

Stream the gate with:
  log stream --predicate 'subsystem == "app.tabby" AND category == "first-token-gate"' --level debug

Closes #24.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deny list trimmed to evidence-supported entries only:
  "Sure", "Here", "Of course", "Certainly", "\n"
Dropped "I " (high false-positive rate in prose), "Let me"/"Let" (too broad),
and the per-model Chinese entries (no measurement). The per-model switch is
kept as the extension point so we can add specific entries later from
gate-fire log evidence rather than intuition.

Also wraps the Settings caption to satisfy the 140-char line-length warning,
removes trailing commas in the deny-list literal, and adds the new
isFirstTokenGatingEnabled argument to the SuggestionRequest constructor in
LlamaPromptRendererTests so the test target builds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +51 to +60
static func denyStrings(for modelFilename: String) -> [String] {
switch modelFilename {
case "gemma-3-1b-it-Q4_K_M.gguf",
"gemma-3n-E4B-it-Q4_K_M.gguf",
"Qwen3-0.6B-Q4_K_M.gguf":
return Self.conservativeDenyStrings
default:
return Self.conservativeDenyStrings
}
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all case return same deny string list, do we need this then?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logit gating on first generated token

3 participants