Skip to content

Make the constrained beam finish the current word mid-word#520

Merged
FuJacob merged 1 commit into
mainfrom
feat/beam-midword-guard
Jun 2, 2026
Merged

Make the constrained beam finish the current word mid-word#520
FuJacob merged 1 commit into
mainfrom
feat/beam-midword-guard

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 2, 2026

Summary

When the caret sits inside a word (a mid-word completion), the constrained beam search's first token
should finish that word, not start a new one with punctuation, whitespace, or a symbol. This adds a
TokenProfile.continuesWordMidStream(_:) check — Unicode-aware, so letters (including CJK), digits,
and the within-word marks ' and - continue a word while breakers (an ASCII period, an em dash, an
arrow, a leading space, ...) do not — and the beam filters its first step to word-continuing tokens
when the request asks for word continuation.

This complements the engine's existing first-step word-continuation mask: it is a Swift-side guard
that holds regardless of how that mask behaves under the beam's trim-and-re-accept stepping.

Affects only the constrained beam path (the cotabbyConstrainedDecoderEnabled developer flag with
cotabbyConstrainedBeamWidth > 1, both off by default); the shipping sampler path is unchanged.

Validation

xcodebuild ... test ... CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/TokenProfileTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests
# ** TEST SUCCEEDED **
#   TokenProfileTests: continuesWordMidStream accepts letters/CJK/digits/'/- and rejects
#     space/period/punctuation/non-ASCII-symbol/empty
#   ConstrainedBeamSearchTests: a higher-logit word-breaking first token is filtered out mid-word,
#     so the word-continuing token starts the completion

swiftlint --strict --quiet   # exit 0 (clean)

Linked issues

None. A constrained-decoder quality guard for mid-word completions.

Risk / rollout notes

  • Constrained beam path only, default off, so there is no change to shipped behavior. The guard
    only ever removes a word-breaking first token mid-word; it can never introduce one.
  • Applies to the first generated token only; once the word is being continued, later tokens generate
    freely.

Greptile Summary

Adds a continuesWordMidStream check to TokenProfile and threads an isMidWord flag through the constrained beam search so that, when the caret sits inside a word, only word-continuing tokens (letters, digits, ', -, or non-ASCII lead bytes) are eligible for the first generated step. The shipping sampler path is entirely unaffected; the guard only fires under the two developer-facing flags that enable the constrained beam.

  • TokenProfile.continuesWordMidStream: Unicode-aware first-character inspection via a lossy UTF-8 decode; partial sequences and symbols map to U+FFFD and are correctly rejected.
  • ConstrainedBeamSearch: The isMidWord flag is plumbed through search(…) and Engine and applied inside expand only when branch.tokenIDs.isEmpty, ensuring later tokens generate freely.
  • Both new code paths have targeted unit tests; the swiftlint suppression on the intentionally-lossy String(decoding:as:) call is well-motivated.

Confidence Score: 4/5

Safe to merge; the guard is default-off, only removes candidates, and the empty-result path already degrades gracefully to an empty completion string.

The change is narrow and conservative — it can only ever remove word-breaking tokens from the first step, never introduce one. The caller already handles an empty candidate list with guard let best else { return "" }. Findings are documentation wording and a missing edge-case test, not behavioral defects.

No files require special attention, though a test pinning the all-candidates-filtered fallback in ConstrainedBeamSearchTests.swift would harden the contract.

Important Files Changed

Filename Overview
Cotabby/Support/TokenProfile.swift Adds continuesWordMidStream(_:) — a Unicode-aware classifier that decodes the first UTF-8 character and accepts letters/digits/apostrophe/hyphen. The lossy-decode approach is intentional and well-commented; one minor doc imprecision about the empty-bytes guard.
Cotabby/Support/ConstrainedBeamSearch.swift Threads isMidWord: Bool = false through search(…) and Engine, filtering candidates to word-continuing tokens only on the first expansion step (branch.tokenIDs.isEmpty). Defaulting to false preserves backward compatibility for all existing callers.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift One-line change passing options.forceWordContinuation as isMidWord; the empty-candidates case is already handled gracefully by the existing guard let best else { return "" } below.
CotabbyTests/TokenProfileTests.swift New test covers ASCII letters, digits, apostrophe, hyphen, CJK, space, period, punctuation, non-ASCII symbol, and empty/control. Missing an explicit assertion for the all-candidates-filtered scenario.
CotabbyTests/ConstrainedBeamSearchTests.swift New test verifies that the higher-logit word-breaking token is filtered out mid-word and the lower-logit word-continuing token wins. Complementary normal assertion confirms non-mid-word behavior is unchanged.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ConstrainedBeamSearch.search called
isMidWord = options.forceWordContinuation] --> B[Engine.run
frontier = initial empty branch]
    B --> C{frontier empty?}
    C -- yes --> G
    C -- no --> D[expand each branch]
    D --> E[rankedAdmissibleTokens
filters control + blocked tokens]
    E --> F{isMidWord AND
branch.tokenIDs.isEmpty?}
    F -- yes --> F1[filter candidates via
continuesWordMidStream]
    F1 --> F2{first byte isLetter
or isNumber or apostrophe
or hyphen?}
    F2 -- yes --> H[token passes]
    F2 -- no --> I[token rejected]
    F -- no --> H
    H --> J{isEOG / newline
/ sentence end?}
    J -- yes --> K[branch → completed]
    J -- no --> L[branch → live / nextFrontier]
    L --> M[prune nextFrontier
to beamWidth]
    M --> C
    G[all completed branches
sorted by meanLogprob]
    G --> N{best == nil?}
    N -- yes --> O[return empty string]
    N -- no --> P[return best.text]
Loading

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Make the constrained beam finish the cur..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

When the caret is inside a word, the beam's first token should finish that word, not start a new one with punctuation/whitespace/a symbol. Add TokenProfile.continuesWordMidStream (Unicode-aware: letters incl. CJK, digits, and the within-word marks ' and - continue a word; breakers do not), and filter the beam's first step to word-continuing tokens when word continuation is requested.

Constrained beam path only (default-off flags); the guard only removes a word-breaking first token, never introduces one. Applies to the first token only.
@FuJacob FuJacob merged commit 0a0ae64 into main Jun 2, 2026
4 checks passed
@FuJacob FuJacob deleted the feat/beam-midword-guard branch June 2, 2026 01:48
Comment on lines +108 to +112
/// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or
/// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts
/// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a
/// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.
/// False for an out-of-range or empty (control) token.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The doc comment says "False for an out-of-range or empty (control) token", but isControl is never consulted — only bytes.isEmpty is checked. A non-control token whose bytes happen to be empty would also return false, and a control token with non-empty bytes starting with a letter would return true (though rankedAdmissibleTokens would already have excluded it). The parenthetical is misleading; tightening the wording removes the ambiguity.

Suggested change
/// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or
/// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts
/// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a
/// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.
/// False for an out-of-range or empty (control) token.
/// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or
/// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts
/// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a
/// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.
/// False for an out-of-range id or a token whose byte sequence is empty.

Fix in Codex Fix in Claude Code

Comment on lines 150 to 173
XCTAssertFalse(recorder.paths.contains([0, 1]), "search must stop at the sentence and not step past it")
}

func test_search_midWord_firstTokenMustContinueTheWord() {
// token 0 breaks the word (leading punctuation) but has the higher logit; token 1 continues it.
// Mid-word, only a word-continuing token may start the completion.
let profile = makeProfile(byteStrings: [", and", "ing"])
let rows: [[Int]: [Float]] = [[]: row([0: 9, 1: 1], vocabSize: 2)]
let normal = ConstrainedBeamSearch.search(
nextLogits: provider(vocabSize: 2, rows: rows), profile: profile,
configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5),
isSingleLine: false, isMidWord: false)
let midWord = ConstrainedBeamSearch.search(
nextLogits: provider(vocabSize: 2, rows: rows), profile: profile,
configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5),
isSingleLine: false, isMidWord: true)

XCTAssertEqual(normal.first?.tokenIDs, [0], "without mid-word, the highest-logit token wins")
XCTAssertEqual(midWord.first?.tokenIDs, [1], "mid-word, the word-breaking token is filtered out")
}

func test_search_respectsMaxTokenBudget() {
// No EOG / sentence end: every token keeps generating, so the budget bounds the length.
let profile = makeProfile(byteStrings: ["a", "b"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No test for all-candidates-filtered path

The new test covers the normal filtered case, but there's no assertion for when isMidWord: true and every admissible token is word-breaking (e.g. the only tokens in the vocabulary are " x", ".y"). In that scenario candidates is empty after filtering, the branch is silently dropped from both live and completed, and search returns []. The caller in LlamaRuntimeCore handles this correctly with guard let best else { return "" }, but adding a single test case would pin that contract and prevent a future refactor from introducing a crash or incorrect fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant