Make the constrained beam finish the current word mid-word by FuJacob · Pull Request #520 · FuJacob/cotabby

FuJacob · 2026-06-02T01:45:24Z

Summary

When the caret sits inside a word (a mid-word completion), the constrained beam search's first token
should finish that word, not start a new one with punctuation, whitespace, or a symbol. This adds a
TokenProfile.continuesWordMidStream(_:) check — Unicode-aware, so letters (including CJK), digits,
and the within-word marks ' and - continue a word while breakers (an ASCII period, an em dash, an
arrow, a leading space, ...) do not — and the beam filters its first step to word-continuing tokens
when the request asks for word continuation.

This complements the engine's existing first-step word-continuation mask: it is a Swift-side guard
that holds regardless of how that mask behaves under the beam's trim-and-re-accept stepping.

Affects only the constrained beam path (the cotabbyConstrainedDecoderEnabled developer flag with
cotabbyConstrainedBeamWidth > 1, both off by default); the shipping sampler path is unchanged.

Validation

xcodebuild ... test ... CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/TokenProfileTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests
# ** TEST SUCCEEDED **
#   TokenProfileTests: continuesWordMidStream accepts letters/CJK/digits/'/- and rejects
#     space/period/punctuation/non-ASCII-symbol/empty
#   ConstrainedBeamSearchTests: a higher-logit word-breaking first token is filtered out mid-word,
#     so the word-continuing token starts the completion

swiftlint --strict --quiet   # exit 0 (clean)

Linked issues

None. A constrained-decoder quality guard for mid-word completions.

Risk / rollout notes

Constrained beam path only, default off, so there is no change to shipped behavior. The guard
only ever removes a word-breaking first token mid-word; it can never introduce one.
Applies to the first generated token only; once the word is being continued, later tokens generate
freely.

Greptile Summary

Adds a continuesWordMidStream check to TokenProfile and threads an isMidWord flag through the constrained beam search so that, when the caret sits inside a word, only word-continuing tokens (letters, digits, ', -, or non-ASCII lead bytes) are eligible for the first generated step. The shipping sampler path is entirely unaffected; the guard only fires under the two developer-facing flags that enable the constrained beam.

TokenProfile.continuesWordMidStream: Unicode-aware first-character inspection via a lossy UTF-8 decode; partial sequences and symbols map to U+FFFD and are correctly rejected.
ConstrainedBeamSearch: The isMidWord flag is plumbed through search(…) and Engine and applied inside expand only when branch.tokenIDs.isEmpty, ensuring later tokens generate freely.
Both new code paths have targeted unit tests; the swiftlint suppression on the intentionally-lossy String(decoding:as:) call is well-motivated.

Confidence Score: 4/5

Safe to merge; the guard is default-off, only removes candidates, and the empty-result path already degrades gracefully to an empty completion string.

The change is narrow and conservative — it can only ever remove word-breaking tokens from the first step, never introduce one. The caller already handles an empty candidate list with guard let best else { return "" }. Findings are documentation wording and a missing edge-case test, not behavioral defects.

No files require special attention, though a test pinning the all-candidates-filtered fallback in ConstrainedBeamSearchTests.swift would harden the contract.

Important Files Changed

Filename	Overview
Cotabby/Support/TokenProfile.swift	Adds `continuesWordMidStream(_:)` — a Unicode-aware classifier that decodes the first UTF-8 character and accepts letters/digits/apostrophe/hyphen. The lossy-decode approach is intentional and well-commented; one minor doc imprecision about the empty-bytes guard.
Cotabby/Support/ConstrainedBeamSearch.swift	Threads `isMidWord: Bool = false` through `search(…)` and `Engine`, filtering candidates to word-continuing tokens only on the first expansion step (`branch.tokenIDs.isEmpty`). Defaulting to `false` preserves backward compatibility for all existing callers.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	One-line change passing `options.forceWordContinuation` as `isMidWord`; the empty-candidates case is already handled gracefully by the existing `guard let best else { return "" }` below.
CotabbyTests/TokenProfileTests.swift	New test covers ASCII letters, digits, apostrophe, hyphen, CJK, space, period, punctuation, non-ASCII symbol, and empty/control. Missing an explicit assertion for the all-candidates-filtered scenario.
CotabbyTests/ConstrainedBeamSearchTests.swift	New test verifies that the higher-logit word-breaking token is filtered out mid-word and the lower-logit word-continuing token wins. Complementary `normal` assertion confirms non-mid-word behavior is unchanged.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ConstrainedBeamSearch.search called
isMidWord = options.forceWordContinuation] --> B[Engine.run
frontier = initial empty branch]
    B --> C{frontier empty?}
    C -- yes --> G
    C -- no --> D[expand each branch]
    D --> E[rankedAdmissibleTokens
filters control + blocked tokens]
    E --> F{isMidWord AND
branch.tokenIDs.isEmpty?}
    F -- yes --> F1[filter candidates via
continuesWordMidStream]
    F1 --> F2{first byte isLetter
or isNumber or apostrophe
or hyphen?}
    F2 -- yes --> H[token passes]
    F2 -- no --> I[token rejected]
    F -- no --> H
    H --> J{isEOG / newline
/ sentence end?}
    J -- yes --> K[branch → completed]
    J -- no --> L[branch → live / nextFrontier]
    L --> M[prune nextFrontier
to beamWidth]
    M --> C
    G[all completed branches
sorted by meanLogprob]
    G --> N{best == nil?}
    N -- yes --> O[return empty string]
    N -- no --> P[return best.text]

_{Reviews (1): Last reviewed commit: "Make the constrained beam finish the cur..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

When the caret is inside a word, the beam's first token should finish that word, not start a new one with punctuation/whitespace/a symbol. Add TokenProfile.continuesWordMidStream (Unicode-aware: letters incl. CJK, digits, and the within-word marks ' and - continue a word; breakers do not), and filter the beam's first step to word-continuing tokens when word continuation is requested. Constrained beam path only (default-off flags); the guard only removes a word-breaking first token, never introduces one. Applies to the first token only.

greptile-apps · 2026-06-02T01:50:06Z

+    /// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or
+    /// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts
+    /// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a
+    /// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.
+    /// False for an out-of-range or empty (control) token.


The doc comment says "False for an out-of-range or empty (control) token", but isControl is never consulted — only bytes.isEmpty is checked. A non-control token whose bytes happen to be empty would also return false, and a control token with non-empty bytes starting with a letter would return true (though rankedAdmissibleTokens would already have excluded it). The parenthetical is misleading; tightening the wording removes the ambiguity.

Suggested change

/// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or

/// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts

/// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a

/// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.

/// False for an out-of-range or empty (control) token.

/// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or

/// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts

/// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a

/// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token.

/// False for an out-of-range id or a token whose byte sequence is empty.

greptile-apps · 2026-06-02T01:50:07Z

        XCTAssertFalse(recorder.paths.contains([0, 1]), "search must stop at the sentence and not step past it")
    }

+    func test_search_midWord_firstTokenMustContinueTheWord() {
+        // token 0 breaks the word (leading punctuation) but has the higher logit; token 1 continues it.
+        // Mid-word, only a word-continuing token may start the completion.
+        let profile = makeProfile(byteStrings: [", and", "ing"])
+        let rows: [[Int]: [Float]] = [[]: row([0: 9, 1: 1], vocabSize: 2)]
+        let normal = ConstrainedBeamSearch.search(
+            nextLogits: provider(vocabSize: 2, rows: rows), profile: profile,
+            configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5),
+            isSingleLine: false, isMidWord: false)
+        let midWord = ConstrainedBeamSearch.search(
+            nextLogits: provider(vocabSize: 2, rows: rows), profile: profile,
+            configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5),
+            isSingleLine: false, isMidWord: true)
+
+        XCTAssertEqual(normal.first?.tokenIDs, [0], "without mid-word, the highest-logit token wins")
+        XCTAssertEqual(midWord.first?.tokenIDs, [1], "mid-word, the word-breaking token is filtered out")
+    }
+
    func test_search_respectsMaxTokenBudget() {
        // No EOG / sentence end: every token keeps generating, so the budget bounds the length.
        let profile = makeProfile(byteStrings: ["a", "b"])


No test for all-candidates-filtered path

The new test covers the normal filtered case, but there's no assertion for when isMidWord: true and every admissible token is word-breaking (e.g. the only tokens in the vocabulary are " x", ".y"). In that scenario candidates is empty after filtering, the branch is silently dropped from both live and completed, and search returns []. The caller in LlamaRuntimeCore handles this correctly with guard let best else { return "" }, but adding a single test case would pin that contract and prevent a future refactor from introducing a crash or incorrect fallback.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

FuJacob merged commit 0a0ae64 into main Jun 2, 2026
4 checks passed

FuJacob deleted the feat/beam-midword-guard branch June 2, 2026 01:48

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the constrained beam finish the current word mid-word#520

Make the constrained beam finish the current word mid-word#520
FuJacob merged 1 commit into
mainfrom
feat/beam-midword-guard

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading