Make the constrained beam finish the current word mid-word#520
Conversation
When the caret is inside a word, the beam's first token should finish that word, not start a new one with punctuation/whitespace/a symbol. Add TokenProfile.continuesWordMidStream (Unicode-aware: letters incl. CJK, digits, and the within-word marks ' and - continue a word; breakers do not), and filter the beam's first step to word-continuing tokens when word continuation is requested. Constrained beam path only (default-off flags); the guard only removes a word-breaking first token, never introduces one. Applies to the first token only.
| /// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or | ||
| /// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts | ||
| /// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a | ||
| /// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token. | ||
| /// False for an out-of-range or empty (control) token. |
There was a problem hiding this comment.
The doc comment says "False for an out-of-range or empty (control) token", but
isControl is never consulted — only bytes.isEmpty is checked. A non-control token whose bytes happen to be empty would also return false, and a control token with non-empty bytes starting with a letter would return true (though rankedAdmissibleTokens would already have excluded it). The parenthetical is misleading; tightening the wording removes the ambiguity.
| /// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or | |
| /// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts | |
| /// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a | |
| /// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token. | |
| /// False for an out-of-range or empty (control) token. | |
| /// Whether `id` can continue the current word mid-stream: its first byte is an ASCII letter or | |
| /// digit, a common within-word mark (apostrophe or hyphen), or a non-ASCII lead byte (which starts | |
| /// a multi-byte letter or ideograph). Tokens that begin with whitespace, breaking punctuation, or a | |
| /// symbol are rejected, so a mid-word completion finishes the word instead of starting a new token. | |
| /// False for an out-of-range id or a token whose byte sequence is empty. |
| XCTAssertFalse(recorder.paths.contains([0, 1]), "search must stop at the sentence and not step past it") | ||
| } | ||
|
|
||
| func test_search_midWord_firstTokenMustContinueTheWord() { | ||
| // token 0 breaks the word (leading punctuation) but has the higher logit; token 1 continues it. | ||
| // Mid-word, only a word-continuing token may start the completion. | ||
| let profile = makeProfile(byteStrings: [", and", "ing"]) | ||
| let rows: [[Int]: [Float]] = [[]: row([0: 9, 1: 1], vocabSize: 2)] | ||
| let normal = ConstrainedBeamSearch.search( | ||
| nextLogits: provider(vocabSize: 2, rows: rows), profile: profile, | ||
| configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5), | ||
| isSingleLine: false, isMidWord: false) | ||
| let midWord = ConstrainedBeamSearch.search( | ||
| nextLogits: provider(vocabSize: 2, rows: rows), profile: profile, | ||
| configuration: BeamSearchConfiguration(beamWidth: 1, maxTokens: 1, topK: 5), | ||
| isSingleLine: false, isMidWord: true) | ||
|
|
||
| XCTAssertEqual(normal.first?.tokenIDs, [0], "without mid-word, the highest-logit token wins") | ||
| XCTAssertEqual(midWord.first?.tokenIDs, [1], "mid-word, the word-breaking token is filtered out") | ||
| } | ||
|
|
||
| func test_search_respectsMaxTokenBudget() { | ||
| // No EOG / sentence end: every token keeps generating, so the budget bounds the length. | ||
| let profile = makeProfile(byteStrings: ["a", "b"]) |
There was a problem hiding this comment.
No test for all-candidates-filtered path
The new test covers the normal filtered case, but there's no assertion for when isMidWord: true and every admissible token is word-breaking (e.g. the only tokens in the vocabulary are " x", ".y"). In that scenario candidates is empty after filtering, the branch is silently dropped from both live and completed, and search returns []. The caller in LlamaRuntimeCore handles this correctly with guard let best else { return "" }, but adding a single test case would pin that contract and prevent a future refactor from introducing a crash or incorrect fallback.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
When the caret sits inside a word (a mid-word completion), the constrained beam search's first token
should finish that word, not start a new one with punctuation, whitespace, or a symbol. This adds a
TokenProfile.continuesWordMidStream(_:)check — Unicode-aware, so letters (including CJK), digits,and the within-word marks
'and-continue a word while breakers (an ASCII period, an em dash, anarrow, a leading space, ...) do not — and the beam filters its first step to word-continuing tokens
when the request asks for word continuation.
This complements the engine's existing first-step word-continuation mask: it is a Swift-side guard
that holds regardless of how that mask behaves under the beam's trim-and-re-accept stepping.
Affects only the constrained beam path (the
cotabbyConstrainedDecoderEnableddeveloper flag withcotabbyConstrainedBeamWidth> 1, both off by default); the shipping sampler path is unchanged.Validation
Linked issues
None. A constrained-decoder quality guard for mid-word completions.
Risk / rollout notes
only ever removes a word-breaking first token mid-word; it can never introduce one.
freely.
Greptile Summary
Adds a
continuesWordMidStreamcheck toTokenProfileand threads anisMidWordflag through the constrained beam search so that, when the caret sits inside a word, only word-continuing tokens (letters, digits,',-, or non-ASCII lead bytes) are eligible for the first generated step. The shipping sampler path is entirely unaffected; the guard only fires under the two developer-facing flags that enable the constrained beam.TokenProfile.continuesWordMidStream: Unicode-aware first-character inspection via a lossy UTF-8 decode; partial sequences and symbols map to U+FFFD and are correctly rejected.ConstrainedBeamSearch: TheisMidWordflag is plumbed throughsearch(…)andEngineand applied insideexpandonly whenbranch.tokenIDs.isEmpty, ensuring later tokens generate freely.swiftlintsuppression on the intentionally-lossyString(decoding:as:)call is well-motivated.Confidence Score: 4/5
Safe to merge; the guard is default-off, only removes candidates, and the empty-result path already degrades gracefully to an empty completion string.
The change is narrow and conservative — it can only ever remove word-breaking tokens from the first step, never introduce one. The caller already handles an empty candidate list with
guard let best else { return "" }. Findings are documentation wording and a missing edge-case test, not behavioral defects.No files require special attention, though a test pinning the all-candidates-filtered fallback in
ConstrainedBeamSearchTests.swiftwould harden the contract.Important Files Changed
continuesWordMidStream(_:)— a Unicode-aware classifier that decodes the first UTF-8 character and accepts letters/digits/apostrophe/hyphen. The lossy-decode approach is intentional and well-commented; one minor doc imprecision about the empty-bytes guard.isMidWord: Bool = falsethroughsearch(…)andEngine, filtering candidates to word-continuing tokens only on the first expansion step (branch.tokenIDs.isEmpty). Defaulting tofalsepreserves backward compatibility for all existing callers.options.forceWordContinuationasisMidWord; the empty-candidates case is already handled gracefully by the existingguard let best else { return "" }below.normalassertion confirms non-mid-word behavior is unchanged.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[ConstrainedBeamSearch.search called isMidWord = options.forceWordContinuation] --> B[Engine.run frontier = initial empty branch] B --> C{frontier empty?} C -- yes --> G C -- no --> D[expand each branch] D --> E[rankedAdmissibleTokens filters control + blocked tokens] E --> F{isMidWord AND branch.tokenIDs.isEmpty?} F -- yes --> F1[filter candidates via continuesWordMidStream] F1 --> F2{first byte isLetter or isNumber or apostrophe or hyphen?} F2 -- yes --> H[token passes] F2 -- no --> I[token rejected] F -- no --> H H --> J{isEOG / newline / sentence end?} J -- yes --> K[branch → completed] J -- no --> L[branch → live / nextFrontier] L --> M[prune nextFrontier to beamWidth] M --> C G[all completed branches sorted by meanLogprob] G --> N{best == nil?} N -- yes --> O[return empty string] N -- no --> P[return best.text]Reviews (1): Last reviewed commit: "Make the constrained beam finish the cur..." | Re-trigger Greptile