test: lock the prompt-encode wiring through the #9 seam — fake tokenizer, decidable clamp, prefill switch (#12)#22
Merged
Conversation
…okenizer The #9 spy injected tokenizer: nil, so the encode → clamp → makeDecodeOptions branch never ran under the seam — the DA proved deleting it (or dropping the leading space, or always passing nil) kept every test green, and the promptTokens == nil canary was vacuously true. A deterministic FakeTokenizer (UTF-8 bytes as ids, ~25 lines of inert stubbing) now rides the spy pipeline: the pipeline must receive exactly encode(" " + prompt) — leading space in-band — an overlong prompt must clamp to the TRAILING 224 tokens (nearest context wins), and the no-prompt canary runs WITH a tokenizer so nil now proves the gate is the absent prompt. Mutation probes: all three DA mutations (drop leading space, prefix-clamp, force nil) now go red. Production code untouched. Refs #12
…ntity fake, prefill switch locked The clamp test's homogeneous 'aaa…' data made suffix ≡ prefix, so its direction claim was vacuous (and exposed that the earlier mutation-probe evidence for that direction was contaminated — the grep-based red-count was measuring noise, not the assertion). Heterogeneous 150a+150b data now makes every 224-window distinct, with an explicit != prefix guard. The fake's encoding moves off identity UTF-8 (1000+byte) so a production path that hardcoded raw bytes without calling the injected tokenizer cannot match. The DA's mutation find is locked too: usePrefillPrompt — the switch that makes WhisperKit actually CONSUME promptTokens — is now asserted at the seam. Unused fake stubs fail fast instead of returning plausible zeros. All four mutations re-verified red by exit code (drop leading space / prefix clamp / force nil / prefill off). Refs #12
This was referenced Jul 2, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #12
Summary
FakeTokenizer(非恆等 1000+byte 映射、未用 stubs fail-fast)注入 #9 spy pipeline,鎖 encode→clamp→makeDecodeOptions 實際接線:前導空格 in-band(1032)、trailing-224 clamp(異質資料、方向可判別 +
!= prefix雙斷言)、usePrefillPrompt開關、canary 去空洞化。production code 零改動。Verification
6-AI verify master report 見 issue #12:MEDIUM(同質測資空洞方向)+ 2 LOW 當輪修復;四連突變以 exit code 判定全紅(丟空格/prefix/恆 nil/關 prefill)。161/161 綠。附帶抓到並 hotfix main 上的 #20 dist-sync 殘留(BestASRVersion 0.3.1→0.4.0,
ac191fb)。Checklist
🤖 Generated by /idd-all. Do NOT add a GitHub close trailer.