Add SentencePiece/LLaMA tokenizer path (#464)#466
Merged
michalharakal merged 4 commits intodevelopfrom Apr 13, 2026
Merged
Conversation
Introduces a llama.cpp-style SPM tokenizer (LLaMA, Gemma, TinyLlama, Mistral v0.1, ...). Implements: - Whitespace escape (space -> U+2581) with configurable add_space_prefix - Code-point symbol split with surrogate-pair safety - Score-priority BPE: the adjacent pair whose concatenation is in the vocab with the highest score wins (the SPM rule — opposite of the merge-rank rule used by QwenByteLevelBpeTokenizer) - Byte fallback via <0xNN> hex tokens for any leftover symbol that isn't in the vocab - Decode that recognizes <0xNN> tokens, accumulates them into raw bytes, UTF-8 decodes, and strips the leading space prefix Covered by SentencePieceTokenizerCoreTest: whitespace round trip, merging to single tokens, byte-fallback for unknown Latin and CJK characters, and interleaving of normal tokens and byte fallback in decode. Follow-up #464. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TokenizerFactory.fromGguf now dispatches "llama"/"sentencepiece" to SentencePieceTokenizer.fromGgufFields, and fromTokenizerJson dispatches model.type == "Unigram" to SentencePieceTokenizer.fromTokenizerJson. Both routes previously threw UnsupportedTokenizerException. WordPiece remains unimplemented. Updates TokenizerFactoryDispatchTest: the llama and Unigram cases now assert successful dispatch (against small synthetic fixtures), and new bert / WordPiece cases hold down the still-unsupported branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fixture-gated jvmTest that validates the full GGUF path against the real TinyLlama-1.1B-Chat-v1.0 tokenizer. Reference encodings match HuggingFace transformers exactly: "Hello" -> [15043], "The capital of France is" -> [450, 7483, 310, 3444, 338], CJK byte fallback round trips, and BOS/EOS ids propagate from the GGUF metadata. Adds a downloadTinyLlamaTokenizerFixtures Gradle task analogous to the Qwen one. Tests skip cleanly when the fixture is absent so offline/CI builds stay green. Also fixes a latent bug in both fromGgufFields builders: GGUF UINT32 fields arrive as kotlin.UInt (a value class, not a Number), so the "as? Number" cast was silently dropping bos_token_id / eos_token_id / unknown_token_id. The new toIntFlexible helper handles every signed and unsigned numeric type GGUF can produce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #464.
Adds the SentencePiece path that was deferred from the #463 byte-level BPE fix: LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs now tokenize correctly through the same
TokenizerFactorythat already routes Qwen/GPT-2 to byte-level BPE.Summary
SentencePieceTokenizer(commonMain) — llama.cpp-style SPM:▁/ U+2581) with configurableaddSpacePrefixQwenByteLevelBpeTokenizer)<0x00>..<0xFF>hex tokens for any leftover symbol that isn't in the vocab<0xNN>tokens, accumulates them into raw bytes, UTF-8 decodes, and strips the leading space prefixTokenizerFactorywiring:fromGgufnow dispatchestokenizer.ggml.model == "llama"/"sentencepiece"to the new tokenizer;fromTokenizerJsondispatchesmodel.type == "Unigram"likewise. Both branches previously threwUnsupportedTokenizerException.StreamingGGUFReaderreturnsUINT32fields askotlin.UInt(a value class — not aNumbersubclass), so the previousas? Numbercast was silently droppingbos_token_id/eos_token_id/unknown_token_id. NewtoIntFlexiblehelper handles every signed/unsigned numeric type GGUF can produce, and the same fix is applied toQwenByteLevelBpeTokenizer.fromGgufFields.Verification
End-to-end against real TinyLlama-1.1B-Chat-v1.0 (download-on-demand fixture, skips cleanly when absent):
"Hello"[15043]"The capital of France is"[450, 7483, 310, 3444, 338]"日本") byte-fallback round-tripbosTokenId/eosTokenIdfrom GGUF1/2SentencePieceTokenizertokenizer.jsondispatch + round-tripReference token IDs come from HuggingFace
transformersforTinyLlama/TinyLlama-1.1B-Chat-v1.0.The synthetic
SentencePieceTokenizerCoreTestin commonTest exercises whitespace escape, merging to single tokens, byte fallback for unknown Latin / CJK characters, and interleaved-token decode without needing any fixture (so CI without network stays green).QwenByteLevelBpeTokenizerFixtureTest(#463) is also re-run and stays green — the UInt cast fix didn't break the Qwen path.Files
New (commonMain):
SentencePieceTokenizer.kt— core + GGUF and tokenizer.json buildersGgufFieldHelpers.kt—toIntFlexiblehelperNew (commonTest):
SentencePieceTokenizerCoreTest.kt— 9 synthetic algorithm testsNew (jvmTest):
SentencePieceTokenizerFixtureTest.kt— 7 end-to-end TinyLlama testsModified:
TokenizerFactory.kt— wire llama/Unigram branchesQwenByteLevelBpeTokenizer.kt— adopttoIntFlexiblefor BOS/EOSTokenizerFactoryDispatchTest.kt— assert dispatch where it used to assert throwsbuild.gradle.kts—downloadTinyLlamaTokenizerFixturestaskCHANGELOG.mdTest plan
./gradlew :skainet-io:skainet-io-core:jvmTest— full core suite green./gradlew :skainet-io:skainet-io-core:downloadTinyLlamaTokenizerFixturesthen re-run jvmTest — all 7 fixture tests pass against the real TinyLlama GGUFtransformers(already done locally; outputs above)🤖 Generated with Claude Code