Add SentencePiece/LLaMA tokenizer path (#464) by michalharakal · Pull Request #466 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-13T06:47:13Z

Closes #464.

Adds the SentencePiece path that was deferred from the #463 byte-level BPE fix: LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs now tokenize correctly through the same TokenizerFactory that already routes Qwen/GPT-2 to byte-level BPE.

Summary

SentencePieceTokenizer (commonMain) — llama.cpp-style SPM:
- Whitespace escape (space → ▁ / U+2581) with configurable addSpacePrefix
- Code-point symbol split (surrogate-pair safe)
- Score-priority BPE: the adjacent pair whose concatenation is in the vocab with the highest score wins (the SPM rule, opposite of the merge-rank rule used by QwenByteLevelBpeTokenizer)
- Byte fallback via <0x00>..<0xFF> hex tokens for any leftover symbol that isn't in the vocab
- Decode that recognizes <0xNN> tokens, accumulates them into raw bytes, UTF-8 decodes, and strips the leading space prefix
TokenizerFactory wiring: fromGguf now dispatches tokenizer.ggml.model == "llama"/"sentencepiece" to the new tokenizer; fromTokenizerJson dispatches model.type == "Unigram" likewise. Both branches previously threw UnsupportedTokenizerException.
GGUF UInt cast bug fix (latent in Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463): StreamingGGUFReader returns UINT32 fields as kotlin.UInt (a value class — not a Number subclass), so the previous as? Number cast was silently dropping bos_token_id / eos_token_id / unknown_token_id. New toIntFlexible helper handles every signed/unsigned numeric type GGUF can produce, and the same fix is applied to QwenByteLevelBpeTokenizer.fromGgufFields.

Verification

End-to-end against real TinyLlama-1.1B-Chat-v1.0 (download-on-demand fixture, skips cleanly when absent):

Test	Expected	Result
`"Hello"`	`[15043]`	✓
`"The capital of France is"`	`[450, 7483, 310, 3444, 338]`	✓
ASCII encode/decode round-trip	identity	✓
CJK (`"日本"`) byte-fallback round-trip	identity	✓
`bosTokenId` / `eosTokenId` from GGUF	`1` / `2`	✓
GGUF dispatches to `SentencePieceTokenizer`	yes	✓
`tokenizer.json` dispatch + round-trip	yes	✓

Reference token IDs come from HuggingFace transformers for TinyLlama/TinyLlama-1.1B-Chat-v1.0.

The synthetic SentencePieceTokenizerCoreTest in commonTest exercises whitespace escape, merging to single tokens, byte fallback for unknown Latin / CJK characters, and interleaved-token decode without needing any fixture (so CI without network stays green).

QwenByteLevelBpeTokenizerFixtureTest (#463) is also re-run and stays green — the UInt cast fix didn't break the Qwen path.

Files

New (commonMain):

SentencePieceTokenizer.kt — core + GGUF and tokenizer.json builders
GgufFieldHelpers.kt — toIntFlexible helper

New (commonTest):

SentencePieceTokenizerCoreTest.kt — 9 synthetic algorithm tests

New (jvmTest):

SentencePieceTokenizerFixtureTest.kt — 7 end-to-end TinyLlama tests

Modified:

TokenizerFactory.kt — wire llama/Unigram branches
QwenByteLevelBpeTokenizer.kt — adopt toIntFlexible for BOS/EOS
TokenizerFactoryDispatchTest.kt — assert dispatch where it used to assert throws
build.gradle.kts — downloadTinyLlamaTokenizerFixtures task
CHANGELOG.md

Test plan

./gradlew :skainet-io:skainet-io-core:jvmTest — full core suite green
./gradlew :skainet-io:skainet-io-core:downloadTinyLlamaTokenizerFixtures then re-run jvmTest — all 7 fixture tests pass against the real TinyLlama GGUF
Cross-check sample encodings against HuggingFace transformers (already done locally; outputs above)
Confirm Qwen Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463 path still green after the UInt cast change
Verify offline build (no network) stays green via the skip path

🤖 Generated with Claude Code

Introduces a llama.cpp-style SPM tokenizer (LLaMA, Gemma, TinyLlama, Mistral v0.1, ...). Implements: - Whitespace escape (space -> U+2581) with configurable add_space_prefix - Code-point symbol split with surrogate-pair safety - Score-priority BPE: the adjacent pair whose concatenation is in the vocab with the highest score wins (the SPM rule — opposite of the merge-rank rule used by QwenByteLevelBpeTokenizer) - Byte fallback via <0xNN> hex tokens for any leftover symbol that isn't in the vocab - Decode that recognizes <0xNN> tokens, accumulates them into raw bytes, UTF-8 decodes, and strips the leading space prefix Covered by SentencePieceTokenizerCoreTest: whitespace round trip, merging to single tokens, byte-fallback for unknown Latin and CJK characters, and interleaving of normal tokens and byte fallback in decode. Follow-up #464. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TokenizerFactory.fromGguf now dispatches "llama"/"sentencepiece" to SentencePieceTokenizer.fromGgufFields, and fromTokenizerJson dispatches model.type == "Unigram" to SentencePieceTokenizer.fromTokenizerJson. Both routes previously threw UnsupportedTokenizerException. WordPiece remains unimplemented. Updates TokenizerFactoryDispatchTest: the llama and Unigram cases now assert successful dispatch (against small synthetic fixtures), and new bert / WordPiece cases hold down the still-unsupported branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a fixture-gated jvmTest that validates the full GGUF path against the real TinyLlama-1.1B-Chat-v1.0 tokenizer. Reference encodings match HuggingFace transformers exactly: "Hello" -> [15043], "The capital of France is" -> [450, 7483, 310, 3444, 338], CJK byte fallback round trips, and BOS/EOS ids propagate from the GGUF metadata. Adds a downloadTinyLlamaTokenizerFixtures Gradle task analogous to the Qwen one. Tests skip cleanly when the fixture is absent so offline/CI builds stay green. Also fixes a latent bug in both fromGgufFields builders: GGUF UINT32 fields arrive as kotlin.UInt (a value class, not a Number), so the "as? Number" cast was silently dropping bos_token_id / eos_token_id / unknown_token_id. The new toIntFlexible helper handles every signed and unsigned numeric type GGUF can produce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal and others added 4 commits April 12, 2026 21:53

Document #464 SentencePiece path in CHANGELOG (#464 step 4)

9a8fdc9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal merged commit 62a1d9b into develop Apr 13, 2026
4 checks passed

michalharakal deleted the feature/464-tokenizer-path branch April 13, 2026 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SentencePiece/LLaMA tokenizer path (#464)#466

Add SentencePiece/LLaMA tokenizer path (#464)#466
michalharakal merged 4 commits intodevelopfrom
feature/464-tokenizer-path

michalharakal commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 13, 2026

Summary

Verification

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant