Skip to content

Add SentencePiece/LLaMA tokenizer path (#464)#466

Merged
michalharakal merged 4 commits intodevelopfrom
feature/464-tokenizer-path
Apr 13, 2026
Merged

Add SentencePiece/LLaMA tokenizer path (#464)#466
michalharakal merged 4 commits intodevelopfrom
feature/464-tokenizer-path

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Closes #464.

Adds the SentencePiece path that was deferred from the #463 byte-level BPE fix: LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs now tokenize correctly through the same TokenizerFactory that already routes Qwen/GPT-2 to byte-level BPE.

Summary

  • SentencePieceTokenizer (commonMain) — llama.cpp-style SPM:
    • Whitespace escape (space → / U+2581) with configurable addSpacePrefix
    • Code-point symbol split (surrogate-pair safe)
    • Score-priority BPE: the adjacent pair whose concatenation is in the vocab with the highest score wins (the SPM rule, opposite of the merge-rank rule used by QwenByteLevelBpeTokenizer)
    • Byte fallback via <0x00>..<0xFF> hex tokens for any leftover symbol that isn't in the vocab
    • Decode that recognizes <0xNN> tokens, accumulates them into raw bytes, UTF-8 decodes, and strips the leading space prefix
  • TokenizerFactory wiring: fromGguf now dispatches tokenizer.ggml.model == "llama"/"sentencepiece" to the new tokenizer; fromTokenizerJson dispatches model.type == "Unigram" likewise. Both branches previously threw UnsupportedTokenizerException.
  • GGUF UInt cast bug fix (latent in Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463): StreamingGGUFReader returns UINT32 fields as kotlin.UInt (a value class — not a Number subclass), so the previous as? Number cast was silently dropping bos_token_id / eos_token_id / unknown_token_id. New toIntFlexible helper handles every signed/unsigned numeric type GGUF can produce, and the same fix is applied to QwenByteLevelBpeTokenizer.fromGgufFields.

Verification

End-to-end against real TinyLlama-1.1B-Chat-v1.0 (download-on-demand fixture, skips cleanly when absent):

Test Expected Result
"Hello" [15043]
"The capital of France is" [450, 7483, 310, 3444, 338]
ASCII encode/decode round-trip identity
CJK ("日本") byte-fallback round-trip identity
bosTokenId / eosTokenId from GGUF 1 / 2
GGUF dispatches to SentencePieceTokenizer yes
tokenizer.json dispatch + round-trip yes

Reference token IDs come from HuggingFace transformers for TinyLlama/TinyLlama-1.1B-Chat-v1.0.

The synthetic SentencePieceTokenizerCoreTest in commonTest exercises whitespace escape, merging to single tokens, byte fallback for unknown Latin / CJK characters, and interleaved-token decode without needing any fixture (so CI without network stays green).

QwenByteLevelBpeTokenizerFixtureTest (#463) is also re-run and stays green — the UInt cast fix didn't break the Qwen path.

Files

New (commonMain):

  • SentencePieceTokenizer.kt — core + GGUF and tokenizer.json builders
  • GgufFieldHelpers.kttoIntFlexible helper

New (commonTest):

  • SentencePieceTokenizerCoreTest.kt — 9 synthetic algorithm tests

New (jvmTest):

  • SentencePieceTokenizerFixtureTest.kt — 7 end-to-end TinyLlama tests

Modified:

  • TokenizerFactory.kt — wire llama/Unigram branches
  • QwenByteLevelBpeTokenizer.kt — adopt toIntFlexible for BOS/EOS
  • TokenizerFactoryDispatchTest.kt — assert dispatch where it used to assert throws
  • build.gradle.ktsdownloadTinyLlamaTokenizerFixtures task
  • CHANGELOG.md

Test plan

  • ./gradlew :skainet-io:skainet-io-core:jvmTest — full core suite green
  • ./gradlew :skainet-io:skainet-io-core:downloadTinyLlamaTokenizerFixtures then re-run jvmTest — all 7 fixture tests pass against the real TinyLlama GGUF
  • Cross-check sample encodings against HuggingFace transformers (already done locally; outputs above)
  • Confirm Qwen Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463 path still green after the UInt cast change
  • Verify offline build (no network) stays green via the skip path

🤖 Generated with Claude Code

michalharakal and others added 4 commits April 12, 2026 21:53
Introduces a llama.cpp-style SPM tokenizer (LLaMA, Gemma, TinyLlama,
Mistral v0.1, ...). Implements:

- Whitespace escape (space -> U+2581) with configurable add_space_prefix
- Code-point symbol split with surrogate-pair safety
- Score-priority BPE: the adjacent pair whose concatenation is in the
  vocab with the highest score wins (the SPM rule — opposite of the
  merge-rank rule used by QwenByteLevelBpeTokenizer)
- Byte fallback via <0xNN> hex tokens for any leftover symbol that
  isn't in the vocab
- Decode that recognizes <0xNN> tokens, accumulates them into raw
  bytes, UTF-8 decodes, and strips the leading space prefix

Covered by SentencePieceTokenizerCoreTest: whitespace round trip,
merging to single tokens, byte-fallback for unknown Latin and CJK
characters, and interleaving of normal tokens and byte fallback in
decode.

Follow-up #464.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TokenizerFactory.fromGguf now dispatches "llama"/"sentencepiece" to
SentencePieceTokenizer.fromGgufFields, and fromTokenizerJson dispatches
model.type == "Unigram" to SentencePieceTokenizer.fromTokenizerJson.
Both routes previously threw UnsupportedTokenizerException. WordPiece
remains unimplemented.

Updates TokenizerFactoryDispatchTest: the llama and Unigram cases now
assert successful dispatch (against small synthetic fixtures), and new
bert / WordPiece cases hold down the still-unsupported branches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fixture-gated jvmTest that validates the full GGUF path against
the real TinyLlama-1.1B-Chat-v1.0 tokenizer. Reference encodings match
HuggingFace transformers exactly: "Hello" -> [15043],
"The capital of France is" -> [450, 7483, 310, 3444, 338], CJK byte
fallback round trips, and BOS/EOS ids propagate from the GGUF metadata.

Adds a downloadTinyLlamaTokenizerFixtures Gradle task analogous to the
Qwen one. Tests skip cleanly when the fixture is absent so offline/CI
builds stay green.

Also fixes a latent bug in both fromGgufFields builders: GGUF UINT32
fields arrive as kotlin.UInt (a value class, not a Number), so the
"as? Number" cast was silently dropping bos_token_id / eos_token_id /
unknown_token_id. The new toIntFlexible helper handles every signed
and unsigned numeric type GGUF can produce.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 62a1d9b into develop Apr 13, 2026
4 checks passed
@michalharakal michalharakal deleted the feature/464-tokenizer-path branch April 13, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add GGUF SentencePiece/LLaMA tokenizer path

1 participant