Follow-up from #463.
The fix for #463 introduces a TokenizerFactory that dispatches tokenizer construction by architecture (tokenizer.ggml.model) rather than by file format. The Qwen/GPT-2 byte-level BPE path lands as part of #463. This issue tracks the remaining branch: SentencePiece / LLaMA.
Scope
TokenizerFactory.fromGguf(fields) currently throws UnsupportedTokenizerException when tokenizer.ggml.model is \"llama\" or \"sentencepiece\". We need a real implementation so LLaMA/Gemma/TinyLlama GGUF files can be tokenized in this repo.
What's needed
- A
SentencePieceTokenizer (or GgufSentencePieceTokenizer) in skainet-io-core that:
- Reads
tokenizer.ggml.tokens, tokenizer.ggml.scores, tokenizer.ggml.token_type from GGUF metadata.
- Implements SentencePiece unigram/BPE semantics: highest score wins (not merge rank).
- Handles the SentencePiece whitespace prefix (
▁ / U+2581) correctly on both encode and decode.
- Respects BOS/EOS/unknown token IDs from
tokenizer.ggml.bos_token_id etc.
- Wire it into
TokenizerFactory.fromGguf() — replace the UnsupportedTokenizerException branch.
- Wire it into
TokenizerFactory.fromTokenizerJson() for HF model.type == \"Unigram\".
- Tests against a small LLaMA/TinyLlama GGUF (download-on-demand fixture, same pattern as the Qwen tests).
Why separate from #463
Keeps the #463 PR focused on the byte-level BPE bug that is actively blocking Qwen chat mode and tool calling. SentencePiece support is currently missing but not broken — no LLaMA inference path exists in this repo yet to regress.
Related
Follow-up from #463.
The fix for #463 introduces a
TokenizerFactorythat dispatches tokenizer construction by architecture (tokenizer.ggml.model) rather than by file format. The Qwen/GPT-2 byte-level BPE path lands as part of #463. This issue tracks the remaining branch: SentencePiece / LLaMA.Scope
TokenizerFactory.fromGguf(fields)currently throwsUnsupportedTokenizerExceptionwhentokenizer.ggml.modelis\"llama\"or\"sentencepiece\". We need a real implementation so LLaMA/Gemma/TinyLlama GGUF files can be tokenized in this repo.What's needed
SentencePieceTokenizer(orGgufSentencePieceTokenizer) inskainet-io-corethat:tokenizer.ggml.tokens,tokenizer.ggml.scores,tokenizer.ggml.token_typefrom GGUF metadata.▁/ U+2581) correctly on both encode and decode.tokenizer.ggml.bos_token_idetc.TokenizerFactory.fromGguf()— replace theUnsupportedTokenizerExceptionbranch.TokenizerFactory.fromTokenizerJson()for HFmodel.type == \"Unigram\".Why separate from #463
Keeps the #463 PR focused on the byte-level BPE bug that is actively blocking Qwen chat mode and tool calling. SentencePiece support is currently missing but not broken — no LLaMA inference path exists in this repo yet to regress.
Related