Skip to content

Add GGUF SentencePiece/LLaMA tokenizer path #464

@michalharakal

Description

@michalharakal

Follow-up from #463.

The fix for #463 introduces a TokenizerFactory that dispatches tokenizer construction by architecture (tokenizer.ggml.model) rather than by file format. The Qwen/GPT-2 byte-level BPE path lands as part of #463. This issue tracks the remaining branch: SentencePiece / LLaMA.

Scope

TokenizerFactory.fromGguf(fields) currently throws UnsupportedTokenizerException when tokenizer.ggml.model is \"llama\" or \"sentencepiece\". We need a real implementation so LLaMA/Gemma/TinyLlama GGUF files can be tokenized in this repo.

What's needed

  1. A SentencePieceTokenizer (or GgufSentencePieceTokenizer) in skainet-io-core that:
    • Reads tokenizer.ggml.tokens, tokenizer.ggml.scores, tokenizer.ggml.token_type from GGUF metadata.
    • Implements SentencePiece unigram/BPE semantics: highest score wins (not merge rank).
    • Handles the SentencePiece whitespace prefix ( / U+2581) correctly on both encode and decode.
    • Respects BOS/EOS/unknown token IDs from tokenizer.ggml.bos_token_id etc.
  2. Wire it into TokenizerFactory.fromGguf() — replace the UnsupportedTokenizerException branch.
  3. Wire it into TokenizerFactory.fromTokenizerJson() for HF model.type == \"Unigram\".
  4. Tests against a small LLaMA/TinyLlama GGUF (download-on-demand fixture, same pattern as the Qwen tests).

Why separate from #463

Keeps the #463 PR focused on the byte-level BPE bug that is actively blocking Qwen chat mode and tool calling. SentencePiece support is currently missing but not broken — no LLaMA inference path exists in this repo yet to regress.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions