Add GGUF SentencePiece/LLaMA tokenizer path

Follow-up from #463.

The fix for #463 introduces a `TokenizerFactory` that dispatches tokenizer construction by architecture (`tokenizer.ggml.model`) rather than by file format. The Qwen/GPT-2 byte-level BPE path lands as part of #463. This issue tracks the remaining branch: **SentencePiece / LLaMA**.

## Scope

`TokenizerFactory.fromGguf(fields)` currently throws `UnsupportedTokenizerException` when `tokenizer.ggml.model` is `\"llama\"` or `\"sentencepiece\"`. We need a real implementation so LLaMA/Gemma/TinyLlama GGUF files can be tokenized in this repo.

## What's needed

1. A `SentencePieceTokenizer` (or `GgufSentencePieceTokenizer`) in `skainet-io-core` that:
   - Reads `tokenizer.ggml.tokens`, `tokenizer.ggml.scores`, `tokenizer.ggml.token_type` from GGUF metadata.
   - Implements SentencePiece unigram/BPE semantics: **highest score wins** (not merge rank).
   - Handles the SentencePiece whitespace prefix (`▁` / U+2581) correctly on both encode and decode.
   - Respects BOS/EOS/unknown token IDs from `tokenizer.ggml.bos_token_id` etc.
2. Wire it into `TokenizerFactory.fromGguf()` — replace the `UnsupportedTokenizerException` branch.
3. Wire it into `TokenizerFactory.fromTokenizerJson()` for HF `model.type == \"Unigram\"`.
4. Tests against a small LLaMA/TinyLlama GGUF (download-on-demand fixture, same pattern as the Qwen tests).

## Why separate from #463

Keeps the #463 PR focused on the byte-level BPE bug that is actively blocking Qwen chat mode and tool calling. SentencePiece support is currently missing but not *broken* — no LLaMA inference path exists in this repo yet to regress.

## Related

- #463 — parent issue (byte-level BPE broken for GPT-2/Qwen)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGUF SentencePiece/LLaMA tokenizer path #464

Scope

What's needed

Why separate from #463

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add GGUF SentencePiece/LLaMA tokenizer path #464

Description

Scope

What's needed

Why separate from #463

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions