Skip to content

Feature/463 bpe qwen#465

Merged
michalharakal merged 7 commits intodevelopfrom
feature/463-bpe-qwen
Apr 12, 2026
Merged

Feature/463 bpe qwen#465
michalharakal merged 7 commits intodevelopfrom
feature/463-bpe-qwen

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

No description provided.

michalharakal and others added 7 commits April 12, 2026 16:47
Introduces a common Tokenizer surface so tokenizer selection can be
per-architecture rather than per file format. This is the first step
toward fixing byte-level BPE for GPT-2/Qwen models (#463), where the
same tokenizer must be dispatched whether weights come from GGUF or
SafeTensors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the canonical Karpathy/HF byte-to-unicode mapping that
byte-level BPE tokenizers use to represent arbitrary bytes (including
newline, tab, space) as printable code points before BPE merging.
Covered by a commonTest round-trip over all 256 bytes plus the
canonical 0x0A -> U+010A and 0x20 -> U+0120 spot-checks that lock the
table to the GPT-2 reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GPT-2-style byte-level BPE with the full seven-step
pipeline: special-token splitting, GPT-2 pretokenization regex,
byte-to-unicode mapping, merge-rank BPE (lowest rank wins, not
vocab score), vocab lookup, and the reverse for decode.

Includes a synthetic commonTest that locks in the algorithm with a
hand-crafted vocab + merges — no model fixture required. End-to-end
assertions against real Qwen2.5 IDs will follow in step 6.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds typed accessors for tokenizer.ggml.{model, tokens, merges,
token_type, bos_token_id, eos_token_id} and reworks vocabSize to
derive from the tokens list (the old code read
"tokenizer.ggml.tokens" as an Int, which was dead code — that
field is a string array). Includes a commonTest that parses a
stub map so the contract is locked in before the factory wiring
lands in the next step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces TokenizerFactory.fromGguf(fields) and .fromTokenizerJson(json)
that route by tokenizer type — gpt2/BPE goes to QwenByteLevelBpeTokenizer
regardless of file format. LLaMA/SentencePiece and WordPiece branches
throw UnsupportedTokenizerException and are tracked in #464.

QwenByteLevelBpeTokenizer gains two builders:
- fromGgufFields: reads tokens/merges/token_type from the raw GGUF field
  map and treats token_type == 3 entries as atomic specials.
- fromTokenizerJson: parses a HuggingFace tokenizer.json with
  kotlinx.serialization, inverting the vocab map and pulling added_tokens
  as specials.

Covered by TokenizerFactoryDispatchTest: gpt2 -> byte-level BPE (verified
by encoding a "Hello<|end|>" round trip), llama/Unigram throw, and a
synthetic tokenizer.json produces the expected merged token id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fixture-gated jvmTest that validates the full GGUF path against
the real Qwen2.5-0.5B-Instruct model: all seven reference assertions
from #463 pass ("Hello" -> [9707], "<|im_start|>" -> [151644],
"The capital of France is" -> [785, 6722, 315, 9625, 374], newline
-> [198], chat-template roundtrip, and GGUF vs tokenizer.json produce
identical ids).

Fixtures are downloaded on demand into build/test-fixtures/ via a new
downloadQwenTokenizerFixtures Gradle task; tests print a skip notice
and pass when the files are absent, so offline/CI builds stay green
without network access. The synthetic commonTest coverage added in
earlier steps still exercises the algorithm unconditionally.

Adds a jvmTest dependency on :skainet-io:skainet-io-gguf so the test
can load real GGUF files via StreamingGGUFReader.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 968bfc1 into develop Apr 12, 2026
4 checks passed
@michalharakal michalharakal deleted the feature/463-bpe-qwen branch April 12, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant