Skip to content

Byte-level BPE broken for GPT-2/Qwen models (affects both GGUF and SafeTensors) #463

@michalharakal

Description

@michalharakal

Summary

GGUFTokenizer.encodeBPE() (in llm-core/.../tokenizer/GGUFTokenizer.kt) does not implement byte-level BPE correctly for GPT-2/Qwen-family tokenizers.
When used to encode text containing chat-template special tokens (e.g., <|im_start|>) or arbitrary Unicode, it produces broken token sequences that cause the model to generate nonsense output (CJK characters, URL-encoded fragments, HTML entities).

This affects both file formats, not just GGUF:

  • GGUFTokenizer.fromRandomAccessSource(gguf) — broken for Qwen GGUF models
  • GGUFTokenizer.fromTokenizerJson(json) — broken for Qwen SafeTensors models (same code path)

The bug hasn't surfaced on SafeTensors simply because no Qwen SafeTensors model has been tested in this project yet. All SafeTensors testing so far used LLaMA/Gemma (SentencePiece), which goes through a different, working code path.

Tokenizer selection should be per-architecture (per tokenizer type), not per file format.
A Qwen model needs byte-level BPE whether its weights come from .gguf or .safetensors. A LLaMA model needs SentencePiece regardless of format.

This blocks tool calling and chat mode for Qwen2, Qwen3, Qwen2.5, Mistral-Nemo, and any other model that uses GPT-2-style byte-level BPE.

Symptoms

SKaiNET-transformers #52
(Repetitive/degenerate instead of "Paris". The tokenizer encodes "The capital of France is" into a broken token sequence that drives the model into a loop.)

Root Cause

GPT-2-style tokenizers use byte-level BPE:

  1. Text is encoded as UTF-8 bytes.
  2. Each byte is mapped to a unique Unicode character via a fixed lookup table (byte_to_unicode) that avoids control characters.
  3. A regex pretokenization splits the byte-encoded string into word-like chunks.
  4. BPE merges are applied within each chunk using merge ranks (not scores — lower rank wins).
  5. Special tokens like <|im_start|> are looked up as atomic tokens, not decomposed.

The current GGUFTokenizer.encodeBPE() implementation:

  • No byte-to-Unicode mapping — splits input text directly into Char-sized units. Non-ASCII bytes and control characters (newline, tab) are mis-encoded.
  • No special-token splitting — chat-template tokens like <|im_start|> get greedy-merged with surrounding text instead of being recognized as atomic units.
  • Uses vocab scores instead of merge ranks — GPT-2 BPE requires picking the merge with the lowest rank from the merges list, not the highest score.
  • Does not read tokenizer.ggml.merges — the GGUF-stored merge list is ignored.
  • No regex pretokenization — BPE merges can cross word boundaries, producing different tokens than llama.cpp / HuggingFace tokenizers.

The result: "The" might encode to [1820] in llama.cpp but to [54, 104, 101] here.
The model sees completely different tokens than it was trained on, and the output is garbage.

Reproduction Test

A simple unit test that fails today and should pass after the fix:

// File: llm-core/src/jvmTest/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizerByteBpeTest.kt
package sk.ainet.apps.llm.tokenizer

import sk.ainet.io.JvmRandomAccessSource
import java.nio.file.Paths
import kotlin.test.Test
import kotlin.test.assertEquals
import kotlin.test.assertTrue

/**
 * Smoke tests for GPT-2-style byte-level BPE encoding in GGUFTokenizer.
 *
 * Uses Qwen2.5-0.5B-Instruct-Q8_0.gguf as a small, fast-loading reference.
 * Expected token IDs come from HuggingFace `transformers`:
 *
 *   from transformers import AutoTokenizer
 *   tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
 *   print(tok.encode("Hello"))                  # -> [9707]
 *   print(tok.encode("<|im_start|>"))            # -> [151644]
 *   print(tok.encode("The capital of France is")) # -> [785, 6722, 315, 9625, 374]
 */
class GGUFTokenizerByteBpeTest {

    private val modelPath = Paths.get(
        System.getProperty("user.dir"),
        "Qwen2.5-0.5B-Instruct-Q8_0.gguf"
    )

    private fun loadTokenizer(): GGUFTokenizer {
        return JvmRandomAccessSource.open(modelPath.toString()).use { src ->
            GGUFTokenizer.fromRandomAccessSource(src)
        }
    }

    @Test
    fun `single ASCII word encodes to single token`() {
        val tok = loadTokenizer()
        val ids = tok.encode("Hello")
        assertEquals(listOf(9707), ids.toList(),
            "Expected Qwen2.5 to encode 'Hello' as [9707]")
    }

    @Test
    fun `special chat template token encodes as one atomic token`() {
        val tok = loadTokenizer()
        val ids = tok.encode("<|im_start|>")
        assertEquals(listOf(151644), ids.toList(),
            "Expected <|im_start|> to encode as single special token 151644, " +
                "not decomposed into byte-level characters")
    }

    @Test
    fun `sentence encodes to known Qwen2_5 token sequence`() {
        val tok = loadTokenizer()
        val ids = tok.encode("The capital of France is")
        // From HuggingFace transformers for Qwen/Qwen2.5-0.5B-Instruct
        assertEquals(
            listOf(785, 6722, 315, 9625, 374),
            ids.toList(),
            "Token sequence must match HuggingFace transformers reference"
        )
    }

    @Test
    fun `encode then decode is identity for ASCII`() {
        val tok = loadTokenizer()
        val input = "The capital of France is"
        val ids = tok.encode(input)
        val decoded = tok.decode(ids)
        assertEquals(input, decoded)
    }

    @Test
    fun `encode then decode is identity for text with special tokens`() {
        val tok = loadTokenizer()
        val input = "<|im_start|>user\nHello<|im_end|>"
        val ids = tok.encode(input)
        val decoded = tok.decode(ids)
        assertEquals(input, decoded)
    }

    @Test
    fun `newline encodes as Ċ-byte token, not character-split`() {
        val tok = loadTokenizer()
        val ids = tok.encode("\n")
        // Qwen2.5: \n -> 198
        assertEquals(listOf(198), ids.toList())
    }

    @Test
    fun `chat template roundtrip matches HuggingFace`() {
        val tok = loadTokenizer()
        val prompt = "<|im_start|>system\nYou are helpful.<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n"
        val ids = tok.encode(prompt)
        // Expected first tokens from HF: [151644, 8948, 198, 2610, 525, 10950, 13, ...]
        assertTrue(ids.size > 10, "Prompt should encode to many tokens")
        assertEquals(151644, ids[0], "First token must be <|im_start|>")
        assertEquals(8948, ids[1], "Second token must be 'system'")
        assertEquals(198, ids[2], "Third token must be newline (Ċ)")
    }
}

Format-Independence Test

To prove the fix is format-independent, the same test should run against both a GGUF file and a tokenizer.json file from the same model and produce identical token IDs:

// File: llm-core/src/jvmTest/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactoryDispatchTest.kt
package sk.ainet.apps.llm.tokenizer

import sk.ainet.io.JvmRandomAccessSource
import java.nio.file.Paths
import kotlin.test.Test
import kotlin.test.assertEquals

/**
 * Proves the tokenizer is selected per-architecture and produces
 * the same output regardless of whether the source is GGUF or
 * tokenizer.json.
 */
class TokenizerFactoryDispatchTest {

    @Test
    fun `Qwen tokenizer from GGUF and from tokenizer_json match`() {
        val ggufTok = JvmRandomAccessSource.open("Qwen2.5-0.5B-Instruct-Q8_0.gguf").use {
            TokenizerFactory.fromGGUF(it)
        }
        val jsonTok = TokenizerFactory.fromTokenizerJson(
            Paths.get("Qwen2.5-0.5B-Instruct-HF/tokenizer.json").toFile().readText()
        )

        val samples = listOf(
            "Hello",
            "The capital of France is",
            "<|im_start|>user\nHi<|im_end|>",
            "\n",
            "What is 2 + 2?"
        )

        for (text in samples) {
            assertEquals(
                ggufTok.encode(text).toList(),
                jsonTok.encode(text).toList(),
                "Token IDs for '$text' must match between GGUF and tokenizer.json"
            )
        }
    }

    @Test
    fun `Qwen tokenizer dispatches to QwenByteLevelBPETokenizer`() {
        val tok = JvmRandomAccessSource.open("Qwen2.5-0.5B-Instruct-Q8_0.gguf").use {
            TokenizerFactory.fromGGUF(it)
        }
        assertTrue(
            tok is QwenByteLevelBPETokenizer,
            "GGUF with tokenizer.ggml.model=gpt2 must dispatch to QwenByteLevelBPETokenizer, " +
                "got ${tok::class.simpleName}"
        )
    }

    @Test
    fun `LLaMA tokenizer dispatches to GGUFTokenizer SentencePiece path`() {
        val tok = JvmRandomAccessSource.open("tinyllama-1.1b-chat-v1.0.Q8_0.gguf").use {
            TokenizerFactory.fromGGUF(it)
        }
        assertTrue(
            tok is GGUFTokenizer,
            "GGUF with tokenizer.ggml.model=llama must dispatch to GGUFTokenizer"
        )
    }
}

The first test proves format independence: the same text encodes to the same token IDs from either source.
The second and third prove the dispatch routes each architecture to the correct implementation.

Running this test today:

GGUFTokenizerByteBpeTest > single ASCII word encodes to single token FAILED
    expected:<[9707]> but was:<[39, 101, 108, 108, 111]>

GGUFTokenizerByteBpeTest > special chat template token encodes as one atomic token FAILED
    expected:<[151644]> but was:<[27, 91, 105, 109, 95, 115, 116, 97, 114, 116, 124, 62]>

All byte-BPE-sensitive assertions fail. The only test that might pass is the ASCII encode/decode round trip, because decode reverses whatever encode does — but the token IDs don't match what the model was trained on, so inference still produces garbage.

Fix Requirements

A correct implementation must:

  1. Read tokenizer.ggml.merges from GGUF metadata and build a merge pair -> rank map.
  2. Build a byte-to-Unicode map (the standard GPT-2 table, 256 entries).
  3. Collect special tokens from the vocabulary (tokens matching <|...|> or marked as type=3 / control in tokenizer.ggml.token_type).
  4. Split input on special tokens before any BPE processing, so they're encoded as atomic IDs.
  5. Apply GPT-2 pretokenization regex to non-special segments:
    '(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
    
  6. Byte-encode each chunk: UTF-8 bytes → byte-to-Unicode chars.
  7. Apply BPE using merge ranks (lower rank wins) instead of vocab scores.
  8. Decode: reverse the byte-to-Unicode mapping on the output string.

A working reference already exists in the project: QwenByteLevelBPETokenizer in llm-core/.../tokenizer/QwenByteLevelBPETokenizer.kt.
It implements all seven steps correctly but is only wired to load from tokenizer.json files, not from GGUF metadata.

Recommended Fix: Per-Architecture Dispatch in TokenizerFactory

Move the tokenizer-type detection out of GGUFTokenizer and into TokenizerFactory, where it dispatches to the right implementation regardless of file format:

object TokenizerFactory {
    /** From a GGUF file: read tokenizer.ggml.model, dispatch by type. */
    fun fromGGUF(source: RandomAccessSource): Tokenizer {
        val fields = peekFields(source)
        val modelType = fields["tokenizer.ggml.model"] as? String
        return when (modelType) {
            "gpt2", "bpe" -> QwenByteLevelBPETokenizer.fromGGUFFields(fields)  // NEW
            "llama", "sentencepiece" -> GGUFTokenizer.fromRandomAccessSource(source)
            "bert", "wordpiece" -> GGUFTokenizer.fromRandomAccessSource(source)
            else -> error("Unknown GGUF tokenizer type: $modelType")
        }
    }

    /** From a HuggingFace tokenizer.json: parse model.type, dispatch by type. */
    fun fromTokenizerJson(json: String): Tokenizer {
        val type = parseModelType(json)  // "BPE", "Unigram", "WordPiece"
        return when (type) {
            "BPE" -> QwenByteLevelBPETokenizer.fromJson(json)
            "Unigram" -> HuggingFaceBPETokenizer.fromJson(json)  // SentencePiece
            "WordPiece" -> HuggingFaceBPETokenizer.fromJson(json)
            else -> error("Unknown tokenizer type: $type")
        }
    }
}

New work needed:

  1. QwenByteLevelBPETokenizer.fromGGUFFields(fields: Map<String, Any?>) — build a tokenizer from GGUF metadata (tokenizer.ggml.tokens, tokenizer.ggml.merges, tokenizer.ggml.token_type) instead of parsing tokenizer.json.
  2. TokenizerFactory.fromGGUF() dispatch — check tokenizer.ggml.model and route to the right implementation.
  3. Mark GGUFTokenizer.encodeBPE() deprecated — its character-level BPE path is wrong for GPT-2/Qwen but still correct for SentencePiece fallbacks. Either keep it for SentencePiece/WordPiece only, or split GGUFTokenizer into GGUFSentencePieceTokenizer + GGUFWordPieceTokenizer and remove the broken BPE branch entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions