Skip to content

[BUG] Missing spaces between English words during streaming inference with Mistral/internlm3 models #398

@rubik-hua

Description

@rubik-hua

Title: [Bug] Missing spaces between English words during streaming inference with Mistral models

Description

When generating text using Mistral-7B-Instruct-v0.1 in InfiniLM, spaces between English words are lost, causing the output to concatenate into a single string.
Steps to Reproduce:

  1. Load the Mistral-7B-Instruct-v0.1 model.
  2. Send the prompt: <s> [INST] introduce yourself [/INST]
  3. Observe the generated output.
    Actual Output:
Hello!I'manAIlanguagemodelheretohelpyouwithanyquestionsortasksyoumighthave.HowcanIassistyoutoday?</s>

Expected Output:

Hello! I'm an AI language model here to help you with any questions or tasks you might have. How can I assist you today?</s>

Root Cause

This issue stems from the interaction between SentencePiece's space placeholder (U+2581) and the underlying mechanics of the Fast Tokenizer during incremental decoding.

  1. Incremental Decoding: During LLM streaming inference, tokens are generated one at a time. InfiniLM's decoding logic (e.g., in generation/utils.py and llm/llm.py) calls tokenizer.decode([token_id]) to retrieve the text for each new token.
  2. Rust Backend Trim Behavior: Mistral utilizes LlamaTokenizerFast (powered by the Rust-based tokenizers library). When decode() is called on a standalone token (e.g., ▁world), the Rust backend first converts to a space ( world), but then automatically assumes it's the beginning of a sentence and trims the leading space, ultimately returning "world".
  3. Why Patching convert_tokens_to_string Fails: For Fast Tokenizers, the decode() method directly invokes the Rust backend and returns the result, completely bypassing convert_tokens_to_string(). Therefore, patching convert_tokens_to_string—as done for ChatGLM—has no effect on Mistral.
    Comparison with ChatGLM:
  • ChatGLM uses the slow Python tokenizer. Its decode() method internally calls convert_tokens_to_string(), so patching that method successfully intercepts the flow.
  • Mistral uses the Fast Rust tokenizer. Its decode() method bypasses convert_tokens_to_string(), meaning the patch must be applied directly to the decode() method itself.

Proposed Fix

Introduce a MistralProcessor that patches tokenizer.decode() during instantiation. The patch bypasses the Rust trim logic by manually fetching raw token strings via convert_ids_to_tokens (which preserves the character), replacing with spaces, and handling SentencePiece byte fallback sequences.
Implementation: python/infinilm/processors/mistral_processor.py

import re
import types
from .basic_llm_processor import BasicLLMProcessor
from .processor import register_processor
@register_processor("mistral")
class MistralProcessor(BasicLLMProcessor):
    def __init__(self, model_dir_path: str):
        super().__init__(model_dir_path)
        self._fix_tokenizer_decode(self.tokenizer)
    @staticmethod
    def _fix_tokenizer_decode(tokenizer):
        """Fix Mistral tokenizer incremental decoding space loss.
        LlamaTokenizerFast.decode() calls the Rust backend directly, which
        trims leading spaces derived from ▁ (U+2581) during single-token
        decoding, causing English words to concatenate.
        Fix: patch tokenizer.decode() to:
        1. Convert token IDs to raw token strings (preserving ▁)
        2. Manually replace ▁ → space and handle byte fallback
        """
        def patched_decode(self_tok, token_ids, skip_special_tokens=False, **kwargs):
            # 1. Get raw token strings (preserving ▁)
            if isinstance(token_ids, int):
                token_ids = [token_ids]
            tokens = self_tok.convert_ids_to_tokens(
                token_ids, skip_special_tokens=skip_special_tokens
            )
            if isinstance(tokens, str):
                tokens = [tokens]
            # 2. Remove special tokens if requested
            if skip_special_tokens:
                special = set(self_tok.all_special_tokens)
                tokens = [t for t in tokens if t not in special]
            # 3. Join + replace ▁ (U+2581) with space
            text = "".join(tokens).replace("\u2581", " ")
            # 4. Handle SentencePiece byte fallback: consecutive <0xHH> → UTF-8
            def byte_fallback_replace(match):
                hex_strs = re.findall(r"<0x([0-9A-Fa-f]{2})>", match.group(0))
                byte_values = bytes([int(h, 16) for h in hex_strs])
                return byte_values.decode("utf-8", errors="replace")
            text = re.sub(r"(<0x[0-9A-Fa-f]{2}>)+", byte_fallback_replace, text)
            return text
        tokenizer.decode = types.MethodType(patched_decode, tokenizer)

Environment

  • Model: Mistral-7B-Instruct-v0.1 Mistral-7B-Instruct-v0.2
  • InfiniLM: main branch
  • Transformers: 4.34.0.dev0
  • Tokenizer Class: LlamaTokenizerFast (is_fast=True)
Image Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions