Title: [Bug] Missing spaces between English words during streaming inference with Mistral models
Description
When generating text using Mistral-7B-Instruct-v0.1 in InfiniLM, spaces between English words are lost, causing the output to concatenate into a single string.
Steps to Reproduce:
- Load the
Mistral-7B-Instruct-v0.1 model.
- Send the prompt:
<s> [INST] introduce yourself [/INST]
- Observe the generated output.
Actual Output:
Hello!I'manAIlanguagemodelheretohelpyouwithanyquestionsortasksyoumighthave.HowcanIassistyoutoday?</s>
Expected Output:
Hello! I'm an AI language model here to help you with any questions or tasks you might have. How can I assist you today?</s>
Root Cause
This issue stems from the interaction between SentencePiece's space placeholder ▁ (U+2581) and the underlying mechanics of the Fast Tokenizer during incremental decoding.
- Incremental Decoding: During LLM streaming inference, tokens are generated one at a time. InfiniLM's decoding logic (e.g., in
generation/utils.py and llm/llm.py) calls tokenizer.decode([token_id]) to retrieve the text for each new token.
- Rust Backend Trim Behavior: Mistral utilizes
LlamaTokenizerFast (powered by the Rust-based tokenizers library). When decode() is called on a standalone token (e.g., ▁world), the Rust backend first converts ▁ to a space ( world), but then automatically assumes it's the beginning of a sentence and trims the leading space, ultimately returning "world".
- Why Patching
convert_tokens_to_string Fails: For Fast Tokenizers, the decode() method directly invokes the Rust backend and returns the result, completely bypassing convert_tokens_to_string(). Therefore, patching convert_tokens_to_string—as done for ChatGLM—has no effect on Mistral.
Comparison with ChatGLM:
- ChatGLM uses the slow Python tokenizer. Its
decode() method internally calls convert_tokens_to_string(), so patching that method successfully intercepts the flow.
- Mistral uses the Fast Rust tokenizer. Its
decode() method bypasses convert_tokens_to_string(), meaning the patch must be applied directly to the decode() method itself.
Proposed Fix
Introduce a MistralProcessor that patches tokenizer.decode() during instantiation. The patch bypasses the Rust trim logic by manually fetching raw token strings via convert_ids_to_tokens (which preserves the ▁ character), replacing ▁ with spaces, and handling SentencePiece byte fallback sequences.
Implementation: python/infinilm/processors/mistral_processor.py
import re
import types
from .basic_llm_processor import BasicLLMProcessor
from .processor import register_processor
@register_processor("mistral")
class MistralProcessor(BasicLLMProcessor):
def __init__(self, model_dir_path: str):
super().__init__(model_dir_path)
self._fix_tokenizer_decode(self.tokenizer)
@staticmethod
def _fix_tokenizer_decode(tokenizer):
"""Fix Mistral tokenizer incremental decoding space loss.
LlamaTokenizerFast.decode() calls the Rust backend directly, which
trims leading spaces derived from ▁ (U+2581) during single-token
decoding, causing English words to concatenate.
Fix: patch tokenizer.decode() to:
1. Convert token IDs to raw token strings (preserving ▁)
2. Manually replace ▁ → space and handle byte fallback
"""
def patched_decode(self_tok, token_ids, skip_special_tokens=False, **kwargs):
# 1. Get raw token strings (preserving ▁)
if isinstance(token_ids, int):
token_ids = [token_ids]
tokens = self_tok.convert_ids_to_tokens(
token_ids, skip_special_tokens=skip_special_tokens
)
if isinstance(tokens, str):
tokens = [tokens]
# 2. Remove special tokens if requested
if skip_special_tokens:
special = set(self_tok.all_special_tokens)
tokens = [t for t in tokens if t not in special]
# 3. Join + replace ▁ (U+2581) with space
text = "".join(tokens).replace("\u2581", " ")
# 4. Handle SentencePiece byte fallback: consecutive <0xHH> → UTF-8
def byte_fallback_replace(match):
hex_strs = re.findall(r"<0x([0-9A-Fa-f]{2})>", match.group(0))
byte_values = bytes([int(h, 16) for h in hex_strs])
return byte_values.decode("utf-8", errors="replace")
text = re.sub(r"(<0x[0-9A-Fa-f]{2}>)+", byte_fallback_replace, text)
return text
tokenizer.decode = types.MethodType(patched_decode, tokenizer)
Environment
- Model: Mistral-7B-Instruct-v0.1 Mistral-7B-Instruct-v0.2
- InfiniLM: main branch
- Transformers: 4.34.0.dev0
- Tokenizer Class:
LlamaTokenizerFast (is_fast=True)

Title: [Bug] Missing spaces between English words during streaming inference with Mistral models
Description
When generating text using Mistral-7B-Instruct-v0.1 in InfiniLM, spaces between English words are lost, causing the output to concatenate into a single string.
Steps to Reproduce:
Mistral-7B-Instruct-v0.1model.<s> [INST] introduce yourself [/INST]Actual Output:
Expected Output:
Root Cause
This issue stems from the interaction between SentencePiece's space placeholder
▁(U+2581) and the underlying mechanics of the Fast Tokenizer during incremental decoding.generation/utils.pyandllm/llm.py) callstokenizer.decode([token_id])to retrieve the text for each new token.LlamaTokenizerFast(powered by the Rust-basedtokenizerslibrary). Whendecode()is called on a standalone token (e.g.,▁world), the Rust backend first converts▁to a space (world), but then automatically assumes it's the beginning of a sentence and trims the leading space, ultimately returning"world".convert_tokens_to_stringFails: For Fast Tokenizers, thedecode()method directly invokes the Rust backend and returns the result, completely bypassingconvert_tokens_to_string(). Therefore, patchingconvert_tokens_to_string—as done for ChatGLM—has no effect on Mistral.Comparison with ChatGLM:
decode()method internally callsconvert_tokens_to_string(), so patching that method successfully intercepts the flow.decode()method bypassesconvert_tokens_to_string(), meaning the patch must be applied directly to thedecode()method itself.Proposed Fix
Introduce a
MistralProcessorthat patchestokenizer.decode()during instantiation. The patch bypasses the Rust trim logic by manually fetching raw token strings viaconvert_ids_to_tokens(which preserves the▁character), replacing▁with spaces, and handling SentencePiece byte fallback sequences.Implementation:
python/infinilm/processors/mistral_processor.pyEnvironment
LlamaTokenizerFast(is_fast=True)