-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
Environment
- CPU architecture: aarch64
- CPU/Host memory size: ~450GB (Estimated for GH200 systems)
- GPU properties:
- GPU name: NVIDIA GH200 480GB
- GPU memory size: 97871 MiB
- Libraries:
TensorRT-LLM branch or tag: release:1.2.0rc7 - Versions:
PyTorch: 2.9.0a0+145a3a7bda.nv25.10
CUDA: 13.0 - Container used: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc7
NVIDIA driver version: 580.65.06
OS: Linux (aarch64)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to Reproduce
Minimal Reproduction Script
Run this script against a deployed trtllm-serve instance hosting a Harmony model (e.g., gpt-oss-20b).
import openai
# Configuration
API_KEY = "EMPTY" # Functionality doesn't require actual auth for local repro
BASE_URL = "http://localhost:8000/v1"
MODEL_NAME = "openai/gpt-oss-20b"
client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
try:
response = client.chat.completions.create(
model=MODEL_NAME,
messages=[{"role": "user", "content": "Use search function"}],
tools=[{
"type": "function",
"function": {
"name": "search",
"description": "Search function",
"parameters": {
"type": "object",
"properties": {"q": {"type": "string"}}
}
}
}],
temperature=1.0, # High temperature increases likelihood of hitting the bug
max_tokens=100
)
content = response.choices[0].message.content
print(f"Response Content: {content}")
if "<|channel|>" in str(content) or "<|message|>" in str(content):
print("FAIL: Leak detected! Internal tokens found in output.")
else:
print("PASS: Output looks clean (or bug not triggered this time).")
except Exception as e:
print(f"Error: {e}")Expected behavior
The model should inherently generate appropriate EOS tokens (e.g., <|call|>, <|return|>) even in tool-use scenarios and high temperatures. stop_token_ids should be passed correctly to the generation engine so that:
The output stream is terminated correctly at the end of a tool call or message.
No internal raw tokens (like <|channel|>) are exposed to the parser or the user.
actual behavior
Due to stop_token_ids being conditionally set to [] (empty list) in certain request paths (use_harmony=False), the model ignores stop conditions. This leads to:
Uncontrolled Generation: The model continues generating past the intended end of the message.
Handling Failures: The Harmony Parser fails to interpret the malformed/extended sequence and falls back to raw text decoding.
Content Leakage: The user receives the raw internal representation, including <|channel|> tags and potential hallucinations/spam.
additional notes
We identified the potential root cause in tensorrt_llm/serve/responses_utils.py.
The code conditionally sets stop_token_ids:
sampling_params = request.to_sampling_params(
default_sampling_params={
"stop_token_ids":
get_harmony_adapter().get_stop_tokens() if use_harmony else []
})When use_harmony is unexpectedly False (which occurs intermittently in certain request flows), stop_token_ids defaults to []. This leaves the model without explicit stop conditions.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.