Skip to content

[Bug]: Missing Stop Tokens in responses_utils.py Causes Content Leakage in Tool Use Scenarios #10651

@zzw09773

Description

@zzw09773

System Info

Environment

  • CPU architecture: aarch64
  • CPU/Host memory size: ~450GB (Estimated for GH200 systems)
  • GPU properties:
  • GPU name: NVIDIA GH200 480GB
  • GPU memory size: 97871 MiB
  • Libraries:
    TensorRT-LLM branch or tag: release:1.2.0rc7
  • Versions:
    PyTorch: 2.9.0a0+145a3a7bda.nv25.10
    CUDA: 13.0
  • Container used: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc7
    NVIDIA driver version: 580.65.06
    OS: Linux (aarch64)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to Reproduce
Minimal Reproduction Script
Run this script against a deployed trtllm-serve instance hosting a Harmony model (e.g., gpt-oss-20b).

import openai
# Configuration
API_KEY = "EMPTY"  # Functionality doesn't require actual auth for local repro
BASE_URL = "http://localhost:8000/v1"
MODEL_NAME = "openai/gpt-oss-20b"
client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
try:
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": "Use search function"}],
        tools=[{
            "type": "function",
            "function": {
                "name": "search",
                "description": "Search function",
                "parameters": {
                    "type": "object",
                    "properties": {"q": {"type": "string"}}
                }
            }
        }],
        temperature=1.0, # High temperature increases likelihood of hitting the bug
        max_tokens=100
    )
    content = response.choices[0].message.content
    print(f"Response Content: {content}")
    
    if "<|channel|>" in str(content) or "<|message|>" in str(content):
        print("FAIL: Leak detected! Internal tokens found in output.")
    else:
        print("PASS: Output looks clean (or bug not triggered this time).")
except Exception as e:
    print(f"Error: {e}")

Expected behavior

The model should inherently generate appropriate EOS tokens (e.g., <|call|>, <|return|>) even in tool-use scenarios and high temperatures. stop_token_ids should be passed correctly to the generation engine so that:

The output stream is terminated correctly at the end of a tool call or message.
No internal raw tokens (like <|channel|>) are exposed to the parser or the user.

actual behavior

Due to stop_token_ids being conditionally set to [] (empty list) in certain request paths (use_harmony=False), the model ignores stop conditions. This leads to:

Uncontrolled Generation: The model continues generating past the intended end of the message.
Handling Failures: The Harmony Parser fails to interpret the malformed/extended sequence and falls back to raw text decoding.
Content Leakage: The user receives the raw internal representation, including <|channel|> tags and potential hallucinations/spam.

additional notes

We identified the potential root cause in tensorrt_llm/serve/responses_utils.py.

The code conditionally sets stop_token_ids:

sampling_params = request.to_sampling_params(
    default_sampling_params={
        "stop_token_ids":
        get_harmony_adapter().get_stop_tokens() if use_harmony else []
    })

When use_harmony is unexpectedly False (which occurs intermittently in certain request flows), stop_token_ids defaults to []. This leaves the model without explicit stop conditions.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions