System Info
- GPU: NVIDIA GeForce RTX 5080 Laptop (SM120, 16GB)
- TRT-LLM: 1.3.0rc9 (NGC container
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9)
- CUDA: 13.2 (host), container default
- Driver: 580.126.20
- OS: Ubuntu 25.10
Who can help?
No response
Information
Tasks
Reproduction
Docker command:
docker run --rm --gpus all --ipc=host -p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9 \
bash -c "TRTLLM_ENABLE_PDL=1 trtllm-serve nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
--host 0.0.0.0 --port 8000 --backend _autodeploy --trust_remote_code \
--max_batch_size 4 --max_seq_len 512"
Server starts successfully. Then:
# Single-token prompt — WORKS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
-d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"Hello","max_tokens":20,"temperature":0}'
# Returns: "Hello, world! Hello, universe!..."
# Multi-token prompt — FAILS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
-d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"What is AI?","max_tokens":20,"temperature":0}'
Expected behavior
Multi-token prompts should work the same as single-token prompts.
Actual behavior
expected size 1==4, stride 1==1 at dim=0
Error in op: torch.ops.auto_deploy.mamba_ssm_prepare_metadata.default
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
The 1==N in the error matches the number of prompt tokens (e.g., 1==4 for "What is AI?", 1==12 for longer prompts). Single-token prompts always succeed.
After the error, subsequent requests (including single-token) fail with AssertionError: Sampling failed.
Analysis
The _mamba_ssm_prepare_metadata_fake kernel in mamba_backend_common.py (line 81) uses position_ids.shape[:2] to determine (b, s). During torch.compile export, the graph appears to specialize s=1 (decode path), causing a shape mismatch when runtime prefill passes s>1.
The 30B MoE model (Nemotron-3-Nano-30B-A3B) is reported working in #12323 — this bug may be specific to the 4B dense variant's graph export path.
Before submitting a new issue...
System Info
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9)Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Docker command:
Server starts successfully. Then:
Expected behavior
Multi-token prompts should work the same as single-token prompts.
Actual behavior
The
1==Nin the error matches the number of prompt tokens (e.g.,1==4for "What is AI?",1==12for longer prompts). Single-token prompts always succeed.After the error, subsequent requests (including single-token) fail with
AssertionError: Sampling failed.Analysis
The
_mamba_ssm_prepare_metadata_fakekernel inmamba_backend_common.py(line 81) usesposition_ids.shape[:2]to determine(b, s). Duringtorch.compileexport, the graph appears to specializes=1(decode path), causing a shape mismatch when runtime prefill passess>1.The 30B MoE model (Nemotron-3-Nano-30B-A3B) is reported working in #12323 — this bug may be specific to the 4B dense variant's graph export path.
Before submitting a new issue...