Skip to content

[Bug][AutoDeploy]: Nemotron-3-Nano-4B-FP8 mamba_ssm_prepare_metadata shape mismatch on multi-token prefill #12573

@kimchi-developer

Description

@kimchi-developer

System Info

  • GPU: NVIDIA GeForce RTX 5080 Laptop (SM120, 16GB)
  • TRT-LLM: 1.3.0rc9 (NGC container nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9)
  • CUDA: 13.2 (host), container default
  • Driver: 580.126.20
  • OS: Ubuntu 25.10

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Docker command:

docker run --rm --gpus all --ipc=host -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9 \
  bash -c "TRTLLM_ENABLE_PDL=1 trtllm-serve nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
    --host 0.0.0.0 --port 8000 --backend _autodeploy --trust_remote_code \
    --max_batch_size 4 --max_seq_len 512"

Server starts successfully. Then:

# Single-token prompt — WORKS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"Hello","max_tokens":20,"temperature":0}'
# Returns: "Hello, world! Hello, universe!..."

# Multi-token prompt — FAILS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"What is AI?","max_tokens":20,"temperature":0}'

Expected behavior

Multi-token prompts should work the same as single-token prompts.

Actual behavior

expected size 1==4, stride 1==1 at dim=0
Error in op: torch.ops.auto_deploy.mamba_ssm_prepare_metadata.default
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.

The 1==N in the error matches the number of prompt tokens (e.g., 1==4 for "What is AI?", 1==12 for longer prompts). Single-token prompts always succeed.

After the error, subsequent requests (including single-token) fail with AssertionError: Sampling failed.

Analysis

The _mamba_ssm_prepare_metadata_fake kernel in mamba_backend_common.py (line 81) uses position_ids.shape[:2] to determine (b, s). During torch.compile export, the graph appears to specialize s=1 (decode path), causing a shape mismatch when runtime prefill passes s>1.

The 30B MoE model (Nemotron-3-Nano-30B-A3B) is reported working in #12323 — this bug may be specific to the 4B dense variant's graph export path.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy Backend

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions