[Bug][AutoDeploy]: Nemotron-3-Nano-4B-FP8 mamba_ssm_prepare_metadata shape mismatch on multi-token prefill

### System Info

- **GPU**: NVIDIA GeForce RTX 5080 Laptop (SM120, 16GB)
- **TRT-LLM**: 1.3.0rc9 (NGC container `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9`)
- **CUDA**: 13.2 (host), container default
- **Driver**: 580.126.20
- **OS**: Ubuntu 25.10

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

**Docker command:**
```bash
docker run --rm --gpus all --ipc=host -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc9 \
  bash -c "TRTLLM_ENABLE_PDL=1 trtllm-serve nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
    --host 0.0.0.0 --port 8000 --backend _autodeploy --trust_remote_code \
    --max_batch_size 4 --max_seq_len 512"
```

Server starts successfully. Then:

```bash
# Single-token prompt — WORKS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"Hello","max_tokens":20,"temperature":0}'
# Returns: "Hello, world! Hello, universe!..."

# Multi-token prompt — FAILS
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8","prompt":"What is AI?","max_tokens":20,"temperature":0}'
```

### Expected behavior

Multi-token prompts should work the same as single-token prompts.

### Actual behavior

```
expected size 1==4, stride 1==1 at dim=0
Error in op: torch.ops.auto_deploy.mamba_ssm_prepare_metadata.default
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
```

The `1==N` in the error matches the number of prompt tokens (e.g., `1==4` for "What is AI?", `1==12` for longer prompts). Single-token prompts always succeed.

After the error, subsequent requests (including single-token) fail with `AssertionError: Sampling failed`.

### Analysis

The `_mamba_ssm_prepare_metadata_fake` kernel in `mamba_backend_common.py` (line 81) uses `position_ids.shape[:2]` to determine `(b, s)`. During `torch.compile` export, the graph appears to specialize `s=1` (decode path), causing a shape mismatch when runtime prefill passes `s>1`.

The 30B MoE model (Nemotron-3-Nano-30B-A3B) is reported working in #12323 — this bug may be specific to the 4B dense variant's graph export path.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][AutoDeploy]: Nemotron-3-Nano-4B-FP8 mamba_ssm_prepare_metadata shape mismatch on multi-token prefill #12573

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Actual behavior

Analysis

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug][AutoDeploy]: Nemotron-3-Nano-4B-FP8 mamba_ssm_prepare_metadata shape mismatch on multi-token prefill #12573

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Actual behavior

Analysis

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions