Skip to content

[Bug]: Multi-node regression: segfault/hang in NCCL fabric detection on dual DGX Spark (1.3.0rc10 vs 1.2.0rc6) #12715

@yam-face

Description

@yam-face

System Info

System Info:

2x DGX Spark (GB10, SM121, 128GB unified memory)
Connected via QSFP 200GbE direct cable (RoCE)
Container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
Working container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
CUDA 13.0, aarch64
Inter-node TP/EP over RoCE

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Happens on multiple models but

  1. follows the trtllm inference guide for two sparks
  2. Serve the Qwen3-235B-A22-FP4
  3. Follow playbook exactly - only change trtllm version from 1.2.0rc6 to 1.3.0rc10

Expected behavior

Expected behavior:
When MNNVL fabric is unavailable (finalMNNVL=0), TRT-LLM should fall through to standard NCCL IB transport as it did in 1.2.0rc6, not hang or segfault.

actual behavior

Description:
Regression from 1.2.0rc6 to 1.3.0rc10 on dual DGX Spark multinode inference. Two distinct failure modes observed at the same point — NCCL fabric/allreduce initialization during autotuner warmup.

Failure 1 — Segfault (before NCCL env workarounds):
Qwen3-235B with --tp_size 2 segfaults in ncclGinGdakiQueryLastError → AllreduceOp::run during autotuner warmup. This model runs successfully on 1.2.0rc6 with the same configuration.
!!!!!!! Segfault encountered !!!!!!!
File "transport/net_ib/gdaki/gin_host_gdaki.cc", line 991, in ncclGinGdakiQueryLastError(void*, bool*)
File "gin/gin_host.cc", line 289, in ncclGinQueryLastError(ncclGinState*, bool*)
File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 3047, in ncclCommGetAsyncError
File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 402, in ncclCommEnsureReady(ncclComm*)
File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 3073, in ncclCommCount
File "", line 0, in tensorrt_llm::_v1::torch_ext::(anonymous namespace)::AllreduceOp::run(...)
File "", line 0, in tensorrt_llm::_v1::torch_ext::allreduce_raw(...)
Failure 2 — Indefinite hang (with NCCL env workarounds):
After adding NCCL_GIN_ENABLE=0 the segfault is eliminated, but both Qwen3-235B (--tp_size 2) and Nemotron-3-Super-120B-A12B-NVFP4 (--tp_size 2 --ep_size 2) hang indefinitely at:
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Found inter-node TP group for rank 0, checking MNNVL support via fabric info
[TensorRT-LLM][INFO] Inter-node topology: localNVLink=1, localFabricValid=0, allRanksSameFabric=0, finalMNNVL=0

The segfualt occurs after MNNVL correctly resolves to finalMNNVL=0. It never proceeds to the standard NCCL IB fallback path.
Environment variables attempted — none resolve the hang:
NCCL_MNNVL_ENABLE=0
NCCL_GIN_ENABLE=0
NCCL_IB_MERGE_NICS=0
NCCL_SYMMETRIC_ENABLE=0
TRTLLM_ALLREDUCE_STRATEGY=NCCL

Qwen3-235B-A22B with --tp_size 2 — works on 1.2.0rc6, segfaults/hangs on 1.3.0rc10

Is reproducible across other models - not just Qwen3

additional notes

Probable regression source:
PR #11174 ("Fallback to NCCL instead of NCCL symmetric") was merged into 1.2.0rc6.post3 and fixed a similar NCCL fallback issue for non-MNNVL topologies. The 1.3.0 branch appears to have reintroduced or bypassed this fallback path with the new GIN/GDAKI transport stack and MNNVL fabric probing logic.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Scale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismbugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions