[Bug]: Multi-node regression: segfault/hang in NCCL fabric detection on dual DGX Spark (1.3.0rc10 vs 1.2.0rc6)

### System Info

System Info:

2x DGX Spark (GB10, SM121, 128GB unified memory)
Connected via QSFP 200GbE direct cable (RoCE)
Container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
Working container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6
CUDA 13.0, aarch64
Inter-node TP/EP over RoCE

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Happens on multiple models but 
1. follows the trtllm inference guide for two sparks
2. Serve the Qwen3-235B-A22-FP4
3. Follow playbook exactly - only change trtllm version from 1.2.0rc6 to 1.3.0rc10

### Expected behavior

Expected behavior:
When MNNVL fabric is unavailable (finalMNNVL=0), TRT-LLM should fall through to standard NCCL IB transport as it did in 1.2.0rc6, not hang or segfault.

### actual behavior

Description:
Regression from 1.2.0rc6 to 1.3.0rc10 on dual DGX Spark multinode inference. Two distinct failure modes observed at the same point — NCCL fabric/allreduce initialization during autotuner warmup.

Failure 1 — Segfault (before NCCL env workarounds):
Qwen3-235B with --tp_size 2 segfaults in ncclGinGdakiQueryLastError → AllreduceOp::run during autotuner warmup. This model runs successfully on 1.2.0rc6 with the same configuration.
!!!!!!! Segfault encountered !!!!!!!
  File "transport/net_ib/gdaki/gin_host_gdaki.cc", line 991, in ncclGinGdakiQueryLastError(void*, bool*)
  File "gin/gin_host.cc", line 289, in ncclGinQueryLastError(ncclGinState*, bool*)
  File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 3047, in ncclCommGetAsyncError
  File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 402, in ncclCommEnsureReady(ncclComm*)
  File "/dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/init.cc", line 3073, in ncclCommCount
  File "<unknown>", line 0, in tensorrt_llm::_v1::torch_ext::(anonymous namespace)::AllreduceOp::run(...)
  File "<unknown>", line 0, in tensorrt_llm::_v1::torch_ext::allreduce_raw(...)
Failure 2 — Indefinite hang (with NCCL env workarounds):
After adding NCCL_GIN_ENABLE=0 the segfault is eliminated, but both Qwen3-235B (--tp_size 2) and Nemotron-3-Super-120B-A12B-NVFP4 (--tp_size 2 --ep_size 2) hang indefinitely at:
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Found inter-node TP group for rank 0, checking MNNVL support via fabric info
[TensorRT-LLM][INFO] Inter-node topology: localNVLink=1, localFabricValid=0, allRanksSameFabric=0, finalMNNVL=0

The segfualt occurs after MNNVL correctly resolves to finalMNNVL=0. It never proceeds to the standard NCCL IB fallback path.
Environment variables attempted — none resolve the hang:
NCCL_MNNVL_ENABLE=0
NCCL_GIN_ENABLE=0
NCCL_IB_MERGE_NICS=0
NCCL_SYMMETRIC_ENABLE=0
TRTLLM_ALLREDUCE_STRATEGY=NCCL

Qwen3-235B-A22B with --tp_size 2 — works on 1.2.0rc6, segfaults/hangs on 1.3.0rc10

Is reproducible across other models - not just Qwen3

### additional notes

Probable regression source:
PR #11174 ("Fallback to NCCL instead of NCCL symmetric") was merged into 1.2.0rc6.post3 and fixed a similar NCCL fallback issue for non-MNNVL topologies. The 1.3.0 branch appears to have reintroduced or bypassed this fallback path with the new GIN/GDAKI transport stack and MNNVL fabric probing logic.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Multi-node regression: segfault/hang in NCCL fabric detection on dual DGX Spark (1.3.0rc10 vs 1.2.0rc6) #12715

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Multi-node regression: segfault/hang in NCCL fabric detection on dual DGX Spark (1.3.0rc10 vs 1.2.0rc6) #12715

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions