[None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM by venmugil · Pull Request #13140 · NVIDIA/TensorRT-LLM

venmugil · 2026-04-16T23:46:24Z

Summary by CodeRabbit

New Features
- Added multi-node distributed execution support for visual generation with automatic external launcher detection (torchrun and SLURM)
- Enhanced GPU device assignment for proper multi-node rank-to-device mapping
Tests
- Added comprehensive test suite covering multi-node behavior, device assignment validation, and distributed launcher detection

Description

VisualGen's DiffusionRemoteClient currently only supports single-node operation, spawning all workers locally via mp.Process. This PR extends it to multi-node support via torchrun and SLURM. It:

Adds _detect_external_launch() to identify torchrun (RANK/WORLD_SIZE) and SLURM SLURM_PROCID/SLURM_NTASKS) launch environments.
In external-launch mode, VisualGen.init intercepts non-zero ranks early: they call run_diffusion_worker directly and exit, never reaching user code. Only rank 0 continues as the ZMQ request coordinator.
Rank 0 runs its own worker in a background daemon thread instead of spawning a new process, and uses MASTER_ADDR/MASTER_PORT for ZMQ addressing so workers on all nodes can connect.
run_diffusion_worker now accepts an explicit local_rank argument (falling back to LOCAL_RANK env var or global rank) and uses it for GPU device assignment, ensuring correct per-node GPU mapping.

Test Coverage

Automated tests added in tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (no GPU required):

_detect_external_launch(): torchrun path, SLURM path, single-rank returns None, default values for optional env vars, missing MASTER_ADDR raises
run_diffusion_worker device assignment: explicit local_rank overrides stale env, fallback to LOCAL_RANK env, fallback to global rank
DiffusionRemoteClient single-node spawn: regression test verifying each worker receives the correct local_rank in its kwargs (prevents all workers mapping to device_id=0)

End-to-end multi-node execution still requires a real cluster and was validated manually on 8-GPU single-node and multi-node SLURM jobs. Existing single-node behavior is unchanged and covered by existing VisualGen tests.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Detect torchrun/SLURM external launchers via environment variables (RANK/WORLD_SIZE and SLURM_PROCID/SLURM_NTASKS). Rank 0 acts as coordinator; ranks 1..N-1 run as pure workers and exit after completion. Adds local_rank parameter to run_diffusion_worker for correct per-node GPU device assignment in multi-node runs. Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

…vice assignment Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

coderabbitai · 2026-04-16T23:50:39Z

📝 Walkthrough

Walkthrough

The changes introduce multi-node distributed launcher support for visual generation. A new local_rank parameter enables per-node GPU device mapping, while detection of external launchers (torchrun/SLURM) enables branching between single-node and multi-node execution modes, with rank-0 serving as coordinator.

Changes

Cohort / File(s)	Summary
Worker executor enhancement `tensorrt_llm/_torch/visual_gen/executor.py`	Added optional `local_rank` parameter to `run_diffusion_worker()`. GPU device selection now derives `device_id` from `local_rank` (with fallback to `LOCAL_RANK` env var, then global `rank`) instead of global rank, enabling correct per-node device mapping.
Multi-node client and initialization `tensorrt_llm/visual_gen/visual_gen.py`	Added `_detect_external_launch()` to identify torchrun or SLURM-based distributed execution and extract launcher parameters. Updated `DiffusionRemoteClient` to branch between single-node (spawn mode) and external multi-node (thread-based coordinator) modes. Updated `VisualGen.__init__` to handle rank-based initialization with external launchers, allowing non-rank-0 processes to exit after calling the worker. Added thread lifecycle management for external-mode workers.
Multi-node integration tests `tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`	New test module validating launcher detection for torchrun and SLURM configurations, device-assignment logic using explicit `local_rank` argument, and single-node spawn behavior to ensure correct per-worker `local_rank` propagation.

Sequence Diagram(s)

sequenceDiagram
    participant Launcher as External Launcher<br/>(torchrun/SLURM)
    participant Rank0 as Rank 0 Process<br/>(Coordinator)
    participant RankN as Rank N Process<br/>(Worker)
    participant Client as DiffusionRemoteClient
    participant Executor as run_diffusion_worker

    Launcher->>Rank0: spawn with RANK=0, LOCAL_RANK=X,<br/>MASTER_ADDR, MASTER_PORT
    Launcher->>RankN: spawn with RANK=N, LOCAL_RANK=Y,<br/>MASTER_ADDR, MASTER_PORT

    RankN->>RankN: _detect_external_launch()<br/>extracts launcher params
    RankN->>Executor: run_diffusion_worker(local_rank=Y, ...)
    Executor->>Executor: device_id = Y % device_count
    Executor->>Executor: init_distributed(), serve loop
    Executor-->>RankN: (running)
    RankN->>RankN: sys.exit(0)

    Rank0->>Rank0: _detect_external_launch()<br/>extracts launcher params
    Rank0->>Client: __init__(ext=(rank=0, local_rank=X, ...))
    Client->>Client: Single-node=False, use launcher addresses
    Client->>Executor: spawn background thread<br/>run_diffusion_worker(rank=0, local_rank=X, ...)
    Executor->>Executor: device_id = X % device_count
    Executor-->>Client: (worker running in thread)
    Client-->>Rank0: ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main feature: adding multi-node support for VisualGen diffusion workers via torchrun/SLURM.
Description check	✅ Passed	The PR description follows the template and provides comprehensive coverage of what changed (external launch detection, rank 0 coordination, device assignment), why it matters (multi-node support), test coverage (unit tests with specific scenarios), and checklist verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (1)
62-69: Minor: Unused unpacked variables can be prefixed with underscore.

The static analysis flagged rank and world_size on line 66 as unused. While this is intentional for test clarity (showing the full return tuple), prefixing with underscore silences the warning and signals intent.
♻️ Optional fix to silence linter warnings
-        rank, local_rank, world_size, master_addr, master_port = _detect_external_launch()
-        assert local_rank == 0  # defaults to RANK when LOCAL_RANK absent
+        _rank, local_rank, _world_size, master_addr, master_port = _detect_external_launch()
+        assert local_rank == 0  # defaults to RANK when LOCAL_RANK absent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`
around lines 62 - 69, Change the unpacking in the
test_torchrun_default_local_rank_and_master test to prefix unused returned
values with an underscore to silence linter warnings: when calling
_detect_external_launch(), replace the variables that are not used (rank and
world_size) with _rank and _world_size (or _ and _world_size) and keep
local_rank, master_addr, master_port unchanged so the assertions still refer to
local_rank, master_addr, and master_port.
tensorrt_llm/visual_gen/visual_gen.py (1)
179-189: Consider potential port collision with hardcoded offsets.

The request and response ports are derived as master_port + 1 and master_port + 2. If another service is using port 29501 or 29502 (when master_port is 29500), this will fail at bind time.

This is a minor concern since:

The ports are only used within the cluster

Failure would be evident with a clear bind error

Consider documenting this port assignment scheme or making the offsets configurable if users report conflicts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/visual_gen/visual_gen.py` around lines 179 - 189, The code in
VisualGen (in the else branch where ext is unpacked) computes request/response
ports as master_port+1 and master_port+2 which can collide with other services;
update VisualGen.__init__ to accept configurable port offsets (e.g.,
req_port_offset and resp_port_offset) or a base_port argument and use those
offsets when computing request_queue_addr/response_queue_addr and
req_addr_connect/resp_addr_connect, and add fallback/retry logic or validation
to detect bind failures and surface a clear error; also document the default
offsets in the constructor docstring so users know the assignment scheme.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/visual_gen/visual_gen.py`:
- Around line 179-189: The code in VisualGen (in the else branch where ext is
unpacked) computes request/response ports as master_port+1 and master_port+2
which can collide with other services; update VisualGen.__init__ to accept
configurable port offsets (e.g., req_port_offset and resp_port_offset) or a
base_port argument and use those offsets when computing
request_queue_addr/response_queue_addr and req_addr_connect/resp_addr_connect,
and add fallback/retry logic or validation to detect bind failures and surface a
clear error; also document the default offsets in the constructor docstring so
users know the assignment scheme.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`:
- Around line 62-69: Change the unpacking in the
test_torchrun_default_local_rank_and_master test to prefix unused returned
values with an underscore to silence linter warnings: when calling
_detect_external_launch(), replace the variables that are not used (rank and
world_size) with _rank and _world_size (or _ and _world_size) and keep
local_rank, master_addr, master_port unchanged so the assertions still refer to
local_rank, master_addr, and master_port.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d84cf15a-5c93-4866-b192-b99716cbbf78

📥 Commits

Reviewing files that changed from the base of the PR and between 4d29d83 and 5e0be9b.

📒 Files selected for processing (3)

tensorrt_llm/_torch/visual_gen/executor.py
tensorrt_llm/visual_gen/visual_gen.py
tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py

- Raise RuntimeError when MASTER_ADDR is unset in torchrun path (mirrors SLURM check) - Validate world_size == n_workers on all ranks before worker launch - Replace master_port+1/+2 ZMQ port derivation with find_free_port() on rank 0 - Make request_queue_addr/response_queue_addr Optional[str]; pass None for non-zero ranks - Store n_workers on DiffusionRemoteClient; use self.n_workers consistently - Pass log_level to rank-0 external worker thread for consistency Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

NVShreyas · 2026-04-20T19:36:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-20T19:44:50Z

PR_Github #44509 [ run ] triggered by Bot. Commit: 139dec1 Link to invocation

tensorrt-cicd · 2026-04-20T23:56:19Z

PR_Github #44509 [ run ] completed with state SUCCESS. Commit: 139dec1
/LLM/main/L0_MergeRequest_PR pipeline #34908 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chang-l

Thanks for the work!
Could we have a similar doc update like this one for LLM:
https://github.com/NVIDIA/TensorRT-LLM/blob/6e5a3392b4c9985ce6edc115b330904101c78ccd/examples/llm-api/llm_mgmn_llm_distributed.sh? (Feel free to address this in a separate PR)

Also, quick question—how does it work with trtllm-serve?

Add SLURM batch scripts demonstrating how to launch trtllm-serve and run benchmarks across multiple nodes with Ulysses sequence parallelism. Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

venmugil · 2026-04-23T17:02:20Z

Thanks for the work! Could we have a similar doc update like this one for LLM: https://github.com/NVIDIA/TensorRT-LLM/blob/6e5a3392b4c9985ce6edc115b330904101c78ccd/examples/llm-api/llm_mgmn_llm_distributed.sh? (Feel free to address this in a separate PR)

Also, quick question—how does it work with trtllm-serve?

I've added two example files examples/visual_gen/serve/benchmark_visual_gen_mgmn_distributed.sh and examples/visual_gen/visual_gen_mgmn_distributed.sh to this PR.
The file examples/visual_gen/serve/benchmark_visual_gen_mgmn_distributed.sh has trtllm-serve usage example.

chang-l · 2026-04-24T22:24:08Z

/bot run

tensorrt-cicd · 2026-04-24T22:30:12Z

PR_Github #45453 [ run ] triggered by Bot. Commit: 30288f0 Link to invocation

tensorrt-cicd · 2026-04-25T02:05:37Z

PR_Github #45453 [ run ] completed with state SUCCESS. Commit: 30288f0
/LLM/main/L0_MergeRequest_PR pipeline #35685 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

NVShreyas · 2026-04-27T17:48:26Z

/bot run

tensorrt-cicd · 2026-04-27T17:54:53Z

PR_Github #45762 [ run ] triggered by Bot. Commit: 30288f0 Link to invocation

tensorrt-cicd · 2026-04-27T23:31:19Z

PR_Github #45762 [ run ] completed with state SUCCESS. Commit: 30288f0
/LLM/main/L0_MergeRequest_PR pipeline #35954 completed with status: 'SUCCESS'

CI Report

Link to invocation

chang-l

Approving to unblock other parallelization-related efforts and allow continued iteration, but please refer to trtllm-llmapi-launch and its doc for a future refactor.
https://jirasw.nvidia.com/browse/TRTLLM-12346

QiJune

LGTM

venmugil added 3 commits April 16, 2026 15:30

Fix missing local_rank in single-node worker process launch

cfde197

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

Add unit tests for multi-node external launch detection and worker de…

5e0be9b

…vice assignment Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

venmugil requested review from a team as code owners April 16, 2026 23:46

venmugil requested a review from syuoni April 16, 2026 23:46

github-actions Bot assigned venmugil Apr 16, 2026

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

NVShreyas reviewed Apr 17, 2026

View reviewed changes

chang-l approved these changes Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/visual_gen/visual_gen.py

chang-l requested review from JunyiXu-nv and QiJune April 21, 2026 23:10

QiJune reviewed Apr 23, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/executor.py

QiJune reviewed Apr 23, 2026

View reviewed changes

Comment thread tensorrt_llm/visual_gen/visual_gen.py

Add multi-node distributed run examples for VisualGen

30288f0

Add SLURM batch scripts demonstrating how to launch trtllm-serve and run benchmarks across multiple nodes with Ulysses sequence parallelism. Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>

venmugil requested a review from a team as a code owner April 23, 2026 16:55

venmugil requested review from QiJune and Shixiaowei02 April 23, 2026 16:55

chang-l approved these changes Apr 28, 2026

View reviewed changes

QiJune approved these changes Apr 29, 2026

View reviewed changes

chang-l merged commit 271c4cc into NVIDIA:main Apr 29, 2026
5 checks passed

venmugil deleted the multi_node_support branch April 29, 2026 15:20

Conversation

venmugil commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 16, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NVShreyas commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

chang-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

venmugil commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chang-l commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 24, 2026

Uh oh!

tensorrt-cicd commented Apr 25, 2026

Uh oh!

NVShreyas commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

chang-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

venmugil commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

chang-l left a comment •

edited

Loading

venmugil commented Apr 23, 2026 •

edited

Loading

chang-l left a comment •

edited

Loading