Skip to content

[None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM#13140

Merged
chang-l merged 5 commits intoNVIDIA:mainfrom
venmugil:multi_node_support
Apr 29, 2026
Merged

[None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM#13140
chang-l merged 5 commits intoNVIDIA:mainfrom
venmugil:multi_node_support

Conversation

@venmugil
Copy link
Copy Markdown
Collaborator

@venmugil venmugil commented Apr 16, 2026

Summary by CodeRabbit

  • New Features

    • Added multi-node distributed execution support for visual generation with automatic external launcher detection (torchrun and SLURM)
    • Enhanced GPU device assignment for proper multi-node rank-to-device mapping
  • Tests

    • Added comprehensive test suite covering multi-node behavior, device assignment validation, and distributed launcher detection

Description

VisualGen's DiffusionRemoteClient currently only supports single-node operation, spawning all workers locally via mp.Process. This PR extends it to multi-node support via torchrun and SLURM. It:

  • Adds _detect_external_launch() to identify torchrun (RANK/WORLD_SIZE) and SLURM SLURM_PROCID/SLURM_NTASKS) launch environments.
  • In external-launch mode, VisualGen.init intercepts non-zero ranks early: they call run_diffusion_worker directly and exit, never reaching user code. Only rank 0 continues as the ZMQ request coordinator.
  • Rank 0 runs its own worker in a background daemon thread instead of spawning a new process, and uses MASTER_ADDR/MASTER_PORT for ZMQ addressing so workers on all nodes can connect.
  • run_diffusion_worker now accepts an explicit local_rank argument (falling back to LOCAL_RANK env var or global rank) and uses it for GPU device assignment, ensuring correct per-node GPU mapping.

Test Coverage

Automated tests added in tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (no GPU required):

  • _detect_external_launch(): torchrun path, SLURM path, single-rank returns None, default values for optional env vars, missing MASTER_ADDR raises
  • run_diffusion_worker device assignment: explicit local_rank overrides stale env, fallback to LOCAL_RANK env, fallback to global rank
  • DiffusionRemoteClient single-node spawn: regression test verifying each worker receives the correct local_rank in its kwargs (prevents all workers mapping to device_id=0)

End-to-end multi-node execution still requires a real cluster and was validated manually on 8-GPU single-node and multi-node SLURM jobs. Existing single-node behavior is unchanged and covered by existing VisualGen tests.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Detect torchrun/SLURM external launchers via environment variables
(RANK/WORLD_SIZE and SLURM_PROCID/SLURM_NTASKS). Rank 0 acts as
coordinator; ranks 1..N-1 run as pure workers and exit after completion.
Adds local_rank parameter to run_diffusion_worker for correct per-node
GPU device assignment in multi-node runs.

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
…vice assignment

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
@venmugil venmugil requested review from a team as code owners April 16, 2026 23:46
@venmugil venmugil requested a review from syuoni April 16, 2026 23:46
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The changes introduce multi-node distributed launcher support for visual generation. A new local_rank parameter enables per-node GPU device mapping, while detection of external launchers (torchrun/SLURM) enables branching between single-node and multi-node execution modes, with rank-0 serving as coordinator.

Changes

Cohort / File(s) Summary
Worker executor enhancement
tensorrt_llm/_torch/visual_gen/executor.py
Added optional local_rank parameter to run_diffusion_worker(). GPU device selection now derives device_id from local_rank (with fallback to LOCAL_RANK env var, then global rank) instead of global rank, enabling correct per-node device mapping.
Multi-node client and initialization
tensorrt_llm/visual_gen/visual_gen.py
Added _detect_external_launch() to identify torchrun or SLURM-based distributed execution and extract launcher parameters. Updated DiffusionRemoteClient to branch between single-node (spawn mode) and external multi-node (thread-based coordinator) modes. Updated VisualGen.__init__ to handle rank-based initialization with external launchers, allowing non-rank-0 processes to exit after calling the worker. Added thread lifecycle management for external-mode workers.
Multi-node integration tests
tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py
New test module validating launcher detection for torchrun and SLURM configurations, device-assignment logic using explicit local_rank argument, and single-node spawn behavior to ensure correct per-worker local_rank propagation.

Sequence Diagram(s)

sequenceDiagram
    participant Launcher as External Launcher<br/>(torchrun/SLURM)
    participant Rank0 as Rank 0 Process<br/>(Coordinator)
    participant RankN as Rank N Process<br/>(Worker)
    participant Client as DiffusionRemoteClient
    participant Executor as run_diffusion_worker

    Launcher->>Rank0: spawn with RANK=0, LOCAL_RANK=X,<br/>MASTER_ADDR, MASTER_PORT
    Launcher->>RankN: spawn with RANK=N, LOCAL_RANK=Y,<br/>MASTER_ADDR, MASTER_PORT

    RankN->>RankN: _detect_external_launch()<br/>extracts launcher params
    RankN->>Executor: run_diffusion_worker(local_rank=Y, ...)
    Executor->>Executor: device_id = Y % device_count
    Executor->>Executor: init_distributed(), serve loop
    Executor-->>RankN: (running)
    RankN->>RankN: sys.exit(0)

    Rank0->>Rank0: _detect_external_launch()<br/>extracts launcher params
    Rank0->>Client: __init__(ext=(rank=0, local_rank=X, ...))
    Client->>Client: Single-node=False, use launcher addresses
    Client->>Executor: spawn background thread<br/>run_diffusion_worker(rank=0, local_rank=X, ...)
    Executor->>Executor: device_id = X % device_count
    Executor-->>Client: (worker running in thread)
    Client-->>Rank0: ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main feature: adding multi-node support for VisualGen diffusion workers via torchrun/SLURM.
Description check ✅ Passed The PR description follows the template and provides comprehensive coverage of what changed (external launch detection, rank 0 coordination, device assignment), why it matters (multi-node support), test coverage (unit tests with specific scenarios), and checklist verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (1)

62-69: Minor: Unused unpacked variables can be prefixed with underscore.

The static analysis flagged rank and world_size on line 66 as unused. While this is intentional for test clarity (showing the full return tuple), prefixing with underscore silences the warning and signals intent.

♻️ Optional fix to silence linter warnings
-        rank, local_rank, world_size, master_addr, master_port = _detect_external_launch()
-        assert local_rank == 0  # defaults to RANK when LOCAL_RANK absent
+        _rank, local_rank, _world_size, master_addr, master_port = _detect_external_launch()
+        assert local_rank == 0  # defaults to RANK when LOCAL_RANK absent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`
around lines 62 - 69, Change the unpacking in the
test_torchrun_default_local_rank_and_master test to prefix unused returned
values with an underscore to silence linter warnings: when calling
_detect_external_launch(), replace the variables that are not used (rank and
world_size) with _rank and _world_size (or _ and _world_size) and keep
local_rank, master_addr, master_port unchanged so the assertions still refer to
local_rank, master_addr, and master_port.
tensorrt_llm/visual_gen/visual_gen.py (1)

179-189: Consider potential port collision with hardcoded offsets.

The request and response ports are derived as master_port + 1 and master_port + 2. If another service is using port 29501 or 29502 (when master_port is 29500), this will fail at bind time.

This is a minor concern since:

  1. The ports are only used within the cluster
  2. Failure would be evident with a clear bind error

Consider documenting this port assignment scheme or making the offsets configurable if users report conflicts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/visual_gen/visual_gen.py` around lines 179 - 189, The code in
VisualGen (in the else branch where ext is unpacked) computes request/response
ports as master_port+1 and master_port+2 which can collide with other services;
update VisualGen.__init__ to accept configurable port offsets (e.g.,
req_port_offset and resp_port_offset) or a base_port argument and use those
offsets when computing request_queue_addr/response_queue_addr and
req_addr_connect/resp_addr_connect, and add fallback/retry logic or validation
to detect bind failures and surface a clear error; also document the default
offsets in the constructor docstring so users know the assignment scheme.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/visual_gen/visual_gen.py`:
- Around line 179-189: The code in VisualGen (in the else branch where ext is
unpacked) computes request/response ports as master_port+1 and master_port+2
which can collide with other services; update VisualGen.__init__ to accept
configurable port offsets (e.g., req_port_offset and resp_port_offset) or a
base_port argument and use those offsets when computing
request_queue_addr/response_queue_addr and req_addr_connect/resp_addr_connect,
and add fallback/retry logic or validation to detect bind failures and surface a
clear error; also document the default offsets in the constructor docstring so
users know the assignment scheme.

In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`:
- Around line 62-69: Change the unpacking in the
test_torchrun_default_local_rank_and_master test to prefix unused returned
values with an underscore to silence linter warnings: when calling
_detect_external_launch(), replace the variables that are not used (rank and
world_size) with _rank and _world_size (or _ and _world_size) and keep
local_rank, master_addr, master_port unchanged so the assertions still refer to
local_rank, master_addr, and master_port.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d84cf15a-5c93-4866-b192-b99716cbbf78

📥 Commits

Reviewing files that changed from the base of the PR and between 4d29d83 and 5e0be9b.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/visual_gen/executor.py
  • tensorrt_llm/visual_gen/visual_gen.py
  • tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py

Comment thread tensorrt_llm/visual_gen/visual_gen.py Outdated
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Comment thread tensorrt_llm/visual_gen/visual_gen.py Outdated
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Comment thread tensorrt_llm/visual_gen/visual_gen.py Outdated
- Raise RuntimeError when MASTER_ADDR is unset in torchrun path (mirrors SLURM check)
- Validate world_size == n_workers on all ranks before worker launch
- Replace master_port+1/+2 ZMQ port derivation with find_free_port() on rank 0
- Make request_queue_addr/response_queue_addr Optional[str]; pass None for non-zero ranks
- Store n_workers on DiffusionRemoteClient; use self.n_workers consistently
- Pass log_level to rank-0 external worker thread for consistency

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
@NVShreyas
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44509 [ run ] triggered by Bot. Commit: 139dec1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44509 [ run ] completed with state SUCCESS. Commit: 139dec1
/LLM/main/L0_MergeRequest_PR pipeline #34908 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!
Could we have a similar doc update like this one for LLM:
https://github.com/NVIDIA/TensorRT-LLM/blob/6e5a3392b4c9985ce6edc115b330904101c78ccd/examples/llm-api/llm_mgmn_llm_distributed.sh? (Feel free to address this in a separate PR)

Also, quick question—how does it work with trtllm-serve?

Comment thread tensorrt_llm/visual_gen/visual_gen.py
@chang-l chang-l requested review from JunyiXu-nv and QiJune April 21, 2026 23:10
Comment thread tensorrt_llm/_torch/visual_gen/executor.py
Comment thread tensorrt_llm/visual_gen/visual_gen.py
Add SLURM batch scripts demonstrating how to launch trtllm-serve and
run benchmarks across multiple nodes with Ulysses sequence parallelism.

Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
@venmugil venmugil requested a review from a team as a code owner April 23, 2026 16:55
@venmugil venmugil requested review from QiJune and Shixiaowei02 April 23, 2026 16:55
@venmugil
Copy link
Copy Markdown
Collaborator Author

venmugil commented Apr 23, 2026

Thanks for the work! Could we have a similar doc update like this one for LLM: https://github.com/NVIDIA/TensorRT-LLM/blob/6e5a3392b4c9985ce6edc115b330904101c78ccd/examples/llm-api/llm_mgmn_llm_distributed.sh? (Feel free to address this in a separate PR)

Also, quick question—how does it work with trtllm-serve?

I've added two example files examples/visual_gen/serve/benchmark_visual_gen_mgmn_distributed.sh and examples/visual_gen/visual_gen_mgmn_distributed.sh to this PR.
The file examples/visual_gen/serve/benchmark_visual_gen_mgmn_distributed.sh has trtllm-serve usage example.

@chang-l
Copy link
Copy Markdown
Collaborator

chang-l commented Apr 24, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45453 [ run ] triggered by Bot. Commit: 30288f0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45453 [ run ] completed with state SUCCESS. Commit: 30288f0
/LLM/main/L0_MergeRequest_PR pipeline #35685 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@NVShreyas
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45762 [ run ] triggered by Bot. Commit: 30288f0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45762 [ run ] completed with state SUCCESS. Commit: 30288f0
/LLM/main/L0_MergeRequest_PR pipeline #35954 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock other parallelization-related efforts and allow continued iteration, but please refer to trtllm-llmapi-launch and its doc for a future refactor.
https://jirasw.nvidia.com/browse/TRTLLM-12346

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chang-l chang-l merged commit 271c4cc into NVIDIA:main Apr 29, 2026
5 checks passed
@venmugil venmugil deleted the multi_node_support branch April 29, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants