[None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM#13140
[None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM#13140chang-l merged 5 commits intoNVIDIA:mainfrom
Conversation
Detect torchrun/SLURM external launchers via environment variables (RANK/WORLD_SIZE and SLURM_PROCID/SLURM_NTASKS). Rank 0 acts as coordinator; ranks 1..N-1 run as pure workers and exit after completion. Adds local_rank parameter to run_diffusion_worker for correct per-node GPU device assignment in multi-node runs. Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
…vice assignment Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
📝 WalkthroughWalkthroughThe changes introduce multi-node distributed launcher support for visual generation. A new Changes
Sequence Diagram(s)sequenceDiagram
participant Launcher as External Launcher<br/>(torchrun/SLURM)
participant Rank0 as Rank 0 Process<br/>(Coordinator)
participant RankN as Rank N Process<br/>(Worker)
participant Client as DiffusionRemoteClient
participant Executor as run_diffusion_worker
Launcher->>Rank0: spawn with RANK=0, LOCAL_RANK=X,<br/>MASTER_ADDR, MASTER_PORT
Launcher->>RankN: spawn with RANK=N, LOCAL_RANK=Y,<br/>MASTER_ADDR, MASTER_PORT
RankN->>RankN: _detect_external_launch()<br/>extracts launcher params
RankN->>Executor: run_diffusion_worker(local_rank=Y, ...)
Executor->>Executor: device_id = Y % device_count
Executor->>Executor: init_distributed(), serve loop
Executor-->>RankN: (running)
RankN->>RankN: sys.exit(0)
Rank0->>Rank0: _detect_external_launch()<br/>extracts launcher params
Rank0->>Client: __init__(ext=(rank=0, local_rank=X, ...))
Client->>Client: Single-node=False, use launcher addresses
Client->>Executor: spawn background thread<br/>run_diffusion_worker(rank=0, local_rank=X, ...)
Executor->>Executor: device_id = X % device_count
Executor-->>Client: (worker running in thread)
Client-->>Rank0: ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (1)
62-69: Minor: Unused unpacked variables can be prefixed with underscore.The static analysis flagged
rankandworld_sizeon line 66 as unused. While this is intentional for test clarity (showing the full return tuple), prefixing with underscore silences the warning and signals intent.♻️ Optional fix to silence linter warnings
- rank, local_rank, world_size, master_addr, master_port = _detect_external_launch() - assert local_rank == 0 # defaults to RANK when LOCAL_RANK absent + _rank, local_rank, _world_size, master_addr, master_port = _detect_external_launch() + assert local_rank == 0 # defaults to RANK when LOCAL_RANK absent🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py` around lines 62 - 69, Change the unpacking in the test_torchrun_default_local_rank_and_master test to prefix unused returned values with an underscore to silence linter warnings: when calling _detect_external_launch(), replace the variables that are not used (rank and world_size) with _rank and _world_size (or _ and _world_size) and keep local_rank, master_addr, master_port unchanged so the assertions still refer to local_rank, master_addr, and master_port.tensorrt_llm/visual_gen/visual_gen.py (1)
179-189: Consider potential port collision with hardcoded offsets.The request and response ports are derived as
master_port + 1andmaster_port + 2. If another service is using port 29501 or 29502 (when master_port is 29500), this will fail at bind time.This is a minor concern since:
- The ports are only used within the cluster
- Failure would be evident with a clear bind error
Consider documenting this port assignment scheme or making the offsets configurable if users report conflicts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/visual_gen/visual_gen.py` around lines 179 - 189, The code in VisualGen (in the else branch where ext is unpacked) computes request/response ports as master_port+1 and master_port+2 which can collide with other services; update VisualGen.__init__ to accept configurable port offsets (e.g., req_port_offset and resp_port_offset) or a base_port argument and use those offsets when computing request_queue_addr/response_queue_addr and req_addr_connect/resp_addr_connect, and add fallback/retry logic or validation to detect bind failures and surface a clear error; also document the default offsets in the constructor docstring so users know the assignment scheme.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/visual_gen/visual_gen.py`:
- Around line 179-189: The code in VisualGen (in the else branch where ext is
unpacked) computes request/response ports as master_port+1 and master_port+2
which can collide with other services; update VisualGen.__init__ to accept
configurable port offsets (e.g., req_port_offset and resp_port_offset) or a
base_port argument and use those offsets when computing
request_queue_addr/response_queue_addr and req_addr_connect/resp_addr_connect,
and add fallback/retry logic or validation to detect bind failures and surface a
clear error; also document the default offsets in the constructor docstring so
users know the assignment scheme.
In `@tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py`:
- Around line 62-69: Change the unpacking in the
test_torchrun_default_local_rank_and_master test to prefix unused returned
values with an underscore to silence linter warnings: when calling
_detect_external_launch(), replace the variables that are not used (rank and
world_size) with _rank and _world_size (or _ and _world_size) and keep
local_rank, master_addr, master_port unchanged so the assertions still refer to
local_rank, master_addr, and master_port.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d84cf15a-5c93-4866-b192-b99716cbbf78
📒 Files selected for processing (3)
tensorrt_llm/_torch/visual_gen/executor.pytensorrt_llm/visual_gen/visual_gen.pytests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py
- Raise RuntimeError when MASTER_ADDR is unset in torchrun path (mirrors SLURM check) - Validate world_size == n_workers on all ranks before worker launch - Replace master_port+1/+2 ZMQ port derivation with find_free_port() on rank 0 - Make request_queue_addr/response_queue_addr Optional[str]; pass None for non-zero ranks - Store n_workers on DiffusionRemoteClient; use self.n_workers consistently - Pass log_level to rank-0 external worker thread for consistency Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #44509 [ run ] triggered by Bot. Commit: |
|
PR_Github #44509 [ run ] completed with state
|
There was a problem hiding this comment.
Thanks for the work!
Could we have a similar doc update like this one for LLM:
https://github.com/NVIDIA/TensorRT-LLM/blob/6e5a3392b4c9985ce6edc115b330904101c78ccd/examples/llm-api/llm_mgmn_llm_distributed.sh? (Feel free to address this in a separate PR)
Also, quick question—how does it work with trtllm-serve?
Add SLURM batch scripts demonstrating how to launch trtllm-serve and run benchmarks across multiple nodes with Ulysses sequence parallelism. Signed-off-by: Venmugil Elango <498703+venmugil@users.noreply.github.com>
I've added two example files |
|
/bot run |
|
PR_Github #45453 [ run ] triggered by Bot. Commit: |
|
PR_Github #45453 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45762 [ run ] triggered by Bot. Commit: |
|
PR_Github #45762 [ run ] completed with state |
There was a problem hiding this comment.
Approving to unblock other parallelization-related efforts and allow continued iteration, but please refer to trtllm-llmapi-launch and its doc for a future refactor.
https://jirasw.nvidia.com/browse/TRTLLM-12346
Summary by CodeRabbit
New Features
Tests
Description
VisualGen's DiffusionRemoteClient currently only supports single-node operation, spawning all workers locally via mp.Process. This PR extends it to multi-node support via torchrun and SLURM. It:
Test Coverage
Automated tests added in tests/unittest/_torch/visual_gen/multi_gpu/test_visual_gen_multinode.py (no GPU required):
End-to-end multi-node execution still requires a real cluster and was validated manually on 8-GPU single-node and multi-node SLURM jobs. Existing single-node behavior is unchanged and covered by existing VisualGen tests.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.