[BugFix] fix num of rdma_comm_ports check#5168
[BugFix] fix num of rdma_comm_ports check#5168Jiang-Jia-Jun merged 4 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes the validation logic for rdma_comm_ports count in multi-node deployments. The original check incorrectly validated against the total number of ranks across all nodes, but should validate against the number of ranks per node.
Key changes:
- Updated the
rdma_comm_portscount validation to account for multi-node deployments by dividingdata_parallel_sizeby the number of nodes (IPs)
PR Metadata Issues:
- Title format: The PR title should follow the format
[Tag]Title. For this bug fix, it should be:[BugFix] Fix num of rdma_comm_ports checkor[BugFix]fix num of rdma_comm_ports check - Description: The PR description is incomplete. Please add:
- Motivation: Explain why this change is needed (e.g., "The original validation incorrectly calculated expected RDMA ports for multi-node deployments")
- Modifications: Describe what was changed (e.g., "Updated the validation formula to account for the number of nodes")
- Usage or Command: Provide how to test or reproduce the fix
fastdeploy/engine/args_utils.py
Outdated
| raise ValueError( | ||
| f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}." |
There was a problem hiding this comment.
The error message is inconsistent with the updated calculation logic. The message still displays the old formula (data_parallel_size * tensor_parallel_size), but the actual check now uses tensor_parallel_size * (data_parallel_size // (len(self.ips) if self.ips else 1)).
Update the error message to reflect the correct calculation:
expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1))
raise ValueError(
f"The number of rdma comm ports must be equal to number of ranks "
f"(tensor_parallel_size * (data_parallel_size / num_nodes) = "
f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {len(self.ips) if self.ips else 1}) = {expected_ports}), "
f"but got {len(self.rdma_comm_ports)}."
)| raise ValueError( | |
| f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}." | |
| expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1)) | |
| num_nodes = len(self.ips) if self.ips else 1 | |
| raise ValueError( | |
| f"The number of rdma comm ports must be equal to number of ranks " | |
| f"(tensor_parallel_size * (data_parallel_size / num_nodes) = " | |
| f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {num_nodes}) = {expected_ports}), " | |
| f"but got {len(self.rdma_comm_ports)}." |
fastdeploy/engine/args_utils.py
Outdated
| if len(self.rdma_comm_ports) != self.tensor_parallel_size * ( | ||
| self.data_parallel_size // (len(self.ips) if self.ips else 1) | ||
| ): |
There was a problem hiding this comment.
The calculation formula is complex and difficult to understand. Consider extracting it into a well-named variable to improve readability and make the logic clearer:
num_nodes = len(self.ips) if self.ips else 1
data_parallel_size_per_node = self.data_parallel_size // num_nodes
expected_rdma_ports = self.tensor_parallel_size * data_parallel_size_per_node
if len(self.rdma_comm_ports) != expected_rdma_ports:This makes it clear that the calculation is based on data parallel size per node, which appears to be the intent of this fix.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5168 +/- ##
==========================================
Coverage ? 57.83%
==========================================
Files ? 317
Lines ? 38315
Branches ? 5727
==========================================
Hits ? 22161
Misses ? 14389
Partials ? 1765
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
rdma_comm_ports check之前没有考虑多机
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.