[BugFix] fix num of rdma_comm_ports check by yuanlehome · Pull Request #5168 · PaddlePaddle/FastDeploy

yuanlehome · 2025-11-21T10:04:37Z

Motivation

rdma_comm_ports check之前没有考虑多机

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-21T10:04:47Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR fixes the validation logic for rdma_comm_ports count in multi-node deployments. The original check incorrectly validated against the total number of ranks across all nodes, but should validate against the number of ranks per node.

Key changes:

Updated the rdma_comm_ports count validation to account for multi-node deployments by dividing data_parallel_size by the number of nodes (IPs)

PR Metadata Issues:

Title format: The PR title should follow the format [Tag]Title. For this bug fix, it should be: [BugFix] Fix num of rdma_comm_ports check or [BugFix]fix num of rdma_comm_ports check
Description: The PR description is incomplete. Please add:
- Motivation: Explain why this change is needed (e.g., "The original validation incorrectly calculated expected RDMA ports for multi-node deployments")
- Modifications: Describe what was changed (e.g., "Updated the validation formula to account for the number of nodes")
- Usage or Command: Provide how to test or reproduce the fix

Copilot · 2025-11-21T10:07:07Z

fastdeploy/engine/args_utils.py

                    raise ValueError(
                        f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."


The error message is inconsistent with the updated calculation logic. The message still displays the old formula (data_parallel_size * tensor_parallel_size), but the actual check now uses tensor_parallel_size * (data_parallel_size // (len(self.ips) if self.ips else 1)).

Update the error message to reflect the correct calculation:

expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1)) raise ValueError( f"The number of rdma comm ports must be equal to number of ranks " f"(tensor_parallel_size * (data_parallel_size / num_nodes) = " f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {len(self.ips) if self.ips else 1}) = {expected_ports}), " f"but got {len(self.rdma_comm_ports)}." )

Suggested change

raise ValueError(

f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."

expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1))

num_nodes = len(self.ips) if self.ips else 1

raise ValueError(

f"The number of rdma comm ports must be equal to number of ranks "

f"(tensor_parallel_size * (data_parallel_size / num_nodes) = "

f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {num_nodes}) = {expected_ports}), "

f"but got {len(self.rdma_comm_ports)}."

Copilot · 2025-11-21T10:07:07Z

fastdeploy/engine/args_utils.py

+                if len(self.rdma_comm_ports) != self.tensor_parallel_size * (
+                    self.data_parallel_size // (len(self.ips) if self.ips else 1)
+                ):


The calculation formula is complex and difficult to understand. Consider extracting it into a well-named variable to improve readability and make the logic clearer:

num_nodes = len(self.ips) if self.ips else 1 data_parallel_size_per_node = self.data_parallel_size // num_nodes expected_rdma_ports = self.tensor_parallel_size * data_parallel_size_per_node if len(self.rdma_comm_ports) != expected_rdma_ports:

This makes it clear that the calculation is based on data parallel size per node, which appears to be the intent of this fix.

codecov-commenter · 2025-11-21T11:29:36Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6471dad). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/args_utils.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5168   +/-   ##
==========================================
  Coverage           ?   57.83%           
==========================================
  Files              ?      317           
  Lines              ?    38315           
  Branches           ?     5727           
==========================================
  Hits               ?    22161           
  Misses             ?    14389           
  Partials           ?     1765

Flag	Coverage Δ
diff	`57.83% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix num of rdma_comm_ports check

300f40a

Copilot AI review requested due to automatic review settings November 21, 2025 10:04

Copilot started reviewing on behalf of yuanlehome November 21, 2025 10:05 View session

yuanlehome changed the title ~~fix num of rdma_comm_ports check~~ [BugFix] fix num of rdma_comm_ports check Nov 21, 2025

Copilot finished reviewing on behalf of yuanlehome November 21, 2025 10:06

Copilot AI reviewed Nov 21, 2025

View reviewed changes

yuanlehome added 3 commits November 21, 2025 18:09

update

c317793

update

e402a40

update

a094b7c

Jiang-Jia-Jun merged commit 5bcf79d into PaddlePaddle:develop Nov 21, 2025
12 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] fix num of rdma_comm_ports check#5168

[BugFix] fix num of rdma_comm_ports check#5168
Jiang-Jia-Jun merged 4 commits intoPaddlePaddle:developfrom
yuanlehome:fix_check_rdma_comm_ports

yuanlehome commented Nov 21, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Uh oh!

codecov-commenter commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		raise ValueError(
		f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."

-                    raise ValueError(
-                        f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."
+                    expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1))
+                    num_nodes = len(self.ips) if self.ips else 1
+                    raise ValueError(
+                        f"The number of rdma comm ports must be equal to number of ranks "
+                        f"(tensor_parallel_size * (data_parallel_size / num_nodes) = "
+                        f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {num_nodes}) = {expected_ports}), "
+                        f"but got {len(self.rdma_comm_ports)}."

Conversation

yuanlehome commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Nov 21, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanlehome commented Nov 21, 2025 •

edited

Loading