Skip to content

[BugFix] fix num of rdma_comm_ports check#5168

Merged
Jiang-Jia-Jun merged 4 commits intoPaddlePaddle:developfrom
yuanlehome:fix_check_rdma_comm_ports
Nov 21, 2025
Merged

[BugFix] fix num of rdma_comm_ports check#5168
Jiang-Jia-Jun merged 4 commits intoPaddlePaddle:developfrom
yuanlehome:fix_check_rdma_comm_ports

Conversation

@yuanlehome
Copy link
Collaborator

@yuanlehome yuanlehome commented Nov 21, 2025

Motivation

rdma_comm_ports check之前没有考虑多机

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings November 21, 2025 10:04
@paddle-bot
Copy link

paddle-bot bot commented Nov 21, 2025

Thanks for your contribution!

@yuanlehome yuanlehome changed the title fix num of rdma_comm_ports check [BugFix] fix num of rdma_comm_ports check Nov 21, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the validation logic for rdma_comm_ports count in multi-node deployments. The original check incorrectly validated against the total number of ranks across all nodes, but should validate against the number of ranks per node.

Key changes:

  • Updated the rdma_comm_ports count validation to account for multi-node deployments by dividing data_parallel_size by the number of nodes (IPs)

PR Metadata Issues:

  1. Title format: The PR title should follow the format [Tag]Title. For this bug fix, it should be: [BugFix] Fix num of rdma_comm_ports check or [BugFix]fix num of rdma_comm_ports check
  2. Description: The PR description is incomplete. Please add:
    • Motivation: Explain why this change is needed (e.g., "The original validation incorrectly calculated expected RDMA ports for multi-node deployments")
    • Modifications: Describe what was changed (e.g., "Updated the validation formula to account for the number of nodes")
    • Usage or Command: Provide how to test or reproduce the fix

Comment on lines 513 to 514
raise ValueError(
f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is inconsistent with the updated calculation logic. The message still displays the old formula (data_parallel_size * tensor_parallel_size), but the actual check now uses tensor_parallel_size * (data_parallel_size // (len(self.ips) if self.ips else 1)).

Update the error message to reflect the correct calculation:

expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1))
raise ValueError(
    f"The number of rdma comm ports must be equal to number of ranks "
    f"(tensor_parallel_size * (data_parallel_size / num_nodes) = "
    f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {len(self.ips) if self.ips else 1}) = {expected_ports}), "
    f"but got {len(self.rdma_comm_ports)}."
)
Suggested change
raise ValueError(
f"The number of rdma comm ports must be equal to number of ranks ({self.data_parallel_size=} * {self.tensor_parallel_size=} = {self.data_parallel_size * self.tensor_parallel_size}), but got {len(self.rdma_comm_ports)}."
expected_ports = self.tensor_parallel_size * (self.data_parallel_size // (len(self.ips) if self.ips else 1))
num_nodes = len(self.ips) if self.ips else 1
raise ValueError(
f"The number of rdma comm ports must be equal to number of ranks "
f"(tensor_parallel_size * (data_parallel_size / num_nodes) = "
f"{self.tensor_parallel_size} * ({self.data_parallel_size} / {num_nodes}) = {expected_ports}), "
f"but got {len(self.rdma_comm_ports)}."

Copilot uses AI. Check for mistakes.
Comment on lines 510 to 512
if len(self.rdma_comm_ports) != self.tensor_parallel_size * (
self.data_parallel_size // (len(self.ips) if self.ips else 1)
):
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation formula is complex and difficult to understand. Consider extracting it into a well-named variable to improve readability and make the logic clearer:

num_nodes = len(self.ips) if self.ips else 1
data_parallel_size_per_node = self.data_parallel_size // num_nodes
expected_rdma_ports = self.tensor_parallel_size * data_parallel_size_per_node

if len(self.rdma_comm_ports) != expected_rdma_ports:

This makes it clear that the calculation is based on data parallel size per node, which appears to be the intent of this fix.

Copilot generated this review using guidance from repository custom instructions.
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 5bcf79d into PaddlePaddle:develop Nov 21, 2025
12 of 16 checks passed
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6471dad). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/args_utils.py 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5168   +/-   ##
==========================================
  Coverage           ?   57.83%           
==========================================
  Files              ?      317           
  Lines              ?    38315           
  Branches           ?     5727           
==========================================
  Hits               ?    22161           
  Misses             ?    14389           
  Partials           ?     1765           
Flag Coverage Δ
diff 57.83% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants