Skip to content

Conversation

@isaki001
Copy link
Contributor

@isaki001 isaki001 commented Oct 28, 2025

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?

  • reverted use of bit-reversal for channel-to-part mapping
  • prevent p2p-batching when recv/send bytes in a batch are different (this applies for a2av)
  • p2p-batching is supported only up to 32 nodes

Why were the changes made?

  • RCCL_P2P_BATCH_ENABLE=1 provides perf. boost for alltoall and should be enabled by default
  • linear mapping use was reverted in favor of bit-reversal to address regressions on e2e workloads, originally linear mapping was introduced to improve peak-bw, however this improvement is less impactful than the e2e workload regression and needs to be reverted
  • the requirement for recv/send bytes to be the same addresses a hang on a2av. The utilization of the RCCL_P2P_BATCH_THRESHOLD could lead to cases where the send bytes would require batching but the recv bytes would not, and this inconsistency would lead to the hang. This hang could also be addressed by not using a RCCL_P2P_BATCH_THRESHOLD, or forcing/disabling batching on all message sizes.
  • p2p-batching is disabled beyond 32 nodes to address hang for a2a as a temporary solution

How was the outcome achieved?

  • bit-reversal enablement only required changes in ncclP2pChannelForPart and ncclP2pChannelToPart
  • p2p-batching adjustments only required changes in enqueue.cc

Additional Documentation:

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@isaki001 isaki001 requested a review from a team as a code owner October 28, 2025 15:34
@isaki001 isaki001 changed the title P2p batch update P2p batching default enablement Oct 28, 2025
@nileshnegi
Copy link
Contributor

@isaki001 please add a changelog entry, as batched P2P is now enabled by default

@isaki001 isaki001 requested a review from a team as a code owner October 29, 2025 21:46
@AbandiGa AbandiGa self-requested a review October 29, 2025 22:03
@isaki001 isaki001 dismissed mustafabar’s stale review October 29, 2025 23:57

Applying this requires propagating an env. variable value to the kernel and adding an extra if-statement in a ncclP2pChannelToPart, which could have negative impact. This will be addressed in another PR.

@isaki001 isaki001 changed the title P2p batching default enablement P2p batching hang-fix Oct 30, 2025
@nileshnegi
Copy link
Contributor

PR has passed CI checks in the past. Latest changes are doc/changelog/default parameter changes. Safe to merge.

@isaki001 isaki001 merged commit 641c0eb into ROCm:develop Oct 30, 2025
14 of 20 checks passed
nileshnegi pushed a commit to nileshnegi/rccl that referenced this pull request Nov 6, 2025
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default

(cherry picked from commit 641c0eb)
shahamed pushed a commit that referenced this pull request Nov 10, 2025
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default

(cherry picked from commit 641c0eb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants