Skip to content

Conversation

@dhonnappa-amd
Copy link

Cherry-pick of #2447

…ntext test in test_c10d_nccl.py (#2447)

In this PR, I have removed a change made in commit 71a21d9 for
test_extra_cuda_context test in test_c10d_nccl.py. Instead using
c10d.barrier() to ensure that all nodes have completed
init_process_group.

We need this barrier to ensure that all nodes have completed
init_process_group if rank=0 gets a mem snapshot before other nodes have
finished init_process_group, then we artificially see a bump in memory
usage. As per the following comment, we are going to be moving away from
this function:
pytorch#154174 (comment)

Tested by building pytorch and running test_extra_cuda_context test with
docker image:
**compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.4:125_ubuntu24.04_py3.12_pytorch_rocm6.4_internal_testing_71a21d9**
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 14, 2025

Jenkins build for 029a4b4e76e9772e1386eb1e4146b8f28374c672 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@akashveramd akashveramd self-assigned this Aug 14, 2025
@akashveramd akashveramd marked this pull request as ready for review August 14, 2025 20:05
@jeffdaily jeffdaily changed the title [AUTOGENERATED] [release/2.8] [ROCm6.4_internal_testing] Using c10d.barrier() in test_extra_cuda_context test in test_c10d_nccl.py [release/2.8] Using c10d.barrier() in test_extra_cuda_context test in test_c10d_nccl.py Aug 15, 2025
@jeffdaily jeffdaily merged commit 75c8052 into release/2.8 Aug 15, 2025
0 of 2 checks passed
@jeffdaily jeffdaily deleted the autogenerated/release/2.8_cherry-pick_pr-2447 branch August 15, 2025 15:38
tvukovic-amd pushed a commit that referenced this pull request Aug 20, 2025
… test_c10d_nccl.py (#2522)

Cherry-pick of #2447

Co-authored-by: akashveramd <Akash.Verma3@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants