Skip to content

Conversation

@dnikolaev-amd
Copy link

@dnikolaev-amd dnikolaev-amd commented Jun 11, 2025

Base class forces to use world_size=2 even for 1 GPU. Then NCCL fails with errors:

ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000

This PR will skip FSDP tests if world_size > torch.cuda.device_count()

HIP_VISIBLE_DEVICES=0 pytest -v distributed/_composable/fsdp/test_fully_shard_extensions.py::TestFullyShardAllGatherExtensionsMultiProcess::test_all_gather_extensions_train_parity

dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)

Fixes SWDEV-535767

@dnikolaev-amd dnikolaev-amd marked this pull request as draft June 11, 2025 17:12
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 11, 2025

Jenkins build for 7eb15e34c4dfb06fe0fb6414a5f3394829499a5b commit finished as SUCCESS
Links: Blue Ocean view / Build artifacts

@dnikolaev-amd dnikolaev-amd force-pushed the dnikolaev/fix_test_all_gather_extensions_train_parity branch from 37583c5 to 7eb15e3 Compare June 11, 2025 20:24
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 11, 2025

Jenkins build for 7eb15e34c4dfb06fe0fb6414a5f3394829499a5b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Copy link
Collaborator

@pragupta pragupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also try to upstream this and hope upstream folks are okay with this fix.

@dnikolaev-amd dnikolaev-amd marked this pull request as ready for review June 11, 2025 23:52
@jithunnair-amd jithunnair-amd merged commit 3057d15 into rocm7.0_internal_testing Jun 12, 2025
0 of 6 checks passed
@jithunnair-amd jithunnair-amd deleted the dnikolaev/fix_test_all_gather_extensions_train_parity branch June 12, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants