Skip to content

Conversation

k-artem
Copy link

@k-artem k-artem commented Sep 16, 2025

Manual cherry-pick of #2401
Fixes #SWDEV-548314

k-artem and others added 3 commits September 16, 2025 15:25
[release/2.5][SWDEV-489778] NAVI4x UT parity for distributed config (#2327)
I did a sweep of all the distributed failures on NAVI4x. On a high
level, we were running into following issues:
- MEM_EFF_ATTENTION is not supported on NAVI4x for 2.5 causing tensors
not alike issues
- Some UTs pass in future releases, skipped those
- Some had slight tolerance fixes as we use hipblas in this branch as
compared to hipblaslt in future branches

Fixes #ISSUE_NUMBER
[release/2.5][SWDEV-489778] NAVI4x UT parity for distributed config (#2327)
I did a sweep of all the distributed failures on NAVI4x. On a high
level, we were running into following issues:
- MEM_EFF_ATTENTION is not supported on NAVI4x for 2.5 causing tensors
not alike issues
- Some UTs pass in future releases, skipped those
- Some had slight tolerance fixes as we use hipblas in this branch as
compared to hipblaslt in future branches

Fixes #ISSUE_NUMBER
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Sep 16, 2025

Jenkins build for 00f48c63a533c333865211f1f3b424cc5c16632c commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@k-artem
Copy link
Author

k-artem commented Sep 23, 2025

@pruthvistony please review

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Oct 1, 2025

Jenkins build for a6a6ee3f9096c92245b7794972d1916f906ebe73 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@k-artem
Copy link
Author

k-artem commented Oct 1, 2025

tests passed on Navi44

test commands:

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShardRegisteredParams.test_param_registration_after_forward

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShardCastAfterInit.test_to_float64_after_init

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCore.test_train_parity_multi_group

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_tools/test_sac_ilp.py TestSACILP.test_sac_ilp_case2

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/elastic/test_control_plane.py WorkerServerTest.test_tcp

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/fsdp/test_fsdp_core.py TestParityWithDDPCUDA.test_transformer_offload_true_no_shard_cuda

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/fsdp/test_fsdp_hybrid_shard.py TestFSDPHybridShard.test_fsdp_hybrid_shard_basic_setup

TEMP_DIR=/tmp/ BACKEND=gloo WORLD_SIZE=2 PYTORCH_TEST_WITH_ROCM=1 python test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py TestShardedGradScalerParityWithDDP.test_sharded_grad_scaler_found_inf

TEMP_DIR=/tmp/ BACKEND=gloo PYTORCH_TEST_WITH_ROCM=1 python test/test_linalg.py TestLinalgCUDA.test_matmul_45724_cuda

@k-artem k-artem merged commit 9015dfd into release/2.7 Oct 1, 2025
@k-artem k-artem deleted the fix_unit_tests_2.7 branch October 1, 2025 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants