Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Jul 28, 2025

Fixes https://ontrack-internal.amd.com/browse/SWDEV-541056

  • Skip *_stress_cuda UTs for all archs
  • Symmetric Memory is not yet supported on rocm7.0_internal_testing branch
  • test_extra_cuda_context - add a barrier to ensure all nodes finish init_process_group before continuing with the test
  • test_sac_ilp: skip for all rocm arch (was already skipped for MI300 and NAVI)
  • test_fsdp2_mem_tracker: update tol
  • test_scaled_mm - this is row-wise scaling dependent, skipped for now
  • test_allreduce_inductor_cudagraph_trees: Skipped as flaky upstream as well
  • test_distributed_spawn - skipped, will be fixed in next IFU

Fixes #ISSUE_NUMBER

Cherry-picked to release/2.7 branch via #2436

@pragupta pragupta changed the title [rocm7.0_internal_testing][MI355] Fix distributed failures [rocm7.0_internal_testing][SWDEV-541056][MI355] Fix distributed failures Jul 28, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 28, 2025

Jenkins build for cf511dda44b7d3c189c2761cc0e5e8da8f1d0171 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Fixes: https://ontrack-internal.amd.com/browse/SWDEV-541056

- Symmetric Memory is not yet supported on rocm7.0_internal_testing
  branch
- test_extra_context - need to add a barrier before running UT to ensure
  that init_process_group finishes before continuing
- test_sac_ilp: skip for all rocm arch (was already skipped for MI300 and NAVI)
- test_fsdp2_mem_tracker: update tol
- test_scaled_mm - this is row-wise scaling dependent, skipped for now
- test_allreduce_inductor_cudagraph_trees: Skipped as flaky upstream as well
- test_distributed_spawn - skipped, will be fixed in next IFU
@pragupta pragupta force-pushed the pg-fix-mi350-dist-rocm7.0 branch from cf511dd to 27f705a Compare July 30, 2025 18:35
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 30, 2025

Jenkins build for 27f705ac55ac94824b94966cfdb1fbd1ab07ba75 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jithunnair-amd jithunnair-amd merged commit fc180e5 into ROCm:rocm7.0_internal_testing Jul 30, 2025
0 of 2 checks passed
@jithunnair-amd
Copy link
Collaborator

! cherry-pick --onto release/2.7

@okakarpa
Copy link
Collaborator

Created branch autogenerated/release/2.7_cherry-pick_pr-2425 and #2436. It contains a merge conflict. Please resolve it

pragupta added a commit that referenced this pull request Jul 31, 2025
…][MI355] Fix distributed failures (#2436)

Fix distributed failures
- Skip *_stress_cuda UTs for all archs
- Symmetric Memory is not yet supported on rocm7.0_internal_testing
branch
- test_extra_cuda_context - add a barrier to ensure all nodes finish
init_process_group before continuing with the test
- test_sac_ilp: skip for all rocm arch (was already skipped for MI300
and NAVI)
- test_fsdp2_mem_tracker: update tol
- test_scaled_mm - this is row-wise scaling dependent, skipped for now
- test_allreduce_inductor_cudagraph_trees: Skipped as flaky upstream as
well
- test_distributed_spawn - skipped, will be fixed in next IFU

Also fixes: https://ontrack-internal.amd.com/browse/SWDEV-544875

Cherry-pick of #2425

Co-authored-by: Prachi Gupta <pracgupt@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants