Skip to content

Check unit test pass/failure and handle timeout error in halo test#323

Draft
amd-sriram wants to merge 2 commits intomasterfrom
fix_halo_test_assert
Draft

Check unit test pass/failure and handle timeout error in halo test#323
amd-sriram wants to merge 2 commits intomasterfrom
fix_halo_test_assert

Conversation

@amd-sriram
Copy link
Copy Markdown
Collaborator

@amd-sriram amd-sriram commented Mar 27, 2026

Motivation

Halo test takes more than 10 hours to run - https://github.com/ROCm/apex/actions/runs/23252552706/job/67612063034?pr=320

There is no graceful way to exit in case of timeout
run - https://github.com/ROCm/apex/actions/runs/23505553410/job/68420328568

Technical Details

To check for unit test failure, assert statement similar to nvidia apex is added -
https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/peer_memory/test_peer_halo_exchange_module.py#L134.

torch.testing.assert_close(list_y, list_y2, msg=memory_format_str) 

The following changes are made to address timeout error:

  1. Shorter timeout on init_process_group so failures surface faster, and timeout for the unit test
  2. Try/except around test functions to catch DistBackendError
  3. Proper cleanup with destroy_process_group in a finally block

Test Plan

Run the CI and check how much time the halo test takes with this change.

Test Result

Run - https://github.com/ROCm/apex/actions/runs/23668144565
The halo test takes ... hours.

Submission Checklist

@amd-sriram amd-sriram self-assigned this Mar 27, 2026
@amd-sriram amd-sriram changed the title add assert statement in halo test Handle failure and long time in halo test Mar 27, 2026
@amd-sriram amd-sriram changed the title Handle failure and long time in halo test Check unit test pass/failure and handle timeout error in halo test Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant