Skip to content

Set cuda devices for example loops#329

Merged
kmontemayor2-sc merged 1 commit intomainfrom
kmonte/set-cuda-device
Sep 19, 2025
Merged

Set cuda devices for example loops#329
kmontemayor2-sc merged 1 commit intomainfrom
kmonte/set-cuda-device

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

Without this, we could see issues with NCCL and devices failing.

As a follow up, we should consider renaming get_available_device to setup_device and have it set the cuda device if applicable, but let's do the less invasive change first.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq why haven't we been seeing this occur in our current E2E runs in GiGL?

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

qq why haven't we been seeing this occur in our current E2E runs in GiGL?

Not sure, maybe we need to use 4 devices instead of 2?

@kmontemayor2-sc kmontemayor2-sc added this pull request to the merge queue Sep 19, 2025
Merged via the queue into main with commit 4753cbb Sep 19, 2025
4 checks passed
@kmontemayor2-sc kmontemayor2-sc deleted the kmonte/set-cuda-device branch September 19, 2025 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants