fix: distributed training hanging with submit_job#141
Merged
Conversation
fcc2095 to
4cd9a0f
Compare
Contributor
Ndles
approved these changes
Apr 2, 2026
Member
Ndles
left a comment
There was a problem hiding this comment.
thanks for digging into it! I think if the CUDA version becomes a problem - we need to raise this to Fabian and Furkan.
4cd9a0f to
7dbcf75
Compare
7dbcf75 to
b8397af
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Distributed training was not working from noether-train-submit-job CLI.
There were two main issues:
cuda.set_devicefunction to set the default device based on the local rank, and also pass the chosen device toinit_process_group.One problem remains but this might be related to the installed CUDA version, which is that if
gpus_per_taskis set instead of gpus_per_node, normally we should just get one GPU assigned; this doesn't work though. Whenever the SLURM restriction of GPU per process is used, with the current installed version of CUDA, we get an error in ncclCommInitRankConfig:which seems to be coming from https://github.com/pytorch/pytorch/blob/v2.11.0/torch/csrc/distributed/c10d/NCCLUtils.cpp#L70. This is not a big problem though because we can always set the GPUs per node as well, only if really necessary we can come back to this.
A few small changes I introduced with this fix: