-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple_communicators branch gets deadlock on Alltoall #5
Comments
Thanks @andrevitorelli for reporting this! |
This was missing from the error message above:
|
@andrevitorelli @EiffL I've just tested fft_benchmark (2 nodes / 8 GPUs) :
With nsys, results look random... I wonder if nsys and fft_benchmark.py face race conditions when accessing NCCL. Can you tell me if you get the same results as me? |
@andrevitorelli @EiffL I can randomly reproduce the problem on 1 node / 4 GPU : "/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds."... At 11:32 this morning:
At 11:40
At 11:50
I wonder if the fft_benchmark has actually ended but for some reason, the step writing the nsys report on the SCRATCH directory is very slow. |
@kimchitsigai @EiffL Successful run, quoting from
Unsuccessful run:
The layout rules are in a different order.
|
@andrevitorelli @EiffL Just in case you didn't know, you can set the Horovod layer's verbosity level with
(I've addede the NCCL messages myself). |
Thanks! I'll keep looking into it. I've eliminated the possibility of it being a node-specific issue. |
Further possibly related insights: |
What does not work to solve this bug:
What seems to:
Current hypotheses:
|
@EiffL @andrevitorelli
Test results (with
Attempt to explain the 4 node / 16 GPU problem. This is a representation of the 16 processes and the 2 sub-communicators, as seen in the logs:
Let's suppose that:
So, we get the "Shutting down Background thread loop" message only for processes 12,13,14,15,3,7,11. |
At the end of fft_benchmark.py, replacing:
by
seems to give better results. The qdrep files are generated in the 4 above test cases. The jobs must still be scanceled. |
When running IDRIS-Hackathon
fft_benchmark.job
I get the following message:W <...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
The job never finishes, but doesn't crash.
The text was updated successfully, but these errors were encountered: