-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclCommWatchdog always terminates the process and prevents error handling if CUDA context is corrupted #126544
Comments
Looking closer to the trace stack:
The trigger is in
From watchdog's point of view, it is a bit "innocent" -- it cannot distinguish whether the CUDA error is from compute kernels launched by the main thread or NCCL kernels. Should it be from the NCCL kernel, shall the watchdog not report it? It is not ideal that way either. @eqy proposed an env to control this in #126587. Maybe that's the way to go for the moment? I mean, if we cannot decide what to do, maybe it is better to give it to the user? |
…#126587) Doesn't affect current behavior by default, for #126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: #126587 Approved by: https://github.com/kwen2501
…pytorch#126587) Doesn't affect current behavior by default, for pytorch#126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: pytorch#126587 Approved by: https://github.com/kwen2501
…#126587) Doesn't affect current behavior by default, for #126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: #126587 Approved by: https://github.com/kwen2501 (cherry picked from commit a76faff)
🐛 Describe the bug
ncclCommWatchdog uses
abort
to terminate python interpreter process if CUDA context becomes corrupted while NCCL collective was being executed. It doesn't respect settings ofTORCH_NCCL_ASYNC_ERROR_HANDLING=0
(NoHandling) orTORCH_NCCL_ASYNC_ERROR_HANDLING=2
(CleanUpOnly) orTORCH_NCCL_ENABLE_MONITORING=0
.Watchdog always terminates the process and prevents any possible error handling like (e.g. perform cleanup, log failure or notify other ranks that error happened).
Repro:
Run on a machine with at least 2 GPUs:
All possible combinations of
TORCH_NCCL_ASYNC_ERROR_HANDLING={0,1,2,3}
x
TORCH_NCCL_ENABLE_MONITORING={0,1}
also trigger the same failureTraceback:
Versions
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: