-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126587
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 6a56cf2 with merge base 796dff7 (): UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
can you explain the mechanism for throwing the cuda errors in the main thread? is it because any current cuda error on any stream/kernel will cause any future cpu synchronization call to report the error in the current cpu thread? If so, then we could argue that the watchdog does not need to rethrow cuda errors because users will discover them unless the usercode has stopped issuing new cuda work. (that's still technically a gap, but probably not an important one?) |
watchDogException_ = | ||
std::make_exception_ptr(C10_BUILD_ERROR(DistBackendError, exitMsg)); | ||
std::rethrow_exception(watchDogException_); | ||
if (C10_LIKELY(rethrowCUDAErrors_) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining a new ENV to control the error handling to make things more complicated, could we re-use existing ones, e.g, calling SHOULD_TEAR_DOWN(asyncErrorHandling_) here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be expected that SHOULD_TEAR_DOWN(asyncErrorHandling_)
would not rethrow CUDA errors? If so we could consider repurposing that for this case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asyncErrorHandling_ is supposed to handle any errors, including CUDA errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to check is whether there are nccl errors that we need to handle still, bc they are not raised in the main thread. Or if all nccl errors would be raised the same way as cuda errors, it's simpler logic.
@kwen2501 Yes it does fix the repro in #126544 @wconstab That matches with the observed behavior (CUDA error is still visible even if the watchdog does not rethrow it) but I'm not sure if that's the exact mechanism here. Will check if other PyTorch-NV folks have an explanation for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving an approval per my reasoning in the original issue: #126544 (comment)
It seems to me that this is the solution.
FYI @wconstab @eqy @shuqiangzhang
Thus, I think we'd not need to spend time discussing TORCH_NCCL_ASYNC_ERROR_HANDLING here.
|
Edit: unless, we decide to pull |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…pytorch#126587) Doesn't affect current behavior by default, for pytorch#126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: pytorch#126587 Approved by: https://github.com/kwen2501
Doesn't affect current behavior by default, for #126544
I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread).
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k