[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587

eqy · 2024-05-17T23:30:23Z

Doesn't affect current behavior by default, for #126544
I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread).

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-05-17T23:30:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126587

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6a56cf2 with merge base 796dff7 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 · 2024-05-21T16:52:55Z

Doesn't affect current behavior by default, for #126544

Do you mean that this PR is a fix for #126544?
If so, a question I have is how it can avoid the process termination mentioned in #126544.
Would appreciate your comment.

wconstab · 2024-05-21T17:10:55Z

can you explain the mechanism for throwing the cuda errors in the main thread?

is it because any current cuda error on any stream/kernel will cause any future cpu synchronization call to report the error in the current cpu thread? If so, then we could argue that the watchdog does not need to rethrow cuda errors because users will discover them unless the usercode has stopped issuing new cuda work. (that's still technically a gap, but probably not an important one?)

shuqiangzhang · 2024-05-21T17:52:49Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

-      watchDogException_ =
-          std::make_exception_ptr(C10_BUILD_ERROR(DistBackendError, exitMsg));
-      std::rethrow_exception(watchDogException_);
+      if (C10_LIKELY(rethrowCUDAErrors_) ||


Instead of defining a new ENV to control the error handling to make things more complicated, could we re-use existing ones, e.g, calling SHOULD_TEAR_DOWN(asyncErrorHandling_) here

Would it be expected that SHOULD_TEAR_DOWN(asyncErrorHandling_) would not rethrow CUDA errors? If so we could consider repurposing that for this case as well.

I think asyncErrorHandling_ is supposed to handle any errors, including CUDA errors.

One thing to check is whether there are nccl errors that we need to handle still, bc they are not raised in the main thread. Or if all nccl errors would be raised the same way as cuda errors, it's simpler logic.

eqy · 2024-05-21T17:53:08Z

@kwen2501 Yes it does fix the repro in #126544
It avoids process termination by caching the CUDA error (which the repro tries to do but is unable to if the watchdog rethrows the exception).

@wconstab That matches with the observed behavior (CUDA error is still visible even if the watchdog does not rethrow it) but I'm not sure if that's the exact mechanism here. Will check if other PyTorch-NV folks have an explanation for this.

kwen2501

Giving an approval per my reasoning in the original issue: #126544 (comment)

It seems to me that this is the solution.

kwen2501 · 2024-05-23T17:35:38Z

FYI @wconstab @eqy @shuqiangzhang
What happened has nothing to do with handleException(), see #126544 (comment).
What killed the process is this line:

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Line 1458 in bbe68a1

std::rethrow_exception(watchDogException_);

Thus, I think we'd not need to spend time discussing TORCH_NCCL_ASYNC_ERROR_HANDLING here.

kwen2501 · 2024-05-23T17:51:15Z

Thus, I think we'd not need to spend time discussing TORCH_NCCL_ASYNC_ERROR_HANDLING here.

Edit: unless, we decide to pull TORCH_NCCL_ASYNC_ERROR_HANDLING out of handleException() and put it at a higher level, like under ProcessGroupNCCL::ncclCommWatchdog(), and funnel all exceptions there.

eqy · 2024-05-28T17:11:22Z

@pytorchmergebot merge

pytorchmergebot · 2024-05-28T17:13:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…pytorch#126587) Doesn't affect current behavior by default, for pytorch#126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: pytorch#126587 Approved by: https://github.com/kwen2501

check in

6d4bda5

eqy requested review from wconstab and kwen2501 May 17, 2024 23:30

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 17, 2024

pytorchbot added the open source label May 17, 2024

eqy added 2 commits May 17, 2024 17:30

Update ProcessGroupNCCL.cpp

148d508

Update ProcessGroupNCCL.hpp

6a56cf2

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

kwen2501 requested a review from shuqiangzhang May 21, 2024 17:13

shuqiangzhang reviewed May 21, 2024

View reviewed changes

kwen2501 mentioned this pull request May 23, 2024

ncclCommWatchdog always terminates the process and prevents error handling if CUDA context is corrupted #126544

Open

kwen2501 approved these changes May 23, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot added the Merged label May 28, 2024

pytorchmergebot closed this in a76faff May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587

[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587

eqy commented May 17, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 •

edited

kwen2501 commented May 21, 2024 •

edited

wconstab commented May 21, 2024

shuqiangzhang May 21, 2024

eqy May 21, 2024

shuqiangzhang May 21, 2024

wconstab May 23, 2024

eqy commented May 21, 2024

kwen2501 left a comment

kwen2501 commented May 23, 2024

kwen2501 commented May 23, 2024

eqy commented May 28, 2024

pytorchmergebot commented May 28, 2024

[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587

[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog #126587

Conversation

eqy commented May 17, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126587

✅ You can merge normally! (1 Unrelated Failure)

kwen2501 commented May 21, 2024 • edited

wconstab commented May 21, 2024

shuqiangzhang May 21, 2024

Choose a reason for hiding this comment

eqy May 21, 2024

Choose a reason for hiding this comment

shuqiangzhang May 21, 2024

Choose a reason for hiding this comment

wconstab May 23, 2024

Choose a reason for hiding this comment

eqy commented May 21, 2024

kwen2501 left a comment

Choose a reason for hiding this comment

kwen2501 commented May 23, 2024

kwen2501 commented May 23, 2024

eqy commented May 28, 2024

pytorchmergebot commented May 28, 2024

Merge started

eqy commented May 17, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 •

edited

kwen2501 commented May 21, 2024 •

edited