-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leak in FIFO queue #1251
Leak in FIFO queue #1251
Comments
We now suspect https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-mixing-support is at issue here. We had turned it off to get a significant speedup, but we may be misusing that feature. |
We actually were still able to reproduce with graph mixing support turned on. Adding a synchronize between usages somehow also doesn't help. We're working on a more minimal reproducer but it will take some time. |
Can you elaborate on the number of graphs, the number of nccl calls per graph, the number of non-graph nccl calls. With that, could you create a minimal reproducer? Also, @ben ***@***.***> can you help them collect a nccl call trace?
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: WhiteFangBuck ***@***.***>
Sent: Saturday, April 13, 2024 10:09:03 AM
To: NVIDIA/nccl ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [NVIDIA/nccl] Leak in FIFO queue (Issue #1251)
@sjeaugey<https://github.com/sjeaugey> @KaimingOuyang<https://github.com/KaimingOuyang>
—
Reply to this email directly, view it on GitHub<#1251 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AARQAY2MKGKW6O6WGOWNYLLY5FRC7AVCNFSM6AAAAABGFRMS2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTG4YDIMJYGM>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Reproducer (on 2 H100s):
|
Here's a C++ version (thanks claude)
|
resolved by this commit (i assume will be added to master soon) ee3d92b |
We are experiencing an issue where 8 processes, each controlling one GPU on a node, all lock up at the same time. It seems to be deterministic, though we don't know exactly the operation that is causing trouble. But something like "after N graph executions, all 8 processes stall at the same time".
We're relatively confident that this is a leak, because increasing NCCL_WORK_FIFO_DEPTH seems to increase the number of graphs that can be executed prior to stalling. And decreasing it causes the stall to happen sooner.
Here is the stack trace for where we get stuck:
This is on an older NCCL patch, a8511ca, but we have verified that the same issue is present on master as of yesterday. The issue affects both H100 and A100.
We'll update the issue as we gather more info on the exact operation that is causing trouble.
The text was updated successfully, but these errors were encountered: