-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable TORCH_NCCL_AVOID_RECORD_STREAMS=1 by default #512
Conversation
Does this work, though, from the PyTorch source it would seem that it needs to be set for the process group: |
|
As a side note: in relation to this pytorch/pytorch#76861 (comment) It would be interesting to understand what happens if we automatically insert |
Thunder doesn't know anything about CUDA streams today. |
Fair point |
Setting the environment variable from the command line or in
|
Maybe the backward call is not affected by this decorator because it's a separate thread and setting the env variable in |
Right, and this needs to be set at the process group creation meaning in usual scripts before the |
This reverts commit 6b5c101.
This would help fix a lot of performance issues we have been seeing with large models/large batch sizes where we see memory thrashing. Thanks for working on this Ivan 🚀 |
Seems like legit CI failures in distributed. This seems a common thing
in these failures:
|
That's unfortunate that the error is coming from the PyTorch source code and is not reproducible with my build. I'll fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @IvanYashchuk
Awesome @IvanYashchuk ! |
This PR enables the magic environment variable to be on by default for Thunder. The change should be restricted only to Thunder.
This magick environment variable is supposed to fix a problem with the allocator thrashing when using collectives from the NCCL backend of PyTorch.
I have tested performance with the command provided by @parthmannan in #420
and this PR gives ~2.11x performance improvement (1517 ms -> 716 ms).
Fixes #420.
Fixes #477.
cc @parthmannan