Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Creation of Gloo groups #435

Closed
janEbert opened this issue Jul 27, 2023 · 2 comments
Closed

[QUESTION] Creation of Gloo groups #435

janEbert opened this issue Jul 27, 2023 · 2 comments

Comments

@janEbert
Copy link
Contributor

janEbert commented Jul 27, 2023

After the big core_transformers refactor, the same torch.distributed groups that NCCL already creates are re-created in Gloo. Is there a specific reason for this or can I safely remove these duplicate groups? The creation of the Gloo groups sometimes causes issues on my supercomputing cluster so it would be desirable to stay on just the specified backend (i.e. NCCL). ← This issue with Gloo creation was fixed by setting GLOO_SOCKET_IFNAME=ib0, which is obviously independent of Megatron-LM.

However, I would still like to know why or whether using Gloo in addition to NCCL is actually necessary.

@mayank31398
Copy link

@janEbert I think the Gloo groups are needed for saving the distributed optimizer.
Megatron gathers the distributed optimizer on CPU before saving
This is different from the earlier behaviour where Megatron used to save the optimizer per process (sharded optimizer).

@janEbert
Copy link
Contributor Author

Ahh, that makes sense, I was wondering why CPU communication was suddenly required. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants