[QUESTION] Creation of Gloo groups #435

janEbert · 2023-07-27T09:02:45Z

After the big core_transformers refactor, the same torch.distributed groups that NCCL already creates are re-created in Gloo. Is there a specific reason for this or can I safely remove these duplicate groups? ~~The creation of the Gloo groups sometimes causes issues on my supercomputing cluster so it would be desirable to stay on just the specified backend (i.e. NCCL).~~ ← This issue with Gloo creation was fixed by setting GLOO_SOCKET_IFNAME=ib0, which is obviously independent of Megatron-LM.

However, I would still like to know why or whether using Gloo in addition to NCCL is actually necessary.

The text was updated successfully, but these errors were encountered:

mayank31398 · 2023-07-31T19:26:50Z

@janEbert I think the Gloo groups are needed for saving the distributed optimizer.
Megatron gathers the distributed optimizer on CPU before saving
This is different from the earlier behaviour where Megatron used to save the optimizer per process (sharded optimizer).

janEbert · 2023-07-31T20:59:24Z

Ahh, that makes sense, I was wondering why CPU communication was suddenly required. Thank you!

janEbert closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Creation of Gloo groups #435

[QUESTION] Creation of Gloo groups #435

janEbert commented Jul 27, 2023 •

edited

mayank31398 commented Jul 31, 2023

janEbert commented Jul 31, 2023

[QUESTION] Creation of Gloo groups #435

[QUESTION] Creation of Gloo groups #435

Comments

janEbert commented Jul 27, 2023 • edited

mayank31398 commented Jul 31, 2023

janEbert commented Jul 31, 2023

janEbert commented Jul 27, 2023 •

edited