You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the big core_transformers refactor, the same torch.distributed groups that NCCL already creates are re-created in Gloo. Is there a specific reason for this or can I safely remove these duplicate groups? The creation of the Gloo groups sometimes causes issues on my supercomputing cluster so it would be desirable to stay on just the specified backend (i.e. NCCL). ← This issue with Gloo creation was fixed by setting GLOO_SOCKET_IFNAME=ib0, which is obviously independent of Megatron-LM.
However, I would still like to know why or whether using Gloo in addition to NCCL is actually necessary.
The text was updated successfully, but these errors were encountered:
@janEbert I think the Gloo groups are needed for saving the distributed optimizer.
Megatron gathers the distributed optimizer on CPU before saving
This is different from the earlier behaviour where Megatron used to save the optimizer per process (sharded optimizer).
After the big
core_transformers
refactor, the sametorch.distributed
groups that NCCL already creates are re-created in Gloo. Is there a specific reason for this or can I safely remove these duplicate groups?The creation of the Gloo groups sometimes causes issues on my supercomputing cluster so it would be desirable to stay on just the specified backend (i.e. NCCL).← This issue with Gloo creation was fixed by settingGLOO_SOCKET_IFNAME=ib0
, which is obviously independent of Megatron-LM.However, I would still like to know why or whether using Gloo in addition to NCCL is actually necessary.
The text was updated successfully, but these errors were encountered: