-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
featureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked on
Description
when using DDP, use must set a manual MASTER_PORT. However, we should automatically set one.
The problem is that we can't choose a random one as each process might generate a different solution. Instead, I propose to use the SLURM JOB ID as the seed to k possible ports. Then every process can deterministically generate the same sequence of ports. With that list, the root node can init NCCL connection making its way down the list until a port is open.
However, if we choose the length of the job id to use correctly we may potentially not run into collisions and won't need to iterate a list of potential ports.
Metadata
Metadata
Assignees
Labels
featureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked on