Skip to content

set MASTER_PORT automatically #234

@williamFalcon

Description

@williamFalcon

when using DDP, use must set a manual MASTER_PORT. However, we should automatically set one.

The problem is that we can't choose a random one as each process might generate a different solution. Instead, I propose to use the SLURM JOB ID as the seed to k possible ports. Then every process can deterministically generate the same sequence of ports. With that list, the root node can init NCCL connection making its way down the list until a port is open.

However, if we choose the length of the job id to use correctly we may potentially not run into collisions and won't need to iterate a list of potential ports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIs an improvement or enhancementhelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions