Skip to content

Support alternative distributed communication #353

@rblake-llnl

Description

@rblake-llnl

Is your feature request related to a problem? Please describe.
I am working on a cluster that forbids opening sockets on a compute node that anyone can connect to, unauthenticated. This means I can't use the 'env://' initialization method for torch.distributed. Furthermore, I may want to train on systems that do not have CUDA, and therefore would like the option of using an mpi backend for torch.distributed.

Describe the solution you'd like

  • Options to change the distributed backend for torch.dist
  • Options to change the default bootstrapping to include things like shared filesystem support.

Describe alternatives you've considered

  • I've debated coding up everything myself using horovod, but I don't like how it locks me down into MPI and won't let me use nccl afterwards. Using horovod also breaks pytorch-lightning's abstraction of handling such parallelism for you, and conflicts with the pytorch default of using the 'dp' backend when possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIs an improvement or enhancementhelp wantedOpen to be worked onwon't fixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions