Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support local_rank computation with MPI<3 #17

Open
jacobhinkle opened this issue Apr 11, 2019 · 0 comments
Open

Support local_rank computation with MPI<3 #17

jacobhinkle opened this issue Apr 11, 2019 · 0 comments

Comments

@jacobhinkle
Copy link
Owner

Is your feature request related to a problem? Please describe.
Trying to use mpirun less than version 3 with for example lagomorph lddmm atlas results in an error currently since we can't properly find the local rank.

Describe the solution you'd like
Local rank should be determined in a uniform way regardless of MPI version. We should try the method used now, which is what horovod uses, but fall back to a naive hostname-based method if the import fails.

Describe alternatives you've considered
Previously we did not need to compute local rank because we accepted it on the command line as an argument. This is a bit cumbersome however, and switching to computation means the calling convention for lagomorph which uses pytorch.distributed will match horovod.

Additional context
The following stackoverflow response outlines the basic method we need to fall back to:
https://stackoverflow.com/a/31792540
The steps required are:

  • On each rank, compute processor name or hostname
  • Perform an allgather to grab all of the node names
  • sort the unique node names alphabetically
  • find integer index of this rank's hostname
  • use mpi_comm_split with the integer index found in the last step as the "color"
    This can all be done inside lagomorph.utils.mpi_local_comm
jacobhinkle added a commit that referenced this issue Apr 11, 2019
This unifies our approach to parallelism. Any command line tool will
parse the command line and MPI environment in a uniform way. On Summit,
this corresponds to calling jsrun with `jsrun -n<N> -a6 -g6` just as is
expected by horovod. We need MPI>=3 in order to find local rank using
the method implemented here. In the future, we need a fallback to remove
this requirement (see Issue #17).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant