Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed: extend MPI functions for arch::Distributed? #3318

Open
glwagner opened this issue Oct 6, 2023 · 0 comments
Open

Distributed: extend MPI functions for arch::Distributed? #3318

glwagner opened this issue Oct 6, 2023 · 0 comments
Labels
distributed 🕸️ Our plan for total cluster domination user interface/experience 💻

Comments

@glwagner
Copy link
Member

glwagner commented Oct 6, 2023

In all the scripts I found on distributed simulations, we seem to query the MPI state using COMM_WORLD, eg:

and

But Distributed allows the communicator to be set, so arch.communicator can be something other than COMM_WORLD, eg

function Distributed(child_architecture = CPU();
topology,
ranks,
devices = nothing,
enable_overlapped_computation = true,
communicator = MPI.COMM_WORLD)

Presumably we should never write scripts that make assumptions about the communicator (and we should strongly discourage users from doing so). So first of all the validation scripts must be changed.

Secondly, I think something that could encouarge clean and good practices would be to extend functions like Comm_rank to arch::Distributed and possibly to the grid as well, so that we can call:

rank = MPI.Comm_rank(arch)
rank = MPI.Comm_rank(grid)
Nranks = MPI.Comm_size(arch)

We'd have the option of throwing an error when non-distributed, or returning some sensible fallback like Comm_rank(arch) = 0 and Comm_size(arch) = 1.

Alternatively, we can develop our own API, since the MPI-based one is rather irregular and abbreviated, thus difficult for newbies to understand (I've been in this boat...)

For example, perhaps simply Base.size and rank would suffice:

Base.size(arch::Distributed) = MPI.Comm_size(arch.communicator)
Base.size(arch::AbstractArchitecture) = 1
rank(arch::Distributed) = MPI.Comm_rank(arch.communicator)
rank(arch::AbstractArchitecture) = 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed 🕸️ Our plan for total cluster domination user interface/experience 💻
Projects
None yet
Development

No branches or pull requests

1 participant