Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak Distributed API #3314

Merged
merged 26 commits into from
Oct 10, 2023
Merged

Tweak Distributed API #3314

merged 26 commits into from
Oct 10, 2023

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented Oct 5, 2023

This PR changes up the API for Distributed. The changes are:

  • Removal of the kwarg enable_overlapping_computation which doesn't belong in architectures (which are model agnostic). (We can add that feature to specific models if desired.)
  • Automatic initialization if MPI.Initialized() is false
  • Implementation of default ranks (by default we distribute in x)

Perhaps more to come.

I'm also going to clean up up the examples, since they are the only source of information about running distributed simulations currently.

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2023

Are there any tests for distributed hydrostatic models?

@simone-silvestri
Copy link
Collaborator

yeah in test/test_distributed_models.jl

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2023

@simone-silvestri DistributedComputations is imported before models here:

include("DistributedComputations/DistributedComputations.jl")
include("TimeSteppers/TimeSteppers.jl")
include("Models/Models.jl")

What was "enable overlapping communication" intended to support? Is there something that depends on whether isnothing(mpi_requests)? Perhaps more direct specification of this feature would help.

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2023

yeah in test/test_distributed_models.jl

There's no tests for HydrostaticFreeSurfaceModel there currently, just NonhydrostaticModel.

Is there another file with tests for HydrostaticFreeSurfaceModel?

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2023

How important is it to call MPI.Finalize? What happens if we don't do it? If it's critical, how do we ensure this happens? (This seems like a bad design...)

@glwagner
Copy link
Member Author

glwagner commented Oct 5, 2023

I think users should put the global grid size into the grid constructor, not the local size. If the partitioning is irregular, this should go into the architecture. The reason we need this is to support a "one line" change from local to distributed computations. How do we implement this?

global_size = map(sum, concatenate_local_sizes(size, arch))

@glwagner
Copy link
Member Author

glwagner commented Oct 6, 2023

Ok, I found in the docs for MPI.jl that

Calling MPI.Finalize() at the end of the program is optional, as it will be called automatically when Julia exits.

This is good news. So we can call MPI.Init() when we build the architecture, and avoid MPI.Finalize(). This brings us one step closer to the "one line to distributed simulations" goal.

https://github.com/JuliaParallel/MPI.jl

@glwagner
Copy link
Member Author

glwagner commented Oct 6, 2023

Hopefully we can close #3318 here too, let's discuss.

Partition(Rx::Number, Ry::Number=1, Rz::Number=1) = Partition{Rx, Ry, Rz}()
Partition(lengths::Array{Int, 1}) = Partition(reshape(lengths, length(lengths), 1))
Base.size(::Partition{<:Any, Rx, Ry, Rz}) where {Rx, Ry, Rz} = (Rx, Ry, Rz)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Partition(; Rx = 1, Ry = 1, Rz = 1) = Partition(Rx, Ry, Rz)

@simone-silvestri
Copy link
Collaborator

return RectilinearGrid(arch, eltype(global_grid); size=local_size, x=x, y=y, z=z, halo=halo, topology=topo)

here we probably need something like

function scatter_local_grids(arch::Distributed, global_grid::RectilinearGrid, local_size)
    x, y, z, topo, halo = scatter_grid_properties(global_grid)
    global_size = map(sum, concatenate_local_sizes(local_size, arch))
    return RectilinearGrid(arch, eltype(global_grid); size=global_size, x=x, y=y, z=z, halo=halo, topology=topo)
end

same for the LatitudeLongitudeGrid

glwagner and others added 2 commits October 6, 2023 13:13
…ustment.jl

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
…ustment.jl

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
@glwagner
Copy link
Member Author

glwagner commented Oct 7, 2023

I don’t think we should fix the untested features in this PR. We need to merge this ASAP so we can open a new PR with tests.

@glwagner
Copy link
Member Author

glwagner commented Oct 7, 2023

@simone-silvestri can you fix the failing test? As soon as tests pass please merge. We need a another PR that implements tests following this one.

Comment on lines 57 to 58
return nothing
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nothing
end
return nothing
end

@@ -84,7 +99,7 @@ function LatitudeLongitudeGrid(arch::Distributed,
radius = R_Earth,
halo = (1, 1, 1))

global_size = map(sum, concatenate_local_sizes(size, arch))
# global_size = map(sum, concatenate_local_sizes(size, arch))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment out? delete?

@@ -218,7 +222,7 @@ include("Biogeochemistry.jl")

# TODO: move above
include("ImmersedBoundaries/ImmersedBoundaries.jl")
include("DistributedComputations/DistributedComputations.jl")
# include("DistributedComputations/DistributedComputations.jl")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment out?

@@ -483,8 +485,9 @@ end
@testset "Time stepping ShallowWaterModel" begin
for child_arch in archs
topo = (Periodic, Periodic, Flat)
arch = Distributed(child_arch; ranks=(1, 4, 1), topology = topo, devices = (0, 0, 0, 0))
grid = RectilinearGrid(arch, topology=topo, size=(8, 2), extent=(1, 2), halo=(3, 3))
#arch = Distributed(child_arch; ranks=(1, 4, 1), topology = topo, devices = (0, 0, 0, 0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the comment?


MPI.Barrier(MPI.COMM_WORLD)
# ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# ? ?

@simone-silvestri simone-silvestri merged commit c823cb9 into main Oct 10, 2023
47 checks passed
@simone-silvestri simone-silvestri deleted the glw/refactor-distributed branch October 10, 2023 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants