-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(0.88.0) MPI communication and computation overlap in the HydrostaticFreeSurfaceModel
and NonhydrostaticModel
#3125
Conversation
definitely bump a release (minor or patch; whatever more appropriate) before this is merged :) |
HydrostaticFreeSurfaceModel
Perhaps we wait for JuliaGPU/KernelAbstractions.jl#399 to be merged and tag a patch release for KA before we merge this? @vchuravy? |
Sure! Anyways, this PR should be in its final form, so ready for review :) (will still wait for KA changes before merging) |
So, it looks like docs take more to build in this PR when compared to main (4:00 hrs vs 3:40 hrs). Probably good to benchmark a bit. I ll try some benchmarking on GPU and CPU |
benchmarking sounds good! but I thought this PR only made changes to the distributed grids, no? |
It also rearranges how time-stepping works to allow
and removes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few beautification remarks
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
…MA/Oceananigans.jl into glw/catke-parameter-refactor
…nigans.jl into ss/load-balance-and-corners
oh dear, my PR created merge conflicts...... |
@simone-silvestri I think I resolve the conflicts alright but it wouldn't hurt if you have a look at b51e681... |
…cFreeSurfaceModel` and `NonhydrostaticModel` (#3125) * comment * fixed tag problems * bugfix * Update scalar_biharmonic_diffusivity.jl * Update src/Distributed/multi_architectures.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Update src/Distributed/partition_assemble.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Update src/ImmersedBoundaries/ImmersedBoundaries.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Update src/ImmersedBoundaries/active_cells_map.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Update src/Distributed/interleave_comm_and_comp.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Clean up batched tridiagonal solver and vertically implicit solver * Fix bug in batched tridiagonal solver * bugfix * Try to fix multi region immersed boundary issue * Hopefully fix immersed boundary grid constructor * Another fix * fixed project and manifest * convert instead of FT * export KernelParameters * remove FT * removed useless where FT * small bugfix * update manifest * remove unbuffered communication * little bit of a cleanup * removed `views` comment * couple of bugfixes * fixed tests * probably done * same thing for nonhydrostatic model * include file * bugfix * prepare for nonhydrostatic multiregion * also here * bugfix * other bugfix * fix closures * bugfix * simplify * 2D leith requires 2 halos! * AMD and Smag require 1 halo! * wrong order * correct halo handling for diffusivities * correct Leith formulation + fixes * `only_local_halos` kwarg in `fill_halo_regions!` * bugfix * FT on GPU * bugfix * bugfix * last bugfix? * removed all offsets from kernels + fixed all tests * fix `_compute!` * finished * fixed broken tests * fixed docs * miscellaneous changes * bugfix * removed tests for vertical subdivision * test corner passing * correction * retry * fixed all problems * Added a validation example * fixed tests * try new test * fill send buffers in the correct place * fixed comments * define async * pass the grid * bugfix * fix show method * RefValue for mpi_tag * comment * add catke preprint * remove warning; add ref to catke preprint * some code cleanup * correct the example * Update src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * bugfix * Refactor unit tests * Generalize regridding for lat-lon * bugfix * Add newline * small correction * new tests * bugfix * bugfix * back for testing * update manifest * more options * more * finished * test hypothesis * fixed bug - correct speed now * add space * bugfix * test * more info * removed left-right connected computation * bugfix * remove info * improve * typo * bugfix * bugfix * correct comments * bugfix * bugfix prescribed velocities * fixes * ok on mac * bugfix * bug fixed * bugfixxed * new default * bugfix * remove <<<<HEAD * bugfix PrescribedVelocityFields * default in another PR * bugfix * flat grids only in Grids * last bugfix * bugfix * try partial cells * bugfix * bugfix * Update test_turbulence_closures.jl * small fixes * rework IBG and MRG * Update src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl * small bugfix * remove multiregion ibg with arrays for the moment * bugfix * little cleaner * fixed tests * see what the error is * allow changing halos from checkpointer * test it * finally fixed it * better naming * bugfix * bugfix * bugfix * bugfix * removed useless tendency * small fix * dummy commit * fix active cell map * comment * bugfix * bugfix * removed useless tendency * maybe just keep it does not harm too much * should have fixed it? * let's go now * done * bugfix * no need for this * convert Δt in time stepping * maximum * minimum substeps * more flexibility * bugfix * mutlidimensional * fallback methods * test a thing * change * chnage * change * change * update * update * new offsets + return to previous KA * bugfix * bugfixxed * remove debugging * going back * more robus partitioning * quite new * bugfix * updated Manifest * build with 1.9.3 * switch boundary_buffer to required_halo_size * bugfix * Update src/Models/HydrostaticFreeSurfaceModels/single_column_model_mode.jl Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com> * Update src/Models/HydrostaticFreeSurfaceModels/update_hydrostatic_free_surface_model_state.jl Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com> * bugfix * biharmonic requires 2 halos * buggfix * compute_auxiliaries! * bugfix * fixed it * little change * some changes * bugfix * bugfix * bugfixxed * another bugfix * compute_diffusivities! * required halo size * all fixed * shorten line * fix comment * remove abbreviation * remove unused functions * better explanation of the MPI tag * Update src/ImmersedBoundaries/active_cells_map.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * Update src/Solvers/batched_tridiagonal_solver.jl Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> * change name * docstring * name change on rank * interior active cells * calculate -> compute * fixed tests * do not compute momentum in prescribed velocities * DistributedComputations * DistributedComputations part #2 * bugfix * fixed the docs --------- Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com> Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>
do we use |
At the top comment it says this PR depends on another KA PR that isn't merged. However, this PR is merged (and tagged). Does this not depend on JuliaGPU/KernelAbstractions.jl#399 anymore? |
Yeap, it doesn't. I edited the first post ;) |
How do you guys handle MPI tracing? (A la MPE, for instance...) |
@PetrKryslUCSD I think we use |
Using nsys it is possible to trace MPI with |
Yep, we used nsys because our primary objective is to trace GPU execution. I think it will work also on CPU programs.
Here,
|
Very cool. Many thanks!
Petr Krysl
Prof. and Vice chair for undergraduate education
Department of Structural Engineering
<https://urldefense.com/v3/__https://www.linkedin.com/company/uc-san-diego-structural-engineering-department/__;!!Mih3wA!DXYUp152SRRo03xCfQJ9NlAXovNVk-zBYRalwekzmzf2bGwdDMmY8gy9t6iQo9ok2_dvk8m987An93w0pnT_6lM$>
University of California, San Diego
9500 Gilman Drive #85
La Jolla, CA 92093
…On Mon, Oct 7, 2024 at 6:38 AM Simone Silvestri ***@***.***> wrote:
Yep, we used nsys because our primary objective is to trace GPU execution.
I think it will work also on CPU programs.
There is nothing really specific about profiling julia with nsys, provided
that MPI is correctly configured (i.e. your script works with MPI already).
An example of a batch script that traces MPI calls is
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --mem=500GB
#SBATCH --time 24:00:00
#SBATCH --gres=gpus:4
cat > launch.sh << EoF_s
#! /bin/sh
export CUDA_VISIBLE_DEVICES=0,1,2,3
exec \$*
EoF_s
chmod +x launch.sh
srun nsys profile --trace=nvtx,cuda,mpi --output=report_%q{SLURM_PROCID} ./launch.sh julia --check-bounds=no --project scaling_experiments.jl
Here, nsys will produce one report per processor. You can use mpirun or
mpiexec instead of srun.
If you want to insert NVTX annotations inside the code you need to add the
environment variable (ref
<https://urldefense.com/v3/__https://github.com/JuliaGPU/NVTX.jl__;!!Mih3wA!Ddu9-FVfAupB2XwS3KrF6PADRaFUORHCHdZEo-HuqlK3va2LdvyLAbpllyVWNEIiD_TVMU-miX1JnMaM6C7RIjP6$>
)
export JULIA_NVTX_CALLBACKS=gc
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/CliMA/Oceananigans.jl/pull/3125*issuecomment-2396961670__;Iw!!Mih3wA!Ddu9-FVfAupB2XwS3KrF6PADRaFUORHCHdZEo-HuqlK3va2LdvyLAbpllyVWNEIiD_TVMU-miX1JnMaM6H5RkVsz$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACLGGWEQFWMJVYNAYQ7WX4LZ2KFD3AVCNFSM6AAAAABPKAGA5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJWHE3DCNRXGA__;!!Mih3wA!Ddu9-FVfAupB2XwS3KrF6PADRaFUORHCHdZEo-HuqlK3va2LdvyLAbpllyVWNEIiD_TVMU-miX1JnMaM6OcpZ5Rs$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Overlapping MPI communication and computation in the HydrostaticFreeSurfaceModel and the NonhydrostaticModel.
In particular, this PR introduces two keyword arguments for the
fill_halo_regions!
function, active only in case of distributed halo passing boundary conditions:async::Bool
keyword argument that allows launching MPI operations without waiting for the communication to complete.only_local_halos::Bool
keyword argument, which fills only the halo in case of a local (i.e., Flux, Value, Gradient, Periodic, and, temporarily, MultiRegionCommunication) boundary condition. This is required for having explicit boundary conditions (like Value or Flux) for turbulent diffusivities (we directly calculate diffusivities in the halos in the case of distributed BC).This PR allows hiding MPI passing of barotropic auxiliary variables behind the implicit vertical solver and prognostic variables behind the tendency calculations. The latter is done by splitting the tendency kernels into an interior kernel that calculates tendencies between, e.g.,
i = Hx
andi = Nx - Hx
, and a boundary kernel, executed once communication is complete, that calculates tendencies adjacent to boundaries.Before computing tendencies near the boundaries, boundary-adjacent auxiliary diagnostic variables are recalculated (hydrostatic pressure, vertical velocity and diffusivities for the hydrostatic model, hydrostatic pressure and diffusivities for the non-hydrostatic model)
Also, this PR introduces
MultiRegion
)KernelParameters(size, offset)
to be passed to thelaunch!
function to start a kernel of sizesize::Tuple
offset byoffset::Tuple
todo:
src/Utils/kernel_launching.jl
to remove when offsets will be implemented)Nranks < 100
)views
halo passingAPI changes
RectilinearGrid(arch::DistributedArch, size = (Nx, Ny, Nz), ....)
,(Nx, Ny, Nz)
are the per-rank local sizes, not the global size to be divided (easy way to specify non-uniform partitioning, seevalidation/distributed/mpi_geostrophic_adjustment.jl
)enable_overlapped_communication
keyword toDistributedArch
(defaults to true)use_buffers
keyword toDistributedArch
(always use buffers, as views did not give significant speedup to justify maintaining two implementations)active_cells_map::Bool = false
toImmersedBoundaryGrid
(ex:ImmersedBoundaryGrid(grid, ib, active_cells_map = true)
required_halo_size::Int
keyword argument toScalarDiffusivity
(defaults to 1) andScalarBiharmonicDiffusivity
(defaults to 2) to be specified by the user which sets the required halo for the specificν
orκ
function (closures have now an explicitly required number of halos)Major internals change
The tendencies are calculated at the end of a time step. Therefore at the end of a simulation
model.timestepper
will hold tendencies for the last time step completedRemoved
fill_halo_regions!
for hydrostatic pressure in both the non-hydrostatic and the hydrostatic model and for w-velocity in the hydrostatic model. The halos are filled by enlarging the size of the kernels inupdate_hydrostatic_pressure!
andcompute_w_from_continuity!
to incorporate the needed ghost points.Removed
fill_halo_regions!
for diffusivities only for halo-passing BC; the halo calculation is now performed by launching thecalculate_diffusivity!
kernel inside the ghost nodes before recomputing the tendencies. This requires knowing how many halos each closure requires.Added a required_halo parameter to
AbstractTurbulenceClosure
. This means that each parameterization will have to specify explicitly the number of halos required to calculate the diffusivity:e.g
Where Leith closure requires 2 halos (one for the vorticity calculation and an additional one for the vorticity derivative)
Minor internals change
removed the general
calculate_nonlinear_viscosity!
andcalculate_nonlinear_diffusivity!
kernels (to each turbulence closure their own kernel)Requires JuliaGPU/KernelAbstractions.jl#399 # On hold at the momentCloses #615
Closes #1882
Closes #3067
Closes #3068
Supersedes #2953