CI: add CUDA-aware MPI opt-in to GPU Nsight profiling by cenamiller · Pull Request #47 · NCAR/MPAS-Model-CI

cenamiller · 2026-04-27T22:31:36Z

Summary

Adds an opt-in cuda_aware_mpi input to the GPU Nsight profiling workflow and traces MPI in the Nsight session so we can actually see halo exchanges in the timeline.

Why

The Nsight workflow always ran with host-staged MPI even though the MPAS-A OpenACC build wraps halo exchanges with acc host_data use_device(...). Without an explicit MPI-side opt-in, the MPI library silently cudaMemcpys device buffers through host memory; the trace shows lots of H↔D copy traffic and the MPI calls themselves are missing because nsys was not tracing MPI.

Changes

profile-gpu-nsight.yml: new cuda_aware_mpi boolean input (default false, preserves prior behaviour).
run-nsys-profile.sh: when the input is true, sets the right env vars per MPI implementation:
- MPICH: MPICH_GPU_SUPPORT_ENABLED=1
- OpenMPI: OMPI_MCA_pml=ucx, OMPI_MCA_osc=ucx, UCX_TLS=cuda,cuda_copy,cuda_ipc,sm,self, plus --mca pml ucx --mca osc ucx on the mpirun line.
  These do nothing unless the container's MPI was built with GPU support; if not, MPI aborts loudly rather than silently degrading.
run-nsys-profile.sh: nsys trace targets now cuda,nvtx,osrt,mpi. MPI calls now show up in the timeline regardless of the cuda-aware setting.

How to use

Dispatch the workflow twice (e.g. with mpich + 240km), once with the input default and once with cuda_aware_mpi: true. Compare the resulting nsys-profile.nsys-rep artifacts:

Without: H↔D cudaMemcpyAsync calls bracketing each MPI send/recv.
With (if the container's MPICH is built --with-cuda): MPI calls go directly between device buffers.

Test plan

CPU push triggers green on this PR
Dispatch profile-gpu-nsight.yml with default input (regression baseline)
Dispatch profile-gpu-nsight.yml with cuda_aware_mpi: true and confirm it either runs cleanly or aborts loudly (so we know whether the container's MPI is GPU-built)

The Nsight workflow always ran with host-staged MPI, even though the MPAS-A OpenACC build ships with `acc host_data use_device(...)` around halo exchanges. Without an explicit opt-in, MPI silently `cudaMemcpy`s device buffers through host memory, so the profile shows a lot of H<->D copy traffic and the MPI calls themselves are missing from the timeline (nsys was not tracing MPI either). Two changes: 1. Add `cuda_aware_mpi` workflow input (default false, preserves prior behaviour). When true, run-nsys-profile.sh sets the right env vars per MPI implementation: - MPICH: MPICH_GPU_SUPPORT_ENABLED=1 - OpenMPI: OMPI_MCA_pml=ucx, OMPI_MCA_osc=ucx, UCX_TLS includes cuda transports; also passes `--mca pml ucx --mca osc ucx` on the mpirun line. These do nothing useful unless the container's MPI is built with GPU support, but the failure mode is loud (MPI abort) rather than silent. 2. Add `mpi` to the nsys trace targets so halo exchanges show up in the timeline regardless of the cuda-aware setting. This lets us compare host-staged vs cuda-aware runs by dispatching the workflow twice. Made-with: Cursor

Both motivated by debugging cuda-aware MPI on PR #47. Without per-rank pinning, all ranks default to GPU 0 on the CIRRUS-4x8-gpu node, which masks any cuda-aware MPI win because there is no cross-GPU traffic. - pin-gpu.sh: tiny shim that sets CUDA_VISIBLE_DEVICES to (local_rank % visible_gpu_count) using whichever local-rank var the MPI runtime exports (MPI_LOCALRANKID / OMPI_COMM_WORLD_LOCAL_RANK / PMI_LOCAL_RANK / SLURM_LOCALID). No-op if no GPUs detected, so safe when the container has no GPU mapping. Round-robin only; not yet wired into _test-gpu / _test-bfb (those can adopt it next pass). - run-nsys-profile.sh: launch the model through pin-gpu.sh inside mpirun so the pin happens per child process, not once for the whole job. Comment now flags why the wrapper is here. - profile-gpu-nsight.yml: replace the old single nvidia-smi probe (which fails inside containers shipping the GDK stub libnvidia-ml.so) with a structured diagnostic block that lists /dev/nvidia*, runs nvidia-smi -L and --query-gpu (both bypass libnvidia-ml), and dumps CUDA / OpenACC / MPI env vars. Goes into the workflow log and step summary so we can tell whether the runner exposed >1 GPU and what the model actually launched with. Made-with: Cursor

The MPI-side env vars are only half the cuda-aware path. MPAS host-stages halo exchanges itself unless config_gpu_aware_mpi=.true. is set in &development; without it the MPI library never sees a device pointer and MPICH_GPU_SUPPORT_ENABLED has nothing to do. Make cuda_aware_mpi a single combined knob: when true, it now both sets the MPI env vars (existing run-nsys-profile.sh logic) and patches nsight-case/namelist.atmosphere so MPAS uses acc host_data use_device around halo sends. Idempotent: appends a &development block if missing, inserts the key if the block exists without it, or flips an existing .false. to .true. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 27, 2026 22:31

cenamiller force-pushed the feature-ci-nsys-cuda-aware-mpi branch from ad8aeac to bdafcb3 Compare April 28, 2026 00:41

cenamiller added 2 commits April 28, 2026 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47

CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47
cenamiller wants to merge 3 commits intomasterfrom
feature-ci-nsys-cuda-aware-mpi

cenamiller commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cenamiller commented Apr 27, 2026

Summary

Why

Changes

How to use

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant