Skip to content

CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47

Open
cenamiller wants to merge 3 commits intomasterfrom
feature-ci-nsys-cuda-aware-mpi
Open

CI: add CUDA-aware MPI opt-in to GPU Nsight profiling#47
cenamiller wants to merge 3 commits intomasterfrom
feature-ci-nsys-cuda-aware-mpi

Conversation

@cenamiller
Copy link
Copy Markdown
Collaborator

Summary

Adds an opt-in cuda_aware_mpi input to the GPU Nsight profiling workflow and traces MPI in the Nsight session so we can actually see halo exchanges in the timeline.

Why

The Nsight workflow always ran with host-staged MPI even though the MPAS-A OpenACC build wraps halo exchanges with acc host_data use_device(...). Without an explicit MPI-side opt-in, the MPI library silently cudaMemcpys device buffers through host memory; the trace shows lots of H↔D copy traffic and the MPI calls themselves are missing because nsys was not tracing MPI.

Changes

  • profile-gpu-nsight.yml: new cuda_aware_mpi boolean input (default false, preserves prior behaviour).
  • run-nsys-profile.sh: when the input is true, sets the right env vars per MPI implementation:
    • MPICH: MPICH_GPU_SUPPORT_ENABLED=1
    • OpenMPI: OMPI_MCA_pml=ucx, OMPI_MCA_osc=ucx, UCX_TLS=cuda,cuda_copy,cuda_ipc,sm,self, plus --mca pml ucx --mca osc ucx on the mpirun line.
      These do nothing unless the container's MPI was built with GPU support; if not, MPI aborts loudly rather than silently degrading.
  • run-nsys-profile.sh: nsys trace targets now cuda,nvtx,osrt,mpi. MPI calls now show up in the timeline regardless of the cuda-aware setting.

How to use

Dispatch the workflow twice (e.g. with mpich + 240km), once with the input default and once with cuda_aware_mpi: true. Compare the resulting nsys-profile.nsys-rep artifacts:

  • Without: H↔D cudaMemcpyAsync calls bracketing each MPI send/recv.
  • With (if the container's MPICH is built --with-cuda): MPI calls go directly between device buffers.

Test plan

  • CPU push triggers green on this PR
  • Dispatch profile-gpu-nsight.yml with default input (regression baseline)
  • Dispatch profile-gpu-nsight.yml with cuda_aware_mpi: true and confirm it either runs cleanly or aborts loudly (so we know whether the container's MPI is GPU-built)

Copilot AI review requested due to automatic review settings April 27, 2026 22:31
The Nsight workflow always ran with host-staged MPI, even though the
MPAS-A OpenACC build ships with `acc host_data use_device(...)` around
halo exchanges. Without an explicit opt-in, MPI silently `cudaMemcpy`s
device buffers through host memory, so the profile shows a lot of
H<->D copy traffic and the MPI calls themselves are missing from the
timeline (nsys was not tracing MPI either).

Two changes:

1. Add `cuda_aware_mpi` workflow input (default false, preserves prior
   behaviour). When true, run-nsys-profile.sh sets the right env vars
   per MPI implementation:
     - MPICH:   MPICH_GPU_SUPPORT_ENABLED=1
     - OpenMPI: OMPI_MCA_pml=ucx, OMPI_MCA_osc=ucx, UCX_TLS includes
                cuda transports; also passes `--mca pml ucx --mca osc ucx`
                on the mpirun line.
   These do nothing useful unless the container's MPI is built with GPU
   support, but the failure mode is loud (MPI abort) rather than silent.

2. Add `mpi` to the nsys trace targets so halo exchanges show up in the
   timeline regardless of the cuda-aware setting. This lets us compare
   host-staged vs cuda-aware runs by dispatching the workflow twice.

Made-with: Cursor
@cenamiller cenamiller force-pushed the feature-ci-nsys-cuda-aware-mpi branch from ad8aeac to bdafcb3 Compare April 28, 2026 00:41
Both motivated by debugging cuda-aware MPI on PR #47. Without per-rank
pinning, all ranks default to GPU 0 on the CIRRUS-4x8-gpu node, which
masks any cuda-aware MPI win because there is no cross-GPU traffic.

- pin-gpu.sh: tiny shim that sets CUDA_VISIBLE_DEVICES to
  (local_rank % visible_gpu_count) using whichever local-rank var the
  MPI runtime exports (MPI_LOCALRANKID / OMPI_COMM_WORLD_LOCAL_RANK /
  PMI_LOCAL_RANK / SLURM_LOCALID). No-op if no GPUs detected, so safe
  when the container has no GPU mapping. Round-robin only; not yet
  wired into _test-gpu / _test-bfb (those can adopt it next pass).

- run-nsys-profile.sh: launch the model through pin-gpu.sh inside
  mpirun so the pin happens per child process, not once for the whole
  job. Comment now flags why the wrapper is here.

- profile-gpu-nsight.yml: replace the old single nvidia-smi probe
  (which fails inside containers shipping the GDK stub libnvidia-ml.so)
  with a structured diagnostic block that lists /dev/nvidia*, runs
  nvidia-smi -L and --query-gpu (both bypass libnvidia-ml), and dumps
  CUDA / OpenACC / MPI env vars. Goes into the workflow log and step
  summary so we can tell whether the runner exposed >1 GPU and what
  the model actually launched with.

Made-with: Cursor
The MPI-side env vars are only half the cuda-aware path. MPAS host-stages
halo exchanges itself unless config_gpu_aware_mpi=.true. is set in
&development; without it the MPI library never sees a device pointer and
MPICH_GPU_SUPPORT_ENABLED has nothing to do.

Make cuda_aware_mpi a single combined knob: when true, it now both sets
the MPI env vars (existing run-nsys-profile.sh logic) and patches
nsight-case/namelist.atmosphere so MPAS uses acc host_data use_device
around halo sends. Idempotent: appends a &development block if missing,
inserts the key if the block exists without it, or flips an existing
.false. to .true.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant