Skip to content

Conversation

anandrdbz
Copy link
Contributor

@anandrdbz anandrdbz commented Aug 31, 2025

User description

Description

Fix a Bug with MPI transfer of IBM Markers on multiple ranks. Also added missing Device--Host transfers for MPI transfer of ib_markers. Addresses Issue #875

Closes #875

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • [x ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • [ x] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Bug fix


Description

  • Fix MPI transfer for IBM markers on multiple GPUs

  • Add RDMA and non-RDMA communication paths

  • Implement proper GPU memory management for transfers

  • Add NVTX profiling ranges for performance monitoring


Diagram Walkthrough

flowchart LR
  A["IBM Markers"] --> B["Check RDMA Support"]
  B --> C["RDMA Path"]
  B --> D["Non-RDMA Path"]
  C --> E["GPU Direct Transfer"]
  D --> F["GPU to Host"]
  F --> G["MPI Transfer"]
  G --> H["Host to GPU"]
  E --> I["Complete Transfer"]
  H --> I
Loading

File Walkthrough

Relevant files
Bug fix
m_mpi_proxy.fpp
Implement dual-path MPI communication for IBM markers       

src/simulation/m_mpi_proxy.fpp

  • Replace single MPI_SENDRECV call with conditional RDMA/non-RDMA paths
  • Add GPU memory management with GPU_UPDATE directives
  • Implement GPU_HOST_DATA wrapper for RDMA transfers
  • Add detailed NVTX profiling ranges for performance tracking
+33/-6   

@anandrdbz anandrdbz requested a review from a team as a code owner August 31, 2025 03:17
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The NVTX label "IB-MARKER-SENDRECV-NO-RMDA" appears misspelled; should likely be "NO-RDMA". This can hinder profiling searches and consistency across ranges.

call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA")
call MPI_SENDRECV( &
        ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
        ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
        MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call nvtxEndRange 
RDMA Branch Logic

The conditional iterates over both RDMA and non-RDMA paths with an 'if (rdma_mpi .eqv. ...)' check. Ensure only the intended path executes at runtime and that the preprocessor generates a single active path; otherwise you may execute neither or duplicate logic depending on how macros expand.

#:for rdma_mpi in [False, True]
    if (rdma_mpi .eqv. ${'.true.' if rdma_mpi else '.false.'}$) then
        #:if rdma_mpi
            #:call GPU_HOST_DATA(use_device='[ib_buff_send, ib_buff_recv]')

            call nvtxStartRange("IB-MARKER-SENDRECV-RDMA")
            call MPI_SENDRECV( &
                ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
                ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
                MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
            call nvtxEndRange 

            #:endcall GPU_HOST_DATA
            $:GPU_WAIT()
        #:else
            call nvtxStartRange("IB-MARKER-DEV2HOST")
            $:GPU_UPDATE(host='[ib_buff_send]')
            call nvtxEndRange

            call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA")
            call MPI_SENDRECV( &
                    ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
                    ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
                    MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
            call nvtxEndRange 

            call nvtxStartRange("IB-MARKER-HOST2DEV")
            $:GPU_UPDATE(device='[ib_buff_recv]')
            call nvtxEndRange
        #:endif
    end if
#:endfor
GPU Sync/Updates

In non-RDMA path, host update is done for send buffer and device update for recv buffer, but no explicit GPU_WAIT fences around the MPI call other than per-branch calls. Confirm that $:GPU_UPDATE and $:GPU_WAIT semantics guarantee correctness and that device pointers are valid when entering/leaving MPI, especially for overlapping streams.

call nvtxStartRange("IB-MARKER-DEV2HOST")
$:GPU_UPDATE(host='[ib_buff_send]')
call nvtxEndRange

call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA")
call MPI_SENDRECV( &
        ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
        ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
        MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call nvtxEndRange 

call nvtxStartRange("IB-MARKER-HOST2DEV")
$:GPU_UPDATE(device='[ib_buff_recv]')
call nvtxEndRange

Copy link

qodo-merge-pro bot commented Aug 31, 2025

PR Code Suggestions ✨

Latest suggestions up to ae30de5

CategorySuggestion                                                                                                                                    Impact
General
Fix buffer initialization order

Restore the initialization order to populate grid buffers after modules that may
modify fields (IBM, sources) to avoid stale halos. Move the buffer population
call after these initializations to ensure consistent boundary data.

src/simulation/m_start_up.fpp [1369-1375]

-! Populating the buffers of the grid variables using the boundary conditions
-call s_populate_grid_variables_buffers()
-
 if (model_eqns == 3) call s_initialize_internal_energy_equations(q_cons_ts(1)%vf)
 if (ib) call s_ibm_setup()
 if (bodyForces) call s_initialize_body_forces_module()
 if (acoustic_source) call s_precalculate_acoustic_spatial_sources()
 
+! Populating the buffers of the grid variables using the boundary conditions
+call s_populate_grid_variables_buffers()
+
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a potential regression in the PR, where moving the s_populate_grid_variables_buffers call could lead to stale buffer data and incorrect simulation results.

High
Incremental [*]
Fix MPI buffer size formulas

*Align buffer_counts with the exact number of elements packed in each branch to
prevent MPI size mismatches; for dir=1 use (n+1)(p+1)buff_size, dir=2 use
(m+2
buff_size+1)
(p+1)buff_size, and dir=3 use
(m+2
buff_size+1)(n+2buff_size+1)buff_size.

src/simulation/m_mpi_proxy.fpp [272-276]

 buffer_counts = (/ &
-                buff_size*(n + 1)*(p + 1), &
-                buff_size*(m + 2*buff_size + 1)*(p + 1), &
-                buff_size*(m + 2*buff_size + 1)*(n + 2*buff_size + 1) &
+                (n + 1)*(p + 1)*buff_size, &
+                (m + 2*buff_size + 1)*(p + 1)*buff_size, &
+                (m + 2*buff_size + 1)*(n + 2*buff_size + 1)*buff_size &
                  /)
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the multiplication order in the buffer_counts calculation is different from the loop indexing, but since Fortran's array initialization is not order-dependent, this is a stylistic improvement for readability rather than a functional bug fix.

Medium
Possible issue
Assert correct MPI buffer sizes

Validate that buffer_count matches the actual packed size for each direction to
avoid MPI mismatches. Add runtime assertions after computing buffer_count and
before communication to catch mis-sized buffers when buff_size changes.

src/simulation/m_mpi_proxy.fpp [272-276]

 buffer_counts = (/ &
                 buff_size*(n + 1)*(p + 1), &
                 buff_size*(m + 2*buff_size + 1)*(p + 1), &
                 buff_size*(m + 2*buff_size + 1)*(n + 2*buff_size + 1) &
                 /)
 
-...
+buffer_count = buffer_counts(mpi_dir)
+
+! Sanity checks to avoid MPI size mismatches
+if (buffer_count <= 0) then
+    print *, "Invalid buffer_count for MPI dir=", mpi_dir, " count=", buffer_count
+    error stop "Invalid MPI buffer count"
+end if
 
 call MPI_SENDRECV( &
     ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
     ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
     MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
Suggestion importance[1-10]: 3

__

Why: The suggestion adds a defensive check for buffer_count, which is good practice for robustness, but it addresses a very unlikely scenario, making its impact minor.

Low
  • More

Previous suggestions

Suggestions up to commit f074f32
CategorySuggestion                                                                                                                                    Impact
High-level
Pre-transfer GPU sync for RDMA

In the RDMA path you invoke MPI_SENDRECV with device pointers without
synchronizing prior GPU work, risking stale or partially-written data being
sent. Move the GPU_WAIT (or stream/queue sync) to before the MPI call (and only
keep a post-call sync if proven necessary), and ensure the subsequent unpack
consistently operates on the same memory space (device vs. host) for both paths.
Additionally, gate the RDMA branch on a robust runtime check for GPU-aware MPI
to prevent undefined behavior when rdma_mpi is misconfigured.

Examples:

src/simulation/m_mpi_proxy.fpp [354-362]
                    call nvtxStartRange("IB-MARKER-SENDRECV-RDMA")
                    call MPI_SENDRECV( &
                        ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
                        ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
                        MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
                    call nvtxEndRange 

                    #:endcall GPU_HOST_DATA
                    $:GPU_WAIT()

Solution Walkthrough:

Before:

#:if rdma_mpi
    // Potentially using device pointers for MPI
    // without ensuring prior GPU kernels are complete.
    call MPI_SENDRECV( &
        ib_buff_send,  // device buffer
        ...,
        ib_buff_recv,  // device buffer
        ...
    )
    // Sync happens *after* the MPI call, which is too late for the send buffer.
    $:GPU_WAIT()
#:else
    // Safe non-RDMA path with explicit copies
    $:GPU_UPDATE(host='[ib_buff_send]')
    call MPI_SENDRECV(...)
    $:GPU_UPDATE(device='[ib_buff_recv]')
#:endif

After:

// At runtime, check if GPU-aware MPI is available
if (gpu_aware_mpi_is_available) then
    // RDMA path
    // Sync GPU *before* MPI call to ensure send buffer is fully written.
    $:GPU_WAIT()
    call MPI_SENDRECV( &
        ib_buff_send,  // device buffer
        ...,
        ib_buff_recv,  // device buffer
        ...
    )
    // A post-call sync might also be needed depending on MPI implementation
    $:GPU_WAIT()
else
    // Fallback to non-RDMA path
    $:GPU_UPDATE(host='[ib_buff_send]')
    call MPI_SENDRECV(...)
    $:GPU_UPDATE(device='[ib_buff_recv]')
endif
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical race condition in the RDMA path where MPI_SENDRECV might access GPU data before it's ready, and also proposes a crucial design improvement by replacing a compile-time flag with a runtime check for GPU-aware MPI.

High
Possible issue
Synchronize before RDMA transfer

Ensure device writes to ib_buff_send are complete before initiating CUDA-aware
MPI. Add a GPU_WAIT before the RDMA MPI_SENDRECV (keep the post-wait to
guarantee visibility of the received data to subsequent GPU work). This prevents
racing the MPI with in-flight GPU kernels.

src/simulation/m_mpi_proxy.fpp [352-362]

 #:call GPU_HOST_DATA(use_device='[ib_buff_send, ib_buff_recv]')
                     
+$:GPU_WAIT()
 call nvtxStartRange("IB-MARKER-SENDRECV-RDMA")
 call MPI_SENDRECV( &
     ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
     ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
     MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
 call nvtxEndRange 
 
 #:endcall GPU_HOST_DATA
 $:GPU_WAIT()
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a potential race condition where the MPI_SENDRECV could start before the GPU has finished writing to ib_buff_send, and proposes adding $:GPU_WAIT() to prevent this critical bug.

High
Synchronize before host update

Add a synchronization before copying ib_buff_send to host in the non-RDMA path
to ensure prior GPU kernels have finished writing. This prevents stale or
partially updated data from being sent via MPI.

src/simulation/m_mpi_proxy.fpp [364-366]

 call nvtxStartRange("IB-MARKER-DEV2HOST")
+$:GPU_WAIT()
 $:GPU_UPDATE(host='[ib_buff_send]')
 call nvtxEndRange
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a potential race condition where ib_buff_send could be copied to the host before the GPU has finished writing to it, and proposes adding $:GPU_WAIT() to prevent this critical bug.

High
General
Correct profiling label typo

Fix the NVTX range label typo to maintain profiling clarity and avoid splitting
metrics across misspelled ranges. Rename "NO-RMDA" to "NO-RDMA".

src/simulation/m_mpi_proxy.fpp [368-373]

-call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA")
+call nvtxStartRange("IB-MARKER-SENDRECV-NO-RDMA")
 call MPI_SENDRECV( &
         ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, &
         ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, &
         MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
 call nvtxEndRange
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies and fixes a typo in an NVTX profiling label from "NO-RMDA" to "NO-RDMA", which improves the clarity and correctness of performance profiling data.

Low

@anandrdbz
Copy link
Contributor Author

@sbryngelson MacOS CI has a problem building cmake btw

Copy link

codecov bot commented Aug 31, 2025

Codecov Report

❌ Patch coverage is 18.00000% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.92%. Comparing base (88c0a11) to head (ae30de5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_mpi_proxy.fpp 6.25% 30 Missing ⚠️
src/simulation/m_ibm.fpp 23.07% 9 Missing and 1 partial ⚠️
src/common/m_helper_basic.fpp 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #990      +/-   ##
==========================================
- Coverage   40.93%   40.92%   -0.02%     
==========================================
  Files          70       70              
  Lines       20288    20299      +11     
  Branches     2517     2521       +4     
==========================================
+ Hits         8305     8307       +2     
- Misses      10447    10454       +7     
- Partials     1536     1538       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anandrdbz anandrdbz changed the title MPI Transfer for IBM Markers on multiple GPUs Resolving bug with multiple ranks using IBM Aug 31, 2025
@sbryngelson
Copy link
Member

@sbryngelson MacOS CI has a problem building cmake btw

Please fix

Removed 'cmake' from the list of installed packages on macOS.
@sbryngelson sbryngelson self-requested a review as a code owner August 31, 2025 12:17
@sbryngelson
Copy link
Member

does doesn't run with case optimization on CPU/Phoenix for some reason... @anandrdbz please check logs

@anandrdbz
Copy link
Contributor Author

Actually this case revealed an important corner case, the IBM boundary and processor boundary match almost exactly. You actually need a larger buffer size than I even thought , I kept a limit of 6 = 2*gp_layers for the image point, but this isn't strictly true. In this particular case, it worked when increasing it to 8.

But there's actually no deterministic way of calculating this though as it depends on the IBM geometry, For now I'll push a fix to get this test to pass, but I'll put a print message that tells the user to increase buff_size if they run into this issue again in the source code.

@sbryngelson
Copy link
Member

sbryngelson commented Sep 1, 2025

hmm... @anandrdbz given that halo exchanges / ghost cell exchanges are relatively cheap except in extreme strong scaling cases, can we just put a rather generous upper bound on the ghost cells? This is a good find though, and I don't want to lose it in case it comes up again. I'm not sure how you want to add an warning message, but try to place it carefully/mindfully. Presumably the vast majority of cases with an IB will fail for a reason that has nothing to do with this 😄

@anandrdbz
Copy link
Contributor Author

Yeah this will fail uniquely, with a print message that's unique too, so it should be clear ( you can checkout my latest commit)

I have kept a generous upper bound of 10 for IBM cases, but there's no way to calculate a strict upper bound as it's geometry dependent.

@anandrdbz
Copy link
Contributor Author

Also @sbryngelson the benchmark case with IBM will eventually fail the check in this latest commit due to the increase in buffer size from 6 to 10 (I checked this for CPUs on Phoenix), with a 20% or so slowdown in grind time (I think the limit is about 5-10%).

@sbryngelson sbryngelson self-requested a review September 1, 2025 19:05
@sbryngelson sbryngelson merged commit fe87fc5 into MFlowCode:master Sep 1, 2025
45 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

IBM gives different results with MPI
2 participants