Resolving bug with multiple ranks using IBM #990

anandrdbz · 2025-08-31T03:17:31Z

User description

Description

Fix a Bug with MPI transfer of IBM Markers on multiple ranks. Also added missing Device--Host transfers for MPI transfer of ib_markers. Addresses Issue #875

Closes #875

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

[x ] Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Something else

Scope

[ x] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

Test A
Test B

Test Configuration:

What computers and compilers did you use to test this:

Checklist

I have added comments for the new code
I added Doxygen docstrings to the new code
I have made corresponding changes to the documentation (docs/)
I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
I have added example cases in examples/ that demonstrate my new feature performing as expected.
They run to completion and demonstrate "interesting physics"
I ran ./mfc.sh format before committing my code
New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
This PR does not introduce any repeated code (it follows the DRY principle)
I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in `src/simulation`)

To make sure the code is performing as expected on GPU devices, I have:

Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
Enclosed the new feature via nvtx ranges so that they can be identified in profiles
Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Bug fix

Description

Fix MPI transfer for IBM markers on multiple GPUs
Add RDMA and non-RDMA communication paths
Implement proper GPU memory management for transfers
Add NVTX profiling ranges for performance monitoring

Diagram Walkthrough

flowchart LR
  A["IBM Markers"] --> B["Check RDMA Support"]
  B --> C["RDMA Path"]
  B --> D["Non-RDMA Path"]
  C --> E["GPU Direct Transfer"]
  D --> F["GPU to Host"]
  F --> G["MPI Transfer"]
  G --> H["Host to GPU"]
  E --> I["Complete Transfer"]
  H --> I

File Walkthrough

Relevant files

Bug fix

m_mpi_proxy.fpp `Implement dual-path MPI communication for IBM markers` src/simulation/m_mpi_proxy.fpp Replace single MPI_SENDRECV call with conditional RDMA/non-RDMA paths Add GPU memory management with `GPU_UPDATE` directives Implement `GPU_HOST_DATA` wrapper for RDMA transfers Add detailed NVTX profiling ranges for performance tracking	+33/-6

qodo-merge-pro · 2025-08-31T03:17:53Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The NVTX label "IB-MARKER-SENDRECV-NO-RMDA" appears misspelled; should likely be "NO-RDMA". This can hinder profiling searches and consistency across ranges. call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange RDMA Branch Logic The conditional iterates over both RDMA and non-RDMA paths with an 'if (rdma_mpi .eqv. ...)' check. Ensure only the intended path executes at runtime and that the preprocessor generates a single active path; otherwise you may execute neither or duplicate logic depending on how macros expand. #:for rdma_mpi in [False, True] if (rdma_mpi .eqv. ${'.true.' if rdma_mpi else '.false.'}$) then #:if rdma_mpi #:call GPU_HOST_DATA(use_device='[ib_buff_send, ib_buff_recv]') call nvtxStartRange("IB-MARKER-SENDRECV-RDMA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange #:endcall GPU_HOST_DATA $:GPU_WAIT() #:else call nvtxStartRange("IB-MARKER-DEV2HOST") $:GPU_UPDATE(host='[ib_buff_send]') call nvtxEndRange call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange call nvtxStartRange("IB-MARKER-HOST2DEV") $:GPU_UPDATE(device='[ib_buff_recv]') call nvtxEndRange #:endif end if #:endfor GPU Sync/Updates In non-RDMA path, host update is done for send buffer and device update for recv buffer, but no explicit GPU_WAIT fences around the MPI call other than per-branch calls. Confirm that $:GPU_UPDATE and $:GPU_WAIT semantics guarantee correctness and that device pointers are valid when entering/leaving MPI, especially for overlapping streams. call nvtxStartRange("IB-MARKER-DEV2HOST") $:GPU_UPDATE(host='[ib_buff_send]') call nvtxEndRange call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange call nvtxStartRange("IB-MARKER-HOST2DEV") $:GPU_UPDATE(device='[ib_buff_recv]') call nvtxEndRange

qodo-merge-pro · 2025-08-31T03:19:42Z

PR Code Suggestions ✨

Latest suggestions up to ae30de5

Category	Suggestion	Impact
General	Fix buffer initialization order Restore the initialization order to populate grid buffers after modules that may modify fields (IBM, sources) to avoid stale halos. Move the buffer population call after these initializations to ensure consistent boundary data. src/simulation/m_start_up.fpp [1369-1375] -! Populating the buffers of the grid variables using the boundary conditions -call s_populate_grid_variables_buffers() - if (model_eqns == 3) call s_initialize_internal_energy_equations(q_cons_ts(1)%vf) if (ib) call s_ibm_setup() if (bodyForces) call s_initialize_body_forces_module() if (acoustic_source) call s_precalculate_acoustic_spatial_sources() +! Populating the buffers of the grid variables using the boundary conditions +call s_populate_grid_variables_buffers() + Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies a potential regression in the PR, where moving the `s_populate_grid_variables_buffers` call could lead to stale buffer data and incorrect simulation results.	High
Incremental ^[*]	Fix MPI buffer size formulas Align `buffer_counts` with the exact number of elements packed in each branch to prevent MPI size mismatches; for dir=1 use `(n+1)(p+1)buff_size`, dir=2 use `(m+2buff_size+1)(p+1)buff_size`, and dir=3 use `(m+2buff_size+1)(n+2buff_size+1)buff_size`. src/simulation/m_mpi_proxy.fpp [272-276] buffer_counts = (/ & - buff_size(n + 1)(p + 1), & - buff_size(m + 2buff_size + 1)(p + 1), & - buff_size(m + 2buff_size + 1)(n + 2buff_size + 1) & + (n + 1)(p + 1)buff_size, & + (m + 2buff_size + 1)(p + 1)buff_size, & + (m + 2buff_size + 1)(n + 2buff_size + 1)*buff_size & /) Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that the multiplication order in the `buffer_counts` calculation is different from the loop indexing, but since Fortran's array initialization is not order-dependent, this is a stylistic improvement for readability rather than a functional bug fix.	Medium
Possible issue	Assert correct MPI buffer sizes Validate that `buffer_count` matches the actual packed size for each direction to avoid MPI mismatches. Add runtime assertions after computing `buffer_count` and before communication to catch mis-sized buffers when `buff_size` changes. src/simulation/m_mpi_proxy.fpp [272-276] buffer_counts = (/ & buff_size(n + 1)(p + 1), & buff_size(m + 2buff_size + 1)(p + 1), & buff_size(m + 2buff_size + 1)(n + 2buff_size + 1) & /) -... +buffer_count = buffer_counts(mpi_dir) + +! Sanity checks to avoid MPI size mismatches +if (buffer_count <= 0) then + print , "Invalid buffer_count for MPI dir=", mpi_dir, " count=", buffer_count + error stop "Invalid MPI buffer count" +end if call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) Suggestion importance[1-10]: 3 __ Why: The suggestion adds a defensive check for `buffer_count`, which is good practice for robustness, but it addresses a very unlikely scenario, making its impact minor.	Low
More

Previous suggestions

Suggestions up to commit f074f32

Category	Suggestion	Impact
High-level	Pre-transfer GPU sync for RDMA In the RDMA path you invoke MPI_SENDRECV with device pointers without synchronizing prior GPU work, risking stale or partially-written data being sent. Move the GPU_WAIT (or stream/queue sync) to before the MPI call (and only keep a post-call sync if proven necessary), and ensure the subsequent unpack consistently operates on the same memory space (device vs. host) for both paths. Additionally, gate the RDMA branch on a robust runtime check for GPU-aware MPI to prevent undefined behavior when rdma_mpi is misconfigured. Examples: src/simulation/m_mpi_proxy.fpp [354-362] call nvtxStartRange("IB-MARKER-SENDRECV-RDMA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange #:endcall GPU_HOST_DATA $:GPU_WAIT() Solution Walkthrough: Before: #:if rdma_mpi // Potentially using device pointers for MPI // without ensuring prior GPU kernels are complete. call MPI_SENDRECV( & ib_buff_send, // device buffer ..., ib_buff_recv, // device buffer ... ) // Sync happens after the MPI call, which is too late for the send buffer. $:GPU_WAIT() #:else // Safe non-RDMA path with explicit copies $:GPU_UPDATE(host='[ib_buff_send]') call MPI_SENDRECV(...) $:GPU_UPDATE(device='[ib_buff_recv]') #:endif After: // At runtime, check if GPU-aware MPI is available if (gpu_aware_mpi_is_available) then // RDMA path // Sync GPU before MPI call to ensure send buffer is fully written. $:GPU_WAIT() call MPI_SENDRECV( & ib_buff_send, // device buffer ..., ib_buff_recv, // device buffer ... ) // A post-call sync might also be needed depending on MPI implementation $:GPU_WAIT() else // Fallback to non-RDMA path $:GPU_UPDATE(host='[ib_buff_send]') call MPI_SENDRECV(...) $:GPU_UPDATE(device='[ib_buff_recv]') endif Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies a critical race condition in the RDMA path where `MPI_SENDRECV` might access GPU data before it's ready, and also proposes a crucial design improvement by replacing a compile-time flag with a runtime check for GPU-aware MPI.	High
Possible issue	Synchronize before RDMA transfer Ensure device writes to `ib_buff_send` are complete before initiating CUDA-aware MPI. Add a `GPU_WAIT` before the RDMA `MPI_SENDRECV` (keep the post-wait to guarantee visibility of the received data to subsequent GPU work). This prevents racing the MPI with in-flight GPU kernels. src/simulation/m_mpi_proxy.fpp [352-362] #:call GPU_HOST_DATA(use_device='[ib_buff_send, ib_buff_recv]') +$:GPU_WAIT() call nvtxStartRange("IB-MARKER-SENDRECV-RDMA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange #:endcall GPU_HOST_DATA $:GPU_WAIT() Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a potential race condition where the `MPI_SENDRECV` could start before the GPU has finished writing to `ib_buff_send`, and proposes adding `$:GPU_WAIT()` to prevent this critical bug.	High
Possible issue	Synchronize before host update Add a synchronization before copying `ib_buff_send` to host in the non-RDMA path to ensure prior GPU kernels have finished writing. This prevents stale or partially updated data from being sent via MPI. src/simulation/m_mpi_proxy.fpp [364-366] call nvtxStartRange("IB-MARKER-DEV2HOST") +$:GPU_WAIT() $:GPU_UPDATE(host='[ib_buff_send]') call nvtxEndRange Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a potential race condition where `ib_buff_send` could be copied to the host before the GPU has finished writing to it, and proposes adding `$:GPU_WAIT()` to prevent this critical bug.	High
General	Correct profiling label typo Fix the NVTX range label typo to maintain profiling clarity and avoid splitting metrics across misspelled ranges. Rename "NO-RMDA" to "NO-RDMA". src/simulation/m_mpi_proxy.fpp [368-373] -call nvtxStartRange("IB-MARKER-SENDRECV-NO-RMDA") +call nvtxStartRange("IB-MARKER-SENDRECV-NO-RDMA") call MPI_SENDRECV( & ib_buff_send, buffer_count, MPI_INTEGER, dst_proc, send_tag, & ib_buff_recv, buffer_count, MPI_INTEGER, src_proc, recv_tag, & MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr) call nvtxEndRange Suggestion importance[1-10]: 4 __ Why: The suggestion correctly identifies and fixes a typo in an NVTX profiling label from "NO-RMDA" to "NO-RDMA", which improves the clarity and correctness of performance profiling data.	Low

anandrdbz · 2025-08-31T03:23:54Z

@sbryngelson MacOS CI has a problem building cmake btw

codecov · 2025-08-31T05:39:08Z

Codecov Report

❌ Patch coverage is 18.00000% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.92%. Comparing base (88c0a11) to head (ae30de5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_mpi_proxy.fpp	6.25%	30 Missing ⚠️
src/simulation/m_ibm.fpp	23.07%	9 Missing and 1 partial ⚠️
src/common/m_helper_basic.fpp	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #990      +/-   ##
==========================================
- Coverage   40.93%   40.92%   -0.02%     
==========================================
  Files          70       70              
  Lines       20288    20299      +11     
  Branches     2517     2521       +4     
==========================================
+ Hits         8305     8307       +2     
- Misses      10447    10454       +7     
- Partials     1536     1538       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson · 2025-08-31T12:00:23Z

@sbryngelson MacOS CI has a problem building cmake btw

Please fix

Removed 'cmake' from the list of installed packages on macOS.

sbryngelson · 2025-09-01T12:45:33Z

does doesn't run with case optimization on CPU/Phoenix for some reason... @anandrdbz please check logs

anandrdbz · 2025-09-01T14:10:31Z

Actually this case revealed an important corner case, the IBM boundary and processor boundary match almost exactly. You actually need a larger buffer size than I even thought , I kept a limit of 6 = 2*gp_layers for the image point, but this isn't strictly true. In this particular case, it worked when increasing it to 8.

But there's actually no deterministic way of calculating this though as it depends on the IBM geometry, For now I'll push a fix to get this test to pass, but I'll put a print message that tells the user to increase buff_size if they run into this issue again in the source code.

sbryngelson · 2025-09-01T14:21:33Z

hmm... @anandrdbz given that halo exchanges / ghost cell exchanges are relatively cheap except in extreme strong scaling cases, can we just put a rather generous upper bound on the ghost cells? This is a good find though, and I don't want to lose it in case it comes up again. I'm not sure how you want to add an warning message, but try to place it carefully/mindfully. Presumably the vast majority of cases with an IB will fail for a reason that has nothing to do with this 😄

anandrdbz · 2025-09-01T14:29:43Z

Yeah this will fail uniquely, with a print message that's unique too, so it should be clear ( you can checkout my latest commit)

I have kept a generous upper bound of 10 for IBM cases, but there's no way to calculate a strict upper bound as it's geometry dependent.

src/common/m_helper_basic.fpp

anandrdbz · 2025-09-01T15:33:41Z

Also @sbryngelson the benchmark case with IBM will eventually fail the check in this latest commit due to the increase in buffer size from 6 to 10 (I checked this for CPUs on Phoenix), with a 20% or so slowdown in grind time (I think the limit is about 5-10%).

Anand added 2 commits August 30, 2025 23:11

Fix MPI Transfer for ib_markers on GPUs

b5a7db9

Format

f074f32

anandrdbz requested a review from a team as a code owner August 31, 2025 03:17

qodo-merge-pro bot added the Review effort 3/5 label Aug 31, 2025

Format

31616a9

Fix bug with IBM on processor boundary

1d26c3a

anandrdbz changed the title ~~MPI Transfer for IBM Markers on multiple GPUs~~ Resolving bug with multiple ranks using IBM Aug 31, 2025

Format

a87c465

Update macOS package installation in test.yml

daff4de

Removed 'cmake' from the list of installed packages on macOS.

sbryngelson self-requested a review as a code owner August 31, 2025 12:17

sbryngelson approved these changes Aug 31, 2025

View reviewed changes

Fix minimum buff_size in IBM, and add a warning

ae30de5

sbryngelson reviewed Sep 1, 2025

View reviewed changes

src/common/m_helper_basic.fpp Show resolved Hide resolved

sbryngelson self-requested a review September 1, 2025 19:05

sbryngelson approved these changes Sep 1, 2025

View reviewed changes

sbryngelson merged commit fe87fc5 into MFlowCode:master Sep 1, 2025
45 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolving bug with multiple ranks using IBM #990

Resolving bug with multiple ranks using IBM #990

Uh oh!

anandrdbz commented Aug 31, 2025 •

edited by sbryngelson

Loading

Uh oh!

qodo-merge-pro bot commented Aug 31, 2025

Uh oh!

qodo-merge-pro bot commented Aug 31, 2025 •

edited

Loading

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

anandrdbz commented Aug 31, 2025

Uh oh!

codecov bot commented Aug 31, 2025 •

edited

Loading

Uh oh!

sbryngelson commented Aug 31, 2025

Uh oh!

sbryngelson commented Sep 1, 2025

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

sbryngelson commented Sep 1, 2025 •

edited

Loading

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

Resolving bug with multiple ranks using IBM #990

Resolving bug with multiple ranks using IBM #990

Uh oh!

Conversation

anandrdbz commented Aug 31, 2025 • edited by sbryngelson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Type of change

Scope

How Has This Been Tested?

Checklist

If your code changes any code source files (anything in src/simulation)

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Aug 31, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

anandrdbz commented Aug 31, 2025

Uh oh!

codecov bot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented Aug 31, 2025

Uh oh!

sbryngelson commented Sep 1, 2025

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

sbryngelson commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

Uh oh!

anandrdbz commented Sep 1, 2025

Uh oh!

Uh oh!

Uh oh!

anandrdbz commented Aug 31, 2025 •

edited by sbryngelson

Loading

If your code changes any code source files (anything in `src/simulation`)

qodo-merge-pro bot commented Aug 31, 2025 •

edited

Loading

codecov bot commented Aug 31, 2025 •

edited

Loading

sbryngelson commented Sep 1, 2025 •

edited

Loading