Add CuPy synchronization by rohanbabbar04 · Pull Request #182 · PyLops/pylops-mpi

rohanbabbar04 · 2026-02-08T17:35:19Z

Add CuPy Synchronization before MPI Calls.

rohanbabbar04 · 2026-02-09T08:11:09Z

@mrava87 When I was testing it with multiple GPUs, during some instances I was getting wrong values. I figured out that it was because the GPU works asynchronously, and when an MPI call was made the GPU work was not completed. We should ensure that the device is synchronized before an MPI call is made.

mrava87 · 2026-02-09T08:52:08Z

@rohanbabbar04 thanks a lot for reporting this and for this PR!

I am however a bit confused as I do not recall experiencing this before (including in our GPU CI)... can I ask a few questions:

can you provide a precise description of what you mean for I was testing it with multiple GPUs, is it the test suite or?
did you run your GPU tests with or without Cuda-Aware MPI (aka PYLOPS_MPI_CUDA_AWARE=1 or PYLOPS_MPI_CUDA_AWARE=0)? Our test suite runs with PYLOPS_MPI_CUDA_AWARE=0 as we could't get a cuda-aware MPI version up and running... and I must say I did have myself a few issues to do it locally (though i eventually managed). So if it is cuda-aware MPI, that may be, if it is not it is a bit strange
@tharittk I recall from our interaction with the MPI4Py developers that we though manual syncronization was not needed (at least for lower letter comm methods - Performance implication on Send/Recv vs send/recv for CuPy array mpi4py/mpi4py#657 (reply in thread))... what do you think about what @rohanbabbar04 is reporting and suggesting?

rohanbabbar04 · 2026-02-09T10:21:33Z

Yes, I have access to 4 GPUs. I was running some examples with them when I came across some strange assertion errors.
I have installed mpich+cuda using spack and ran the tests with PYLOPS_MPI_CUDA_AWARE=1 that is when I was able to get these errors. Looks like when PYLOPS_MPI_CUDA_AWARE=0 the tests pass.
It is not handled for the MPI Standard funcs(like Allgather, Allreduce, etc) in the library, whereas it looks like it is working for the much pythonic functions (allgather, allreduce, etc).
This is what is mentioned in the conversation as well I think.

In general, though, we want the users to synchronize on an as-needed basis, instead of us taking over the control, so as to reduce latency and not hamper the performance.

I think the above two statements have been mentioned which relate to this issue.

mrava87 · 2026-02-09T11:02:28Z

Great, this starts to make more sense 😊

Ok, then looking at what you did I have some questions:

could we use the env variable to check if an explicit synchronization is needed (for Cuda-aware MPI) versus not needed (for non Cuda-aware MPI)? I just want to avoid adding an extra sync when it's not needed as it could potentially lead to some small slow down
this is more for @hongyx11 really, do you think we could use spack in the GA to also get Cuda-aware MPI working?

rohanbabbar04 · 2026-02-09T11:05:01Z

I tried to do it on both OpenMPI+cuda and MPICH+cuda. I found that the tests pass with OpenMPI but not using MPICH. I came across this link open-mpi/ompi#7733 (comment) wherein they have discussed about the same in OpenMPI. Digging deep it looks like they have synced it on their own and that is why it is working in OpenMPI(this looks like the fix, but would need to look into prev conversations)

rohanbabbar04 · 2026-02-09T12:18:33Z

Great, this starts to make more sense 😊

Ok, then looking at what you did I have some questions:

could we use the env variable to check if an explicit synchronization is needed (for Cuda-aware MPI) versus not needed (for non Cuda-aware MPI)? I just want to avoid adding an extra sync when it's not needed as it could potentially lead to some small slow down

this is more for @hongyx11 really, do you think we could use spack in the GA to also get Cuda-aware MPI working?

I like the idea, we can introduce a new variable named GPU_DEVICE_SYNC. If the user assigns it to 0, we can skip the sync process otherwise we execute it. I think it is the best idea since some MPI implementations may handle it internally.
Spack is open source - Spack-docs.

tharittk · 2026-02-09T15:12:52Z

I think it makes sense that the lower-case (non-buffered) MPI calls do not need CUDA-related synchronization on top, because the data communication is done via CPU-to-CPU (after copying from the GPU). MPI_Comm takes care of that.

Now the issue becomes more complicated with CUDA-aware MPI (the buffered version — Allreduce, etc.). In that particular thread @mrava87 mentioned, one of the maintainers said:
Send/Recv are blocking p2p communications, and the MPI and CUDA programming guides still apply.

At that time, I had the impression that “just the API call to Send/Recv is enough and manual synchronization is needed in all scenarios,” which I then generalized to other collective communication calls as well.

From the two discussion threads pointed out by @rohanbabbar04, it looks like the community agrees that if the host (CPU) performs the data transfer (lower-case, non-buffered communication in our case), no extra synchronization is needed. But when the data is generated by CUDA kernels, extra synchronization is suggested.

It looks like we need additional housekeeping for CUDA-aware MPI, as @rohanbabbar04 proposed.

mrava87 · 2026-02-09T15:52:30Z

@tharittk thanks 😄 do you agree with either/both of strateg(ies) suggested:

use PYLOPS_MPI_CUDA_AWARE to decide whether we need to syncronize;
be even more granular and add GPU_DEVICE_SYNC.. I haven't read the links that Rohan posted yet, but it seems like that for cuda-aware MPI the need of syncronization depends even on the mpi distribution (so it is not enough to know that one is using cuda-aware MPI to know that manual sync is needed - of course doing it does never arm even if it is done internally, but can be suboptimal.. not doing it arms if it is not doing internally...)

tharittk · 2026-02-09T16:03:39Z

@mrava87 I agree on the first point. But the second point is ... I lean more towards the lowest common denominator of all distribution i.e., it may hurt performance for some but works correctly for all. My gut feeling says adding distribution-specific logic to our high-level PyLops-MPI is a bit of slippery slope.

mrava87 · 2026-02-09T17:24:52Z

Yeah I agree! I gave it a second thought and I feel the same, too may code bifurcations can cause headaches in the long term for us (and also make it harder for users if we keep adding env variables to define behaviors).

Let's go for the solution proposed by Rohan in this PR - we want codes to always work even if in some cases may be a bit suboptimal 😊 and this also shows once again that if one can they should use NCCL for gpu codes.

I'll review this PR, @tharittk great if you could also do it

rohanbabbar04 · 2026-02-09T18:13:13Z

Just a minor push to only synchronize when PYLOPS_MPI_CUDA_AWARE=1.
I also agree better not to introduce a new env variable for syncing as the safest/best way to avoid issues as suggested by @tharittk would be to keep it turned on always. As suggested in the open-mpi/ompi#7733 (comment), better to call it before every MPI call to avoid any issues.

Add check for engine="cupy"

tharittk

Looks good to me !

mrava87 · 2026-02-11T14:22:24Z

Same here 😊 thanks Rohan!

Just waiting for us (me and @hongyx11) to get the GPU CI back on so we can check it all runs fine, then I'll merge.

PS: I think it's about time for a release given that we never did one since we changed our communication backend to the unified one. Will try to get it going this week...

rohanbabbar04 added 3 commits February 8, 2026 22:58

Add synchronization

27c6f8c

Linting

fcf5473

Change to ncp.cuda.Device().synchronize()

41d2e7c

mrava87 requested review from mrava87 and tharittk and removed request for tharittk February 9, 2026 08:52

mrava87 added bug Something isn't working documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Feb 9, 2026

Add check for cuda_aware_mpi_enabled

bf0c427

Add check for engine="cupy"

rohanbabbar04 force-pushed the cuy-sync branch from 3940ce4 to bf0c427 Compare February 9, 2026 18:18

tharittk approved these changes Feb 11, 2026

View reviewed changes

mrava87 approved these changes Feb 11, 2026

View reviewed changes

mrava87 merged commit 75eed01 into main Feb 11, 2026
55 of 67 checks passed

mrava87 deleted the cuy-sync branch February 11, 2026 16:19

Conversation

rohanbabbar04 commented Feb 8, 2026

Uh oh!

rohanbabbar04 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrava87 commented Feb 9, 2026

Uh oh!

rohanbabbar04 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrava87 commented Feb 9, 2026

Uh oh!

rohanbabbar04 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohanbabbar04 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tharittk commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrava87 commented Feb 9, 2026

Uh oh!

tharittk commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrava87 commented Feb 9, 2026

Uh oh!

rohanbabbar04 commented Feb 9, 2026

Uh oh!

tharittk left a comment

Choose a reason for hiding this comment

Uh oh!

mrava87 commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohanbabbar04 commented Feb 9, 2026 •

edited

Loading

rohanbabbar04 commented Feb 9, 2026 •

edited

Loading

rohanbabbar04 commented Feb 9, 2026 •

edited

Loading

rohanbabbar04 commented Feb 9, 2026 •

edited

Loading

tharittk commented Feb 9, 2026 •

edited

Loading

tharittk commented Feb 9, 2026 •

edited

Loading