Skip to content

Add CuPy synchronization#182

Merged
mrava87 merged 4 commits intomainfrom
cuy-sync
Feb 11, 2026
Merged

Add CuPy synchronization#182
mrava87 merged 4 commits intomainfrom
cuy-sync

Conversation

@rohanbabbar04
Copy link
Copy Markdown
Collaborator

  • Add CuPy Synchronization before MPI Calls.

@rohanbabbar04
Copy link
Copy Markdown
Collaborator Author

rohanbabbar04 commented Feb 9, 2026

@mrava87 When I was testing it with multiple GPUs, during some instances I was getting wrong values. I figured out that it was because the GPU works asynchronously, and when an MPI call was made the GPU work was not completed. We should ensure that the device is synchronized before an MPI call is made.

@mrava87
Copy link
Copy Markdown
Contributor

mrava87 commented Feb 9, 2026

@rohanbabbar04 thanks a lot for reporting this and for this PR!

I am however a bit confused as I do not recall experiencing this before (including in our GPU CI)... can I ask a few questions:

  • can you provide a precise description of what you mean for I was testing it with multiple GPUs, is it the test suite or?
  • did you run your GPU tests with or without Cuda-Aware MPI (aka PYLOPS_MPI_CUDA_AWARE=1 or PYLOPS_MPI_CUDA_AWARE=0)? Our test suite runs with PYLOPS_MPI_CUDA_AWARE=0 as we could't get a cuda-aware MPI version up and running... and I must say I did have myself a few issues to do it locally (though i eventually managed). So if it is cuda-aware MPI, that may be, if it is not it is a bit strange
  • @tharittk I recall from our interaction with the MPI4Py developers that we though manual syncronization was not needed (at least for lower letter comm methods - Performance implication on Send/Recv vs send/recv for CuPy array mpi4py/mpi4py#657 (reply in thread))... what do you think about what @rohanbabbar04 is reporting and suggesting?

@mrava87 mrava87 requested review from mrava87 and tharittk and removed request for tharittk February 9, 2026 08:52
@rohanbabbar04
Copy link
Copy Markdown
Collaborator Author

rohanbabbar04 commented Feb 9, 2026

  • Yes, I have access to 4 GPUs. I was running some examples with them when I came across some strange assertion errors.
  • I have installed mpich+cuda using spack and ran the tests with PYLOPS_MPI_CUDA_AWARE=1 that is when I was able to get these errors. Looks like when PYLOPS_MPI_CUDA_AWARE=0 the tests pass.
  • It is not handled for the MPI Standard funcs(like Allgather, Allreduce, etc) in the library, whereas it looks like it is working for the much pythonic functions (allgather, allreduce, etc).
    This is what is mentioned in the conversation as well I think.
Screenshot from 2026-02-09 15-38-40

In general, though, we want the users to synchronize on an as-needed basis, instead of us taking over the control, so as to reduce latency and not hamper the performance.

I think the above two statements have been mentioned which relate to this issue.

@mrava87
Copy link
Copy Markdown
Contributor

mrava87 commented Feb 9, 2026

Great, this starts to make more sense 😊

Ok, then looking at what you did I have some questions:

  • could we use the env variable to check if an explicit synchronization is needed (for Cuda-aware MPI) versus not needed (for non Cuda-aware MPI)? I just want to avoid adding an extra sync when it's not needed as it could potentially lead to some small slow down
  • this is more for @hongyx11 really, do you think we could use spack in the GA to also get Cuda-aware MPI working?

@rohanbabbar04
Copy link
Copy Markdown
Collaborator Author

rohanbabbar04 commented Feb 9, 2026

I tried to do it on both OpenMPI+cuda and MPICH+cuda. I found that the tests pass with OpenMPI but not using MPICH. I came across this link open-mpi/ompi#7733 (comment) wherein they have discussed about the same in OpenMPI. Digging deep it looks like they have synced it on their own and that is why it is working in OpenMPI(this looks like the fix, but would need to look into prev conversations)

@mrava87 mrava87 added bug Something isn't working documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Feb 9, 2026
@rohanbabbar04
Copy link
Copy Markdown
Collaborator Author

rohanbabbar04 commented Feb 9, 2026

Great, this starts to make more sense 😊

Ok, then looking at what you did I have some questions:

  • could we use the env variable to check if an explicit synchronization is needed (for Cuda-aware MPI) versus not needed (for non Cuda-aware MPI)? I just want to avoid adding an extra sync when it's not needed as it could potentially lead to some small slow down
  • this is more for @hongyx11 really, do you think we could use spack in the GA to also get Cuda-aware MPI working?
  • I like the idea, we can introduce a new variable named GPU_DEVICE_SYNC. If the user assigns it to 0, we can skip the sync process otherwise we execute it. I think it is the best idea since some MPI implementations may handle it internally.
  • Spack is open source - Spack-docs.

@tharittk
Copy link
Copy Markdown
Collaborator

tharittk commented Feb 9, 2026

I think it makes sense that the lower-case (non-buffered) MPI calls do not need CUDA-related synchronization on top, because the data communication is done via CPU-to-CPU (after copying from the GPU). MPI_Comm takes care of that.

Now the issue becomes more complicated with CUDA-aware MPI (the buffered version — Allreduce, etc.). In that particular thread @mrava87 mentioned, one of the maintainers said:
Send/Recv are blocking p2p communications, and the MPI and CUDA programming guides still apply.

At that time, I had the impression that “just the API call to Send/Recv is enough and manual synchronization is needed in all scenarios,” which I then generalized to other collective communication calls as well.

From the two discussion threads pointed out by @rohanbabbar04, it looks like the community agrees that if the host (CPU) performs the data transfer (lower-case, non-buffered communication in our case), no extra synchronization is needed. But when the data is generated by CUDA kernels, extra synchronization is suggested.

It looks like we need additional housekeeping for CUDA-aware MPI, as @rohanbabbar04 proposed.

@mrava87
Copy link
Copy Markdown
Contributor

mrava87 commented Feb 9, 2026

@tharittk thanks 😄 do you agree with either/both of strateg(ies) suggested:

  • use PYLOPS_MPI_CUDA_AWARE to decide whether we need to syncronize;
  • be even more granular and add GPU_DEVICE_SYNC.. I haven't read the links that Rohan posted yet, but it seems like that for cuda-aware MPI the need of syncronization depends even on the mpi distribution (so it is not enough to know that one is using cuda-aware MPI to know that manual sync is needed - of course doing it does never arm even if it is done internally, but can be suboptimal.. not doing it arms if it is not doing internally...)

@tharittk
Copy link
Copy Markdown
Collaborator

tharittk commented Feb 9, 2026

@mrava87 I agree on the first point. But the second point is ... I lean more towards the lowest common denominator of all distribution i.e., it may hurt performance for some but works correctly for all. My gut feeling says adding distribution-specific logic to our high-level PyLops-MPI is a bit of slippery slope.

@mrava87
Copy link
Copy Markdown
Contributor

mrava87 commented Feb 9, 2026

Yeah I agree! I gave it a second thought and I feel the same, too may code bifurcations can cause headaches in the long term for us (and also make it harder for users if we keep adding env variables to define behaviors).

Let's go for the solution proposed by Rohan in this PR - we want codes to always work even if in some cases may be a bit suboptimal 😊 and this also shows once again that if one can they should use NCCL for gpu codes.

I'll review this PR, @tharittk great if you could also do it

@rohanbabbar04
Copy link
Copy Markdown
Collaborator Author

Just a minor push to only synchronize when PYLOPS_MPI_CUDA_AWARE=1.
I also agree better not to introduce a new env variable for syncing as the safest/best way to avoid issues as suggested by @tharittk would be to keep it turned on always. As suggested in the open-mpi/ompi#7733 (comment), better to call it before every MPI call to avoid any issues.

Add check for engine="cupy"
Copy link
Copy Markdown
Collaborator

@tharittk tharittk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !

@mrava87
Copy link
Copy Markdown
Contributor

mrava87 commented Feb 11, 2026

Same here 😊 thanks Rohan!

Just waiting for us (me and @hongyx11) to get the GPU CI back on so we can check it all runs fine, then I'll merge.

PS: I think it's about time for a release given that we never did one since we changed our communication backend to the unified one. Will try to get it going this week...

@mrava87 mrava87 merged commit 75eed01 into main Feb 11, 2026
55 of 67 checks passed
@mrava87 mrava87 deleted the cuy-sync branch February 11, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants