Conversation
rohanbabbar04
commented
Feb 8, 2026
- Add CuPy Synchronization before MPI Calls.
|
@mrava87 When I was testing it with multiple GPUs, during some instances I was getting wrong values. I figured out that it was because the GPU works asynchronously, and when an MPI call was made the GPU work was not completed. We should ensure that the device is synchronized before an MPI call is made. |
|
@rohanbabbar04 thanks a lot for reporting this and for this PR! I am however a bit confused as I do not recall experiencing this before (including in our GPU CI)... can I ask a few questions:
|
|
Great, this starts to make more sense 😊 Ok, then looking at what you did I have some questions:
|
|
I tried to do it on both OpenMPI+cuda and MPICH+cuda. I found that the tests pass with OpenMPI but not using MPICH. I came across this link open-mpi/ompi#7733 (comment) wherein they have discussed about the same in OpenMPI. Digging deep it looks like they have synced it on their own and that is why it is working in OpenMPI(this looks like the fix, but would need to look into prev conversations) |
|
|
I think it makes sense that the lower-case (non-buffered) MPI calls do not need CUDA-related synchronization on top, because the data communication is done via CPU-to-CPU (after copying from the GPU). MPI_Comm takes care of that. Now the issue becomes more complicated with CUDA-aware MPI (the buffered version — Allreduce, etc.). In that particular thread @mrava87 mentioned, one of the maintainers said: At that time, I had the impression that “just the API call to Send/Recv is enough and manual synchronization is needed in all scenarios,” which I then generalized to other collective communication calls as well. From the two discussion threads pointed out by @rohanbabbar04, it looks like the community agrees that if the host (CPU) performs the data transfer (lower-case, non-buffered communication in our case), no extra synchronization is needed. But when the data is generated by CUDA kernels, extra synchronization is suggested. It looks like we need additional housekeeping for CUDA-aware MPI, as @rohanbabbar04 proposed. |
|
@tharittk thanks 😄 do you agree with either/both of strateg(ies) suggested:
|
|
@mrava87 I agree on the first point. But the second point is ... I lean more towards the lowest common denominator of all distribution i.e., it may hurt performance for some but works correctly for all. My gut feeling says adding distribution-specific logic to our high-level PyLops-MPI is a bit of slippery slope. |
|
Yeah I agree! I gave it a second thought and I feel the same, too may code bifurcations can cause headaches in the long term for us (and also make it harder for users if we keep adding env variables to define behaviors). Let's go for the solution proposed by Rohan in this PR - we want codes to always work even if in some cases may be a bit suboptimal 😊 and this also shows once again that if one can they should use NCCL for gpu codes. I'll review this PR, @tharittk great if you could also do it |
|
Just a minor push to only synchronize when |
Add check for engine="cupy"
3940ce4 to
bf0c427
Compare
|
Same here 😊 thanks Rohan! Just waiting for us (me and @hongyx11) to get the GPU CI back on so we can check it all runs fine, then I'll merge. PS: I think it's about time for a release given that we never did one since we changed our communication backend to the unified one. Will try to get it going this week... |
