Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Merged

Conversation

therault
Copy link
Contributor

@therault therault commented Mar 8, 2024

In #570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening.

In this PR, we add a step, after all devices have been registered, to complete this initialization.

An alternative is to come back to using cuda_index-based bitmasks, but then the decision if two GPUs can directly communicate is device-type specific and needs to be moved from device_gpu.c to device_cuda_module.c, which means we add another function call to device_cuda_module.c inside the stage_in() function of device_gpu.c

In ICLDisco#570, we moved from using
cuda_index to using device_index in the 'nvlink' mask that decides
if we can directly communicate to another GPU. However, this bitmask
was initialized at query time, before devices get assigned a device_index.
As a consequence, the bitmask was wrong and no direct device to device
communication was happening.

In this PR, we add a step, after all devices have been registered,
to complete this initialization.
@therault therault requested a review from a team as a code owner March 8, 2024 19:55
@bosilca
Copy link
Contributor

bosilca commented Mar 8, 2024

Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using the device_index in the mask makes little sense. Going back and rebuilding this mask based on cuda_index seems like a simpler and most resilient solution (we could change the device_index without having to rebuild all the info).

@therault
Copy link
Contributor Author

therault commented Mar 8, 2024 via email

to prevent their reuse as candidates on other gpus.
@abouteiller
Copy link
Contributor

Passes all ctests and works with large size POTRF -g 8 with nvlink active on Leconte.

@abouteiller abouteiller self-requested a review March 11, 2024 22:55
@abouteiller abouteiller added this to the v4.0 milestone Mar 11, 2024
parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.c Show resolved Hide resolved
@bosilca
Copy link
Contributor

bosilca commented Mar 12, 2024

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
parsec/scheduling.c Outdated Show resolved Hide resolved
tests/runtime/cuda/stage_main.c Outdated Show resolved Hide resolved
tests/runtime/cuda/stage_main.c Outdated Show resolved Hide resolved
@abouteiller
Copy link
Contributor

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

The issue is not minor at all. Turning back on D2D management unearthed a whole bunch of bugs that would produce wrong results (in particular in TRSM where we would exercise CPU->GPU1,2,3->CPU data motions). While I agree that the description of the PR does not match the real scope of what is achieved here, the changes are not random and serve the greater case of making D2D work at a basic level.

@abouteiller abouteiller self-requested a review March 12, 2024 21:59
@abouteiller
Copy link
Contributor

I will be extracting the fix for #641 in a separate PR, @therault that means I will be rewriting history in your branch to achieve this sorry.

bugfix: properly compute the number of readers when we impersonate the other gpu-manager during end of D2D transfer
bugfix: d2d_complete tasks do not have a data_out set
Add some comments for clarification, address review remarks
@abouteiller abouteiller force-pushed the fix-component-initialization-GPUs branch from 41e8e9b to 2ab2fc4 Compare March 12, 2024 22:38
@abouteiller abouteiller merged commit 1ababbe into ICLDisco:master Apr 1, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants