Fix the lack of direct GPU to GPU communications in multi-device runs. #642

therault · 2024-03-08T19:55:00Z

In #570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening.

In this PR, we add a step, after all devices have been registered, to complete this initialization.

An alternative is to come back to using cuda_index-based bitmasks, but then the decision if two GPUs can directly communicate is device-type specific and needs to be moved from device_gpu.c to device_cuda_module.c, which means we add another function call to device_cuda_module.c inside the stage_in() function of device_gpu.c

In ICLDisco#570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening. In this PR, we add a step, after all devices have been registered, to complete this initialization.

bosilca · 2024-03-08T20:10:48Z

Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using the device_index in the mask makes little sense. Going back and rebuilding this mask based on cuda_index seems like a simpler and most resilient solution (we could change the device_index without having to rebuild all the info).

therault · 2024-03-08T20:20:22Z

device_gpu.c doesn't know the cuda_index (or hip_index or level_zero_index), that's the reason why we moved that mask to global device_index based. I don't think we want to replicate the full parsec_device_gpu_stage_in() function in all device implementations, so if we want to go back to cuda_index based bitmask, we can replace the test `if( gpu_device->peer_access_mask & (1 << candidate_dev->super.device_index) )` (and similar tests in device_gpu.c) by a call to the device-specific implementation, something like `if( gpu_device->peer_access(candidate_dev)` I'm fine with that, but I thought the global device_index mask would save these function calls that cannot be inlined in stage_in, which is taking a lot of time already.

…

On Fri, Mar 8, 2024 at 3:11 PM bosilca ***@***.***> wrote: Why did we moved from using cuda_index to using the device_index ? We cannot use d2d on devices of different types, and we check for that in parsec_default_gpu_stage_in and parsec_default_gpu_stage_out, so using the device_index in the mask makes little sense. Going back and rebuilding this mask based on cuda_index seems like a simpler and most resilient solution (we could change the device_index without having to rebuild all the info). — Reply to this email directly, view it on GitHub <#642 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFEZNVU76DEPMU6TOVWMODYXILN5AVCNFSM6AAAAABENLG62WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGM2TENBSGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

to prevent their reuse as candidates on other gpus.

abouteiller · 2024-03-11T22:54:52Z

Passes all ctests and works with large size POTRF -g 8 with nvlink active on Leconte.

parsec/mca/device/device_gpu.c

bosilca · 2024-03-12T06:11:13Z

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

parsec/mca/device/device_gpu.c

parsec/scheduling.c

tests/runtime/cuda/stage_main.c

abouteiller · 2024-03-12T21:52:37Z

I can understand how the need to add debugging messages can be justified as part of this PR, but there are other things (such as changing the coherency state or moving data copy version manipulation code across function calls) that do not fit into the description of this PR.

Also, there are several instances where the indentation of the new code is incorrect, 8 commits (half of them not signed) for a seemingly minor issue.

The issue is not minor at all. Turning back on D2D management unearthed a whole bunch of bugs that would produce wrong results (in particular in TRSM where we would exercise CPU->GPU1,2,3->CPU data motions). While I agree that the description of the PR does not match the real scope of what is achieved here, the changes are not random and serve the greater case of making D2D work at a basic level.

abouteiller · 2024-03-12T22:04:09Z

I will be extracting the fix for #641 in a separate PR, @therault that means I will be rewriting history in your branch to achieve this sorry.

bugfix: properly compute the number of readers when we impersonate the other gpu-manager during end of D2D transfer bugfix: d2d_complete tasks do not have a data_out set Add some comments for clarification, address review remarks

therault requested a review from a team as a code owner March 8, 2024 19:55

bugfix: under-transfer data must be invalid on the destination data copy

f84f56f

to prevent their reuse as candidates on other gpus.

abouteiller self-requested a review March 11, 2024 22:55

abouteiller approved these changes Mar 11, 2024

View reviewed changes

abouteiller added this to the v4.0 milestone Mar 11, 2024

devreal reviewed Mar 12, 2024

View reviewed changes

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved

parsec/mca/device/device_gpu.c Show resolved Hide resolved

therault mentioned this pull request Mar 12, 2024

stress:gpu crashes when compiled with CUDA but run without a device #641

Closed

devreal reviewed Mar 12, 2024

View reviewed changes

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved

parsec/scheduling.c Outdated Show resolved Hide resolved

tests/runtime/cuda/stage_main.c Outdated Show resolved Hide resolved

tests/runtime/cuda/stage_main.c Outdated Show resolved Hide resolved

abouteiller self-requested a review March 12, 2024 21:59

abouteiller force-pushed the fix-component-initialization-GPUs branch from 41e8e9b to 2ab2fc4 Compare March 12, 2024 22:38

abouteiller approved these changes Mar 12, 2024

View reviewed changes

abouteiller mentioned this pull request Mar 13, 2024

Consolidated error handling when GPU only tests execute on CPU systems #644

Merged

4 tasks

abouteiller merged commit 1ababbe into ICLDisco:master Apr 1, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

therault commented Mar 8, 2024

bosilca commented Mar 8, 2024

therault commented Mar 8, 2024 via email

abouteiller commented Mar 11, 2024

bosilca commented Mar 12, 2024

abouteiller commented Mar 12, 2024

abouteiller commented Mar 12, 2024

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Conversation

therault commented Mar 8, 2024

bosilca commented Mar 8, 2024

therault commented Mar 8, 2024 via email

abouteiller commented Mar 11, 2024

bosilca commented Mar 12, 2024

abouteiller commented Mar 12, 2024

abouteiller commented Mar 12, 2024