Refactor GPU device to increase code factorization between the devices. #570

therault · 2023-08-28T22:22:25Z

Expose 8 functions that are device hardware-specific in the gpu_module_s: set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation
Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious
Move all hardware-oblivious code in device_gpu.c
Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c

parsec/mca/device/cuda/device_cuda_component.c

parsec/mca/device/device.h

parsec/mca/device/device_gpu.c

abouteiller · 2023-10-16T13:50:12Z

parsec/mca/device/device_gpu.c

+                continue;
+            if( NULL != chores[j].dyld_fn ) {
+                /* the function has been set for another device of the same type */
+                return PARSEC_SUCCESS;


don't we still need to iterate over the rest of the devices as they may have different incarnations that are not already loaded?

parsec/mca/device/device_gpu.c

parsec/mca/device/level_zero/device_level_zero_internal.h

parsec/mca/device/level_zero/device_level_zero_module.c

- Expose 8 functions that are device hardware-specific in the gpu_module_s: set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation - Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious - Move all hardware-oblivious code in device_gpu.c - Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c

…erface

….c to device_gpu.c

…e currently non-implemented

…me points raised by PR review

parsec/mca/device/device.h

parsec/mca/device/device_gpu.h

bosilca · 2023-10-23T19:15:42Z

parsec/interfaces/ptg/ptg-compiler/jdf2c.c

-            "  return parsec_%s_kernel_scheduler( es, gpu_task, dev_index );\n"
-            "}\n\n",
-            dev_lower);
+            "  return parsec_gpu_kernel_scheduler( es, gpu_task, dev_index );\n"


how is this working when multiple devices are present ? I think we need to name the scheduler appropriately, to allow support for multiple devices in same time (even if we do not currently have the cmake-configury nor the code generation capability for)

One way to revolve this would be to keep this generic function as a switchyard that would switch(devices[dev_index]->type and call the appropriate function based on this. That would keep the complexity in the generated code at a minimum.

We could make the code more complicated instead of just adding the well-defined device type into the function name. In fact, I think I would prefer a solution where the scheduler function is part of the device, so we will directly call devices[dev_index]->scheduler(es, gpu_task, devices[dev_index])

Renamed in parsec_device_kernel_scheduler in 9cdf7db.

When multiple accelerator devices are present, dev_index uniquely identifies a single device, and the code in parsec_device_kernel_scheduler is device-agnostic (the code uses the functions defined in the module to schedule tasklets, data movement and allocation).

Recursive and CPU devices do not call this function, as before.

This approach limits the parsec runtime to a single enabled device per binary instance.

Maybe the description of the PR was not clear . As part of the factorization effort, each accelerator device defines 8 functions (set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation), and parsec_device_kernel_scheduler (and the other functions in device_gpu.c) use those 8 device-specific functions to call device-specific operations within a device-agnostic code.

So, parsec_device_kernel_scheduler in device_gpu.c is the generic part of the existing code that does not depend on the device type, and when we do device-specific things (e.g. adding a CUDA event in a CUDA stream, or appending a Level Zero fence to a Level Zero command queue), we call one of the module-specific functions (gpu_module->event_record for that example).

This relies on the user / DSL putting the right thing in the gpu_task_t passed to parsec_device_kernel_scheduler: the submit, stage_in, and stage_out need to target the right type of device that corresponds to the dev_index in the call.

I believe this code should allow to have a HIP, Level Zero and CUDA device functionning simultaneously, but I may be missing something?

it is clear. However, the only reason you need to define a set_device function is to be able to create a unified parsec_device_kernel_scheduler and therefore define a unique behavior across all devices. If instead you add another member to the device structure _scheduler, then you can in addition to having the unified behavior you describe, have specialized behavior per device (hint recursive and CPU).

bosilca · 2023-10-23T19:21:55Z

parsec/mca/device/cuda/device_cuda_component.c

@@ -141,7 +141,7 @@ static int device_cuda_component_query(mca_base_module_t **module, int *priority
                PARSEC_CUDA_CHECK_ERROR( "(parsec_device_cuda_component_query) cuCtxEnablePeerAccess ", cudastatus,
                                         {continue;} );
                source_gpu->super.peer_access_mask = (int16_t)(source_gpu->super.peer_access_mask | (int16_t)(1 <<
-                        target_gpu->cuda_index));
+                        target_gpu->super.super.device_index));


This is a major change ! You repurpose the device index bitmask from a device naming scheme (cuda_index being the CUDA naming of the device) to a parsec naming scheme (device_index being the position in the parsec's array of devices).

However, I don't see any changes on the use of this peer_access_mask which point to something weird in this patch.

I did change the initialization of peer_access_mask in device_cuda_component.c:143: we use target_gpu->super.super.device_index instead of target_gpu->cuda_index, so I believe the access and initialization are consistent.

In parsec master, peer_access_mask is

initialized to 0 in parsec_cuda_module_init in device_cuda_module.c

then computed in device_cuda_component_query in device_cuda_component.c after all modules have been loaded

then read in parsec_cuda_data_stage_in in device_cuda_module.c

In this PR, parsec_cuda_data_stage_in is moved in device_gpu.c, and I have changed for both the access and initialization what we use as index.

The user-facing MCA parameter device_cuda_nvlink_mask is still using the CUDA index to exclude pairs of devices that could rely on NVLINK (so things have not changed here), and we only display peer_access_mask in debugging functions (the mask show parsec-device indices instead of cuda-indices, so it's not a 1-to-1 with the MCA parameter).

If this is not acceptable, to make parsec_gpu_data_stage_in device-type independent and keep it in device_gpu.c, I can extend the API of the devices to provide a function bool device_to_device_direct(parsec_device_module_t *src, parsec_device_module_t *dst) and each device will fill it up. But since src and dst might be of different types, the first test will be to return false if src->type != dst->type.

parsec/mca/device/cuda/device_cuda_component.c

parsec/mca/device/device_gpu.c

bosilca · 2023-10-23T19:32:27Z

parsec/mca/device/device_gpu.c

@@ -76,7 +80,7 @@ parsec_gpu_check_space_needed(parsec_device_gpu_module_t *gpu_device,
 void parsec_gpu_init_profiling(void)
 {
    if(parsec_gpu_profiling_initiated == 0) {
-        parsec_profiling_add_dictionary_keyword("cuda", "fill:#66ff66",


Each accelerator should generate their own events, instead of generating generic-named events.

device->name will contain a string of the form "cuda(2)" where 2 is the dev_index.

The device name is translated into the stream naming, but these events being unique they will be unique across all profiling streams, making them matcheable across execution domains. That make sense when we are talking about threads, but for devices it might be a stretch.

… function of the devices, define cuda (and hip)-specific MCA parameter to enable it to the default sorting list algorithm if enabled

… device_gpu.c

…d rename API fields to remove gpu

…the GPU devices.

- make static to device_gpu.c any event that is traced at the unified level - move MCA parameters to the per-device-type files

In ICLDisco#570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening. In this PR, we add a step, after all devices have been registered, to complete this initialization.

therault force-pushed the unified_gpu_interface branch from 5686fd6 to fe9733d Compare August 30, 2023 20:33

abouteiller reviewed Oct 16, 2023

View reviewed changes

therault added 5 commits October 23, 2023 17:36

Import Level Zero device from PR ICLDisco#486; port to new device int…

48b2e30

…erface

Backport PRs 576, 575, 563 in functions moved from device_cuda_module…

69bb9e3

….c to device_gpu.c

Use added gflops_guess feature to notify that the gflops estimates ar…

1ad5e8a

…e currently non-implemented

Remove duplicates and unused variables; fix some comments; address so…

3f9ae03

…me points raised by PR review

therault force-pushed the unified_gpu_interface branch from fe9733d to 3f9ae03 Compare October 23, 2023 19:02

therault marked this pull request as ready for review October 23, 2023 19:04

therault requested a review from a team as a code owner October 23, 2023 19:04

Fix a couple of porting errors

2412284

bosilca reviewed Oct 23, 2023

View reviewed changes

Remove CUDA-specific code from parsec_gpu.c; Make sort_pending_list a…

f51e9b0

… function of the devices, define cuda (and hip)-specific MCA parameter to enable it to the default sorting list algorithm if enabled

therault force-pushed the unified_gpu_interface branch from b098633 to f51e9b0 Compare October 24, 2023 19:23

therault added 3 commits October 24, 2023 19:54

Rename parsec_gpu into parsec_device for all functions that belong to…

9cdf7db

… device_gpu.c

Continue renaming: remove gpu from most parsec_device_gpu_ things, an…

7b1a05b

…d rename API fields to remove gpu

Add a kernel_scheduler function in the device module API. Use it for …

2332a73

…the GPU devices.

therault force-pushed the unified_gpu_interface branch from 6490618 to 2332a73 Compare October 25, 2023 14:56

Unified GPU interface & Tracing:

c889139

- make static to device_gpu.c any event that is traced at the unified level - move MCA parameters to the per-device-type files

bosilca approved these changes Nov 4, 2023

View reviewed changes

bosilca merged commit 3d046a1 into ICLDisco:master Nov 4, 2023
2 of 4 checks passed

bosilca mentioned this pull request Nov 4, 2023

Introduce the DPC++ and LevelZero device driver #486

Open

therault mentioned this pull request Mar 8, 2024

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GPU device to increase code factorization between the devices. #570

Refactor GPU device to increase code factorization between the devices. #570

therault commented Aug 28, 2023

abouteiller Oct 16, 2023

bosilca Oct 23, 2023

abouteiller Oct 24, 2023

bosilca Oct 24, 2023

therault Oct 24, 2023

bosilca Oct 24, 2023

therault Oct 24, 2023

bosilca Oct 24, 2023

bosilca Oct 23, 2023

therault Oct 24, 2023

bosilca Oct 23, 2023

abouteiller Oct 24, 2023

bosilca Oct 24, 2023

Refactor GPU device to increase code factorization between the devices. #570

Refactor GPU device to increase code factorization between the devices. #570

Conversation

therault commented Aug 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment