Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GPU device to increase code factorization between the devices. #570

Merged
merged 11 commits into from
Nov 4, 2023

Conversation

therault
Copy link
Contributor

  • Expose 8 functions that are device hardware-specific in the gpu_module_s: set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation
  • Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious
  • Move all hardware-oblivious code in device_gpu.c
  • Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c

parsec/mca/device/cuda/device_cuda_component.c Outdated Show resolved Hide resolved
parsec/mca/device/device.h Outdated Show resolved Hide resolved
parsec/mca/device/device.h Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.c Show resolved Hide resolved
continue;
if( NULL != chores[j].dyld_fn ) {
/* the function has been set for another device of the same type */
return PARSEC_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we still need to iterate over the rest of the devices as they may have different incarnations that are not already loaded?

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
parsec/mca/device/level_zero/device_level_zero_internal.h Outdated Show resolved Hide resolved
parsec/mca/device/level_zero/device_level_zero_internal.h Outdated Show resolved Hide resolved
parsec/mca/device/level_zero/device_level_zero_module.c Outdated Show resolved Hide resolved
 - Expose 8 functions that are device hardware-specific in the gpu_module_s:
    set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation
 - Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious
 - Move all hardware-oblivious code in device_gpu.c
 - Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c
@therault therault marked this pull request as ready for review October 23, 2023 19:04
@therault therault requested a review from a team as a code owner October 23, 2023 19:04
parsec/mca/device/device.h Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.h Outdated Show resolved Hide resolved
" return parsec_%s_kernel_scheduler( es, gpu_task, dev_index );\n"
"}\n\n",
dev_lower);
" return parsec_gpu_kernel_scheduler( es, gpu_task, dev_index );\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this working when multiple devices are present ? I think we need to name the scheduler appropriately, to allow support for multiple devices in same time (even if we do not currently have the cmake-configury nor the code generation capability for)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to revolve this would be to keep this generic function as a switchyard that would switch(devices[dev_index]->type and call the appropriate function based on this. That would keep the complexity in the generated code at a minimum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make the code more complicated instead of just adding the well-defined device type into the function name. In fact, I think I would prefer a solution where the scheduler function is part of the device, so we will directly call devices[dev_index]->scheduler(es, gpu_task, devices[dev_index])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in parsec_device_kernel_scheduler in 9cdf7db.

When multiple accelerator devices are present, dev_index uniquely identifies a single device, and the code in parsec_device_kernel_scheduler is device-agnostic (the code uses the functions defined in the module to schedule tasklets, data movement and allocation).

Recursive and CPU devices do not call this function, as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach limits the parsec runtime to a single enabled device per binary instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the description of the PR was not clear . As part of the factorization effort, each accelerator device defines 8 functions (set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation), and parsec_device_kernel_scheduler (and the other functions in device_gpu.c) use those 8 device-specific functions to call device-specific operations within a device-agnostic code.

So, parsec_device_kernel_scheduler in device_gpu.c is the generic part of the existing code that does not depend on the device type, and when we do device-specific things (e.g. adding a CUDA event in a CUDA stream, or appending a Level Zero fence to a Level Zero command queue), we call one of the module-specific functions (gpu_module->event_record for that example).

This relies on the user / DSL putting the right thing in the gpu_task_t passed to parsec_device_kernel_scheduler: the submit, stage_in, and stage_out need to target the right type of device that corresponds to the dev_index in the call.

I believe this code should allow to have a HIP, Level Zero and CUDA device functionning simultaneously, but I may be missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is clear. However, the only reason you need to define a set_device function is to be able to create a unified parsec_device_kernel_scheduler and therefore define a unique behavior across all devices. If instead you add another member to the device structure _scheduler, then you can in addition to having the unified behavior you describe, have specialized behavior per device (hint recursive and CPU).

@@ -141,7 +141,7 @@ static int device_cuda_component_query(mca_base_module_t **module, int *priority
PARSEC_CUDA_CHECK_ERROR( "(parsec_device_cuda_component_query) cuCtxEnablePeerAccess ", cudastatus,
{continue;} );
source_gpu->super.peer_access_mask = (int16_t)(source_gpu->super.peer_access_mask | (int16_t)(1 <<
target_gpu->cuda_index));
target_gpu->super.super.device_index));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a major change ! You repurpose the device index bitmask from a device naming scheme (cuda_index being the CUDA naming of the device) to a parsec naming scheme (device_index being the position in the parsec's array of devices).

However, I don't see any changes on the use of this peer_access_mask which point to something weird in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change the initialization of peer_access_mask in device_cuda_component.c:143: we use target_gpu->super.super.device_index instead of target_gpu->cuda_index, so I believe the access and initialization are consistent.

In parsec master, peer_access_mask is

  • initialized to 0 in parsec_cuda_module_init in device_cuda_module.c
  • then computed in device_cuda_component_query in device_cuda_component.c after all modules have been loaded
  • then read in parsec_cuda_data_stage_in in device_cuda_module.c

In this PR, parsec_cuda_data_stage_in is moved in device_gpu.c, and I have changed for both the access and initialization what we use as index.

The user-facing MCA parameter device_cuda_nvlink_mask is still using the CUDA index to exclude pairs of devices that could rely on NVLINK (so things have not changed here), and we only display peer_access_mask in debugging functions (the mask show parsec-device indices instead of cuda-indices, so it's not a 1-to-1 with the MCA parameter).

If this is not acceptable, to make parsec_gpu_data_stage_in device-type independent and keep it in device_gpu.c, I can extend the API of the devices to provide a function bool device_to_device_direct(parsec_device_module_t *src, parsec_device_module_t *dst) and each device will fill it up. But since src and dst might be of different types, the first test will be to return false if src->type != dst->type.

parsec/mca/device/cuda/device_cuda_component.c Outdated Show resolved Hide resolved
parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved
@@ -76,7 +80,7 @@ parsec_gpu_check_space_needed(parsec_device_gpu_module_t *gpu_device,
void parsec_gpu_init_profiling(void)
{
if(parsec_gpu_profiling_initiated == 0) {
parsec_profiling_add_dictionary_keyword("cuda", "fill:#66ff66",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each accelerator should generate their own events, instead of generating generic-named events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device->name will contain a string of the form "cuda(2)" where 2 is the dev_index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device name is translated into the stream naming, but these events being unique they will be unique across all profiling streams, making them matcheable across execution domains. That make sense when we are talking about threads, but for devices it might be a stretch.

… function of the devices, define cuda (and hip)-specific MCA parameter to enable it to the default sorting list algorithm if enabled
 - make static to device_gpu.c any event that is traced at the unified
   level
 - move MCA parameters to the per-device-type files
@bosilca bosilca merged commit 3d046a1 into ICLDisco:master Nov 4, 2023
2 of 4 checks passed
therault added a commit to therault/parsec that referenced this pull request Mar 8, 2024
In ICLDisco#570, we moved from using
cuda_index to using device_index in the 'nvlink' mask that decides
if we can directly communicate to another GPU. However, this bitmask
was initialized at query time, before devices get assigned a device_index.
As a consequence, the bitmask was wrong and no direct device to device
communication was happening.

In this PR, we add a step, after all devices have been registered,
to complete this initialization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants