-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor GPU device to increase code factorization between the devices. #570
Conversation
therault
commented
Aug 28, 2023
- Expose 8 functions that are device hardware-specific in the gpu_module_s: set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation
- Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious
- Move all hardware-oblivious code in device_gpu.c
- Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c
5686fd6
to
fe9733d
Compare
continue; | ||
if( NULL != chores[j].dyld_fn ) { | ||
/* the function has been set for another device of the same type */ | ||
return PARSEC_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we still need to iterate over the rest of the devices as they may have different incarnations that are not already loaded?
- Expose 8 functions that are device hardware-specific in the gpu_module_s: set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation - Make 90% of the code in device_cuda_module.c use these functions and remain hardware-oblivious - Move all hardware-oblivious code in device_gpu.c - Keep cuda_module_init and cuda_module_fini, as well as other functions in the module API that were 90% hardware-specific in device_cuda_module.c
….c to device_gpu.c
…e currently non-implemented
…me points raised by PR review
fe9733d
to
3f9ae03
Compare
" return parsec_%s_kernel_scheduler( es, gpu_task, dev_index );\n" | ||
"}\n\n", | ||
dev_lower); | ||
" return parsec_gpu_kernel_scheduler( es, gpu_task, dev_index );\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this working when multiple devices are present ? I think we need to name the scheduler appropriately, to allow support for multiple devices in same time (even if we do not currently have the cmake-configury nor the code generation capability for)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to revolve this would be to keep this generic function as a switchyard that would switch(devices[dev_index]->type
and call the appropriate function based on this. That would keep the complexity in the generated code at a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make the code more complicated instead of just adding the well-defined device type into the function name. In fact, I think I would prefer a solution where the scheduler function is part of the device, so we will directly call devices[dev_index]->scheduler(es, gpu_task, devices[dev_index])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in parsec_device_kernel_scheduler
in 9cdf7db.
When multiple accelerator devices are present, dev_index
uniquely identifies a single device, and the code in parsec_device_kernel_scheduler
is device-agnostic (the code uses the functions defined in the module to schedule tasklets, data movement and allocation).
Recursive and CPU devices do not call this function, as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach limits the parsec runtime to a single enabled device per binary instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the description of the PR was not clear . As part of the factorization effort, each accelerator device defines 8 functions (set_device, memcpy_async, event_query, event_record, memory_info, memory_allocate, memory_free, find_incarnation), and parsec_device_kernel_scheduler
(and the other functions in device_gpu.c
) use those 8 device-specific functions to call device-specific operations within a device-agnostic code.
So, parsec_device_kernel_scheduler
in device_gpu.c
is the generic part of the existing code that does not depend on the device type, and when we do device-specific things (e.g. adding a CUDA event in a CUDA stream, or appending a Level Zero fence to a Level Zero command queue), we call one of the module-specific functions (gpu_module->event_record for that example).
This relies on the user / DSL putting the right thing in the gpu_task_t
passed to parsec_device_kernel_scheduler
: the submit
, stage_in
, and stage_out
need to target the right type of device that corresponds to the dev_index
in the call.
I believe this code should allow to have a HIP, Level Zero and CUDA device functionning simultaneously, but I may be missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is clear. However, the only reason you need to define a set_device
function is to be able to create a unified parsec_device_kernel_scheduler and therefore define a unique behavior across all devices. If instead you add another member to the device structure _scheduler
, then you can in addition to having the unified behavior you describe, have specialized behavior per device (hint recursive and CPU).
@@ -141,7 +141,7 @@ static int device_cuda_component_query(mca_base_module_t **module, int *priority | |||
PARSEC_CUDA_CHECK_ERROR( "(parsec_device_cuda_component_query) cuCtxEnablePeerAccess ", cudastatus, | |||
{continue;} ); | |||
source_gpu->super.peer_access_mask = (int16_t)(source_gpu->super.peer_access_mask | (int16_t)(1 << | |||
target_gpu->cuda_index)); | |||
target_gpu->super.super.device_index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a major change ! You repurpose the device index bitmask from a device naming scheme (cuda_index
being the CUDA naming of the device) to a parsec naming scheme (device_index
being the position in the parsec's array of devices).
However, I don't see any changes on the use of this peer_access_mask
which point to something weird in this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did change the initialization of peer_access_mask
in device_cuda_component.c:143
: we use target_gpu->super.super.device_index
instead of target_gpu->cuda_index
, so I believe the access and initialization are consistent.
In parsec master, peer_access_mask
is
- initialized to 0 in
parsec_cuda_module_init
indevice_cuda_module.c
- then computed in
device_cuda_component_query
indevice_cuda_component.c
after all modules have been loaded - then read in
parsec_cuda_data_stage_in
indevice_cuda_module.c
In this PR, parsec_cuda_data_stage_in
is moved in device_gpu.c
, and I have changed for both the access and initialization what we use as index.
The user-facing MCA parameter device_cuda_nvlink_mask
is still using the CUDA index to exclude pairs of devices that could rely on NVLINK (so things have not changed here), and we only display peer_access_mask
in debugging functions (the mask show parsec-device indices instead of cuda-indices, so it's not a 1-to-1 with the MCA parameter).
If this is not acceptable, to make parsec_gpu_data_stage_in
device-type independent and keep it in device_gpu.c
, I can extend the API of the devices to provide a function bool device_to_device_direct(parsec_device_module_t *src, parsec_device_module_t *dst)
and each device will fill it up. But since src
and dst
might be of different types, the first test will be to return false
if src->type != dst->type
.
@@ -76,7 +80,7 @@ parsec_gpu_check_space_needed(parsec_device_gpu_module_t *gpu_device, | |||
void parsec_gpu_init_profiling(void) | |||
{ | |||
if(parsec_gpu_profiling_initiated == 0) { | |||
parsec_profiling_add_dictionary_keyword("cuda", "fill:#66ff66", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each accelerator should generate their own events, instead of generating generic-named events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device->name
will contain a string of the form "cuda(2)" where 2 is the dev_index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device name is translated into the stream naming, but these events being unique they will be unique across all profiling streams, making them matcheable across execution domains. That make sense when we are talking about threads, but for devices it might be a stretch.
… function of the devices, define cuda (and hip)-specific MCA parameter to enable it to the default sorting list algorithm if enabled
b098633
to
f51e9b0
Compare
…d rename API fields to remove gpu
6490618
to
2332a73
Compare
- make static to device_gpu.c any event that is traced at the unified level - move MCA parameters to the per-device-type files
In ICLDisco#570, we moved from using cuda_index to using device_index in the 'nvlink' mask that decides if we can directly communicate to another GPU. However, this bitmask was initialized at query time, before devices get assigned a device_index. As a consequence, the bitmask was wrong and no direct device to device communication was happening. In this PR, we add a step, after all devices have been registered, to complete this initialization.