Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648

Merged

Conversation

abouteiller
Copy link
Contributor

@abouteiller abouteiller commented Apr 1, 2024

new: rework get_best_device to lock-in the device used (it still use the evaluate functions), this will let us consolidate version management for GPU and CPU for all DSLs

This incidentally make potrf_dtd -g 2 work (while it was suspicious before)

New: parsec_select_best_device
Removed: parsec_get_best_device

  • stress and stage crash when no GPU available at runtime (defer to Consolidated error handling when GPU only tests execute on CPU systems #644)
  • testing_best_device computes wrong values under some cases (on b00 notably)
  • re-enable the PTG device=123 body decorator
  • compute CPU versions upfront to enable TTG
  • optional: stop lying about data_out and data_in being the CPU copy all the time when it most of the time is a GPU copy that we are just passing down directly, there are complexities with updating the right copy->readers (especially in dtd)

evaluate functions), this will let us consolidate version management for
GPU and CPU for all DSLs

New: parsec_select_best_device
Removed: parsec_get_best_device
* always construct task_t objects
* iterate on incarnations using PARSEC_DEV_NONE as the terminating
  element marker
* detect case where dyld function did not find a hook during
  best_device (rather than crashing durint hook execution)
* incorrect use of flow indices to reference data_in objects
  replaced with correct in[i]->flow_index
* split abiding by W advise_on device and RO prefer device in two loops
  so that we always try hard to find the W dependency before looking at
  RO dependencies
* consolidate detection of case where no_valid_device should trigger
  when doing load-balancing on RO data on GPU
* an invalid case (no valid device found when we did select a device
  already) could cause an infinite loop, now it will assert
* reduced verbosity of gpu memory allocation debug_verbose
* device_load is now mca/device independent, done in scheduling.c
* documented better why we never use cpu device load (why it cannot be computed effectively atm)
@abouteiller abouteiller force-pushed the new/best-device-and-version-counting branch from 018d305 to fc18d78 Compare April 4, 2024 18:42
@abouteiller abouteiller marked this pull request as ready for review April 6, 2024 00:01
@abouteiller abouteiller requested a review from a team as a code owner April 6, 2024 00:01
@abouteiller abouteiller changed the title Compute CPU and GPU versions in the runtime without DSL input (enable TTG/PTG versioning to coexist) Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) Apr 6, 2024
@abouteiller abouteiller requested a review from devreal April 8, 2024 22:06
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with TTG and DPLASMA on Frontier. Thanks @abouteiller

@abouteiller abouteiller merged commit e95ec85 into ICLDisco:master Apr 11, 2024
3 of 4 checks passed
Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to submit my review.

@@ -126,88 +126,49 @@ int __parsec_execute( parsec_execution_stream_t* es,
parsec_task_t* task )
{
const parsec_task_class_t* tc = task->task_class;
parsec_evaluate_function_t* eval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being able to decide at runtime not to execute a particular chore is a significant loss of functionality. With this new code recursive kernels are forced to be recursive, instead of being allowed to fall back to the CPU version.

@@ -508,7 +475,6 @@ int __parsec_task_progress( parsec_execution_stream_t* es,
/* We're good to go ... */
switch(rc) {
case PARSEC_HOOK_RETURN_DONE: /* This execution succeeded */
task->status = PARSEC_TASK_STATUS_COMPLETE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a reason this was done here. I don't recall all the details, but it had something to do with the completion of recursive kernels.

@@ -527,6 +493,7 @@ int __parsec_task_progress( parsec_execution_stream_t* es,
case PARSEC_HOOK_RETURN_NEXT: /* Try next variant [if any] */
case PARSEC_HOOK_RETURN_DISABLE: /* Disable the device, something went wrong */
case PARSEC_HOOK_RETURN_ERROR: /* Some other major error happened */
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what else is left for the default case ?

@@ -3157,7 +3157,7 @@ static void jdf_generate_startup_tasks(const jdf_t *jdf, const jdf_function_entr
"%s vpid = (vpid + 1) %% context->nb_vp; /* spread the initial joy */\n"
"%s }\n"
"%s new_task = (%s*)parsec_thread_mempool_allocate( context->virtual_processes[vpid]->execution_streams[0]->context_mempool );\n"
"%s new_task->status = PARSEC_TASK_STATUS_NONE;\n",
"%s PARSEC_OBJ_CONSTRUCT(new_task, parsec_task_t); /* construct called only when new, force-construct it again */\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this not necessary before ?

@@ -4494,9 +4504,9 @@ static void jdf_generate_startup_hook( const jdf_t *jdf )
" chores[idx].hook = NULL;\n"
" /* Create the initialization tasks for each taskclass */\n"
" parsec_task_t* task = (parsec_task_t*)parsec_thread_mempool_allocate(context->virtual_processes[0]->execution_streams[0]->context_mempool);\n"
" PARSEC_OBJ_CONSTRUCT(task, parsec_task_t); /* construct called only when new, force-construct it again */\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed now ?

@@ -4481,7 +4491,7 @@ static void jdf_generate_startup_hook( const jdf_t *jdf )
" parsec_task_class_t* tc = (parsec_task_class_t*)__parsec_tp->super.super.task_classes_array[i];\n"
" __parsec_chore_t* chores = (__parsec_chore_t*)tc->incarnations;\n"
" uint32_t idx = 0, j;\n"
" for( j = 0; NULL != chores[j].hook; j++ ) {\n"
" for( j = 0; PARSEC_DEV_NONE != chores[j].type; j++ ) {\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ? It seems that the two lead to similar outcome, but I question the need for the change.

@@ -4670,7 +4680,7 @@ static void jdf_generate_constructor( const jdf_t* jdf )
coutput(" for( i = 0; i < __parsec_tp->super.super.nb_task_classes; i++ ) {\n"
" __parsec_tp->super.super.task_classes_array[i] = tc = malloc(sizeof(parsec_task_class_t));\n"
" memcpy(tc, %s_task_classes[i], sizeof(parsec_task_class_t));\n"
" for( j = 0; NULL != tc->incarnations[j].hook; j++); /* compute the number of incarnations */\n"
" for( j = 0; PARSEC_DEV_NONE != tc->incarnations[j].type; j++); /* compute the number of incarnations */\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. why ?

" __parsec_%s_internal_taskpool_t *__parsec_tp = (__parsec_%s_internal_taskpool_t *)this_task->taskpool; (void)__parsec_tp;\n"
" int dev_prop = %s;\n"
" int nb_devs_of_type, dev_index;\n"
" if( NULL != this_task->selected_device ) {\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can it be possible that the task already has a selected_device ?

coutput("static int eval_%s_%s_select_device(%s *this_task)\n"
"{\n"
" __parsec_%s_internal_taskpool_t *__parsec_tp = (__parsec_%s_internal_taskpool_t *)this_task->taskpool; (void)__parsec_tp;\n"
" int dev_prop = %s;\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using the device_expr, but you can't just dump the user code into a variable, because you have no guarantees that this will result in valid C code. You might need to use jdf_generate_function_without_expression here.

" this_task->selected_device->device_index, this_task->selected_device->name);\n"
" return PARSEC_HOOK_RETURN_DONE;\n"
" }\n"
" for( dev_index = nb_devs_of_type = 0; dev_index < parsec_mca_device_enabled(); dev_index++ )\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should have been into a parsec internal function. That would have simplified the entire code, because you could have just dumped the device property as a function, then call use it's pointer to call it in validation function, before doing the checks below.

((fl->flow_flags & JDF_FLOW_TYPE_CTL) ? "PARSEC_FLOW_ACCESS_NONE" :
((fl->flow_flags & JDF_FLOW_TYPE_READ) ?
((fl->flow_flags & JDF_FLOW_TYPE_WRITE) ? "PARSEC_FLOW_ACCESS_RW" : "PARSEC_FLOW_ACCESS_READ") : "PARSEC_FLOW_ACCESS_WRITE")));
if (fl->flow_flags & JDF_FLOW_TYPE_WRITE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a long time I haven't work on this particular piece of code, but if I recall well this function is mostly used for generating the hook for the CPU/GPU, and the GPU code is generated by jdf_generate_code_hook_gpu. Thus, any GPU related code you put in here will never have any impact on the execution.

"#if defined(PARSEC_DEBUG_NOISIER)\n"
" char tmp[128];\n"
"#endif\n"
" this_task->data._f_%s.data_out->version++; /* %s */\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant change, you move the version management from _complete, aka. after the body execution into the _hook, aka. for each execution. How would that work if the body returns AGAIN or NEXT ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants