-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648
Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648
Conversation
evaluate functions), this will let us consolidate version management for GPU and CPU for all DSLs New: parsec_select_best_device Removed: parsec_get_best_device
* always construct task_t objects * iterate on incarnations using PARSEC_DEV_NONE as the terminating element marker * detect case where dyld function did not find a hook during best_device (rather than crashing durint hook execution) * incorrect use of flow indices to reference data_in objects replaced with correct in[i]->flow_index * split abiding by W advise_on device and RO prefer device in two loops so that we always try hard to find the W dependency before looking at RO dependencies * consolidate detection of case where no_valid_device should trigger when doing load-balancing on RO data on GPU * an invalid case (no valid device found when we did select a device already) could cause an infinite loop, now it will assert * reduced verbosity of gpu memory allocation debug_verbose * device_load is now mca/device independent, done in scheduling.c * documented better why we never use cpu device load (why it cannot be computed effectively atm)
018d305
to
fc18d78
Compare
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
debug: priority already printed in the task name
advise_on_device, so this is what we do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works with TTG and DPLASMA on Frontier. Thanks @abouteiller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to submit my review.
@@ -126,88 +126,49 @@ int __parsec_execute( parsec_execution_stream_t* es, | |||
parsec_task_t* task ) | |||
{ | |||
const parsec_task_class_t* tc = task->task_class; | |||
parsec_evaluate_function_t* eval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being able to decide at runtime not to execute a particular chore is a significant loss of functionality. With this new code recursive kernels are forced to be recursive, instead of being allowed to fall back to the CPU version.
@@ -508,7 +475,6 @@ int __parsec_task_progress( parsec_execution_stream_t* es, | |||
/* We're good to go ... */ | |||
switch(rc) { | |||
case PARSEC_HOOK_RETURN_DONE: /* This execution succeeded */ | |||
task->status = PARSEC_TASK_STATUS_COMPLETE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a reason this was done here. I don't recall all the details, but it had something to do with the completion of recursive kernels.
@@ -527,6 +493,7 @@ int __parsec_task_progress( parsec_execution_stream_t* es, | |||
case PARSEC_HOOK_RETURN_NEXT: /* Try next variant [if any] */ | |||
case PARSEC_HOOK_RETURN_DISABLE: /* Disable the device, something went wrong */ | |||
case PARSEC_HOOK_RETURN_ERROR: /* Some other major error happened */ | |||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what else is left for the default
case ?
@@ -3157,7 +3157,7 @@ static void jdf_generate_startup_tasks(const jdf_t *jdf, const jdf_function_entr | |||
"%s vpid = (vpid + 1) %% context->nb_vp; /* spread the initial joy */\n" | |||
"%s }\n" | |||
"%s new_task = (%s*)parsec_thread_mempool_allocate( context->virtual_processes[vpid]->execution_streams[0]->context_mempool );\n" | |||
"%s new_task->status = PARSEC_TASK_STATUS_NONE;\n", | |||
"%s PARSEC_OBJ_CONSTRUCT(new_task, parsec_task_t); /* construct called only when new, force-construct it again */\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this not necessary before ?
@@ -4494,9 +4504,9 @@ static void jdf_generate_startup_hook( const jdf_t *jdf ) | |||
" chores[idx].hook = NULL;\n" | |||
" /* Create the initialization tasks for each taskclass */\n" | |||
" parsec_task_t* task = (parsec_task_t*)parsec_thread_mempool_allocate(context->virtual_processes[0]->execution_streams[0]->context_mempool);\n" | |||
" PARSEC_OBJ_CONSTRUCT(task, parsec_task_t); /* construct called only when new, force-construct it again */\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed now ?
@@ -4481,7 +4491,7 @@ static void jdf_generate_startup_hook( const jdf_t *jdf ) | |||
" parsec_task_class_t* tc = (parsec_task_class_t*)__parsec_tp->super.super.task_classes_array[i];\n" | |||
" __parsec_chore_t* chores = (__parsec_chore_t*)tc->incarnations;\n" | |||
" uint32_t idx = 0, j;\n" | |||
" for( j = 0; NULL != chores[j].hook; j++ ) {\n" | |||
" for( j = 0; PARSEC_DEV_NONE != chores[j].type; j++ ) {\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ? It seems that the two lead to similar outcome, but I question the need for the change.
@@ -4670,7 +4680,7 @@ static void jdf_generate_constructor( const jdf_t* jdf ) | |||
coutput(" for( i = 0; i < __parsec_tp->super.super.nb_task_classes; i++ ) {\n" | |||
" __parsec_tp->super.super.task_classes_array[i] = tc = malloc(sizeof(parsec_task_class_t));\n" | |||
" memcpy(tc, %s_task_classes[i], sizeof(parsec_task_class_t));\n" | |||
" for( j = 0; NULL != tc->incarnations[j].hook; j++); /* compute the number of incarnations */\n" | |||
" for( j = 0; PARSEC_DEV_NONE != tc->incarnations[j].type; j++); /* compute the number of incarnations */\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above. why ?
" __parsec_%s_internal_taskpool_t *__parsec_tp = (__parsec_%s_internal_taskpool_t *)this_task->taskpool; (void)__parsec_tp;\n" | ||
" int dev_prop = %s;\n" | ||
" int nb_devs_of_type, dev_index;\n" | ||
" if( NULL != this_task->selected_device ) {\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can it be possible that the task already has a selected_device
?
coutput("static int eval_%s_%s_select_device(%s *this_task)\n" | ||
"{\n" | ||
" __parsec_%s_internal_taskpool_t *__parsec_tp = (__parsec_%s_internal_taskpool_t *)this_task->taskpool; (void)__parsec_tp;\n" | ||
" int dev_prop = %s;\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using the device_expr
, but you can't just dump the user code into a variable, because you have no guarantees that this will result in valid C code. You might need to use jdf_generate_function_without_expression
here.
" this_task->selected_device->device_index, this_task->selected_device->name);\n" | ||
" return PARSEC_HOOK_RETURN_DONE;\n" | ||
" }\n" | ||
" for( dev_index = nb_devs_of_type = 0; dev_index < parsec_mca_device_enabled(); dev_index++ )\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should have been into a parsec internal function. That would have simplified the entire code, because you could have just dumped the device property as a function, then call use it's pointer to call it in validation function, before doing the checks below.
((fl->flow_flags & JDF_FLOW_TYPE_CTL) ? "PARSEC_FLOW_ACCESS_NONE" : | ||
((fl->flow_flags & JDF_FLOW_TYPE_READ) ? | ||
((fl->flow_flags & JDF_FLOW_TYPE_WRITE) ? "PARSEC_FLOW_ACCESS_RW" : "PARSEC_FLOW_ACCESS_READ") : "PARSEC_FLOW_ACCESS_WRITE"))); | ||
if (fl->flow_flags & JDF_FLOW_TYPE_WRITE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a long time I haven't work on this particular piece of code, but if I recall well this function is mostly used for generating the hook for the CPU/GPU, and the GPU code is generated by jdf_generate_code_hook_gpu
. Thus, any GPU related code you put in here will never have any impact on the execution.
"#if defined(PARSEC_DEBUG_NOISIER)\n" | ||
" char tmp[128];\n" | ||
"#endif\n" | ||
" this_task->data._f_%s.data_out->version++; /* %s */\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a significant change, you move the version management from _complete
, aka. after the body execution into the _hook
, aka. for each execution. How would that work if the body returns AGAIN or NEXT ?
new: rework get_best_device to lock-in the device used (it still use the evaluate functions), this will let us consolidate version management for GPU and CPU for all DSLs
This incidentally make potrf_dtd -g 2 work (while it was suspicious before)
New: parsec_select_best_device
Removed: parsec_get_best_device