Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648

abouteiller · 2024-04-01T20:14:43Z

new: rework get_best_device to lock-in the device used (it still use the evaluate functions), this will let us consolidate version management for GPU and CPU for all DSLs

This incidentally make potrf_dtd -g 2 work (while it was suspicious before)

New: parsec_select_best_device
Removed: parsec_get_best_device

stress and stage crash when no GPU available at runtime (defer to Consolidated error handling when GPU only tests execute on CPU systems #644)
testing_best_device computes wrong values under some cases (on b00 notably)
re-enable the PTG device=123 body decorator
compute CPU versions upfront to enable TTG
optional: stop lying about data_out and data_in being the CPU copy all the time when it most of the time is a GPU copy that we are just passing down directly, there are complexities with updating the right copy->readers (especially in dtd)

evaluate functions), this will let us consolidate version management for GPU and CPU for all DSLs New: parsec_select_best_device Removed: parsec_get_best_device

kernel_epilog

* always construct task_t objects * iterate on incarnations using PARSEC_DEV_NONE as the terminating element marker * detect case where dyld function did not find a hook during best_device (rather than crashing durint hook execution) * incorrect use of flow indices to reference data_in objects replaced with correct in[i]->flow_index * split abiding by W advise_on device and RO prefer device in two loops so that we always try hard to find the W dependency before looking at RO dependencies * consolidate detection of case where no_valid_device should trigger when doing load-balancing on RO data on GPU * an invalid case (no valid device found when we did select a device already) could cause an infinite loop, now it will assert * reduced verbosity of gpu memory allocation debug_verbose * device_load is now mca/device independent, done in scheduling.c * documented better why we never use cpu device load (why it cannot be computed effectively atm)

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

debug: priority already printed in the task name

advise_on_device, so this is what we do now.

devreal

This works with TTG and DPLASMA on Frontier. Thanks @abouteiller

bosilca

forgot to submit my review.

bosilca · 2024-04-06T02:59:49Z

parsec/scheduling.c

@@ -126,88 +126,49 @@ int __parsec_execute( parsec_execution_stream_t* es,
                      parsec_task_t* task )
 {
    const parsec_task_class_t* tc = task->task_class;
-    parsec_evaluate_function_t* eval;


Not being able to decide at runtime not to execute a particular chore is a significant loss of functionality. With this new code recursive kernels are forced to be recursive, instead of being allowed to fall back to the CPU version.

bosilca · 2024-04-06T03:02:01Z

parsec/scheduling.c

@@ -508,7 +475,6 @@ int __parsec_task_progress( parsec_execution_stream_t* es,
        /* We're good to go ... */
        switch(rc) {
        case PARSEC_HOOK_RETURN_DONE:    /* This execution succeeded */
-            task->status = PARSEC_TASK_STATUS_COMPLETE;


There was a reason this was done here. I don't recall all the details, but it had something to do with the completion of recursive kernels.

bosilca · 2024-04-06T03:02:35Z

parsec/scheduling.c

@@ -527,6 +493,7 @@ int __parsec_task_progress( parsec_execution_stream_t* es,
        case PARSEC_HOOK_RETURN_NEXT:    /* Try next variant [if any] */
        case PARSEC_HOOK_RETURN_DISABLE: /* Disable the device, something went wrong */
        case PARSEC_HOOK_RETURN_ERROR:   /* Some other major error happened */
+        default:


what else is left for the default case ?

bosilca · 2024-04-11T15:19:05Z