Consolidated error handling when GPU only tests execute on CPU systems #644

abouteiller · 2024-03-12T22:58:11Z

Fixes issue #641

capture ci tag and compile for Cuda or HIP based on this tag, not on availability of the software stack.
decide if we keep the check in stage_main about counting GPUs. We now will detect (later) erroneous behavior and produce a parsec_fatal so maybe that is already clear enough?
cleanup error cases for when hook==NULL, cases have been split between when the dyld function was not populated (goto next_chore) and other cases (this is a bug, assert/fail).
stress tester is not symmetrical stress:gpu crashes when compiled with CUDA but run without a device #641 (comment)

bosilca · 2024-03-12T23:28:14Z

parsec/scheduling.c

@@ -175,7 +175,9 @@ int __parsec_execute( parsec_execution_stream_t* es,
                             chore_id);
 #endif
        parsec_hook_t *hook = tc->incarnations[chore_id].hook;
-        assert( NULL != hook );
+        if( NULL == hook ) {
+            goto next_chore; /* Gracefuly manage the case when NO chores are available for this task */


I think this is coming up regularly, but as far as I recall the code it is not correct. The chores are packed, no NULL chores can exists except at the end. Second, this loop is only executed for valid chores (aka not NULL hook). So, the only way to get into this is if no eval function accepted the work, in which case the correct outcome is to bail out of the loop (not go to the next chore).

Yes I remembered this discussion and checked the history #642 (comment) Summary of last time: I dropped it because I couldn't reproduce it anymore in the dtd case I was looking at.

Now it is back. What happens here is that we have only 1 valid chore (the JDF declares only a GPU chore), but we have runtime disabled it (set hook=NULL by force) because we didn't find a device of that type during the task pool creation. The evaluate function is not invoked because it is NULL for the task class.

ok, that makes sense. But, at the end, it is an incorrect application, as we do not have support for all tasks available. If we stop this at a single process, it will be difficult to properly report and act on this. Thus, I think this might be something that needs to be checked upfront, before the first task is even executed, and all processes must act uniformly.

The chore is NULL because it is a dyld function, and when we tried to populate the dyld functions, the device was not active, so we didn't search for it at all.

In device.c:868 we should have removed the device from the task pool device mask if we had tried, but we didn't try (we don't run register on disabled devices), so in the end we never initialized the hook and it is still NULL.

The proposed behavior (try the next evaluate/hook if the current hook is NULL) sounds reasonable, do you see a problem with this?

If we run out of valid hooks, it will generate a parsec_fatal and that will kill the whole app.
In addition, I separated the cases for dyld (we go to the next) and other (this is erroneous, jump direct to the error case). For some reason GitHub doesn't update the PR with the commits now

The device was not active when we tried to dlsym but it is now ? If now the device is active, that's bad, the devices should have been freezed before. If the device is still inactive then this is an incorrect application and we should have detected it before. The problem is that at this point there is no easy way to bail out, we have to abort the entire app.

The app is correct (it has only cuda incarnations, but that is not a bug), but we have detected (late) that the app cannot run on this system (no cuda devices, before and after freeze).

Ideally we should not call parsec_fatal when we capture HOOK_RETURN_ERROR (whatever the reason), but instead bubble up the error condition all the way to taskpool_wait to return as an rc code, but that's work for another day. When we do that, the error cases discovered here will be handled correctly.

Last thing left is to fix the CI to use the correct tag. Thomas wanted to wait for Geri's rework of the CI, but it may be pretty easy to do now.

As I said in my previous message, we cannot escalate this across a distributed run, so from user perspective it might not be the best solution. The user should have tested upfront if it can start a GPU-only run on the setup.

abouteiller · 2024-03-13T00:35:28Z

tests/runtime/cuda/stage_main.c

@@ -41,6 +42,19 @@ int main(int argc, char *argv[])
        exit(-1);
    }

+    for(int id = 0; id < (int)parsec_nb_devices; id++) {


Don't think this is necessary if we have appropriate error handling in parsec?

therault added 2 commits March 12, 2024 18:53

When no chore is available, fail more gracefully than via the assert

7cbffb5

When no GPU is available, fail more gracefully

b160a42

bosilca reviewed Mar 12, 2024

View reviewed changes

stress tester should not allocate 30+GB of ram

23b1569

abouteiller commented Mar 13, 2024

View reviewed changes

consolidate error handling for incarnations

2d6e7dc

abouteiller changed the title ~~Bugfix/no gpu found~~ Consolidated error handling when GPU only tests execute on CPU systems Mar 13, 2024

abouteiller added this to the v4.0 milestone Mar 13, 2024

abouteiller linked an issue Mar 13, 2024 that may be closed by this pull request

stress:gpu crashes when compiled with CUDA but run without a device #641

Open

abouteiller added the bug Something isn't working label Mar 13, 2024

abouteiller mentioned this pull request Apr 6, 2024

Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidated error handling when GPU only tests execute on CPU systems #644

Consolidated error handling when GPU only tests execute on CPU systems #644

abouteiller commented Mar 12, 2024 •

edited

bosilca Mar 12, 2024

abouteiller Mar 13, 2024 •

edited

bosilca Mar 13, 2024

abouteiller Mar 13, 2024 •

edited

bosilca Mar 13, 2024

abouteiller Mar 13, 2024

bosilca Mar 13, 2024

abouteiller Mar 13, 2024

Consolidated error handling when GPU only tests execute on CPU systems #644

Are you sure you want to change the base?

Consolidated error handling when GPU only tests execute on CPU systems #644

Conversation

abouteiller commented Mar 12, 2024 • edited

bosilca Mar 12, 2024

Choose a reason for hiding this comment

abouteiller Mar 13, 2024 • edited

Choose a reason for hiding this comment

bosilca Mar 13, 2024

Choose a reason for hiding this comment

abouteiller Mar 13, 2024 • edited

Choose a reason for hiding this comment

bosilca Mar 13, 2024

Choose a reason for hiding this comment

abouteiller Mar 13, 2024

Choose a reason for hiding this comment

bosilca Mar 13, 2024

Choose a reason for hiding this comment

abouteiller Mar 13, 2024

Choose a reason for hiding this comment

abouteiller commented Mar 12, 2024 •

edited

abouteiller Mar 13, 2024 •

edited

abouteiller Mar 13, 2024 •

edited