Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated error handling when GPU only tests execute on CPU systems #644

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

abouteiller
Copy link
Contributor

@abouteiller abouteiller commented Mar 12, 2024

Fixes issue #641

  • capture ci tag and compile for Cuda or HIP based on this tag, not on availability of the software stack.
  • decide if we keep the check in stage_main about counting GPUs. We now will detect (later) erroneous behavior and produce a parsec_fatal so maybe that is already clear enough?
  • cleanup error cases for when hook==NULL, cases have been split between when the dyld function was not populated (goto next_chore) and other cases (this is a bug, assert/fail).
  • stress tester is not symmetrical stress:gpu crashes when compiled with CUDA but run without a device #641 (comment)

@@ -175,7 +175,9 @@ int __parsec_execute( parsec_execution_stream_t* es,
chore_id);
#endif
parsec_hook_t *hook = tc->incarnations[chore_id].hook;
assert( NULL != hook );
if( NULL == hook ) {
goto next_chore; /* Gracefuly manage the case when NO chores are available for this task */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is coming up regularly, but as far as I recall the code it is not correct. The chores are packed, no NULL chores can exists except at the end. Second, this loop is only executed for valid chores (aka not NULL hook). So, the only way to get into this is if no eval function accepted the work, in which case the correct outcome is to bail out of the loop (not go to the next chore).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I remembered this discussion and checked the history #642 (comment) Summary of last time: I dropped it because I couldn't reproduce it anymore in the dtd case I was looking at.

Now it is back. What happens here is that we have only 1 valid chore (the JDF declares only a GPU chore), but we have runtime disabled it (set hook=NULL by force) because we didn't find a device of that type during the task pool creation. The evaluate function is not invoked because it is NULL for the task class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that makes sense. But, at the end, it is an incorrect application, as we do not have support for all tasks available. If we stop this at a single process, it will be difficult to properly report and act on this. Thus, I think this might be something that needs to be checked upfront, before the first task is even executed, and all processes must act uniformly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chore is NULL because it is a dyld function, and when we tried to populate the dyld functions, the device was not active, so we didn't search for it at all.

In device.c:868 we should have removed the device from the task pool device mask if we had tried, but we didn't try (we don't run register on disabled devices), so in the end we never initialized the hook and it is still NULL.

The proposed behavior (try the next evaluate/hook if the current hook is NULL) sounds reasonable, do you see a problem with this?

If we run out of valid hooks, it will generate a parsec_fatal and that will kill the whole app.
In addition, I separated the cases for dyld (we go to the next) and other (this is erroneous, jump direct to the error case). For some reason GitHub doesn't update the PR with the commits now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device was not active when we tried to dlsym but it is now ? If now the device is active, that's bad, the devices should have been freezed before. If the device is still inactive then this is an incorrect application and we should have detected it before. The problem is that at this point there is no easy way to bail out, we have to abort the entire app.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The app is correct (it has only cuda incarnations, but that is not a bug), but we have detected (late) that the app cannot run on this system (no cuda devices, before and after freeze).

Ideally we should not call parsec_fatal when we capture HOOK_RETURN_ERROR (whatever the reason), but instead bubble up the error condition all the way to taskpool_wait to return as an rc code, but that's work for another day. When we do that, the error cases discovered here will be handled correctly.

Last thing left is to fix the CI to use the correct tag. Thomas wanted to wait for Geri's rework of the CI, but it may be pretty easy to do now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in my previous message, we cannot escalate this across a distributed run, so from user perspective it might not be the best solution. The user should have tested upfront if it can start a GPU-only run on the setup.

@@ -41,6 +42,19 @@ int main(int argc, char *argv[])
exit(-1);
}

for(int id = 0; id < (int)parsec_nb_devices; id++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is necessary if we have appropriate error handling in parsec?

@abouteiller abouteiller changed the title Bugfix/no gpu found Consolidated error handling when GPU only tests execute on CPU systems Mar 13, 2024
@abouteiller abouteiller added this to the v4.0 milestone Mar 13, 2024
@abouteiller abouteiller added the bug Something isn't working label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stress:gpu crashes when compiled with CUDA but run without a device
3 participants