New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidated error handling when GPU only tests execute on CPU systems #644
base: master
Are you sure you want to change the base?
Conversation
parsec/scheduling.c
Outdated
@@ -175,7 +175,9 @@ int __parsec_execute( parsec_execution_stream_t* es, | |||
chore_id); | |||
#endif | |||
parsec_hook_t *hook = tc->incarnations[chore_id].hook; | |||
assert( NULL != hook ); | |||
if( NULL == hook ) { | |||
goto next_chore; /* Gracefuly manage the case when NO chores are available for this task */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is coming up regularly, but as far as I recall the code it is not correct. The chores are packed, no NULL chores can exists except at the end. Second, this loop is only executed for valid chores (aka not NULL hook). So, the only way to get into this is if no eval function accepted the work, in which case the correct outcome is to bail out of the loop (not go to the next chore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I remembered this discussion and checked the history #642 (comment) Summary of last time: I dropped it because I couldn't reproduce it anymore in the dtd case I was looking at.
Now it is back. What happens here is that we have only 1 valid chore (the JDF declares only a GPU chore), but we have runtime disabled it (set hook=NULL by force) because we didn't find a device of that type during the task pool creation. The evaluate function is not invoked because it is NULL for the task class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that makes sense. But, at the end, it is an incorrect application, as we do not have support for all tasks available. If we stop this at a single process, it will be difficult to properly report and act on this. Thus, I think this might be something that needs to be checked upfront, before the first task is even executed, and all processes must act uniformly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chore is NULL because it is a dyld function, and when we tried to populate the dyld functions, the device was not active, so we didn't search for it at all.
In device.c:868 we should have removed the device from the task pool device mask if we had tried, but we didn't try (we don't run register on disabled devices), so in the end we never initialized the hook and it is still NULL.
The proposed behavior (try the next evaluate/hook if the current hook is NULL) sounds reasonable, do you see a problem with this?
If we run out of valid hooks, it will generate a parsec_fatal and that will kill the whole app.
In addition, I separated the cases for dyld (we go to the next) and other (this is erroneous, jump direct to the error case). For some reason GitHub doesn't update the PR with the commits now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device was not active when we tried to dlsym but it is now ? If now the device is active, that's bad, the devices should have been freezed before. If the device is still inactive then this is an incorrect application and we should have detected it before. The problem is that at this point there is no easy way to bail out, we have to abort the entire app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The app is correct (it has only cuda incarnations, but that is not a bug), but we have detected (late) that the app cannot run on this system (no cuda devices, before and after freeze).
Ideally we should not call parsec_fatal
when we capture HOOK_RETURN_ERROR (whatever the reason), but instead bubble up the error condition all the way to taskpool_wait
to return as an rc code, but that's work for another day. When we do that, the error cases discovered here will be handled correctly.
Last thing left is to fix the CI to use the correct tag. Thomas wanted to wait for Geri's rework of the CI, but it may be pretty easy to do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in my previous message, we cannot escalate this across a distributed run, so from user perspective it might not be the best solution. The user should have tested upfront if it can start a GPU-only run on the setup.
@@ -41,6 +42,19 @@ int main(int argc, char *argv[]) | |||
exit(-1); | |||
} | |||
|
|||
for(int id = 0; id < (int)parsec_nb_devices; id++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this is necessary if we have appropriate error handling in parsec?
Fixes issue #641
parsec_fatal
so maybe that is already clear enough?