New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite core INTERRUPTED loop #1157
Comments
We have run into this before. When the core returns |
Is this due to an erroneous exit code return by the core then? If so, we can fix that. When you say you think this is fixed, do you mean the Client has some protection from falling into an infinite loop? |
I think the best course of action is to fix the core return code. It should not return this code in this situation. |
I've run into this on |
It's a problem if the core is still returning |
So is there currently any solution? We already have the latest |
This is a bug in the OpenMM core not the client. |
John: The client was advanced from 7.4.4 but those versions were never released. Fixing the bug in the FahCore is the best approach if somebody can do that.. Joseph: Are you thinking that from the perspective of the client, the problem was fixed after 7.4.4 but prior to or in 7.4.9? |
The 7.4.9 client has a workaround for this problem but it has never been fully beta tested and releasing this client is not currently on the radar. The client only stops the loop eventually anyway. The best solution is to fix the core. |
@jcoffland: Should I take a stab at the core code changes, or do you need
to do this? Which exit code(s) cause INTERRUPTED?
|
Please do take a stab at it. |
Sorry---what's the numerical exit code corresponding to |
All of the FAH related stuff is in libfah. https://github.com/FoldingAtHome/libfah/blob/master/src/fah/core/ExitCode.h |
FYI, we (@cxhernandez) got around this issue on VSP-FAH by upgrading the nvidia driver. So that might give you a hint as to why this is happening. Current driver version is: 340.65. I don't know what it was before. Carlos, please correct me if I've got any of this wrong. |
Wouldn't there be a number of things (like SEGFAULTs) that would cause the Core to be INTERRUPTED that might be indistinguishable from CTRL-C without additional code to differentiate between those interruptions? Updating the NVidia driver might have helped, but it would also have temporarily cleared, say, a slow memory leak. |
Yeah, this definitely isn't a general fix for the
How could we test this? |
Any chance you can also look at the core logs, rather than the client logs? That will probably contain the exact error (e.g. that there is an NVIDIA driver error). |
There are two questions that need answers here:
|
I can address (2) by taking a pass through to clean all of the return codes this weekend. This has been long planned as issue https://github.com/FoldingAtHome/openmm-core/issues/57 Addressing (1) requires project maintainers for projects experiencing this issue look into the returned error result packets, unzip them, and look at the core logs. These result packets are in directories that begin with |
IMHO, we can't. (since we don't know which of the various possible causes of INTERRUPTED produced the loop.) All we can do is to fix the THROW macro calls to provide more information next time it happens.. If somebody wants to unzip the results for Project: 10496 (Run 83, Clone 1, Gen 7) they might find something useful, but I'm guessing you'll only find the Core log for somebody else completing the same WU -- not the log corresponding to the client-log posted above. |
I believe this was solved in the core. |
One slack donor (
hayesk
) had his core keep crashing in an infinite loop when reading the checkpoint file:We may want to have the client detect repeated crash cycles and abort the WU if some number is exceeded.
(@jcoffland: You may want to restore the internal
fah-client
issue tracker so we can report issues like this internally.)The text was updated successfully, but these errors were encountered: