New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network connection and No heartbeat message #113
Comments
Commented by MikeMarsUK on 11 Oct 37304787 07:06 UTC This was Nickolas's analysis of the bug: There is a network-related problem that can cause this. BOINC recently switched to using synchronous DNS resolving, in an attempt to workaround a DNS cache bug. That means the core client can't do anything while it's waiting for the DNS to respond; it's essentially hanged until it gets a reply. If the DNS server is not replying, for example, if your internet connection has problems, it takes a relatively long time (say 30 seconds) for it to finally give up. During this period, the science app can't communicate with the core client (as the core client is "hanged", it can't reply). It may quit with the error "No heartbeat from core client for 30 seconds, exiting". When the core client finally gets either a reply from the DNS server, or a timeout, and starts being able to do other things, it notices the science applications had suddenly disappeared. So it gives the error "Task [name] exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." That's the part where the clueless user follows instructions, resets project, and makes the project lose a climate model, all because of a slow or non-working Internet connection! Another problem this DNS thing causes is unresponsive manager. BOINC Manager has always used blocking I/O for GUI RPCs. That means the BOINC Manager can't do anything while it's waiting for the core client to respond; it's essentially hanged until it gets a reply. If the core client is hanged waiting for DNS, it can't respond to the manager, so the manager can't respond to mouseclicks. It all ends in getting a completely unresponsive GUI, all because of a slow or non-working Internet connection! Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed. |
Commented by MikeMarsUK on 7 Feb 37304800 14:40 UTC For 'Nickolas' please read Nicolas. I feel this is potentially a big bug for the following categories of users:
Not much of an issue for ADSL users running small work units, since it doesn't matter if a few get trashed sometimes. |
Commented by MikeMarsUK on 9 Sep 37311051 12:26 UTC
'''ARGH!!! again, Trac thinks the above links and links in the logs are spam. Can project sites be whitelisted? To fix, replace * with /''' vsmon.exe (ZoneAlarm's virus checker) crashed at 8:17UTC, while the PC was unattended, and as a result took down ZoneAlarm itself, blocking comms to the seasonal attribution project servers as well as localhost traffic. Came home 12 hours later, and all 3 of my SAP models on the PC were killed. SAP doesn't upload it's climate as it goes, so that's a loss of 300 CPU hours as far as the project goes. vsmon has crashed a few times in the past (once every few months), but it's always been with older versions of the boinc manager (5.4.x) rather than 5.8.x, and it's never affected any work units before. Approx time that VSMon crashed would have been 9:17 GMT in the log (8:17 UTC), so it was just over two hours before the models crashed.
|
Commented by MikeMarsUK on 1 Apr 37313295 23:06 UTC http://www.climateprediction.net/board/viewtopic.php?p=62746#62746 This one shows both a malariacontrol and a climate model crashing simultaneously with an 0xc0000142 exception. The log also shows 'communications deferred 1 minute' type messages, although the user (Haraldo) is not aware of any network issues. Boinc manager version is 5.8.15. 0xc0000142 is said to be a problem with initialising applications, in particular not being able to load DLLs. I'm guessing that the DNS issue causes the science application(s) to continually drop out and restart. If there is some sort of resource leak due to the crash then this would only happen a finite number of times, followed by all non-suspended work units on the PC crashing in quick succession? |
Commented by davea on 26 Apr 37395530 16:53 UTC |
Commented by Bunsen on 1 Aug 37406900 19:06 UTC |
Commented by MikeMarsUK on 20 May 37460494 14:13 UTC Best thing to do is to use 5.4.16 |
Commented by MikeMarsUK on 6 Feb 37460512 12:00 UTC See also: [ (don't know the version) http://boinc.berkeley.edu/trac/ticket/171 (duplicate of 113) |
Commented by Contact on 7 Dec 37461348 15:33 UTC 2nd part of problem as defined by Nicolas:
Technical details for the second problem: this is because the manager uses blocking I/O to communicate with the client. |
Commented by MikeMarsUK on 1 Nov 37463458 15:33 UTC |
Commented by Bunsen on 14 Mar 37845263 04:53 UTC |
Commented by Pepo on 24 Nov 37846904 06:13 UTC
(I was suffering from this issue nearly a year long, often (but not only) when moving running notebook between various networks, but not anymore since.... this sumer? (northern hemisphere)) |
Commented by Bunsen on 5 Oct 37848594 21:46 UTC
I'm a chemist, not a software engineer. (Insert a ''Star Trek'' joke here if you like.) I don't understand a lot about network operations and such-like, but I'm a reasonably competent observer. I noted a recurring problem with my projects resetting every minute or so while they weren't able to communicate with the servers, and was told that the probable cause was this DNS thing. My response is: "If you say so... can you fix it?" It doesn't ever happen with 5.4.11; it's happened with all the later versions that I've tried. And now I'm reporting that I'm seeing the same behaviour in 5.10.28: the manager gets into a state such that the projects reset every minute, and make no progress as a result. I'm hoping that the observation might be useful in diagnosing the problem... and of course, perhaps to nudge the problem back into someone's attention. I'd prefer to be using a current version of the manager, but for people like me who don't have a highly-reliable, always-on network connection, this problem is a serious disincentive to doing so. |
Commented by MikeMarsUK on 30 Mar 37884982 12:53 UTC As far as I can remember, 5.4.11 had asynchronous DNS, whereas the later versions went to synchronous DNS. That might explain why 5.4.11 works, but the more recent versions fail. Is there any chance of reverting to async DNS? I feel this problem is more severe than the stale-DNS-cache issue which prompted the move to synchronous DNS in the first place. Perhaps the timeout on the sync DNS query should be shortened to a few seconds, with it falling back to an aysnc call after timeout. |
Commented by Dingo on 9 Apr 37905756 09:46 UTC
I am also having this problem over at ABC: If you look at computer 27978 You will see the work units 5342497 and 5342483 they are both in error with "No heartbeat from core client for 31 sec - exiting". These two work units were running on a Linux pc with Boinc version:5.10.21 for Linux 64 bit when the network went down here at home. When it went down BOINC was frozen, probably trying to connect, and the four cores (q6600)were not crunching. I rebooted the PC and when it restarted the ABC work units above were both in error with the no heartbeat error. You can see when my network stops as there are between two and four work units together that are in error. This is happening on a few machines that are all linux 64 bit. BOINC freezes up and stops processing work whenever the network shuts down. I have the Activity set to "always available" as it usually is. I have 12 PC's here at home and I do not want to have to manage the network activity manually by "suspending activity" and starting it when it is up again. I am returning work units to multiple projects that are in error due to this problem. |
Commented by Wang Solutions on 4 May 37916886 00:26 UTC |
Commented by MikeMarsUK on 20 Jan 37926280 19:33 UTC Perhaps as a temporary work-around you could install a reliable DNS server on your local network (rather than using your ISP's DNS?). Personally I'm using a router which has DNS included. |
There are no reports of this happening within the last months. |
Reported by KSMarksPsych on 25 Oct 37304634 05:46 UTC
As reported at
http://boinc.berkeley.edu/dev/forum_thread.php?id=1561
The behavior appears with 5.8.x, but not with 5.4.11.
"According to the message log, the repeated exit-and-reset stopped when SETI@Home started asking to connect to the net to fetch work and report results. Einstein@Home had been requesting a connection for a couple of minutes before that. It was probably at about that time that I first "woke up" the computer this morning and dialed in to my ISP. I don't know which of these "state changes" might relate to the exit-and-reset stopping."
"It looks rather like the bogus error-and-reset is associated with unsuccessful attempts of the manager to connect with the net, i.e. when I'm not dialed in. It doesn't always happen with a connection attempt, but it only appears to be happening when an attempt is made, and at that precise time. The "defer for 1 minute" behaviour seems to be what was causing that 63-second reset."
"Hmm, no. For me, the problem recurred repeatedly all the time that I wasn't dialled in, roughly once per minute (since the manager repeated its connection attempt with a 1-minute delay). The problem stopped as soon as the connection was established. I don't recall seeing any occurrences of the problem between when I connected and when I logged off."
"Okay, I've been doing this with 5.8.15 for the last week and a half or so. Having to enable/disable the network connectivity manually is indeed an annoyance, but I haven't had that error occur once in that time. It looks like you may have found the culprit."
Migrated-From: http://boinc.berkeley.edu/trac/ticket/113
The text was updated successfully, but these errors were encountered: