Network connection and No heartbeat message #113

romw · 2015-02-03T20:51:41Z

Reported by KSMarksPsych on 25 Oct 37304634 05:46 UTC
As reported at

http://boinc.berkeley.edu/dev/forum_thread.php?id=1561

The behavior appears with 5.8.x, but not with 5.4.11.

"According to the message log, the repeated exit-and-reset stopped when SETI@Home started asking to connect to the net to fetch work and report results. Einstein@Home had been requesting a connection for a couple of minutes before that. It was probably at about that time that I first "woke up" the computer this morning and dialed in to my ISP. I don't know which of these "state changes" might relate to the exit-and-reset stopping."

"It looks rather like the bogus error-and-reset is associated with unsuccessful attempts of the manager to connect with the net, i.e. when I'm not dialed in. It doesn't always happen with a connection attempt, but it only appears to be happening when an attempt is made, and at that precise time. The "defer for 1 minute" behaviour seems to be what was causing that 63-second reset."

"Hmm, no. For me, the problem recurred repeatedly all the time that I wasn't dialled in, roughly once per minute (since the manager repeated its connection attempt with a 1-minute delay). The problem stopped as soon as the connection was established. I don't recall seeing any occurrences of the problem between when I connected and when I logged off."

"Okay, I've been doing this with 5.8.15 for the last week and a half or so. Having to enable/disable the network connectivity manually is indeed an annoyance, but I haven't had that error occur once in that time. It looks like you may have found the culprit."

Migrated-From: http://boinc.berkeley.edu/trac/ticket/113

romw · 2015-02-04T07:48:32Z

Commented by MikeMarsUK on 11 Oct 37304787 07:06 UTC

This was Nickolas's analysis of the bug:

There is a network-related problem that can cause this.

BOINC recently switched to using synchronous DNS resolving, in an attempt to workaround a DNS cache bug. That means the core client can't do anything while it's waiting for the DNS to respond; it's essentially hanged until it gets a reply. If the DNS server is not replying, for example, if your internet connection has problems, it takes a relatively long time (say 30 seconds) for it to finally give up. During this period, the science app can't communicate with the core client (as the core client is "hanged", it can't reply). It may quit with the error "No heartbeat from core client for 30 seconds, exiting".

When the core client finally gets either a reply from the DNS server, or a timeout, and starts being able to do other things, it notices the science applications had suddenly disappeared. So it gives the error "Task [name] exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." That's the part where the clueless user follows instructions, resets project, and makes the project lose a climate model, all because of a slow or non-working Internet connection!

Another problem this DNS thing causes is unresponsive manager. BOINC Manager has always used blocking I/O for GUI RPCs. That means the BOINC Manager can't do anything while it's waiting for the core client to respond; it's essentially hanged until it gets a reply. If the core client is hanged waiting for DNS, it can't respond to the manager, so the manager can't respond to mouseclicks. It all ends in getting a completely unresponsive GUI, all because of a slow or non-working Internet connection!

Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed.

romw · 2015-02-04T07:48:53Z

Commented by MikeMarsUK on 7 Feb 37304800 14:40 UTC
Replying to MikeMarsUK:

For 'Nickolas' please read Nicolas.

I feel this is potentially a big bug for the following categories of users:

Anyone with an iffy connection, since it'll cause the Boinc manager to lock up
Anyone with dial-up, since if Boinc makes a DNS query during the establishment or disestablishment of their connection, it could freeze
Anyone running a long-duration work unit, since it makes their work unit vunerable to transient network issues

Not much of an issue for ADSL users running small work units, since it doesn't matter if a few get trashed sometimes.

romw · 2015-02-04T07:49:14Z

Commented by MikeMarsUK on 9 Sep 37311051 12:26 UTC
ARGH!!! Until just now this problem has been purely theoretical for me, but I've just lost three climate models after a network issue (firewall crashed taking down comms, including localhost traffic, for 12 hours). There was approx 2 hours between the firewall crashing and the climate models crashing.

Boinc manager: 5.8.16
Project: Seasonal Attribution Project
PC: AMD X2 4600, 2GB RAM
OS: XP SP2
Network: ADSL via router
ZoneAlarm security suite: 7.0.337.000
Work units: 
  http:**attribution.cpdn.org*result.php?resultid=61997
  http:**attribution.cpdn.org*result.php?resultid=61960
  http:**attribution.cpdn.org*result.php?resultid=62318

'''ARGH!!! again, Trac thinks the above links and links in the logs are spam. Can project sites be whitelisted? To fix, replace * with /'''

vsmon.exe (ZoneAlarm's virus checker) crashed at 8:17UTC, while the PC was unattended, and as a result took down ZoneAlarm itself, blocking comms to the seasonal attribution project servers as well as localhost traffic. Came home 12 hours later, and all 3 of my SAP models on the PC were killed.

SAP doesn't upload it's climate as it goes, so that's a loss of 300 CPU hours as far as the project goes.

vsmon has crashed a few times in the past (once every few months), but it's always been with older versions of the boinc manager (5.4.x) rather than 5.8.x, and it's never affected any work units before.

Approx time that VSMon crashed would have been 9:17 GMT in the log (8:17 UTC), so it was just over two hours before the models crashed.

2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) [Process for hadam3h_n_167s1_006b_006b_1_1 exited
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) Task hadam3h_n_167s1_006b_006b_1_1 exited with zero status but no 'finished' file
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) If this happens repeatedly you may need to reset the project.
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) [task_state=UNINITIALIZED for hadam3h_n_167s1_006b_006b_1_1 from handle_exited_app
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) [Process for hadam3h_a_111s4_2000_2000_1_1 exited
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) Task hadam3h_a_111s4_2000_2000_1_1 exited with zero status but no 'finished' file
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) If this happens repeatedly you may need to reset the project.
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) [task_state=UNINITIALIZED for hadam3h_a_111s4_2000_2000_1_1 from handle_exited_app
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) [task_state=EXECUTING for hadam3h_n_167s1_006b_006b_1_1 from start
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) Restarting task hadam3h_n_167s1_006b_006b_1_1 using hadam3 version 407
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) [task_state=EXECUTING for hadam3h_a_111s4_2000_2000_1_1 from start
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project](task_debug]) Restarting task hadam3h_a_111s4_2000_2000_1_1 using hadam3 version 407
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) Scheduler request failed: a timeout was reached
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) Deferring communication for 1 min 0 sec
2007-04-23 11:28:17 [Seasonal Attribution Project](CPDN) Reason: scheduler request failed
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) [Process for hadam3h_n_167s1_006b_006b_1_1 exited
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [task_state=EXITED for hadam3h_n_167s1_006b_006b_1_1 from handle_exited_app
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) Deferring communication for 1 min 0 sec
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Reason: Unrecoverable error for result hadam3h_n_167s1_006b_006b_1_1 ( - exit code -1073741502 (0xc0000142))
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) [result state=COMPUTE_ERROR for hadam3h_n_167s1_006b_006b_1_1 from CS::report_result_error
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [Process for hadam3h_n_167s1_006b_006b_1_1 exited
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [exit code -1073741502 (0xc0000142): 
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [Process for hadam3h_a_111s4_2000_2000_1_1 exited
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [task_state=EXITED for hadam3h_a_111s4_2000_2000_1_1 from handle_exited_app
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [result state=COMPUTE_ERROR for hadam3h_a_111s4_2000_2000_1_1 from CS::report_result_error
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [Process for hadam3h_a_111s4_2000_2000_1_1 exited
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) [exit code -1073741502 (0xc0000142): 
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) Computation for task hadam3h_n_167s1_006b_006b_1_1 finished
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_167s1_006b_006b_1_1_1.zip for task hadam3h_n_167s1_006b_006b_1_1 absent
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_167s1_006b_006b_1_1_2.zip for task hadam3h_n_167s1_006b_006b_1_1 absent
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_167s1_006b_006b_1_1_3.zip for task hadam3h_n_167s1_006b_006b_1_1 absent
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_167s1_006b_006b_1_1_4.zip for task hadam3h_n_167s1_006b_006b_1_1 absent
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_167s1_006b_006b_1_1_5.zip for task hadam3h_n_167s1_006b_006b_1_1 absent
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) [result state=COMPUTE_ERROR for hadam3h_n_167s1_006b_006b_1_1 from CS::app_finished
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) Starting hadam3h_n_014s1_003b_003b_2_1
2007-04-23 11:28:18 [Seasonal Attribution Project](CPDN) [task_state=EXECUTING for hadam3h_n_014s1_003b_003b_2_1 from start
2007-04-23 11:28:18 [CPDN Seasonal Attribution Project](task_debug]) Starting task hadam3h_n_014s1_003b_003b_2_1 using hadam3 version 407
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) [Process for hadam3h_n_014s1_003b_003b_2_1 exited
2007-04-23 11:28:19 [CPDN Seasonal Attribution Project](task_debug]) [task_state=EXITED for hadam3h_n_014s1_003b_003b_2_1 from handle_exited_app
2007-04-23 11:28:19 [CPDN Seasonal Attribution Project](task_debug]) Reason: Unrecoverable error for result hadam3h_n_014s1_003b_003b_2_1 ( - exit code -1073741502 (0xc0000142))
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) [result state=COMPUTE_ERROR for hadam3h_n_014s1_003b_003b_2_1 from CS::report_result_error
2007-04-23 11:28:19 [CPDN Seasonal Attribution Project](task_debug]) [Process for hadam3h_n_014s1_003b_003b_2_1 exited
2007-04-23 11:28:19 [CPDN Seasonal Attribution Project](task_debug]) [exit code -1073741502 (0xc0000142): 
2007-04-23 11:28:19 [CPDN Seasonal Attribution Project](task_debug]) Computation for task hadam3h_a_111s4_2000_2000_1_1 finished
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) Output file hadam3h_a_111s4_2000_2000_1_1_1.zip for task hadam3h_a_111s4_2000_2000_1_1 absent
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) Output file hadam3h_a_111s4_2000_2000_1_1_2.zip for task hadam3h_a_111s4_2000_2000_1_1 absent
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) Output file hadam3h_a_111s4_2000_2000_1_1_3.zip for task hadam3h_a_111s4_2000_2000_1_1 absent
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) Output file hadam3h_a_111s4_2000_2000_1_1_4.zip for task hadam3h_a_111s4_2000_2000_1_1 absent
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) Output file hadam3h_a_111s4_2000_2000_1_1_5.zip for task hadam3h_a_111s4_2000_2000_1_1 absent
2007-04-23 11:28:19 [Seasonal Attribution Project](CPDN) [result state=COMPUTE_ERROR for hadam3h_a_111s4_2000_2000_1_1 from CS::app_finished
2007-04-23 11:28:20 [CPDN Seasonal Attribution Project](task_debug]) Computation for task hadam3h_n_014s1_003b_003b_2_1 finished
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_014s1_003b_003b_2_1_1.zip for task hadam3h_n_014s1_003b_003b_2_1 absent
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_014s1_003b_003b_2_1_2.zip for task hadam3h_n_014s1_003b_003b_2_1 absent
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_014s1_003b_003b_2_1_3.zip for task hadam3h_n_014s1_003b_003b_2_1 absent
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_014s1_003b_003b_2_1_4.zip for task hadam3h_n_014s1_003b_003b_2_1 absent
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) Output file hadam3h_n_014s1_003b_003b_2_1_5.zip for task hadam3h_n_014s1_003b_003b_2_1 absent
2007-04-23 11:28:20 [Seasonal Attribution Project](CPDN) [result state=COMPUTE_ERROR for hadam3h_n_014s1_003b_003b_2_1 from CS::app_finished


_____________________________________________________________________________





23/04/2007 20:48:26||Starting BOINC client version 5.8.16 for windows_intelx86
23/04/2007 20:48:26||log flags: task, file_xfer, sched_ops, task_debug, sched_op_debug, checkpoint_debug
23/04/2007 20:48:26||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3
23/04/2007 20:48:26||Data directory: E:\Program Files\BOINC
23/04/2007 20:48:27||[task_debug](task_debug]) result state=FILES_UPLOADED for hadam3h_n_167s1_006b_006b_1_1 from RESULT::parse_state
23/04/2007 20:48:27||[result state=FILES_UPLOADED for hadam3h_a_111s4_2000_2000_1_1 from RESULT::parse_state
23/04/2007 20:48:27||[task_debug](task_debug]) result state=FILES_UPLOADED for hadam3h_n_014s1_003b_003b_2_1 from RESULT::parse_state
23/04/2007 20:48:27||Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ [Family 15 Model 43 Stepping 1](x86) [tsc pae nx sse sse2 3dnow mmx](fpu)
23/04/2007 20:48:27||Memory: 2.00 GB physical, 3.85 GB virtual
23/04/2007 20:48:27||Disk: 71.56 GB total, 41.05 GB free
23/04/2007 20:48:28|CPDN Seasonal Attribution Project|URL: http:**attribution.cpdn.org*; Computer ID: 17357; location: home; project prefs: default
23/04/2007 20:48:28|BBC Climate Change Experiment|URL: http:**bbc.cpdn.org*; Computer ID: 102303; location: home; project prefs: default
23/04/2007 20:48:28|rosetta@home|URL: http:**boinc.bakerlab.org*rosetta*; Computer ID: 138077; location: home; project prefs: default
23/04/2007 20:48:28|Climateprediction.net Beta|URL: http:**climateapps1.oucs.ox.ac.uk*beta*; Computer ID: 118; location: home; project prefs: default
23/04/2007 20:48:28|climateprediction.net|URL: http:**climateprediction.net*; Computer ID: 525881; location: home; project prefs: default
23/04/2007 20:48:28|Zivis|URL: http:**zivis.bifi.unizar.es*; Computer ID: 1086; location: (none); project prefs: default

'''Above mangled to try to get round Trac's refusal to include logs in the post'''

23/04/2007 20:48:28||General prefs: from climateprediction.net (last modified 2007-04-21 12:37:49)
23/04/2007 20:48:28||Host location: home
23/04/2007 20:48:28||General prefs: no separate prefs for home; using your defaults
23/04/2007 20:48:28||[SCHEDULER_OP::init_op_project(): starting op for http://attribution.cpdn.org/
23/04/2007 20:48:31|CPDN Seasonal Attribution Project|Sending scheduler request: Requested by user
23/04/2007 20:48:31|CPDN Seasonal Attribution Project|Requesting 17280000 seconds of new work, and reporting 3 completed tasks
23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505](sched_op_debug])
23/04/2007 20:48:33||[handle_scheduler_reply(): got ack for result hadam3h_n_167s1_006b_006b_1_1
23/04/2007 20:48:33||[sched_op_debug](sched_op_debug]) handle_scheduler_reply(): got ack for result hadam3h_a_111s4_2000_2000_1_1
23/04/2007 20:48:33||[handle_scheduler_reply(): got ack for result hadam3h_n_014s1_003b_003b_2_1
23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Deferring communication for 1 min 0 sec
23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Reason: no work from project
23/04/2007 20:49:25||Suspending computation - user request
23/04/2007 20:50:03||Resuming computation
23/04/2007 20:50:03|BBC Climate Change Experiment|Starting hadcm3ohf_cd6l_00977177_1
23/04/2007 20:50:07|BBC Climate Change Experiment|[task_debug](sched_op_debug]) task_state=EXECUTING for hadcm3ohf_cd6l_00977177_1 from start
23/04/2007 20:50:07|BBC Climate Change Experiment|Starting task hadcm3ohf_cd6l_00977177_1 using hadcm3 version 515
23/04/2007 20:50:07||[SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org*

'''mangled'''

23/04/2007 20:50:07|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work
23/04/2007 20:50:07|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work
23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505](sched_op_debug])
23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Deferring communication for 1 min 56 sec
23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Reason: no work from project
23/04/2007 20:52:12||[SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org*

'''mangled'''

23/04/2007 20:52:13|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work
23/04/2007 20:52:13|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work
23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505](sched_op_debug])
23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Deferring communication for 6 min 1 sec
23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Reason: no work from project
23/04/2007 20:58:23||[SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org*

'''mangled'''

23/04/2007 20:58:23|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work
23/04/2007 20:58:23|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work
23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505](sched_op_debug])
23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Deferring communication for 16 min 10 sec
23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Reason: no work from project
23/04/2007 21:03:32|BBC Climate Change Experiment|[task_debug] result hadcm3ohf_cd6l_00977177_1 checkpointed

romw · 2015-02-04T07:49:34Z

Commented by MikeMarsUK on 1 Apr 37313295 23:06 UTC
Found a very similar log at the following location:

http://www.climateprediction.net/board/viewtopic.php?p=62746#62746

This one shows both a malariacontrol and a climate model crashing simultaneously with an 0xc0000142 exception. The log also shows 'communications deferred 1 minute' type messages, although the user (Haraldo) is not aware of any network issues.

Boinc manager version is 5.8.15.

0xc0000142 is said to be a problem with initialising applications, in particular not being able to load DLLs.

I'm guessing that the DNS issue causes the science application(s) to continually drop out and restart. If there is some sort of resource leak due to the crash then this would only happen a finite number of times, followed by all non-suspended work units on the PC crashing in quick succession?

romw · 2015-02-04T07:49:55Z

Commented by davea on 26 Apr 37395530 16:53 UTC
Possibly fixed in 5.9.12. Please reopen if not. -- David

romw · 2015-02-04T07:50:15Z

Commented by Bunsen on 1 Aug 37406900 19:06 UTC
As someone who isn't participating in the development process but has had some of the problems resulting from this bug... should I try this "MAY BE UNSTABLE - USE ONLY FOR TESTING" version, or wait for release of a later "recommended" version?

romw · 2015-02-04T07:50:36Z

Commented by MikeMarsUK on 20 May 37460494 14:13 UTC

Best thing to do is to use 5.4.16

romw · 2015-02-04T07:50:56Z

Commented by MikeMarsUK on 6 Feb 37460512 12:00 UTC

See also:

[ (don't know the version)

http://boinc.berkeley.edu/trac/ticket/171 (duplicate of 113)

romw · 2015-02-04T07:51:17Z

Commented by Contact on 7 Dec 37461348 15:33 UTC
Reopened. Possibly related to #282

2nd part of problem as defined by Nicolas:

BOINC Manager hangs and doesn't respond to keyboard/mouse if the core client doesn't reply; which could be caused by many reasons, including a hanged client, or a network problem (if the manager is connected to a remote client).

Technical details for the second problem: this is because the manager uses blocking I/O to communicate with the client.

romw · 2015-02-04T07:51:37Z

Commented by MikeMarsUK on 1 Nov 37463458 15:33 UTC
See also Trac/286 (v. similar to /282).

romw · 2015-02-04T07:51:58Z

Commented by Bunsen on 14 Mar 37845263 04:53 UTC
The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message. I've reverted again to 5.4.11 .

romw · 2015-02-04T07:52:18Z

Commented by Pepo on 24 Nov 37846904 06:13 UTC
Replying to Bunsen:

The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message.
Not necessarily! "Exit 0 status with no finished file" can have many reasons. Problem described in #113 (and related tickets) are bound to DNS problems.

(I was suffering from this issue nearly a year long, often (but not only) when moving running notebook between various networks, but not anymore since.... this sumer? (northern hemisphere))

romw · 2015-02-04T07:52:39Z

Commented by Bunsen on 5 Oct 37848594 21:46 UTC
Replying to Pepo:

Replying to Bunsen:

The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message.
Not necessarily! "Exit 0 status with no finished file" can have many reasons. Problem described in #113 (and related tickets) are bound to DNS problems.

I'm a chemist, not a software engineer. (Insert a ''Star Trek'' joke here if you like.) I don't understand a lot about network operations and such-like, but I'm a reasonably competent observer. I noted a recurring problem with my projects resetting every minute or so while they weren't able to communicate with the servers, and was told that the probable cause was this DNS thing. My response is: "If you say so... can you fix it?" It doesn't ever happen with 5.4.11; it's happened with all the later versions that I've tried.

And now I'm reporting that I'm seeing the same behaviour in 5.10.28: the manager gets into a state such that the projects reset every minute, and make no progress as a result. I'm hoping that the observation might be useful in diagnosing the problem... and of course, perhaps to nudge the problem back into someone's attention. I'd prefer to be using a current version of the manager, but for people like me who don't have a highly-reliable, always-on network connection, this problem is a serious disincentive to doing so.

romw · 2015-02-04T07:53:00Z

Commented by MikeMarsUK on 30 Mar 37884982 12:53 UTC

As far as I can remember, 5.4.11 had asynchronous DNS, whereas the later versions went to synchronous DNS. That might explain why 5.4.11 works, but the more recent versions fail.

Is there any chance of reverting to async DNS? I feel this problem is more severe than the stale-DNS-cache issue which prompted the move to synchronous DNS in the first place.

Perhaps the timeout on the sync DNS query should be shortened to a few seconds, with it falling back to an aysnc call after timeout.

romw · 2015-02-04T07:53:20Z

Commented by Dingo on 9 Apr 37905756 09:46 UTC
Replying to KSMarksPsych:

As reported at

I am also having this problem over at ABC:

If you look at computer 27978 You will see the work units 5342497 and 5342483 they are both in error with "No heartbeat from core client for 31 sec - exiting".

These two work units were running on a Linux pc with Boinc version:5.10.21 for Linux 64 bit when the network went down here at home. When it went down BOINC was frozen, probably trying to connect, and the four cores (q6600)were not crunching. I rebooted the PC and when it restarted the ABC work units above were both in error with the no heartbeat error. You can see when my network stops as there are between two and four work units together that are in error.

This is happening on a few machines that are all linux 64 bit.

BOINC freezes up and stops processing work whenever the network shuts down. I have the Activity set to "always available" as it usually is. I have 12 PC's here at home and I do not want to have to manage the network activity manually by "suspending activity" and starting it when it is up again.

I am returning work units to multiple projects that are in error due to this problem.

romw · 2015-02-04T07:53:41Z

Commented by Wang Solutions on 4 May 37916886 00:26 UTC
Replying to Dingo:
I have 10 PCs on a network, 3 are 64-bit Linux running BOINC 5.8.17 and the rest are 32-bit Windows running BOINC 5.10.13. In the past week I have had several hundred work units error out with this problem. Furthermore, when the problem develops (usually on one of the Linux machines first, particularly if running ABC at the time) this then brings the entire network down, freezing BOINC on all PCs and requiring me to shut all PCs down and bring them back up one by one. It has taken quite a lot of debugging to find the source of the problem, including having replaced all network components in case of a hardware issue. But it has now been traced back to the BOINC DNS issue. Just 30 seconds without a viable internet connection is enough to start the cascade.

romw · 2015-02-04T07:54:01Z

Commented by MikeMarsUK on 20 Jan 37926280 19:33 UTC
Replying to Wang Solutions:

Perhaps as a temporary work-around you could install a reliable DNS server on your local network (rather than using your ISP's DNS?). Personally I'm using a router which has DNS included.

ChristianBeer · 2017-04-05T15:01:46Z

There are no reports of this happening within the last months.

romw assigned davidpanderson Feb 3, 2015

romw added this to the Undetermined milestone Feb 3, 2015

romw added C: Client - Daemon P: Minor T: Defect labels Feb 3, 2015

romw mentioned this issue Feb 3, 2015

Iffy network connection causes Boinc 5.8.* to freeze solid #171

Closed

ChristianBeer unassigned davidpanderson Apr 5, 2017

ChristianBeer closed this as completed Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network connection and No heartbeat message #113

Network connection and No heartbeat message #113

romw commented Feb 3, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

ChristianBeer commented Apr 5, 2017

Network connection and No heartbeat message #113

Network connection and No heartbeat message #113

Comments

romw commented Feb 3, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

romw commented Feb 4, 2015

ChristianBeer commented Apr 5, 2017