Improved heartbeat controller to engine monitoring for long running tasks by chapmanb · Pull Request #3290 · ipython/ipython

chapmanb · 2013-05-08T01:34:58Z

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 registration::unregister_engine messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put on_probation and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new EngineFactory.max_heartbeat_misses which defaults to 50 misses.

This pull request provides a configuration variable HeartMonitor.max_heartmonitor_misses, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad

…rom controller checking engine connectivity

…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute

…ogging of misses

minrk · 2013-05-08T01:37:00Z

Excellent, I've been meaning to do this for ages.

minrk · 2013-05-08T01:38:00Z

change 'shutting down' to 'unregistering'

and 20 seems awfully high - maybe 10?

chapmanb · 2013-05-08T01:46:18Z

Great points, thank you. I updated the documentation and default.

…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling

minrk · 2013-05-09T17:33:06Z

Excellent, thanks!

Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad

chapmanb added 3 commits May 6, 2013 15:56

Provide configuration hook to specify allowable heartmonitor misses f…

d8c2d4f

…rom controller checking engine connectivity

Update heartmonitor miss default to be in line with new engine heartb…

9819283

…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute

Improve documentation of default to indicate consecutive and supply l…

e3040cd

…ogging of misses

minrk reviewed May 8, 2013
View reviewed changes

Lower default misses to 10 and improve documentation of option

536686b

minrk merged commit 2b0be41 into ipython:master May 9, 2013

minrk mentioned this pull request Jul 6, 2013

ipcontroller purging some engines during connect #2887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved heartbeat controller to engine monitoring for long running tasks#3290

Improved heartbeat controller to engine monitoring for long running tasks#3290
minrk merged 4 commits into
ipython:masterfrom
chapmanb:master

chapmanb commented May 8, 2013

Uh oh!

minrk commented May 8, 2013

Uh oh!

minrk May 8, 2013

Uh oh!

minrk May 8, 2013

Uh oh!

chapmanb commented May 8, 2013

Uh oh!

minrk commented May 9, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chapmanb commented May 8, 2013

Uh oh!

minrk commented May 8, 2013

Uh oh!

minrk May 8, 2013

Choose a reason for hiding this comment

Uh oh!

minrk May 8, 2013

Choose a reason for hiding this comment

Uh oh!

chapmanb commented May 8, 2013

Uh oh!

minrk commented May 9, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants