Improved heartbeat controller to engine monitoring for long running tasks#3290
Merged
Conversation
…rom controller checking engine connectivity
…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute
Member
|
Excellent, I've been meaning to do this for ages. |
Member
There was a problem hiding this comment.
change 'shutting down' to 'unregistering'
Member
There was a problem hiding this comment.
and 20 seems awfully high - maybe 10?
Contributor
Author
|
Great points, thank you. I updated the documentation and default. |
chapmanb
added a commit
to roryk/ipython-cluster-helper
that referenced
this pull request
May 8, 2013
…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling
Member
|
Excellent, thanks! |
minrk
added a commit
that referenced
this pull request
May 9, 2013
Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad
mattvonrocketstein
pushed a commit
to mattvonrocketstein/ipython
that referenced
this pull request
Nov 3, 2014
Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10
registration::unregister_enginemessages and remove those engines.In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put
on_probationand the second triggers unregistration.For longer running engines this is too restrictive, and is at odds with the new
EngineFactory.max_heartbeat_misseswhich defaults to 50 misses.This pull request provides a configuration variable
HeartMonitor.max_heartmonitor_misses, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad