Skip to content

Improved heartbeat controller to engine monitoring for long running tasks#3290

Merged
minrk merged 4 commits into
ipython:masterfrom
chapmanb:master
May 9, 2013
Merged

Improved heartbeat controller to engine monitoring for long running tasks#3290
minrk merged 4 commits into
ipython:masterfrom
chapmanb:master

Conversation

@chapmanb
Copy link
Copy Markdown
Contributor

@chapmanb chapmanb commented May 8, 2013

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 registration::unregister_engine messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put on_probation and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new EngineFactory.max_heartbeat_misses which defaults to 50 misses.

This pull request provides a configuration variable HeartMonitor.max_heartmonitor_misses, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad

chapmanb added 3 commits May 6, 2013 15:56
…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute
@minrk
Copy link
Copy Markdown
Member

minrk commented May 8, 2013

Excellent, I've been meaning to do this for ages.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 'shutting down' to 'unregistering'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and 20 seems awfully high - maybe 10?

@chapmanb
Copy link
Copy Markdown
Contributor Author

chapmanb commented May 8, 2013

Great points, thank you. I updated the documentation and default.

chapmanb added a commit to roryk/ipython-cluster-helper that referenced this pull request May 8, 2013
…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling
@minrk
Copy link
Copy Markdown
Member

minrk commented May 9, 2013

Excellent, thanks!

minrk added a commit that referenced this pull request May 9, 2013
Improved heartbeat controller to engine monitoring for long running tasks

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses.

This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad
@minrk minrk merged commit 2b0be41 into ipython:master May 9, 2013
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
Improved heartbeat controller to engine monitoring for long running tasks

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses.

This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants