Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

marcan · 2015-06-11T08:25:45Z

This avoids things like nodes getting killed due to hanging stop operations if a start operation caused stonithd to crash.

I saw and debugged this issue on pacemaker 1.1.10+git20130802-1ubuntu2.3 (Ubuntu Trusty). Although the issue manifests itself there due to another bug that is already fixed, it's a bug in its own right, hence this PR. I don't know of any way to trigger this behavior on the current version of pacemaker short of manually killing stonithd in the middle of an op, but it could theoretically happen if stonithd dies for whatever reason.

This is the sequence of events that triggers the bug, conditional on the (long fixed) bug mentioned in PR #334:

lrmd gets a start action, and it becomes rsc->active
stonithd times out on the action (e.g. stonith device is broken/hanging)
stonithd gets SIGTERMed due to the aforementioned bug
lrmd notices and attemps to clean up the pending ops, but misses rsc->active, thus the start op never completes
start times out in crmd, it attempts to recover the resource
lrmd gets a stop action, and it gets put into rsc->pending_ops
lrmd never runs the action since rsc->active is still non-NULL
stop times out in crmd and the host gets STONITHed due to a failed stop
(even though a stonith resource stop is basically a no-op!)

… is lost The currently pending op is moved from rsc->pending_ops to rsc->active (if it is asynchronous). Therefore, that also needs to be cleaned up if the stonithd connection fails. Otherwise, the resource gets stuck forever on an op that will never complete. Example, interacting with the (long fixed) bug mentioned in pull/334: 1. lrmd gets a start action, becomes rsc->active 2. stonithd times out on the action 3. stonithd gets SIGTERMed due to the aforementioned bug 4. lrmd notices and attemps to clean up the pending ops, but misses rsc->active 5. start times out in crmd, attempts to recover 6. lrmd gets a stop action, gets put into rsc->pending_ops 7. lrmd never runs the action since rsc->active is non-NULL 8. stop times out in crmd and the host gets STONITHed due to a failed stop (even though a stonith resource stop is basically a no-op!)

beekhof · 2015-06-12T01:21:46Z

@davidvossel fyi

beekhof · 2015-06-12T01:22:15Z

Cancelling in-flight ops seems to be a theme this week.

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost

beekhof added a commit that referenced this pull request Jun 12, 2015

Merge pull request #730 from marcan/master

5839e67

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost

beekhof merged commit 5839e67 into ClusterLabs:master Jun 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

marcan commented Jun 11, 2015

beekhof commented Jun 12, 2015

beekhof commented Jun 12, 2015

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

Conversation

marcan commented Jun 11, 2015

beekhof commented Jun 12, 2015

beekhof commented Jun 12, 2015