Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost #730

Merged
merged 1 commit into from Jun 12, 2015

Conversation

marcan
Copy link
Contributor

@marcan marcan commented Jun 11, 2015

This avoids things like nodes getting killed due to hanging stop operations if a start operation caused stonithd to crash.

I saw and debugged this issue on pacemaker 1.1.10+git20130802-1ubuntu2.3 (Ubuntu Trusty). Although the issue manifests itself there due to another bug that is already fixed, it's a bug in its own right, hence this PR. I don't know of any way to trigger this behavior on the current version of pacemaker short of manually killing stonithd in the middle of an op, but it could theoretically happen if stonithd dies for whatever reason.

This is the sequence of events that triggers the bug, conditional on the (long fixed) bug mentioned in PR #334:

  1. lrmd gets a start action, and it becomes rsc->active
  2. stonithd times out on the action (e.g. stonith device is broken/hanging)
  3. stonithd gets SIGTERMed due to the aforementioned bug
  4. lrmd notices and attemps to clean up the pending ops, but misses rsc->active, thus the start op never completes
  5. start times out in crmd, it attempts to recover the resource
  6. lrmd gets a stop action, and it gets put into rsc->pending_ops
  7. lrmd never runs the action since rsc->active is still non-NULL
  8. stop times out in crmd and the host gets STONITHed due to a failed stop
    (even though a stonith resource stop is basically a no-op!)

… is lost

The currently pending op is moved from rsc->pending_ops to rsc->active
(if it is asynchronous). Therefore, that also needs to be cleaned up if
the stonithd connection fails. Otherwise, the resource gets stuck forever
on an op that will never complete.

Example, interacting with the (long fixed) bug mentioned in pull/334:
1. lrmd gets a start action, becomes rsc->active
2. stonithd times out on the action
3. stonithd gets SIGTERMed due to the aforementioned bug
4. lrmd notices and attemps to clean up the pending ops, but misses rsc->active
5. start times out in crmd, attempts to recover
6. lrmd gets a stop action, gets put into rsc->pending_ops
7. lrmd never runs the action since rsc->active is non-NULL
8. stop times out in crmd and the host gets STONITHed due to a failed stop
   (even though a stonith resource stop is basically a no-op!)
@beekhof
Copy link
Member

beekhof commented Jun 12, 2015

@davidvossel fyi

@beekhof
Copy link
Member

beekhof commented Jun 12, 2015

Cancelling in-flight ops seems to be a theme this week.

beekhof added a commit that referenced this pull request Jun 12, 2015
Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost
@beekhof beekhof merged commit 5839e67 into ClusterLabs:master Jun 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants