Doc: Pacemaker Explained: document per-operation fail counts

This replaces the "Moving Resources Due to Failure" section with a new "Handling Resource Failure" section that can go into more detail about failure handling in general.
ClusterLabs · Mar 16, 2017 · 4f7420e · 4f7420e
1 parent 8323616
commit 4f7420e
Show file tree

Hide file tree

Showing 4 changed files with 125 additions and 59 deletions.
diff --git a/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt b/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt
@@ -350,7 +350,7 @@ cluster running, and any validation errors are often more informative.
 
 ==== New ====
 
-* Failure timeouts. See <<s-failure-migration>>
+* Failure timeouts. See <<s-failure-handling>>
 * New section for resource and operation defaults. See <<s-resource-defaults>> and <<s-operation-defaults>>
 * Tool for making offline configuration changes. See <<s-config-sandboxes>>
 * +Rules, instance_attributes, meta_attributes+ and sets of operations can be defined once and referenced in multiple places. See <<s-reusing-config-elements>>
@@ -371,7 +371,7 @@ cluster running, and any validation errors are often more informative.
 * The +stonith-enabled+ option now defaults to true.
 * The cluster will refuse to start resources if +stonith-enabled+ is true (or unset) and no STONITH resources have been defined
 * The attributes of colocation and ordering constraints were renamed for clarity. See <<s-resource-ordering>> and <<s-resource-colocation>>
-* +resource-failure-stickiness+ has been replaced by +migration-threshold+. See <<s-failure-migration>>
+* +resource-failure-stickiness+ has been replaced by +migration-threshold+. See <<s-failure-handling>>
 * The parameters for command-line tools have been made consistent
 * Switched to 'RelaxNG' schema validation and 'libxml2' parser
 ** id fields are now XML IDs which have the following limitations:

diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt
@@ -118,6 +118,127 @@ example, to specify an operation that would run on the first Monday of
 <op id="my-weekly-action" name="custom-action" interval="P7D" interval-origin="2009-W01-1"/> 
 =====
 
+[[s-failure-handling]]
+== Handling Resource Failure ==
+
+By default, Pacemaker will attempt to recover failed resources by restarting
+them. However, failure recovery is highly configurable.
+
+=== Failure Counts ===
+
+Pacemaker tracks resource failures for each combination of node, resource, and
+operation (start, stop, monitor, etc.).
+
+You can query the fail count for a particular node, resource, and/or operation
+using the `crm_failcount` command. For example, to see how many times the
+10-second monitor for +myrsc+ has failed on +node1+, run:
+
+----
+# crm_failcount --query -r myrsc -N node1 -n monitor -I 10s
+----
+
+If you omit the node, `crm_failcount` will use the local node. If you omit the
+operation and interval, `crm_failcount` will display the sum of the fail counts
+for all operations on the resource.
+
+You can use `crm_resource --cleanup` or `crm_failcount --delete` to clear
+fail counts. For example, to clear the above monitor failures, run:
+
+----
+# crm_resource --cleanup -r myrsc -N node1 -n monitor -I 10s
+----
+
+If you omit the resource, `crm_resource --cleanup` will clear failures for all
+resources. If you omit the node, it will clear failures on all nodes. If you
+omit the operation and interval, it will clear the failures for all operations
+on the resource.
+
+[[NOTE]]
+====
+Even when cleaning up only a single operation, all failed operations will
+disappear from the status display. This allows us to trigger a re-check of the
+resource's current status.
+====
+
+Higher-level tools may provide other commands for querying and clearing
+fail counts.
+
+The `crm_mon` tool shows the current cluster status, including any failed
+operations. To see the current fail counts for any failed resources, call
+`crm_mon` with the `--failcounts` option. This shows the fail counts per
+resource (that is, the sum of any operation fail counts for the resource).
+
+=== Failure Response ===
+
+Normally, if a running resource fails, pacemaker will try to stop it and start
+it again. Pacemaker will choose the best location to start it each time, which
+may be the same node that it failed on.
+
+However, if a resource fails repeatedly, it is possible that there is an
+underlying problem on that node, and you might desire trying a different node
+in such a case. Pacemaker allows you to set your preference via the
++migration-threshold+ resource meta-attribute.
+footnote:[
+The naming of this option was perhaps unfortunate as it is easily
+confused with live migration, the process of moving a resource from
+one node to another without stopping it.  Xen virtual guests are the
+most common example of resources that can be migrated in this manner.
+]
+
+If you define +migration-threshold=pass:[<replaceable>N</replaceable>]+ for a
+resource, it will be banned from the original node after 'N' failures.
+
+[[NOTE]]
+====
+The +migration-threshold+ is per 'resource', even though fail counts are
+tracked per 'operation'. The operation fail counts are added together
+to compare against the +migration-threshold+.
+====
+
+By default, fail counts remain until manually cleared by an administrator
+using `crm_resource --cleanup` or `crm_failcount --delete` (hopefully after
+first fixing the failure's cause). It is possible to have fail counts expire
+automatically by setting the +failure-timeout+ resource meta-attribute.
+
+[IMPORTANT]
+====
+A successful operation does not clear past failures. If a recurring monitor
+operation fails once, succeeds many times, then fails again days later, its
+fail count is 2. Fail counts are cleared only by manual intervention or
+falure timeout.
+====
+
+For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+
+would cause the resource to move to a new node after 2 failures, and
+allow it to move back (depending on stickiness and constraint scores) after one
+minute.
+
+[[NOTE]]
+====
++failure-timeout+ is measured since the most recent failure. That is, older
+failures do not individually time out and lower the fail count. Instead, all
+failures are timed out simultaneously (and the fail count is reset to 0) if
+there is no new failure for the timeout period.
+====
+
+There are two exceptions to the migration threshold concept:
+when a resource either fails to start or fails to stop.
+
+If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the
+default), start failures cause the fail count to be set to +INFINITY+ and thus
+always cause the resource to move immediately.
+
+Stop failures are slightly different and crucial.  If a resource fails
+to stop and STONITH is enabled, then the cluster will fence the node
+in order to be able to start the resource elsewhere.  If STONITH is
+not enabled, then the cluster has no way to continue and will not try
+to start the resource elsewhere, but will try to stop it again after
+the failure timeout.
+
+[IMPORTANT]
+Please read <<s-rules-recheck>> to understand how timeouts work
+before configuring a +failure-timeout+.
+
 == Moving Resources ==
 indexterm:[Moving,Resources] 
 indexterm:[Resource,Moving]
@@ -236,61 +357,6 @@ positive and negative constraints. E.g.
 
 which has the same long-term consequences as discussed earlier.
 
-[[s-failure-migration]]
-=== Moving Resources Due to Failure ===
-
-Normally, if a running resource fails, pacemaker will try to start
-it again on the same node. However if a resource fails repeatedly,
-it is possible that there is an underlying problem on that node, and you
-might desire trying a different node in such a case.
-
-indexterm:[migration-threshold]
-indexterm:[failure-timeout]
-indexterm:[start-failure-is-fatal]
-
-Pacemaker allows you to set your preference via the +migration-threshold+
-resource option.
-footnote:[
-The naming of this option was perhaps unfortunate as it is easily
-confused with live migration, the process of moving a resource from
-one node to another without stopping it.  Xen virtual guests are the
-most common example of resources that can be migrated in this manner.
-]
-
-Simply define +migration-threshold=pass:[<replaceable>N</replaceable>]+ for a resource and it will
-migrate to a new node after 'N' failures.  There is no threshold defined
-by default.  To determine the resource's current failure status and
-limits, run `crm_mon --failcounts`.
-
-By default, once the threshold has been reached, the troublesome node will no
-longer be allowed to run the failed resource until the administrator
-manually resets the resource's failcount using `crm_failcount` (after
-hopefully first fixing the failure's cause).  Alternatively, it is possible
-to expire them by setting the +failure-timeout+ option for the resource.
-
-For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+
-would cause the resource to move to a new node after 2 failures, and
-allow it to move back (depending on stickiness and constraint scores) after one
-minute.
-
-There are two exceptions to the migration threshold concept:
-when a resource either fails to start or fails to stop.
-
-If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the
-default), start failures cause the failcount to be set to +INFINITY+ and thus
-always cause the resource to move immediately.
-
-Stop failures are slightly different and crucial.  If a resource fails
-to stop and STONITH is enabled, then the cluster will fence the node
-in order to be able to start the resource elsewhere.  If STONITH is
-not enabled, then the cluster has no way to continue and will not try
-to start the resource elsewhere, but will try to stop it again after
-the failure timeout.
-
-[IMPORTANT]
-Please read <<s-rules-recheck>> to understand how timeouts work
-before configuring a +failure-timeout+.
-
 === Moving Resources Due to Connectivity Changes ===
 
 You can configure the cluster to move resources when external connectivity is

diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
@@ -189,7 +189,7 @@ indexterm:[Cluster,Option,start-failure-is-fatal]
 Should a failure to start a resource on a particular node prevent further start
 attempts on that node? If FALSE, the cluster will decide whether the same
 node is still eligible based on the resource's current failure count
-and +migration-threshold+ (see <<s-failure-migration>>).
+and +migration-threshold+ (see <<s-failure-handling>>).
 
 | enable-startup-probes | TRUE |
 indexterm:[enable-startup-probes,Cluster Option]

diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt
@@ -665,7 +665,7 @@ indexterm:[Action,Property,on-fail]
 |FALSE
 |If +true+, the intention to perform the operation is recorded so that
  GUIs and CLI tools can indicate that an operation is in progress.
- This is best set as an 'operation default' (see next section).
+ This is best set as an _operation default_ (see next section).
  Allowed values: +true+, +false+.
  indexterm:[enabled,Action Property]
  indexterm:[Action,Property,enabled]