Skip to content

Commit

Permalink
Doc: Pacemaker Explained: document per-operation fail counts
Browse files Browse the repository at this point in the history
This replaces the "Moving Resources Due to Failure" section with a new
"Handling Resource Failure" section that can go into more detail about
failure handling in general.
  • Loading branch information
kgaillot committed Mar 16, 2017
1 parent 8323616 commit 4f7420e
Show file tree
Hide file tree
Showing 4 changed files with 125 additions and 59 deletions.
4 changes: 2 additions & 2 deletions doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt
Expand Up @@ -350,7 +350,7 @@ cluster running, and any validation errors are often more informative.

==== New ====

* Failure timeouts. See <<s-failure-migration>>
* Failure timeouts. See <<s-failure-handling>>
* New section for resource and operation defaults. See <<s-resource-defaults>> and <<s-operation-defaults>>
* Tool for making offline configuration changes. See <<s-config-sandboxes>>
* +Rules, instance_attributes, meta_attributes+ and sets of operations can be defined once and referenced in multiple places. See <<s-reusing-config-elements>>
Expand All @@ -371,7 +371,7 @@ cluster running, and any validation errors are often more informative.
* The +stonith-enabled+ option now defaults to true.
* The cluster will refuse to start resources if +stonith-enabled+ is true (or unset) and no STONITH resources have been defined
* The attributes of colocation and ordering constraints were renamed for clarity. See <<s-resource-ordering>> and <<s-resource-colocation>>
* +resource-failure-stickiness+ has been replaced by +migration-threshold+. See <<s-failure-migration>>
* +resource-failure-stickiness+ has been replaced by +migration-threshold+. See <<s-failure-handling>>
* The parameters for command-line tools have been made consistent
* Switched to 'RelaxNG' schema validation and 'libxml2' parser
** id fields are now XML IDs which have the following limitations:
Expand Down
176 changes: 121 additions & 55 deletions doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt
Expand Up @@ -118,6 +118,127 @@ example, to specify an operation that would run on the first Monday of
<op id="my-weekly-action" name="custom-action" interval="P7D" interval-origin="2009-W01-1"/>
=====

[[s-failure-handling]]
== Handling Resource Failure ==

By default, Pacemaker will attempt to recover failed resources by restarting
them. However, failure recovery is highly configurable.

=== Failure Counts ===

Pacemaker tracks resource failures for each combination of node, resource, and
operation (start, stop, monitor, etc.).

You can query the fail count for a particular node, resource, and/or operation
using the `crm_failcount` command. For example, to see how many times the
10-second monitor for +myrsc+ has failed on +node1+, run:

----
# crm_failcount --query -r myrsc -N node1 -n monitor -I 10s
----

If you omit the node, `crm_failcount` will use the local node. If you omit the
operation and interval, `crm_failcount` will display the sum of the fail counts
for all operations on the resource.

You can use `crm_resource --cleanup` or `crm_failcount --delete` to clear
fail counts. For example, to clear the above monitor failures, run:

----
# crm_resource --cleanup -r myrsc -N node1 -n monitor -I 10s
----

If you omit the resource, `crm_resource --cleanup` will clear failures for all
resources. If you omit the node, it will clear failures on all nodes. If you
omit the operation and interval, it will clear the failures for all operations
on the resource.

[[NOTE]]
====
Even when cleaning up only a single operation, all failed operations will
disappear from the status display. This allows us to trigger a re-check of the
resource's current status.
====

Higher-level tools may provide other commands for querying and clearing
fail counts.

The `crm_mon` tool shows the current cluster status, including any failed
operations. To see the current fail counts for any failed resources, call
`crm_mon` with the `--failcounts` option. This shows the fail counts per
resource (that is, the sum of any operation fail counts for the resource).

=== Failure Response ===

Normally, if a running resource fails, pacemaker will try to stop it and start
it again. Pacemaker will choose the best location to start it each time, which
may be the same node that it failed on.

However, if a resource fails repeatedly, it is possible that there is an
underlying problem on that node, and you might desire trying a different node
in such a case. Pacemaker allows you to set your preference via the
+migration-threshold+ resource meta-attribute.
footnote:[
The naming of this option was perhaps unfortunate as it is easily
confused with live migration, the process of moving a resource from
one node to another without stopping it. Xen virtual guests are the
most common example of resources that can be migrated in this manner.
]

If you define +migration-threshold=pass:[<replaceable>N</replaceable>]+ for a
resource, it will be banned from the original node after 'N' failures.

[[NOTE]]
====
The +migration-threshold+ is per 'resource', even though fail counts are
tracked per 'operation'. The operation fail counts are added together
to compare against the +migration-threshold+.
====

By default, fail counts remain until manually cleared by an administrator
using `crm_resource --cleanup` or `crm_failcount --delete` (hopefully after
first fixing the failure's cause). It is possible to have fail counts expire
automatically by setting the +failure-timeout+ resource meta-attribute.

[IMPORTANT]
====
A successful operation does not clear past failures. If a recurring monitor
operation fails once, succeeds many times, then fails again days later, its
fail count is 2. Fail counts are cleared only by manual intervention or
falure timeout.
====

For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+
would cause the resource to move to a new node after 2 failures, and
allow it to move back (depending on stickiness and constraint scores) after one
minute.

[[NOTE]]
====
+failure-timeout+ is measured since the most recent failure. That is, older
failures do not individually time out and lower the fail count. Instead, all
failures are timed out simultaneously (and the fail count is reset to 0) if
there is no new failure for the timeout period.
====

There are two exceptions to the migration threshold concept:
when a resource either fails to start or fails to stop.

If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the
default), start failures cause the fail count to be set to +INFINITY+ and thus
always cause the resource to move immediately.

Stop failures are slightly different and crucial. If a resource fails
to stop and STONITH is enabled, then the cluster will fence the node
in order to be able to start the resource elsewhere. If STONITH is
not enabled, then the cluster has no way to continue and will not try
to start the resource elsewhere, but will try to stop it again after
the failure timeout.

[IMPORTANT]
Please read <<s-rules-recheck>> to understand how timeouts work
before configuring a +failure-timeout+.

== Moving Resources ==
indexterm:[Moving,Resources]
indexterm:[Resource,Moving]
Expand Down Expand Up @@ -236,61 +357,6 @@ positive and negative constraints. E.g.

which has the same long-term consequences as discussed earlier.

[[s-failure-migration]]
=== Moving Resources Due to Failure ===

Normally, if a running resource fails, pacemaker will try to start
it again on the same node. However if a resource fails repeatedly,
it is possible that there is an underlying problem on that node, and you
might desire trying a different node in such a case.

indexterm:[migration-threshold]
indexterm:[failure-timeout]
indexterm:[start-failure-is-fatal]

Pacemaker allows you to set your preference via the +migration-threshold+
resource option.
footnote:[
The naming of this option was perhaps unfortunate as it is easily
confused with live migration, the process of moving a resource from
one node to another without stopping it. Xen virtual guests are the
most common example of resources that can be migrated in this manner.
]

Simply define +migration-threshold=pass:[<replaceable>N</replaceable>]+ for a resource and it will
migrate to a new node after 'N' failures. There is no threshold defined
by default. To determine the resource's current failure status and
limits, run `crm_mon --failcounts`.

By default, once the threshold has been reached, the troublesome node will no
longer be allowed to run the failed resource until the administrator
manually resets the resource's failcount using `crm_failcount` (after
hopefully first fixing the failure's cause). Alternatively, it is possible
to expire them by setting the +failure-timeout+ option for the resource.

For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+
would cause the resource to move to a new node after 2 failures, and
allow it to move back (depending on stickiness and constraint scores) after one
minute.

There are two exceptions to the migration threshold concept:
when a resource either fails to start or fails to stop.

If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the
default), start failures cause the failcount to be set to +INFINITY+ and thus
always cause the resource to move immediately.

Stop failures are slightly different and crucial. If a resource fails
to stop and STONITH is enabled, then the cluster will fence the node
in order to be able to start the resource elsewhere. If STONITH is
not enabled, then the cluster has no way to continue and will not try
to start the resource elsewhere, but will try to stop it again after
the failure timeout.

[IMPORTANT]
Please read <<s-rules-recheck>> to understand how timeouts work
before configuring a +failure-timeout+.

=== Moving Resources Due to Connectivity Changes ===

You can configure the cluster to move resources when external connectivity is
Expand Down
2 changes: 1 addition & 1 deletion doc/Pacemaker_Explained/en-US/Ch-Options.txt
Expand Up @@ -189,7 +189,7 @@ indexterm:[Cluster,Option,start-failure-is-fatal]
Should a failure to start a resource on a particular node prevent further start
attempts on that node? If FALSE, the cluster will decide whether the same
node is still eligible based on the resource's current failure count
and +migration-threshold+ (see <<s-failure-migration>>).
and +migration-threshold+ (see <<s-failure-handling>>).

| enable-startup-probes | TRUE |
indexterm:[enable-startup-probes,Cluster Option]
Expand Down
2 changes: 1 addition & 1 deletion doc/Pacemaker_Explained/en-US/Ch-Resources.txt
Expand Up @@ -665,7 +665,7 @@ indexterm:[Action,Property,on-fail]
|FALSE
|If +true+, the intention to perform the operation is recorded so that
GUIs and CLI tools can indicate that an operation is in progress.
This is best set as an 'operation default' (see next section).
This is best set as an _operation default_ (see next section).
Allowed values: +true+, +false+.
indexterm:[enabled,Action Property]
indexterm:[Action,Property,enabled]
Expand Down

0 comments on commit 4f7420e

Please sign in to comment.