Skip to content

Commit

Permalink
Merge pull request #742 from kgaillot/stonith
Browse files Browse the repository at this point in the history
Minor stonith fixes and refactoring
  • Loading branch information
davidvossel committed Jul 1, 2015
2 parents 1edcf61 + 5966ddb commit c4d6be4
Show file tree
Hide file tree
Showing 5 changed files with 429 additions and 81 deletions.
3 changes: 2 additions & 1 deletion cts/CTStests.py
Expand Up @@ -1314,7 +1314,8 @@ def __call__(self, node):
self.debug("Shooting %s aka. %s" % (rsc.clone_id, rsc.id))

pats = []
pats.append("pengine.*: warning: Processing failed op %s for %s on" % (self.action, self.rid))
pats.append(r"pengine.*: warning: Processing failed op %s for (%s|%s) on" % (self.action,
rsc.id, rsc.clone_id))

if rsc.managed():
pats.append(self.templates["Pat:RscOpOK"] % (self.rid, "stop_0"))
Expand Down
145 changes: 145 additions & 0 deletions fencing/README.md
@@ -0,0 +1,145 @@
# Directory contents

* `admin.c`, `stonith_admin.8`: `stonith_admin` command-line tool and its man
page
* `commands.c`, `internal.h`, `main.c`, `remote.c`, `stonithd.7`: stonithd and
its man page
* `fence_dummy`, `fence_legacy`, `fence_legacy.8`, `fence_pcmk`,
`fence_pcmk.8`: Pacemaker-supplied fence agents and their man pages
* `regression.py(.in)`: regression tests for `stonithd`
* `standalone_config.c`, `standalone_config.h`: abandoned project
* `test.c`: `stonith-test` command-line tool

# How fencing requests are handled

## Bird's eye view

In the broadest terms, stonith works like this:

1. The initiator (an external program such as `stonith_admin`, or the cluster
itself via the `crmd`) asks the local `stonithd`, "Hey, can you fence this
node?"
1. The local `stonithd` asks all the `stonithd's` in the cluster (including
itself), "Hey, what fencing devices do you have access to that can fence
this node?"
1. Each `stonithd` in the cluster replies with a list of available devices that
it knows about.
1. Once the original `stonithd` gets all the replies, it asks the most
appropriate `stonithd` peer to actually carry out the fencing. It may send
out more than one such request if the target node must be fenced with
multiple devices.
1. The chosen `stonithd(s)` call the appropriate fencing resource agent(s) to
do the fencing, then replies to the original `stonithd` with the result.
1. The original `stonithd` broadcasts the result to all `stonithd's`.
1. Each `stonithd` sends the result to each of its local clients (including, at
some point, the initiator).

## Detailed view

### Initiating a fencing request

A fencing request can be initiated by the cluster or externally, using the
libfencing API.

* The cluster always initiates fencing via `crmd/te_actions.c:te_fence_node()`
(which calls the `fence()` API). This occurs when a graph synapse contains a
`CRM_OP_FENCE` XML operation.
* The main external clients are `stonith_admin` and `stonith-test`.

Highlights of the fencing API:
* `stonith_api_new()` creates and returns a new `stonith_t` object, whose
`cmds` member has methods for connect, disconnect, fence, etc.
* the `fence()` method creates and sends a `STONITH_OP_FENCE XML` request with
the desired action and target node. Callers do not have to choose or even
have any knowledge about particular fencing devices.

### Fencing queries

The function calls for a stonith request go something like this as of this writing:

The local `stonithd` receives the client's request via an IPC or messaging
layer callback, which calls
* `stonith_command()`, which (for requests) calls
* `handle_request()`, which (for `STONITH_OP_FENCE` from a client) calls
* `initiate_remote_stonith_op()`, which creates a `STONITH_OP_QUERY` XML
request with the target, desired action, timeout, etc.. then broadcasts
the operation to the cluster group (i.e. all `stonithd` instances) and
starts a timer. The query is broadcast because (1) location constraints
might prevent the local node from accessing the stonith device directly,
and (2) even if the local node does have direct access, another node
might be preferred to carry out the fencing.

Each `stonithd` receives the original `stonithd's STONITH_OP_QUERY` broadcast
request via IPC or messaging layer callback, which calls:
* `stonith_command()`, which (for requests) calls
* `handle_request()`, which (for `STONITH_OP_QUERY` from a peer) calls
* `stonith_query()`, which calls
* `get_capable_devices()` with `stonith_query_capable_device_db()` to add
device information to an XML reply and send it. (A message is
considered a reply if it contains `T_STONITH_REPLY`, which is only set
by `stonithd` peers, not clients.)

The original `stonithd` receives all peers' `STONITH_OP_QUERY` replies via IPC
or messaging layer callback, which calls:
* `stonith_command()`, which (for replies) calls
* `handle_reply()` which (for `STONITH_OP_QUERY`) calls
* `process_remote_stonith_query()`, which allocates a new query result
structure, parses device information into it, and adds it to operation
object. It increments the number of replies received for this operation,
and compares it against the expected number of replies (i.e. the number
of active peers), and if this is the last expected reply, calls
* `call_remote_stonith()`, which calculates the timeout and sends
`STONITH_OP_FENCE` request(s) to carry out the fencing. If the target
node has a fencing "topology" (which allows specifications such as
"this node can be fenced either with device A, or devices B and C in
combination"), it will choose the device(s), and send out as many
requests as needed. If it chooses a device, it will choose the peer; a
peer is preferred if it has "verified" access to the desired device,
meaning that it has the device "running" on it and thus has a monitor
operation ensuring reachability.

### Fencing operations

Each `STONITH_OP_FENCE` request goes something like this as of this writing:

The chosen peer `stonithd` receives the `STONITH_OP_FENCE` request via IPC or
messaging layer callback, which calls:
* `stonith_command()`, which (for requests) calls
* `handle_request()`, which (for `STONITH_OP_FENCE` from a peer) calls
* `stonith_fence()`, which calls
* `schedule_stonith_command()` (using supplied device if
`F_STONITH_DEVICE` was set, otherwise the highest-priority capable
device obtained via `get_capable_devices()` with
`stonith_fence_get_devices_cb()`), which adds the operation to the
device's pending operations list and triggers processing.

The chosen peer `stonithd's` mainloop is triggered and calls
* `stonith_device_dispatch()`, which calls
* `stonith_device_execute()`, which pops off the next item from the device's
pending operations list. If acting as the (internally implemented) watchdog
agent, it panics the node, otherwise it calls
* `stonith_action_create()` and `stonith_action_execute_async()` to call the fencing agent.

The chosen peer stonithd's mainloop is triggered again once the fencing agent returns, and calls
* `stonith_action_async_done()` which adds the results to an action object then calls its
* done callback (`st_child_done()`), which calls `schedule_stonith_command()`
for a new device if there are further required actions to execute or if the
original action failed, then builds and sends an XML reply to the original
`stonithd` (via `stonith_send_async_reply()`), then checks whether any
pending actions are the same as the one just executed and merges them if so.

### Fencing replies

The original `stonithd` receives the `STONITH_OP_FENCE` reply via IPC or
messaging layer callback, which calls:
* `stonith_command()`, which (for replies) calls
* `handle_reply()`, which calls
* `process_remote_stonith_exec()`, which calls either
`call_remote_stonith()` (to retry a failed operation, or try the next
device in a topology is appropriate, which issues a new
`STONITH_OP_FENCE` request, proceeding as before) or `remote_op_done()`
(if the operation is definitively failed or successful).
* remote_op_done() broadcasts the result to all peers.

Finally, all peers receive the broadcast result and call
* `remote_op_done()`, which sends the result to all local clients.
135 changes: 108 additions & 27 deletions fencing/commands.c
Expand Up @@ -53,15 +53,24 @@ GHashTable *topology = NULL;
GList *cmd_list = NULL;

struct device_search_s {
/* target of fence action */
char *host;
/* requested fence action */
char *action;
/* timeout to use if a device is queried dynamically for possible targets */
int per_device_timeout;
/* number of registered fencing devices at time of request */
int replies_needed;
/* number of device replies received so far */
int replies_received;
/* whether the target is eligible to perform requested action (or off) */
bool allow_suicide;

/* private data to pass to search callback function */
void *user_data;
/* function to call when all replies have been received */
void (*callback) (GList * devices, void *user_data);
/* devices capable of performing requested action (or off if remapping) */
GListPtr capable;
};

Expand Down Expand Up @@ -173,6 +182,17 @@ get_action_timeout(stonith_device_t * device, const char *action, int default_ti
char buffer[64] = { 0, };
const char *value = NULL;

/* If "reboot" was requested but the device does not support it,
* we will remap to "off", so check timeout for "off" instead
*/
if (safe_str_eq(action, "reboot")
&& is_not_set(device->flags, st_device_supports_reboot)) {
crm_trace("%s doesn't support reboot, using timeout for off instead",
device->id);
action = "off";
}

/* If the device config specified an action-specific timeout, use it */
snprintf(buffer, sizeof(buffer) - 1, "pcmk_%s_timeout", action);
value = g_hash_table_lookup(device->params, buffer);
if (value) {
Expand Down Expand Up @@ -1241,6 +1261,38 @@ search_devices_record_result(struct device_search_s *search, const char *device,
}
}

/*
* \internal
* \brief Check whether the local host is allowed to execute a fencing action
*
* \param[in] device Fence device to check
* \param[in] action Fence action to check
* \param[in] target Hostname of fence target
* \param[in] allow_suicide Whether self-fencing is allowed for this operation
*
* \return TRUE if local host is allowed to execute action, FALSE otherwise
*/
static gboolean
localhost_is_eligible(const stonith_device_t *device, const char *action,
const char *target, gboolean allow_suicide)
{
gboolean localhost_is_target = safe_str_eq(target, stonith_our_uname);

if (device && action && device->on_target_actions
&& strstr(device->on_target_actions, action)) {
if (!localhost_is_target) {
crm_trace("%s operation with %s can only be executed for localhost not %s",
action, device->id, target);
return FALSE;
}

} else if (localhost_is_target && !allow_suicide) {
crm_trace("%s operation does not support self-fencing", action);
return FALSE;
}
return TRUE;
}

static void
can_fence_host_with_device(stonith_device_t * dev, struct device_search_s *search)
{
Expand All @@ -1258,19 +1310,11 @@ can_fence_host_with_device(stonith_device_t * dev, struct device_search_s *searc
goto search_report_results;
}

if (dev->on_target_actions &&
search->action &&
strstr(dev->on_target_actions, search->action)) {
/* this device can only execute this action on the target node */

if(safe_str_neq(host, stonith_our_uname)) {
crm_trace("%s operation with %s can only be executed for localhost not %s",
search->action, dev->id, host);
goto search_report_results;
}

} else if(safe_str_eq(host, stonith_our_uname) && search->allow_suicide == FALSE) {
crm_trace("%s operation does not support self-fencing", search->action);
/* Short-circuit the query if the local host is not allowed to perform the
* desired action.
*/
if (!localhost_is_eligible(dev, search->action, host,
search->allow_suicide)) {
goto search_report_results;
}

Expand Down Expand Up @@ -1423,6 +1467,43 @@ struct st_query_data {
int call_options;
};

/*
* \internal
* \brief Add action-specific attributes to query reply XML
*
* \param[in,out] xml XML to add attributes to
* \param[in] action Fence action
* \param[in] device Fence device
*/
static void
add_action_specific_attributes(xmlNode *xml, const char *action,
stonith_device_t *device)
{
int action_specific_timeout;
int delay_max;

CRM_CHECK(xml && action && device, return);

if (is_action_required(action, device)) {
crm_trace("Action %s is required on %s", action, device->id);
crm_xml_add_int(xml, F_STONITH_DEVICE_REQUIRED, 1);
}

action_specific_timeout = get_action_timeout(device, action, 0);
if (action_specific_timeout) {
crm_trace("Action %s has timeout %dms on %s",
action, action_specific_timeout, device->id);
crm_xml_add_int(xml, F_STONITH_ACTION_TIMEOUT, action_specific_timeout);
}

delay_max = get_action_delay_max(device, action);
if (delay_max > 0) {
crm_trace("Action %s has maximum random delay %dms on %s",
action, delay_max, device->id);
crm_xml_add_int(xml, F_STONITH_DELAY_MAX, delay_max / 1000);
}
}

static void
stonith_query_capable_device_cb(GList * devices, void *user_data)
{
Expand All @@ -1432,13 +1513,12 @@ stonith_query_capable_device_cb(GList * devices, void *user_data)
xmlNode *list = NULL;
GListPtr lpc = NULL;

/* Pack the results into data */
/* Pack the results into XML */
list = create_xml_node(NULL, __FUNCTION__);
crm_xml_add(list, F_STONITH_TARGET, query->target);
for (lpc = devices; lpc != NULL; lpc = lpc->next) {
stonith_device_t *device = g_hash_table_lookup(device_list, lpc->data);
int action_specific_timeout;
int delay_max;
const char *action = query->action;

if (!device) {
/* It is possible the device got unregistered while
Expand All @@ -1448,24 +1528,25 @@ stonith_query_capable_device_cb(GList * devices, void *user_data)

available_devices++;

action_specific_timeout = get_action_timeout(device, query->action, 0);
dev = create_xml_node(list, F_STONITH_DEVICE);
crm_xml_add(dev, XML_ATTR_ID, device->id);
crm_xml_add(dev, "namespace", device->namespace);
crm_xml_add(dev, "agent", device->agent);
crm_xml_add_int(dev, F_STONITH_DEVICE_VERIFIED, device->verified);
if (is_action_required(query->action, device)) {
crm_xml_add_int(dev, F_STONITH_DEVICE_REQUIRED, 1);
}
if (action_specific_timeout) {
crm_xml_add_int(dev, F_STONITH_ACTION_TIMEOUT, action_specific_timeout);
}

delay_max = get_action_delay_max(device, query->action);
if (delay_max > 0) {
crm_xml_add_int(dev, F_STONITH_DELAY_MAX, delay_max / 1000);
/* If the originating stonithd wants to reboot the node, and we have a
* capable device that doesn't support "reboot", remap to "off" instead.
*/
if (is_not_set(device->flags, st_device_supports_reboot)
&& safe_str_eq(query->action, "reboot")) {
crm_trace("%s doesn't support reboot, using values for off instead",
device->id);
action = "off";
}

/* Add action-specific values if available */
add_action_specific_attributes(dev, action, device);

if (query->target == NULL) {
xmlNode *attrs = create_xml_node(dev, XML_TAG_ATTRS);

Expand All @@ -1481,7 +1562,7 @@ stonith_query_capable_device_cb(GList * devices, void *user_data)
}

if (list != NULL) {
crm_trace("Attaching query list output");
crm_log_xml_trace(list, "Add query results");
add_message_xml(query->reply, F_STONITH_CALLDATA, list);
}
stonith_send_reply(query->reply, query->call_options, query->remote_peer, query->client_id);
Expand Down
14 changes: 14 additions & 0 deletions fencing/internal.h
Expand Up @@ -129,6 +129,20 @@ typedef struct remote_fencing_op_s {

} remote_fencing_op_t;

/*
* Complex fencing requirements are specified via fencing topologies.
* A topology consists of levels; each level is a list of fencing devices.
* Topologies are stored in a hash table by node name. When a node needs to be
* fenced, if it has an entry in the topology table, the levels are tried
* sequentially, and the devices in each level are tried sequentially.
* Fencing is considered successful as soon as any level succeeds;
* a level is considered successful if all its devices succeed.
* Essentially, all devices at a given level are "and-ed" and the
* levels are "or-ed".
*
* This structure is used for the topology table entries.
* Topology levels start from 1, so levels[0] is unused and always NULL.
*/
typedef struct stonith_topology_s {
char *node;
GListPtr levels[ST_LEVEL_MAX];
Expand Down

0 comments on commit c4d6be4

Please sign in to comment.