Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Fencing: Prevent sending SIGTERM to stonithd when stointh operation timed out #334

Closed
wants to merge 1 commit into from
Closed

Conversation

inouekazu
Copy link
Contributor

[ problem point ]
When stonith operation is executed and

UID        PID  PPID  PGID   SID  C STIME TTY          TIME CMD
root     18143     1 18143 18143  2 16:05 ?        00:00:02 corosync
root     18149     1 18149 18149  0 16:05 ?        00:00:00 /usr/sbin/pacemakerd -f
root     18152 18149 18152 18152  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/stonithd
root     18153 18149 18153 18153  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/lrmd
root     18372 18152 18152 18152  1 16:06 ?        00:00:00 /usr/bin/perl /usr/sbin/fence_legacy
root     18373 18372 18152 18152  0 16:06 ?        00:00:00 stonith -t external/libvirt.NG1 -E -T reset dev1
root     18378 18373 18152 18152  0 16:06 ?        00:00:00 /bin/sh /usr/lib64/stonith/plugins/external/libvirt.NG1 reset dev1
root     18382 18378 18152 18152  0 16:06 ?        00:00:00 sleep 135
991      18151 18149 18151 18151  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/cib
991      18154 18149 18154 18154  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/attrd
991      18155 18149 18155 18155  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/pengine
991      18156 18149 18156 18156  0 16:05 ?        00:00:00 /usr/libexec/pacemaker/crmd

it timed out, SIGTERM is sent also to stonithd. As a result, stonithd does unnecessary restart.

Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: st_child_term: Child 18372 timed out, sending SIGTERM
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: stonith_shutdown: Terminating with  2 clients
Aug  1 16:07:25 dev2 cib[18151]:     info: crm_client_destroy: Destroying 0 events
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: cib_connection_destroy: Connection to the CIB closed.
Aug  1 16:07:25 dev2 lrmd[18153]:    error: crm_ipc_read: Connection to stonith-ng failed
Aug  1 16:07:25 dev2 lrmd[18153]:    error: mainloop_gio_callback: Connection to stonith-ng[0xfed890] closed (I/O condition=17)
Aug  1 16:07:25 dev2 lrmd[18153]:    error: stonith_connection_destroy_cb: LRMD lost STONITH connection
Aug  1 16:07:25 dev2 lrmd[18153]:    error: stonith_connection_failed: STONITH connection failed, finalizing 2 pending operations.
Aug  1 16:07:25 dev2 crmd[18156]:    error: process_lrm_event: LRM operation f1-1_monitor_3600000 (call=31, status=4, cib-update=45, confirmed=false) Error
Aug  1 16:07:25 dev2 crmd[18156]:    error: process_lrm_event: LRM operation f1-2_monitor_3600000 (call=37, status=4, cib-update=46, confirmed=false) Error
Aug  1 16:07:25 dev2 crmd[18156]:     info: abort_transition_graph: process_graph_event:520 - Triggered transition abort (complete=0, node=dev2, tag=lrm_rsc_op, id=f1-1_last_failure_0, magic=4:1;18:2:0:20d90530-b60e-47f4-9180-465e3d41de54, cib=0.8.25) : Old event
Aug  1 16:07:25 dev2 crmd[18156]:  warning: update_failcount: Updating failcount for f1-1 on dev2 after failed monitor: rc=1 (update=value++, time=1375340845)
Aug  1 16:07:25 dev2 attrd[18154]:     info: find_hash_entry: Creating hash entry for fail-count-f1-1
Aug  1 16:07:25 dev2 attrd[18154]:     info: attrd_local_callback: Expanded fail-count-f1-1=value++ to 1
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-f1-1 (1)
Aug  1 16:07:25 dev2 crmd[18156]:     info: process_graph_event: Detected action (2.18) f1-1_monitor_3600000.31=unknown error: failed
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: crm_client_destroy: Destroying 0 events
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/45, version=0.8.25)
Aug  1 16:07:25 dev2 crmd[18156]:     info: abort_transition_graph: process_graph_event:520 - Triggered transition abort (complete=0, node=dev2, tag=lrm_rsc_op, id=f1-2_last_failure_0, magic=4:1;20:2:0:20d90530-b60e-47f4-9180-465e3d41de54, cib=0.8.26) : Old event
Aug  1 16:07:25 dev2 crmd[18156]:  warning: update_failcount: Updating failcount for f1-2 on dev2 after failed monitor: rc=1 (update=value++, time=1375340845)
Aug  1 16:07:25 dev2 crmd[18156]:     info: process_graph_event: Detected action (2.20) f1-2_monitor_3600000.37=unknown error: failed
Aug  1 16:07:25 dev2 crmd[18156]:    error: crm_ipc_read: Connection to stonith-ng failed
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: crm_client_destroy: Destroying 0 events
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/46, version=0.8.26)
Aug  1 16:07:25 dev2 crmd[18156]:    error: mainloop_gio_callback: Connection to stonith-ng[0x14181d0] closed (I/O condition=17)
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: qb_ipcs_us_withdraw: withdrawing server sockets
Aug  1 16:07:25 dev2 crmd[18156]:     crit: tengine_stonith_connection_destroy: Fencing daemon connection failed
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='3232261523']//transient_attributes//nvpair[@name='fail-count-f1-1']: No such device or address (rc=-6, origin=local/attrd/9, version=0.8.26)
Aug  1 16:07:25 dev2 crmd[18156]:   notice: fail_incompletable_stonith: Failing action 27 (27): STONITHd terminated
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section /cib: OK (rc=0, origin=local/attrd/10, version=0.8.26)
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_perform_update: Sent update 11: fail-count-f1-1=1
Aug  1 16:07:25 dev2 attrd[18154]:     info: find_hash_entry: Creating hash entry for last-failure-f1-1
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-f1-1 (1375340845)
Aug  1 16:07:25 dev2 crmd[18156]:  warning: fail_incompletable_stonith: STONITHd failure resulted in un-runnable actions
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: main: Done
Aug  1 16:07:25 dev2 stonith-ng[18152]:     info: crm_xml_cleanup: Cleaning up memory from libxml2
Aug  1 16:07:25 dev2 crmd[18156]:     info: abort_transition_graph: fail_incompletable_stonith:74 - Triggered transition abort (complete=0, node=, tag=crm_event, id=27, magic=NA) : Stonith failure
Aug  1 16:07:25 dev2 crmd[18156]:     info: abort_transition_graph: te_update_diff:172 - Triggered transition abort (complete=0, node=dev2, tag=nvpair, id=status-3232261523-fail-count-f1-1, name=fail-count-f1-1, value=1, magic=NA, cib=0.8.27) : Transient attribute: update
Aug  1 16:07:25 dev2 crmd[18156]:   notice: run_graph: Transition 4 (Complete=1, Pending=0, Fired=0, Skipped=9, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Stopped
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/11, version=0.8.27)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='3232261523']//transient_attributes//nvpair[@name='last-failure-f1-1']: No such device or address (rc=-6, origin=local/attrd/12, version=0.8.27)
Aug  1 16:07:25 dev2 pacemakerd[18149]:     info: pcmk_child_exit: Child process stonith-ng (18152) exited: OK (0)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section /cib: OK (rc=0, origin=local/attrd/13, version=0.8.27)
Aug  1 16:07:25 dev2 pacemakerd[18149]:   notice: pcmk_process_exit: Respawning failed child process: stonith-ng
Aug  1 16:07:25 dev2 pacemakerd[18149]:     info: start_child: Forked child 18398 for process stonith-ng
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_perform_update: Sent update 14: last-failure-f1-1=1375340845
Aug  1 16:07:25 dev2 attrd[18154]:     info: find_hash_entry: Creating hash entry for fail-count-f1-2
Aug  1 16:07:25 dev2 attrd[18154]:     info: attrd_local_callback: Expanded fail-count-f1-2=value++ to 1
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-f1-2 (1)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/14, version=0.8.28)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='3232261523']//transient_attributes//nvpair[@name='fail-count-f1-2']: No such device or address (rc=-6, origin=local/attrd/15, version=0.8.28)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section /cib: OK (rc=0, origin=local/attrd/16, version=0.8.28)
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_perform_update: Sent update 17: fail-count-f1-2=1
Aug  1 16:07:25 dev2 attrd[18154]:     info: find_hash_entry: Creating hash entry for last-failure-f1-2
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-f1-2 (1375340845)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/17, version=0.8.29)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='3232261523']//transient_attributes//nvpair[@name='last-failure-f1-2']: No such device or address (rc=-6, origin=local/attrd/18, version=0.8.29)
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_query operation for section /cib: OK (rc=0, origin=local/attrd/19, version=0.8.29)
Aug  1 16:07:25 dev2 attrd[18154]:   notice: attrd_perform_update: Sent update 20: last-failure-f1-2=1375340845
Aug  1 16:07:25 dev2 cib[18151]:     info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/20, version=0.8.30)
Aug  1 16:07:25 dev2 stonith-ng[18398]:   notice: crm_add_logfile: Additional logging available in /var/log/ha-debug
Aug  1 16:07:25 dev2 stonith-ng[18398]:    debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Aug  1 16:07:25 dev2 stonith-ng[18398]:     info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root

If this commit is applied, PGID will become like below.

UID        PID  PPID  PGID   SID  C STIME TTY          TIME CMD
root     28347     1 28347 28347  2 17:33 ?        00:00:01 corosync
root     28353     1 28353 28353  0 17:33 ?        00:00:00 /usr/sbin/pacemakerd -f
root     28356 28353 28356 28356  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/stonithd
root     28357 28353 28357 28357  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/lrmd
root     28585 28356 28585 28356  1 17:34 ?        00:00:00 /usr/bin/perl /usr/sbin/fence_legacy
root     28586 28585 28585 28356  0 17:34 ?        00:00:00 stonith -t external/libvirt.NG1 -E -T reset dev1
root     28591 28586 28585 28356  0 17:34 ?        00:00:00 /bin/sh /usr/lib64/stonith/plugins/external/libvirt.NG1 reset dev1
root     28595 28591 28585 28356  0 17:34 ?        00:00:00 sleep 135
991      28355 28353 28355 28355  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/cib
991      28358 28353 28358 28358  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/attrd
991      28359 28353 28359 28359  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/pengine
991      28360 28353 28360 28360  0 17:33 ?        00:00:00 /usr/libexec/pacemaker/crmd

@davidvossel
Copy link
Contributor

I'm trying to get this straight... stonithd is being killed because fence_pcmk ends up killing its process group.

perhaps we should be separating the agents from the stonithd process group after the fork similar to how we do in the lrmd.

https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/services_linux.c#L355

@beekhof
Copy link
Member

beekhof commented Aug 1, 2013

The changes to fence_pcmk don't affect stonithd, but could avoid it bringing down fenced.
For fence_legacy, beekhof@33fb54b seems more appropriate given the other changes in st_client.c

@beekhof
Copy link
Member

beekhof commented Aug 1, 2013

I split the commit up and changed the commit messages to help illustrate the problems each element solves.
The fence_pcmk change is also different because it is invoked by fenced, not stonithd.

  • Andrew Beekhof (17 minutes ago) 33fb54b: Fix: fencing: Put all fencing agent processes into their own process group (origin/master, origin/HEAD)
  • Kazunori INOUE (16 hours ago) 346a80a: Feature: Fencing: On timeout, clean up the agent's entire process group
  • Kazunori INOUE (16 hours ago) 6502e0b: Refactor: fence_legacy: stonithd now handles killing the process group for us
  • Andrew Beekhof (6 minutes ago) e6e572f: Fix: fence_pcmk: Do not propagate SIGTERM back to fenced (HEAD, master)

@beekhof beekhof closed this Aug 1, 2013
}

setpgrp($$,0);
setpgrp(0,0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, these two lines are functionally equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants