New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-servant: only report online if quorate or never had quorum #27
Conversation
Currently, the sbd-cluster servant does nothing for a corosync/cman cluster each time notify_timer_cb fires. So this servant can never cause fencing after connecting. Instead, query the QUORUM service and only report 'online' status when quorate or when quorum has never been attained. This allows sbd to be used for fencing upon loss of quorum when pacemaker is not running.
Guess we could set the quorum-state via the callback as well. |
Okay. I originally thought checking quorum state would replace the need to do cpg polling. But I think you are suggesting that cpg polling is worthwhile in addition. Would it be best for that to belong in a separate servant? Perhaps it could send out cpg messages and check that it is receiving messages from at least one node that is not itself (except in a 2-node cluster)? |
Think opening up an additional servant is just too much hassle and has impact on how to configure the whole thing. Should be possible to handle that all in the cluster-servant. |
Sorry for that ;-) |
I checked this PR taken from the author repository |
@Splarv: As already stated above I think we wouldn't like this synchronous implementation. And thinking over it again I'm not sure if suicide in case of quorum-loss without pacemaker running would make sense. Contrary it would rather be undesired behavior in cluster-shutdown e.g. |
17 апр. 2019 г., в 12:06, wenningerk ***@***.***> написал(а):
And thinking over it again I'm not sure if suicide in case of quorum-loss without pacemaker running would make sense. Contrary it would rather be undesired behavior in cluster-shutdown e.g.
I have got your point.
Although I do see reason for the cpg-message check as to be sure that corosync-messaging is working and we can assume that it is working for pacemaker as well so that what we get from there really either reflects the state of the partial cluster or we don't get anything and thus will suicide. And there is no need for different handling of two-node-clusters or anything.
Yes, what is needed is a prove of live. May be ping or so on, to the corosync and pacemaker. Does cpg have something such?
|
Well cpg sends you back your own message once it is distributed over the cluster. |
For checking liveness of the corosync-daemon we don't necessarily have to send messages over the network. Using something simple like cpg_local_get already sufficiently verifies the daemon connection. |
Yep, I already found that. The pull request is almost ready (for corosync).
diff --git a/configure.ac b/configure.ac
index fac26a8..2c57287 100644
--- a/configure.ac
+++ b/configure.ac
@@ -66,6 +66,7 @@ AC_CHECK_LIB(pe_rules, test_rule, , missing="yes")
AC_CHECK_LIB(crmcluster, crm_peer_init, , missing="yes")
AC_CHECK_LIB(uuid, uuid_unparse, , missing="yes")
AC_CHECK_LIB(cmap, cmap_initialize, , HAVE_cmap=0)
+AC_CHECK_LIB(cpg, cpg_local_get, , missing="yes")
dnl pacemaker >= 1.1.8
AC_CHECK_HEADERS(pacemaker/crm/cluster.h)
diff --git a/src/sbd-cluster.c b/src/sbd-cluster.c
index 541212f..7d947bb 100644
--- a/src/sbd-cluster.c
+++ b/src/sbd-cluster.c
@@ -264,8 +264,15 @@ notify_timer_cb(gpointer data)
#if HAVE_DECL_PCMK_CLUSTER_CMAN
case pcmk_cluster_cman:
#endif
- /* TODO - Make a CPG call and only call notify_parent() when we get a reply */
- notify_parent();
+ { // "ping" corosync
+ cl_log(LOG_INFO, "ping");
+ unsigned int nodeid;
+ int rc=cpg_local_get(cluster.cpg_handle,&nodeid);
+ cl_log(LOG_INFO, "rc=%d,nodeid=%u,cluster.nodeid=%u",rc,nodeid,cluster.nodeid);
+ if (rc && nodeid==cluster.nodeid)
+ cl_log(LOG_INFO, "notify");
+ notify_parent();
+ }
But I have 2 questions. First, do I need to some directives about conditional compilation around my code? For instance ifdef #SUPPORT_COROSYNC or something else.
Second, corosync sometimes become laggy. This patch will have sense only if I set `-I 30` (timeout_io), or may be need `-I 60` (with 2 times reserve). When I STOP the corosync one node, the corosycn on the other is frozen for 28s and without `-I 30` the other node will be also watchdogged. So question is how to better set default timeout_io and to what value?
… 18 апр. 2019 г., в 15:43, wenningerk ***@***.***> написал(а):
For checking liveness of the corosync-daemon we don't necessarily have to send messages over the network. Using something simple like cpg_local_get already sufficiently verifies the daemon connection.
#76
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
check out my PR via the link above - should have all the conditionals |
Yep, your patch is looked better, then my. I do one more healthy check, that the gotten nodeid equals cluster.nodeid, by may be, this is not necessary.
Well, may be issue that the healthy node is also watchdogged can be solved by other config options, not by `sbd -I 30`. But problem is that this issue is with the default timeout options. And users with default timeout config will have unexpected results. :) Or, may be reason is that I use too old corosync/pacemaker and there will be not problem with a future version.
… 18 апр. 2019 г., в 16:30, wenningerk ***@***.***> написал(а):
check out my PR via the link above - should have all the conditionals
I guess you have other issues with corosync that should be observed separately.
If you have that much of a lag you will see membership-drops and loss of quorum as well that will lead to self-fencing subsequently during operation.
I don't see these issues in my setup.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
We just want to check the connection so extra checks of what is returned as nodeid are not necessary. |
Answered here
#76
… 18 апр. 2019 г., в 17:01, wenningerk ***@***.***> написал(а):
We just want to check the connection so extra checks of what is returned as nodeid are not necessary.
I've simplified my patch a bit as anyway 0 isn't guaranteed as non-valid cpg_handle either.
I'm using pacemaker 2.0.1 and corosync 2.99.3 but honestly the behavior you are describing shouldn't be there before either.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@jjd25: |
Can one of the admins verify this patch? |
Currently, the sbd-cluster servant does nothing for a corosync/cman cluster each
time notify_timer_cb fires. So this servant can never cause fencing after
connecting.
Instead, query the QUORUM service and only report 'online' status when quorate
or when quorum has never been attained.
This allows sbd to be used for fencing upon loss of quorum when pacemaker is not
running.