Bug fixes #1409

kgaillot · 2018-01-18T23:35:43Z

No description provided.

…dles in the Master role

0b68905 aborted the transition on quorum loss, but quorum can also be acquired without triggering a new transition, if corosync gives quorum without a node joining (e.g. forced via corosync-cmapctl, or perhaps via heuristics). This aborts the transition when quorum is gained, but only after a 5-second delay, if the transition has not been aborted in that time. This avoids an unnecessary abort in the vast majority of cases where an abort is already done, and it allows some time for all nodes to connect when quorum is gained, rather than immediately fencing remaining unseen nodes.

…terval It already did when a resource was not specified. Also update help text to clarify cleanup vs refresh.

regression introduced by 047a661 + e3b825a

gao-yan · 2023-02-17T10:49:58Z

crmd/membership.c

+             * nodes are joining around the same time, so the one that brings us
+             * to quorum doesn't cause all the remaining ones to be fenced.
+             */
+            abort_after_delay(INFINITY, tg_restart, "Quorum gained", 5000);


If a joining peer makes the cluster acquire the quorum from corosync level, meanwhile if its sbd has SBD_DELAY_START enabled, which is usually longer than 5s and postpones start of pacemaker there, unnecessary startup fencing targeting the peer is always triggered: 5s after the quorum has been acquired before the node gets the chance to join at pacemaker level.

I'm trying to think of some potential solutions such as:

Let pacemaker-controld recognize SBD_DELAY_START setting and use a value greater than that as the delay parameter of abort_after_delay() here so that abortion of transition waits for the peer to join CPG. But it's not necessarily straight-forward since SBD_DELAY_START can also be just set to yes, which makes it adapt the value of msgwait in case of disk mode otherwise 2 * watchdog_timeout ...

Postpone start of corosync as well by adding the following dependency in sbd.service:

Before=corosync.service

But it wouldn't work for the case of diskless SBD, since it apparently requires the connection to corosync to report a successful start:

ClusterLabs/sbd@7ef73ab

Ask users to set this for corosync.service:

ExecStartPre=/bin/sleep <time>

The <time> is corresponding to SBD_DELAY_START configured in /etc/sysconfig/sbd

It'd be bothering for the users since they would have to pay attention to it and make sure the relevant settings are synchronized in between.

We probably could let acquiring of "quorum" require the peer to show up at CPG level as well besides at corosync's quorum level?

Or we probably could only specifically address the case where "wait_for_all" is enabled in corosync and let acquiring of "quorum" require all the peers to show up at CPG level?

So far I cannot think of any better ideas. What do you think @kgaillot and @wenningerk ?

Keep in mind that even if we don't abort the transition here, some unrelated event could abort the transition, and we'd still fence the node.

Why are we fencing the node to begin with, if it's in the cluster membership? I wouldn't expect CPG membership to be required.

Keep in mind that even if we don't abort the transition here, some unrelated event could abort the transition, and we'd still fence the node.

Indeed. It's just that the situation easily occurs with this predictable transition abortion.

Why are we fencing the node to begin with, if it's in the cluster membership? I wouldn't expect CPG membership to be required.

Well, now AFAICS it's related to whether the uname of the pending node is known yet upon creating a node_state entry for it, and how scheduler considers the status of the node under the situation...

Please take a look if this makes sense when you get a chance: #3031. Thanks.

That makes much more sense, thanks

To make things even more complicated to consider in case we're running corosync in "2-node" (probably same as "wait_for_all" for most of the cases) we're already ignoring quorum and would rather be going for availability of the peer via CPG. At least I guess this is also relevant for startup ... have to check how I did that exactly ...

beekhof added 2 commits January 22, 2018 11:37

Bug rhbz#1519812 - Prevent notify actions from causing --wait to hang

f643935

Fix: rhbz#1527072 - Correctly observe colocation constraints with bun…

cd8f984

…dles in the Master role

kgaillot force-pushed the fixes branch from c270fd0 to 479a8c8 Compare January 23, 2018 22:43

kgaillot changed the title ~~Low: crmd: complete fix for quorum change not causing new transition~~ Bug fixes Jan 23, 2018

kgaillot force-pushed the fixes branch 3 times, most recently from 5b7ff74 to 590a754 Compare January 24, 2018 01:03

kgaillot force-pushed the fixes branch from 590a754 to 93b77f4 Compare January 24, 2018 01:05

kgaillot added 2 commits January 24, 2018 10:51

Low: tools: crm_resource --refresh should ignore --operation and --in…

30eb9a9

…terval It already did when a resource was not specified. Also update help text to clarify cleanup vs refresh.

Fix: tools: crm_resource --cleanup couldn't match clone instances

aecab58

regression introduced by 047a661 + e3b825a

kgaillot merged commit e9741ca into ClusterLabs:master Jan 24, 2018

gao-yan reviewed Feb 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes #1409

Bug fixes #1409

kgaillot commented Jan 18, 2018 •

edited

gao-yan Feb 17, 2023

kgaillot Feb 20, 2023

gao-yan Feb 21, 2023

kgaillot Feb 27, 2023

wenningerk Mar 3, 2023

Bug fixes #1409

Bug fixes #1409

Conversation

kgaillot commented Jan 18, 2018 • edited

gao-yan Feb 17, 2023

Choose a reason for hiding this comment

kgaillot Feb 20, 2023

Choose a reason for hiding this comment

gao-yan Feb 21, 2023

Choose a reason for hiding this comment

kgaillot Feb 27, 2023

Choose a reason for hiding this comment

wenningerk Mar 3, 2023

Choose a reason for hiding this comment

kgaillot commented Jan 18, 2018 •

edited