New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: libpacemaker: Don't shuffle anonymous clone instances unnecessarily #2313
Conversation
Regression tests pass, including the new ones I created (but note that a I've got a few uncertainties though:
Not directly related to this patch: I also noticed that EDIT: I did do some testing with promotable clones of groups, and they behaved as expected even before the fix. Can't promise that would be true in all situations. |
7899f69
to
f4282d8
Compare
The reason Jenkins failed is that there are now too many tests in |
24d67c0
to
7b0737e
Compare
The scope of this grew beyond what I had planned. The "Fix: libpacemaker: Don't shuffle anonymous clone instances" commit (currently 7b0737e) is my main goal. The "Enclose dup_file_path() in SUPPORT_NAGIOS guard" commit (currently 82ac086) addresses a Jenkins build failure on Fedora 32. The huge "Build, Test: Scheduler: Split up cts/scheduler directory" commit (currently 4becd5d) addresses an "Argument list too long" issue that caused a bunch of builds to fail, due to the increased number of tests in the I'm not at all tied to this approach. If you'd rather fix the argument issue in a different way, I'm all ears :) |
Everything through splitting up the scheduler directory looks good, let's do those in a separate PR to get those fixes out quickly and be able to focus this one better. Don't bother with the output formatting change, that will be taken care of at release time. |
7b0737e
to
2f657b8
Compare
I opened #2314 for those first ones. For the time being, I kept those commits in this PR too so that I don't have to update the changes to the test outputs. I can remove them later. I removed the output formatting change and re-pushed. |
2f657b8
to
7dfb35d
Compare
ovn-dbs-bundle-2 (ocf:ovn:ovndb-servers): Slave controller-1 | ||
ip-172.17.1.87 (ocf:heartbeat:IPaddr2): Started controller-0 | ||
ip-172.17.1.87 (ocf:heartbeat:IPaddr2): Stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To save you some potential confusion during review: The ip-172.17.1.87 resource is now stopped because it's banned from node 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure that's acceptable. Selecting a node for the promotion of ovn-dbs-bundle should take into account colocation dependencies' preferences. Previously that was implicit by the ordering of the clone instances since the first N nodes are chosen for promotion. I'm not sure what the resolution is, maybe we need to re-sort the nodes before setting instance roles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I thought the behavior would have been the same on the existing scheduler if the ban constraint were for controller-0
instead, which is why this didn't bother me. But I'm testing now and that seems not to be the case.
The fact that ovn-dbs-bundle-0
is going to controller-0
and ovn-dbs-bundle-1
is going to controller-2
now is good and we want to keep that. It doesn't make sense to move instance 1. As you said, maybe there's something we can do in promotion_order()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be missing constraints that didn't get passed down from the bundle to the promotable clone within it. I ran the below at the beginning of pcmk__set_instance_roles()
. My understanding is that rsc_cons_lhs
is the list of constraints where some other resource depends on this resource... and not the list of constraints for which this resource is on the left-hand side. I bet I'm not the only one who gets a headache from the lh/rh names.
(gdb) p rsc->id
$1 = 0x13e02e0 "ovn-dbs-bundle-master"
(gdb) p rsc->parent
$2 = (pe_resource_t *) 0x13d3530
(gdb) p rsc->parent->id
$3 = 0x13d3510 "ovn-dbs-bundle"
(gdb) p rsc->parent->rsc_cons
$4 = 0x0
(gdb) p rsc->parent->rsc_cons_lhs
$5 = 0x15340e0 = {0x153ead0}
(gdb) p rsc->rsc_cons
$6 = 0x0
(gdb) p rsc->rsc_cons_lhs
$7 = 0x0
Edit: Checked, and yeah the ip with bundle
is attached to the bundle but not the promotable clone within it. Not sure if it should be attached to the clone or not, or whether this is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, scratch the constraints. The direct cause of this unwanted behavior isn't that the colocation constraints aren't being honored. The direct cause seems to be that ovndb_servers:0
is in Stopped state when we begin. This causes a few issues.
First, lookup_promotion_score()
is unable to get the promotion score for ovndb_servers:0
.
(pe_node_attribute_calculated) trace: ovndb_servers:0: Not looking for master-ovndb_servers:0 on the container host: ovn-dbs-bundle-podman-0 is inactive
That's because pe_node_attribute_calculated()
checks the first node in the container's running_on
list. If we modify it as follows:
//if (node->details->remote_rsc->container->running_on != NULL) {
//pe_node_t *host = node->details->remote_rsc->container->running_on->data;
if (node->details->remote_rsc->container->allocated_to != NULL) {
pe_node_t *host = node->details->remote_rsc->container->allocated_to;
Then we're able to get the promotion score for ovndb_servers:0
. (Whether the above will have any unwanted side effects, I don't know yet. We could also check allocated_to
and then running_on
, or vice versa.)
(pe_node_attribute_calculated) trace: ovndb_servers:0: Looking for master-ovndb_servers:0 on the allocated container host controller-0
(promotion_score) trace: promotion score for ovndb_servers:0 on ovn-dbs-bundle-0 = <null>
(pe_node_attribute_calculated) trace: ovndb_servers:0: Looking for master-ovndb_servers on the allocated container host controller-0
(promotion_score) trace: stripped promotion score for ovndb_servers on ovn-dbs-bundle-0 = 5
So now we have the promotion score. Our next issue is that sort_promotable_instance()
sorts based on the current role.
role1 = resource1->fns->state(resource1, TRUE);
role2 = resource2->fns->state(resource2, TRUE);
if (role1 > role2) {
crm_trace("%s %c %s (role)", resource1->id, '<', resource2->id);
return -1;
} else if (role1 < role2) {
crm_trace("%s %c %s (role)", resource1->id, '>', resource2->id);
return 1;
}
If we modify that to sort by next role (again, haven't checked whether that has any unwanted side effects), then we encounter a third issue: ovndb_servers:0
is inactive, so it loses in the sort order to the active instances.
On a positive note: If the "sort by next role instead of current role" change is in place, then on the second transition, the correct node gets promoted and the IP starts. It's apparently able to do this because at that point, ovndb_servers:0
is already in started state.
# crm_simulate -S -x cts/scheduler/xml/cancel-behind-moving-remote.xml -O /tmp/next.xml >/dev/null
# crm_simulate -R -x /tmp/next.xml
...
Transition Summary:
* Start rabbitmq-bundle-1 ( controller-0 ) due to unrunnable rabbitmq-bundle-podman-1 start (blocked)
* Start rabbitmq:1 ( rabbitmq-bundle-1 ) due to unrunnable rabbitmq-bundle-podman-1 start (blocked)
* Promote ovndb_servers:0 ( Slave -> Master ovn-dbs-bundle-0 )
* Demote ovndb_servers:1 ( Master -> Slave ovn-dbs-bundle-1 )
* Start ip-172.17.1.87 ( controller-0 )
So colocation constraints are being accounted for.
To summarize, I feel like it would be fine to check the allocated_to
node in pe_node_attribute_calculated()
as a simple fix for looking up the promotion score. But the remaining issue is "how do we sort the promotable instances so that a Stopped instance has a fair chance to be promoted on the first transition, without making the sort criteria problematic for other situations?"
P.S. The reason this worked before the fix is that the stopped instance (ovndb_servers:0
) was allocated to node that didn't need to be promoted. Now, the stopped instance is allocated to controller-0
, which is the node that needs to be promoted so that the IP can start. The instance on controller-2
gets promoted instead, and the IP can't start there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that
rsc_cons_lhs
is the list of constraints where some other resource depends on this resource... and not the list of constraints for which this resource is on the left-hand side. I bet I'm not the only one who gets a headache from the lh/rh names.
I know right! We can't change struct member names until we're ready to break backward API compatibility, but I have it on my to-do list for when that happens: s/rsc_cons/this_with/ and s/rsc_cons_lhs/with_this/
First,
lookup_promotion_score()
is unable to get the promotion score forovndb_servers:0
.(pe_node_attribute_calculated) trace: ovndb_servers:0: Not looking for master-ovndb_servers:0 on the container host: ovn-dbs-bundle-podman-0 is inactive
That's because
pe_node_attribute_calculated()
checks the first node in the container'srunning_on
list. If we modify it as follows://if (node->details->remote_rsc->container->running_on != NULL) { //pe_node_t *host = node->details->remote_rsc->container->running_on->data; if (node->details->remote_rsc->container->allocated_to != NULL) { pe_node_t *host = node->details->remote_rsc->container->allocated_to;
Then we're able to get the promotion score for
ovndb_servers:0
. (Whether the above will have any unwanted side effects, I don't know yet. We could also checkallocated_to
and thenrunning_on
, or vice versa.)
pe_node_attribute_calculated() is different from pe_node_attribute_raw() only for resources in a bundle where container-attribute-target has been set to "host" (indicating that the value for the bundle instance's host should be used rather than any value set for the bundle instance node itself).
Besides promotion_score(), calculated attributes are used for location rules where the rule score is taken from a node attribute, for example to say a containerized resource has to run on a host with a node attribute indicating it has access to necessary shared storage that is exported into the container.
Location rules feed into the allocation process, so we don't have ->allocated_to at that point.
Offhand, it sounds reasonable to use ->allocated_to if it's non-NULL and ->running_on otherwise -- it should work in practice for these two purposes. Alternatively we could pass a bool to pe_node_attribute_calculated() indicating which one we want. Hopefully existing regression tests are sufficient to rule out side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offhand, it sounds reasonable to use ->allocated_to if it's non-NULL and ->running_on otherwise -- it should work in practice for these two purposes.
It's worth pointing out that pcmk__set_instance_roles() calls promotion_score() with the ->allocated_to node (via ->location(..., FALSE)), so it's reasonable to assume the host's ->allocated_to is the right choice for that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our next issue is that
sort_promotable_instance()
sorts based on the current role.role1 = resource1->fns->state(resource1, TRUE); role2 = resource2->fns->state(resource2, TRUE); if (role1 > role2) { crm_trace("%s %c %s (role)", resource1->id, '<', resource2->id); return -1; } else if (role1 < role2) { crm_trace("%s %c %s (role)", resource1->id, '>', resource2->id); return 1; }
If we modify that to sort by next role (again, haven't checked whether that has any unwanted side effects),
The point of promotion_order() is to decide which instances to promote, so at this point none of the children will have a next_role of promoted. Using the current role allows current promoted instances to be favored to keep their role. Also, preferring already-started instances over to-be-started instances probably has some value in terms of getting a promoted instance more quickly.
Before comparing roles, sort_promotable_instance() compares rsc->sort_index (via sort_rsc_index()). It feels like we need either a new criteria between those two, or to adjust ->sort_index in promotion_order().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of promotion_order() is to decide which instances to promote, so at this point none of the children will have a next_role of promoted. Using the current role allows current promoted instances to be favored to keep their role. Also, preferring already-started instances over to-be-started instances probably has some value in terms of getting a promoted instance more quickly.
All of that reasoning makes sense.
Before comparing roles, sort_promotable_instance() compares rsc->sort_index (via sort_rsc_index()). It feels like we need either a new criteria between those two, or to adjust ->sort_index in promotion_order().
Agreed. And I wasn't looking closely enough last night. This move that I mentioned on the second transition when the role comparison uses next role:
* Promote ovndb_servers:0 ( Slave -> Master ovn-dbs-bundle-0 )
happens not because colocation constraints are being honored, but rather because the preferences are all equal and "0" sorts before "2" by default.
I think (for now I only have time to skim) that promotion_order()
already provides the logic we need:
- https://github.com/ClusterLabs/pacemaker/blob/master/lib/pacemaker/pcmk_sched_promotable.c#L340-L362
- https://github.com/ClusterLabs/pacemaker/blob/master/lib/pacemaker/pcmk_sched_promotable.c#L390-L400
If that all works properly already, then the issue is that ovn-dbs-bundle
has the constraints but ovn-dbs-bundle-master
does not (as noted earlier). We'd need to pull the colocations down to the clone from the bundle. This might be a pretty simple thing to do. I'm just not familiar with it yet. Would also need to determine whether it needs to be done earlier (in pcmk__bundle_allocate()
); only here (in the promotion ordering); or somewhere else. I would think in allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to take a detour and get a better grasp on how bundles and colocation work, before I could add what's most likely going to be a simple fix. I've got a version that appears to work now. It needs some cleanup. Might have something by tomorrow.
1379563
to
7938312
Compare
Excellent analysis. The problem arises because:
1&2 are more or less a necessity, and 3&4 work together to ensure that the priorities defined by scores are respected. I don't think we can swap allocations after 3 without risking all sorts of breakage. This is a half-baked idea at this point, but I'm wondering if we can reorder rsc->children for some steps. Currently the order of rsc->children always matches the numeric order of instances, but I'm not sure that's important to anything -- maybe we could sort rsc->children at some step, and then use the first N children rather than the first N instance numbers. In other words, currently stateful:0 is allocated to node2 because it is the first member of rsc->children and node2 has the highest node score, but what if we could sort rsc->children so that stateful:2 is first, so that gets allocated to node2 instead. The questions are how should the sort be done, and will changing the order affect anything else. |
Pardon the length:
Yep, although I do want to emphasize that this issue doesn't occur only with promotable clones. It's just easier to reproduce with promotable clones. I'm not sure about the conditions to reproduce it with non-promotable clones, aside from the So any solution that's based in
I'm concerned about that also. Everything that's been tested works fine, which is a good sign but not conclusive -- e.g., if nesting of resources or a lot of interwoven constraints end up producing unexpected behavior after a swap. With that being said, only anonymous clone instances are under discussion, so it shouldn't make any difference which instance ultimately goes to which node, as long as each node has the same number of instances allocated to it both before and after all the swapping. The swap takes place after allocation is finished, and we're not messing with anything that hasn't been allocated. We only consider swapping two instances if both of them have already been allocated, and Swapping My line of thinking is that since these instances are anonymous, if there is any risk in swapping their allocations after step 3, we can avoid that risk by swapping some set of their member variables. We're currently swapping only With all of that said, I was originally interested in doing this in It seems impossible to proactively allocate instances to their current nodes, if it can't already be done with the existing pre-allocation logic -- at least not without a major change. We can only do it retroactively, after all the instances have been allocated and we know which nodes are going to get how many instances. We don't know in advance whether an instance's current node is going to be allocated an instance when all is said and done.
Hmm, it's worth considering. It could be simpler. I'm not sure right away exactly how we would do that (what sort criteria get us the order we need, etc.), whether it could solve the whole problem, or (like you said) what side effects it would have either. I also wonder if there could be any negative interactions from that if there are fewer instances allocated post-transition than pre-transition. Maybe that part could be solved by ensuring all the allocated instances are contiguous at the front of the before sorting further.
But that ordering is effectively negated by the time we reach the situation that's described in this issue (i.e., if pre-allocation can't give the instances to their current nodes). And for non-promotable clones, I think we're finished after Even for the promotable case, where we still have |
Edit to the below: Nah, not looking super promising due to side effects. Scrap that. Another half-baked idea: Fool Pacemaker into thinking that an anonymous instance that's been allocated to a node is already running on that node. That way it doesn't schedule a move to place it there. This would require manipulating Take the instance off the running list of a node where it's currently running (and take the node off the instance's I don't know if this would even work or not, and IMO it's harder to reason about than swapping the allocations. But perhaps it would end up with fewer variables needing reassignment. Either way, if we swap anything at all, then we ought to take care about nested children (e.g., a clone where each instance is a group, whose children in turn are the natives). I re-pushed with a slightly modified |
6158a6a
to
f10da99
Compare
I don't know how I forgot about sort_clone_instance(); that's the idea I had in mind, so it would be changing the criteria there. Whether that's feasible or not I don't know. My gut feeling is it has to be possible to handle this at the sort and/or distribute steps when allocating, so the right instance number gets allocated to the right node. Even if we have to track new info in the instances' pe_resource_t (like we already do with clone_name). |
I want it to be possible, and it very well may be. It just seems like the correct decision about which instance should go where, depends on the final result of the allocation (which nodes get how many instances). For example, in the It seems that determining the ultimately correct sort order in advance would require doing a dry-run of most of the allocation process. Which might work but seems like a non-trivial amount of extra work for the scheduler. I'll see what I can figure out. There may be a way to handle some or all of this via sort order, and it will come to me later. It might involve tracking new info like you mentioned, and/or using a different sort scheme for pre-allocation versus the main allocation. In the meantime I'm doing a little bit of refactoring of |
BTW what is |
That might not be out of the question. We do something similar when taking constraint dependencies' preferences into account (the "Rolling back" stuff). It would only be extra work in the rare cases where the first allocation is suboptimal.
I did #2327 while investigating. I hadn't finished what I wanted to do with it but if you want it, grab it. Makes tracing less unpleasant. |
->count is only used for clones, and is the number of instances of the particular clone being allocated (it's reset in distribute_children()). Notice count is not in ->details, which means it's not shared between all uses of the node object. When we keep node lists (data_set->nodes, rsc->allowed_nodes, etc.), everything in ->details is shared across all lists, and only the outer pe_node_t object is unique. The outer members can be different for each resource and are used during allocation. |
One day we'll break backward compatibility so we can make bigger changes in the API. It would make more sense to have the ->details object be the node object (data_set->nodes could probably use just these), and call the current node object a node allocation object or something like that. |
d2f7bb4
to
999549e
Compare
The latest push contains a different approach. It doesn't modify the sort order but it does eliminate the swapping. I considered various ways to modify the sort order to achieve what we want, and all of it seemed more complicated than this current approach. Basically, during pre-allocation, if an instance prefers a different node compared to its current node, we make that node unavailable and try pre-allocation to the instance's current node again. We keep a counter of how many instances we need to reserve for nodes with higher weights. Every time an instance wants to go to a node other than its current node, we increment the The We have to do some bookkeeping of I also added three refactor commits that helped make I just noticed this conflicts with your WIP #2327, so one of our PRs would have to be updated. It wouldn't be hard to check Also note that a block from |
999549e
to
dbd1f7f
Compare
Latest push is simply a rebase onto current master, to incorporate #2333. |
FYI, I just merged a commit that changed crm_simulate's output formatting, so the scheduler regression test outputs here will need updating |
8d8b6f2
to
cfab4f9
Compare
I've pushed with some changes that address problems and sources of confusion that arose along the way. Let me know if you want some of these to go into a separate PR for faster merging.
I'm still thinking about the best way to let negative colocations influence the promotion order. Remember the case that started all this: an IPaddr2 is colocated with a bundle's promoted instance and also banned from a particular host ( However, the bundle shouldn't be prevented from promoting on Through all this, we don't want to overwrite the On a related note, we also have to determine how the negative dependent colocations should weigh against the positive colocations -- e.g., whether they should be given the same precedence or not. Anyway, this should all be easier to reason about now with the refactors and with the positive colocations fixed. |
0d0fd6d
to
67e2748
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thoughts, need to look more in-depth later
I'll wait on a full review until you're done with whatever you're planning (no rush) |
This commit adds a new pcmk__rsc_node_e bitfield enum containing values for allocated, current, and pending. This indicates the criterion used to look up a resource's location (e.g., where is it now vs. where is it allocated?). After a compatibility break, native_location() could use these flags instead of an int. That would require making this enum public. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Use the pcmk__rsc_node_e enum as a flags argument to the pe_node_attr_calculated() function. Pass pcmk__rsc_node_current as the flags argument for existing calls, as this mimics the existing behavior. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
lookup_promotion_score() should get a container's promotion score from the host to which it's allocated (if it's been allocated), rather than the host on which it's running. pe_node_attribute_calculated() now accepts a flags variable using these a pcmk__rsc_node_e enum to specify whether the value should come from the allocated_to host or the first running_on host. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Maximum allowed length of a stringified score is strlen("-INFINITY"). We can use this instead of hard-coded integer array sizes with score2char_stack(). Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, bundle colocations aren't considered when determining promotion order. Bundle colocations apply to the bundle and its containers; the clone wrapper is not aware of them. This commit adds four tests for positive bundle colocations: - bundle (promoted) with ip, where ip has an INFINITY location score - bundle (promoted) with ip, where ip has a positive non-INFINITY location score - ip with bundle (promoted), where ip has an INFINITY location score - ip with bundle (promoted), where ip has a positive non-INFINITY location score Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, bundle colocations aren't considered when determining promotion order. Bundle colocations apply to the bundle and its containers; the clone wrapper is not aware of them. This commit grabs the bundle's colocations into a working table. Then it iterates over the clone's children, finds each child's bundle node, and transfers node weights from the bundle node's host to the clone wrapper's copy of the bundle node object. This came up incidentally during an investigation of how bundles are processed when scheduling promotions. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, when a promoted instance of an anonymous clone is stopped and about to be recovered, the instances get shuffled. Instance 0, which is running in non-promoted state on another node, gets moved to the to-be-started-and-promoted node. Then instance 1 starts on the non-promoted node. This causes an unnecessary restart on the non-promoted node. This will be fixed in an upcoming commit. This commit adds six tests: - clone-anon-recover (correct) - clone-group-anon-recover (correct) - promotable-anon-recover-promoted (incorrect) - promotable-anon-recover-non-promoted (correct) - promotable-group-anon-recover-promoted (questionable) - correct in that it doesn't shuffle instances, but maybe incorrect in that it promotes a different node after one monitor failure on the promoted node - promotable-group-anon-recover-non-promoted (correct) Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, there are some circumstances under which running anonymous clones may be shuffled around the cluster. The exact requirements to reproduce the issue are unclear. In the case of this test CIB, the issue disappears if: - the colocation constraint between the Filesystem and clvm resources is removed, or - certain ones of the INFINITY location constraints for the Filesystem resource is removed. This will be fixed in an upcoming commit. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, anonymous clone instances may be shuffled under certain conditions, causing an unnecessary resource downtime when an instance is moved away from its current running node. For example, this can happen when a stopped promotable instance is scheduled to promote and the stickiness is lower than the promotion score (see the promotable-anon-recover-promoted test). Instance 0 gets allocated first and goes to the node that will be promoted, causing it to relocate if it's already running somewhere else. There are also some other corner cases that can trigger shuffling, like the one in the clone-anon-no-shuffle-constraints test. The fix is to allocate an instance to its current node during pre-allocation if that node is going to receive an instance at all. Previously, if instance:0 was running on node1 and got pre-allocated to node2 due to node2 having a higher weight, we backed out and immediately gave up on pre-allocating instance:0. Now, if instance:0 is running on node1 and gets pre-allocated to node2, we increment the "reserved" counter (to ensure we don't allocate the max number of instances without node2 getting one), and we make node2 unavailable. If allocated + reserved < max, we try pre-allocating instance:0 again with node2 out of the picture. This commit also updates several tests that contain unnecessary instance moves, and it updates scores files that changed due to the fix. Resolves: RHBZ#1931023 Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Thanks, it's been pretty busy for me lately :) New push to rebase on current master and update according to feedback:
I'm hoping to get a draft of the negative colocation stuff for promotion order taken care of by end of week. That's the last piece of the puzzle. I believe we need to consider negative preference scores of resources that depend on the promotable clone, without those dependent scores downright preventing a primary resource from promoting on a particular node. |
67e2748
to
ca476b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to separate the first 5 commits, I can merge them first. I need to look at the rest more closely
{ | ||
const char *source; | ||
const pe_resource_t *container = NULL; | ||
const pe_node_t *host = NULL; | ||
const char *lookup_type = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put some default in case no flags are set
host = container->allocated_to; | ||
lookup_type = "allocated"; | ||
|
||
} else if (pcmk_is_set(flags, pcmk__rsc_node_current) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else if implies the flags can't be used together -- probably we want to have one take precedence, but use the other if nothing was found (when both flags are set)
@@ -37,6 +37,9 @@ extern bool pcmk__is_daemon; | |||
// Number of elements in a statically defined array | |||
#define PCMK__NELEM(a) ((int) (sizeof(a)/sizeof(a[0])) ) | |||
|
|||
/* Maximum length of a stringified score value */ | |||
#define PCMK__SCORE_MAX_LEN (PCMK__NELEM(CRM_MINUS_INFINITY_S) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use sizeof instead of PCMK__NELEM -- sizeof a literal string is the string length + 1 (i.e. the memory required to hold it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at one more commit
|
||
if (bundled) { | ||
/* All explicit constraints are applied to the bundle, rather | ||
* than to the clone wrapper as with a regular promotable clone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice Pacemaker Explained doesn't document how bundles should be used in constraints. I'm not sure offhand, but I think it might be possible to refer to the primitive inside a bundle in constraints, although it would be a bad idea (as with clones). We should be fine to ignore the possibility, though we should document the preferred usage at some point ...
I believe ordering constraints work correctly with bundles because they will be interpreted as referring to the pseudo-operations at the bundle level that the implicit resources are already ordered relative to.
Location preferences are handled in stage2() by the bundle's ->rsc_location(), which will recursively call ->rsc_location() for its implicit containers, IPs, and child using the bundle's location constraints.
I haven't checked how ticket constraints work with bundles. Hopefully they do ;)
For colocations involving bundles and clones, ->rsc_colocation_lh() is an assertion commented "Never called -- Instead we add the colocation constraints to the child and call from there". However I don't see anywhere that happens. For primitives, ->allocate() calls ->rsc_colocation_lh() for each colocation constraint where the primitive is the dependent resource. For clones, it calls ->merge_weights() instead, which updates the clone's own ->allowed_nodes, but I don't think it directly does anything with children -- it just plays into distribute_children() called after that. For bundles, ->allocate() will call the child clone's ->allocate() after calling distribute_children() and allocating the individual clone instances via the replica children.
So, I'm wondering if the root of the problem is that the bundle's ->allocate() should be doing more of what the clone's does before calling distribute_children(). In particular the clone child's ->allocate() will run through the clone child's colocation constraints, but I'm not sure any exist; the bundle ->allocate() maybe should run through its own colocation constraints. Alternatively, maybe we just need to copy the bundle's colocation constraints to the clone child. Then perhaps this resource switching stuff wouldn't be needed.
FYI, I just started the 2.1.0 release cycle. If you get back to this, feel free to keep it against master or resubmit against 2.1 as desired. EDIT: 2.1.0 has been released, so keep this against master. |
Fix: libpacemaker: Don't shuffle anonymous clone instances
Currently, anonymous clone instances may be shuffled under certain
conditions, causing an unnecessary resource downtime when an instance is
moved away from its current running node.
For example, this can happen when a stopped promotable instance is
scheduled to promote and the stickiness is lower than the promotion
score (see the promotable-anon-recover-promoted test). Instance 0 gets
allocated first and goes to the node that will be promoted, causing it
to relocate if it's already running somewhere else.
There are also some other corner cases that can trigger shuffling, like
the one in the clone-anon-no-shuffle-constraints test.
My solution is to wait until all instance allocations are finished, and
then to swap allocations to keep as many instances as possible on their
current running nodes.
This commit also updates several tests that contain unnecessary instance
moves.
I'm opening this up as a draft first for feedback. Details in the
comments.
Resolves: RHBZ#1931023