Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: libpacemaker: Don't shuffle anonymous clone instances unnecessarily #2313

Closed
wants to merge 10 commits into from

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Feb 26, 2021

Fix: libpacemaker: Don't shuffle anonymous clone instances

Currently, anonymous clone instances may be shuffled under certain
conditions, causing an unnecessary resource downtime when an instance is
moved away from its current running node.

For example, this can happen when a stopped promotable instance is
scheduled to promote and the stickiness is lower than the promotion
score (see the promotable-anon-recover-promoted test). Instance 0 gets
allocated first and goes to the node that will be promoted, causing it
to relocate if it's already running somewhere else.

There are also some other corner cases that can trigger shuffling, like
the one in the clone-anon-no-shuffle-constraints test.

My solution is to wait until all instance allocations are finished, and
then to swap allocations to keep as many instances as possible on their
current running nodes.

This commit also updates several tests that contain unnecessary instance
moves.

I'm opening this up as a draft first for feedback. Details in the
comments.

Resolves: RHBZ#1931023

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Feb 26, 2021

@kgaillot

Regression tests pass, including the new ones I created (but note that a
couple of existing ones with buggy behavior had to be updated). This
patch does what it needs to do in the cases I've tested.

I've got a few uncertainties though:

  • Do we need to swap anything else in swap_allocations()? I listed
    some attributes that may need to be swapped but that didn't cause any
    problems in testing. I'm particularly concerned about rsc_cons_lhs,
    rsc_cons, known_on, and (for clones of groups) children. Can you think
    of any cases for which any of these would need to be swapped?
  • Is rsc->allowed_nodes guaranteed to contain the same list of nodes
    (not identical pe_node_t objects but a list whose members have the same
    details)? For example, if instance:0 has node1->node2->node3 as its
    allowed_nodes list, must instance:1 have node1->node2->node3? I assumed
    the answer is "yes," and that simplified the swap implementation. I
    swapped allowed_nodes to swap all the weights in one fell swoop, in
    addition to anything else that might be needed from the objects in those
    lists.
  • Let me know if I missed any opportunities to use existing helper functions.
  • Do any special considerations for bundles come to mind? I only
    tested bundles insofar as they're covered by the existing regression
    tests.
  • Same question but for clones of groups.

Not directly related to this patch: I also noticed that
native_assign_node() increments count while native_deallocate() doesn't
decrement it. Wonder if that's caused any problems that we haven't
identified.

EDIT: I did do some testing with promotable clones of groups, and they behaved as expected even before the fix. Can't promise that would be true in all situations.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Feb 26, 2021

The reason Jenkins failed is that there are now too many tests in cts/scheduler, and make distdir-am fails with "Argument list too long". I'm not sure how to approach that. I guess the make targets would require some tuning.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-rhbz1931023 branch 4 times, most recently from 24d67c0 to 7b0737e Compare March 1, 2021 05:10
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 1, 2021

The scope of this grew beyond what I had planned. The "Fix: libpacemaker: Don't shuffle anonymous clone instances" commit (currently 7b0737e) is my main goal.

The "Enclose dup_file_path() in SUPPORT_NAGIOS guard" commit (currently 82ac086) addresses a Jenkins build failure on Fedora 32.

The huge "Build, Test: Scheduler: Split up cts/scheduler directory" commit (currently 4becd5d) addresses an "Argument list too long" issue that caused a bunch of builds to fail, due to the increased number of tests in the cts/scheduler directory after I added the new ones. Most of the changes are just from moving files to new directories.

I'm not at all tied to this approach. If you'd rather fix the argument issue in a different way, I'm all ears :)

@kgaillot
Copy link
Contributor

kgaillot commented Mar 1, 2021

Everything through splitting up the scheduler directory looks good, let's do those in a separate PR to get those fixes out quickly and be able to focus this one better.

Don't bother with the output formatting change, that will be taken care of at release time.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 1, 2021

I opened #2314 for those first ones. For the time being, I kept those commits in this PR too so that I don't have to update the changes to the test outputs. I can remove them later.

I removed the output formatting change and re-pushed.

ovn-dbs-bundle-2 (ocf:ovn:ovndb-servers): Slave controller-1
ip-172.17.1.87 (ocf:heartbeat:IPaddr2): Started controller-0
ip-172.17.1.87 (ocf:heartbeat:IPaddr2): Stopped
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save you some potential confusion during review: The ip-172.17.1.87 resource is now stopped because it's banned from node 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure that's acceptable. Selecting a node for the promotion of ovn-dbs-bundle should take into account colocation dependencies' preferences. Previously that was implicit by the ordering of the clone instances since the first N nodes are chosen for promotion. I'm not sure what the resolution is, maybe we need to re-sort the nodes before setting instance roles.

Copy link
Contributor Author

@nrwahl2 nrwahl2 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I thought the behavior would have been the same on the existing scheduler if the ban constraint were for controller-0 instead, which is why this didn't bother me. But I'm testing now and that seems not to be the case.

The fact that ovn-dbs-bundle-0 is going to controller-0 and ovn-dbs-bundle-1 is going to controller-2 now is good and we want to keep that. It doesn't make sense to move instance 1. As you said, maybe there's something we can do in promotion_order().

Copy link
Contributor Author

@nrwahl2 nrwahl2 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be missing constraints that didn't get passed down from the bundle to the promotable clone within it. I ran the below at the beginning of pcmk__set_instance_roles(). My understanding is that rsc_cons_lhs is the list of constraints where some other resource depends on this resource... and not the list of constraints for which this resource is on the left-hand side. I bet I'm not the only one who gets a headache from the lh/rh names.

(gdb) p rsc->id
$1 = 0x13e02e0 "ovn-dbs-bundle-master"
(gdb) p rsc->parent
$2 = (pe_resource_t *) 0x13d3530
(gdb) p rsc->parent->id
$3 = 0x13d3510 "ovn-dbs-bundle"
(gdb) p rsc->parent->rsc_cons
$4 = 0x0
(gdb) p rsc->parent->rsc_cons_lhs
$5 = 0x15340e0 = {0x153ead0}
(gdb) p rsc->rsc_cons
$6 = 0x0
(gdb) p rsc->rsc_cons_lhs
$7 = 0x0

Edit: Checked, and yeah the ip with bundle is attached to the bundle but not the promotable clone within it. Not sure if it should be attached to the clone or not, or whether this is relevant.

Copy link
Contributor Author

@nrwahl2 nrwahl2 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, scratch the constraints. The direct cause of this unwanted behavior isn't that the colocation constraints aren't being honored. The direct cause seems to be that ovndb_servers:0 is in Stopped state when we begin. This causes a few issues.

First, lookup_promotion_score() is unable to get the promotion score for ovndb_servers:0.

(pe_node_attribute_calculated)  trace: ovndb_servers:0: Not looking for master-ovndb_servers:0 on the container host: ovn-dbs-bundle-podman-0 is inactive

That's because pe_node_attribute_calculated() checks the first node in the container's running_on list. If we modify it as follows:

    //if (node->details->remote_rsc->container->running_on != NULL) {
        //pe_node_t *host = node->details->remote_rsc->container->running_on->data;
    if (node->details->remote_rsc->container->allocated_to != NULL) {
        pe_node_t *host = node->details->remote_rsc->container->allocated_to;

Then we're able to get the promotion score for ovndb_servers:0. (Whether the above will have any unwanted side effects, I don't know yet. We could also check allocated_to and then running_on, or vice versa.)

(pe_node_attribute_calculated)  trace: ovndb_servers:0: Looking for master-ovndb_servers:0 on the allocated container host controller-0
(promotion_score)       trace: promotion score for ovndb_servers:0 on ovn-dbs-bundle-0 = <null>
(pe_node_attribute_calculated)  trace: ovndb_servers:0: Looking for master-ovndb_servers on the allocated container host controller-0
(promotion_score)       trace: stripped promotion score for ovndb_servers on ovn-dbs-bundle-0 = 5

So now we have the promotion score. Our next issue is that sort_promotable_instance() sorts based on the current role.

    role1 = resource1->fns->state(resource1, TRUE);
    role2 = resource2->fns->state(resource2, TRUE);

    if (role1 > role2) {
        crm_trace("%s %c %s (role)", resource1->id, '<', resource2->id);
        return -1;

    } else if (role1 < role2) {
        crm_trace("%s %c %s (role)", resource1->id, '>', resource2->id);
        return 1;
    }

If we modify that to sort by next role (again, haven't checked whether that has any unwanted side effects), then we encounter a third issue: ovndb_servers:0 is inactive, so it loses in the sort order to the active instances.


On a positive note: If the "sort by next role instead of current role" change is in place, then on the second transition, the correct node gets promoted and the IP starts. It's apparently able to do this because at that point, ovndb_servers:0 is already in started state.

# crm_simulate -S -x cts/scheduler/xml/cancel-behind-moving-remote.xml -O /tmp/next.xml >/dev/null
# crm_simulate -R -x /tmp/next.xml
...
Transition Summary:
 * Start      rabbitmq-bundle-1     (                     controller-0 )   due to unrunnable rabbitmq-bundle-podman-1 start (blocked)
 * Start      rabbitmq:1            (                rabbitmq-bundle-1 )   due to unrunnable rabbitmq-bundle-podman-1 start (blocked)
 * Promote    ovndb_servers:0       ( Slave -> Master ovn-dbs-bundle-0 )  
 * Demote     ovndb_servers:1       ( Master -> Slave ovn-dbs-bundle-1 )  
 * Start      ip-172.17.1.87        (                     controller-0 )

So colocation constraints are being accounted for.


To summarize, I feel like it would be fine to check the allocated_to node in pe_node_attribute_calculated() as a simple fix for looking up the promotion score. But the remaining issue is "how do we sort the promotable instances so that a Stopped instance has a fair chance to be promoted on the first transition, without making the sort criteria problematic for other situations?"

P.S. The reason this worked before the fix is that the stopped instance (ovndb_servers:0) was allocated to node that didn't need to be promoted. Now, the stopped instance is allocated to controller-0, which is the node that needs to be promoted so that the IP can start. The instance on controller-2 gets promoted instead, and the IP can't start there.

Copy link
Contributor

@kgaillot kgaillot Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that rsc_cons_lhs is the list of constraints where some other resource depends on this resource... and not the list of constraints for which this resource is on the left-hand side. I bet I'm not the only one who gets a headache from the lh/rh names.

I know right! We can't change struct member names until we're ready to break backward API compatibility, but I have it on my to-do list for when that happens: s/rsc_cons/this_with/ and s/rsc_cons_lhs/with_this/

First, lookup_promotion_score() is unable to get the promotion score for ovndb_servers:0.

(pe_node_attribute_calculated)  trace: ovndb_servers:0: Not looking for master-ovndb_servers:0 on the container host: ovn-dbs-bundle-podman-0 is inactive

That's because pe_node_attribute_calculated() checks the first node in the container's running_on list. If we modify it as follows:

    //if (node->details->remote_rsc->container->running_on != NULL) {
        //pe_node_t *host = node->details->remote_rsc->container->running_on->data;
    if (node->details->remote_rsc->container->allocated_to != NULL) {
        pe_node_t *host = node->details->remote_rsc->container->allocated_to;

Then we're able to get the promotion score for ovndb_servers:0. (Whether the above will have any unwanted side effects, I don't know yet. We could also check allocated_to and then running_on, or vice versa.)

pe_node_attribute_calculated() is different from pe_node_attribute_raw() only for resources in a bundle where container-attribute-target has been set to "host" (indicating that the value for the bundle instance's host should be used rather than any value set for the bundle instance node itself).

Besides promotion_score(), calculated attributes are used for location rules where the rule score is taken from a node attribute, for example to say a containerized resource has to run on a host with a node attribute indicating it has access to necessary shared storage that is exported into the container.

Location rules feed into the allocation process, so we don't have ->allocated_to at that point.

Offhand, it sounds reasonable to use ->allocated_to if it's non-NULL and ->running_on otherwise -- it should work in practice for these two purposes. Alternatively we could pass a bool to pe_node_attribute_calculated() indicating which one we want. Hopefully existing regression tests are sufficient to rule out side effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offhand, it sounds reasonable to use ->allocated_to if it's non-NULL and ->running_on otherwise -- it should work in practice for these two purposes.

It's worth pointing out that pcmk__set_instance_roles() calls promotion_score() with the ->allocated_to node (via ->location(..., FALSE)), so it's reasonable to assume the host's ->allocated_to is the right choice for that context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our next issue is that sort_promotable_instance() sorts based on the current role.

    role1 = resource1->fns->state(resource1, TRUE);
    role2 = resource2->fns->state(resource2, TRUE);

    if (role1 > role2) {
        crm_trace("%s %c %s (role)", resource1->id, '<', resource2->id);
        return -1;

    } else if (role1 < role2) {
        crm_trace("%s %c %s (role)", resource1->id, '>', resource2->id);
        return 1;
    }

If we modify that to sort by next role (again, haven't checked whether that has any unwanted side effects),

The point of promotion_order() is to decide which instances to promote, so at this point none of the children will have a next_role of promoted. Using the current role allows current promoted instances to be favored to keep their role. Also, preferring already-started instances over to-be-started instances probably has some value in terms of getting a promoted instance more quickly.

Before comparing roles, sort_promotable_instance() compares rsc->sort_index (via sort_rsc_index()). It feels like we need either a new criteria between those two, or to adjust ->sort_index in promotion_order().

Copy link
Contributor Author

@nrwahl2 nrwahl2 Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of promotion_order() is to decide which instances to promote, so at this point none of the children will have a next_role of promoted. Using the current role allows current promoted instances to be favored to keep their role. Also, preferring already-started instances over to-be-started instances probably has some value in terms of getting a promoted instance more quickly.

All of that reasoning makes sense.

Before comparing roles, sort_promotable_instance() compares rsc->sort_index (via sort_rsc_index()). It feels like we need either a new criteria between those two, or to adjust ->sort_index in promotion_order().

Agreed. And I wasn't looking closely enough last night. This move that I mentioned on the second transition when the role comparison uses next role:

 * Promote    ovndb_servers:0       ( Slave -> Master ovn-dbs-bundle-0 )  

happens not because colocation constraints are being honored, but rather because the preferences are all equal and "0" sorts before "2" by default.

I think (for now I only have time to skim) that promotion_order() already provides the logic we need:

If that all works properly already, then the issue is that ovn-dbs-bundle has the constraints but ovn-dbs-bundle-master does not (as noted earlier). We'd need to pull the colocations down to the clone from the bundle. This might be a pretty simple thing to do. I'm just not familiar with it yet. Would also need to determine whether it needs to be done earlier (in pcmk__bundle_allocate()); only here (in the promotion ordering); or somewhere else. I would think in allocate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to take a detour and get a better grasp on how bundles and colocation work, before I could add what's most likely going to be a simple fix. I've got a version that appears to work now. It needs some cleanup. Might have something by tomorrow.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-rhbz1931023 branch 2 times, most recently from 1379563 to 7938312 Compare March 3, 2021 02:12
@kgaillot
Copy link
Contributor

Excellent analysis. The problem arises because:

  1. A clone's rsc->children are numbered sequentially when they are created (by clone_unpack()).
  2. Anonymous clone instances that are active (or inactive with a pending action) are assigned one of the rsc->children instances (and thus an instance number) and their current node according to the order the node history is encountered in the status section (by unpack_lrm_resource() -> unpack_find_resource() -> find_anonymous_clone()).
  3. Anonymous clone instances (and thus instance numbers) are allocated to their next node according to node scores (lowest instance number to highest node score) (by pcmk__clone_allocate() -> distribute_children() -> allocate_instance()).
  4. The first N instances of a promotable clone are chosen for promotion (by pcmk__set_instance_roles()).

1&2 are more or less a necessity, and 3&4 work together to ensure that the priorities defined by scores are respected.

I don't think we can swap allocations after 3 without risking all sorts of breakage.

This is a half-baked idea at this point, but I'm wondering if we can reorder rsc->children for some steps. Currently the order of rsc->children always matches the numeric order of instances, but I'm not sure that's important to anything -- maybe we could sort rsc->children at some step, and then use the first N children rather than the first N instance numbers. In other words, currently stateful:0 is allocated to node2 because it is the first member of rsc->children and node2 has the highest node score, but what if we could sort rsc->children so that stateful:2 is first, so that gets allocated to node2 instead. The questions are how should the sort be done, and will changing the order affect anything else.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 21, 2021

Pardon the length:

The first N instances of a promotable clone are chosen for promotion (by pcmk__set_instance_roles()).

Yep, although I do want to emphasize that this issue doesn't occur only with promotable clones. It's just easier to reproduce with promotable clones. I'm not sure about the conditions to reproduce it with non-promotable clones, aside from the clone-anon-no-shuffle-constraints test that I provided.

So any solution that's based in pcmk_sched_promotable.c won't solve the entire problem.


I don't think we can swap allocations after 3 without risking all sorts of breakage.

I'm concerned about that also. Everything that's been tested works fine, which is a good sign but not conclusive -- e.g., if nesting of resources or a lot of interwoven constraints end up producing unexpected behavior after a swap.

With that being said, only anonymous clone instances are under discussion, so it shouldn't make any difference which instance ultimately goes to which node, as long as each node has the same number of instances allocated to it both before and after all the swapping. The swap takes place after allocation is finished, and we're not messing with anything that hasn't been allocated. We only consider swapping two instances if both of them have already been allocated, and rsc1 has been allocated to rsc2's current node (or vice versa).

Swapping allowed_nodes swaps the node weights for each instance, and I would expect that each allocated instance must have a copy of the same set of nodes in its allowed_nodes member. If not, then we could iterate over allowed_nodes for each instance and swap if the nodes match -- but that's more complicated than swapping the whole allowed_nodes object. Swapping the node weights ought to keep the scores in alignment with the node assignments, when we call assign_node() next, which wipes out the old allocation from both the instance and the node and then sets the new one. To your point ("3&4 work together to ensure that the priorities defined by scores are respected"), swapping the node weights is what prevents step 4 from breaking (in pcmk__set_instance_roles()). Before I swapped node weights, the wrong node was getting the promoted instance.

My line of thinking is that since these instances are anonymous, if there is any risk in swapping their allocations after step 3, we can avoid that risk by swapping some set of their member variables. We're currently swapping only allowed_nodes, but I hope that any other issues could be remedied by swapping other objects too. Still, this is the main thing that worries me -- are we not swapping some member that we should be swapping? Conversely, it may be dangerous to swap members that don't clearly need to be swapped.

With all of that said, I was originally interested in doing this in allocate_instance() so that no swapping is required later. I can take another look at that if you think it's worthwhile and possible. As far as I could tell it wasn't.

It seems impossible to proactively allocate instances to their current nodes, if it can't already be done with the existing pre-allocation logic -- at least not without a major change. We can only do it retroactively, after all the instances have been allocated and we know which nodes are going to get how many instances. We don't know in advance whether an instance's current node is going to be allocated an instance when all is said and done.


This is a half-baked idea at this point, but I'm wondering if we can reorder rsc->children for some steps. ...

Hmm, it's worth considering. It could be simpler. I'm not sure right away exactly how we would do that (what sort criteria get us the order we need, etc.), whether it could solve the whole problem, or (like you said) what side effects it would have either.

I also wonder if there could be any negative interactions from that if there are fewer instances allocated post-transition than pre-transition. Maybe that part could be solved by ensuring all the allocated instances are contiguous at the front of the before sorting further.

sort_clone_instance(), which gets called before distribute_children(), already tries to do something like this:

    /* allocation order:
     *  - active instances
     *  - instances running on nodes with the least copies
     *  - active instances on nodes that can't support them or are to be fenced
     *  - failed instances
     *  - inactive instances
     */

But that ordering is effectively negated by the time we reach the situation that's described in this issue (i.e., if pre-allocation can't give the instances to their current nodes). And for non-promotable clones, I think we're finished after distribute_children(). So I'm not sure we can fix the non-promotable case by re-ordering children.

Even for the promotable case, where we still have pcmk__set_instance_roles() to work with, I'm concerned that we would ultimately face a "node weight" issue again despite the re-ordering, unless we swap the node weights like I'm doing in swap_allocations().

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 21, 2021

Edit to the below: Nah, not looking super promising due to side effects. Scrap that.

Another half-baked idea: Fool Pacemaker into thinking that an anonymous instance that's been allocated to a node is already running on that node. That way it doesn't schedule a move to place it there. This would require manipulating rsc->running_on and node->details->running_rsc.

Take the instance off the running list of a node where it's currently running (and take the node off the instance's running_on), and push it onto the running list of the node it's been allocated to. The trick would be ensuring that every node ends up with the same number of instances "running" there both before and after the swaps.

I don't know if this would even work or not, and IMO it's harder to reason about than swapping the allocations. But perhaps it would end up with fewer variables needing reassignment.


Either way, if we swap anything at all, then we ought to take care about nested children (e.g., a clone where each instance is a group, whose children in turn are the natives). I re-pushed with a slightly modified swap_allocations() function that crawls through the tree of children if any exist, and swaps the allowed_nodes for each pair of children.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-rhbz1931023 branch 2 times, most recently from 6158a6a to f10da99 Compare March 22, 2021 01:11
@kgaillot
Copy link
Contributor

I don't know how I forgot about sort_clone_instance(); that's the idea I had in mind, so it would be changing the criteria there. Whether that's feasible or not I don't know.

My gut feeling is it has to be possible to handle this at the sort and/or distribute steps when allocating, so the right instance number gets allocated to the right node. Even if we have to track new info in the instances' pe_resource_t (like we already do with clone_name).

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 23, 2021

My gut feeling is it has to be possible to handle this at the sort and/or distribute steps when allocating, so the right instance number gets allocated to the right node. Even if we have to track new info in the instances' pe_resource_t (like we already do with clone_name).

I want it to be possible, and it very well may be. It just seems like the correct decision about which instance should go where, depends on the final result of the allocation (which nodes get how many instances).

For example, in the promotable-anon-recover-promoted test, I don't think we know that stateful:2 should be the instance to go to node2 (where stateful is currently stopped) until we've determined that node1 and node3-rem are each going to receive one instance. Then we know that stateful:0 and stateful:1, which are already running on those two nodes, should stay put. We shouldn't send stateful:0 to node2 based on promotion score. But in the absence of that knowledge, stateful:0 will go to node2 because it gets allocated first and node2 has the highest score.

It seems that determining the ultimately correct sort order in advance would require doing a dry-run of most of the allocation process. Which might work but seems like a non-trivial amount of extra work for the scheduler.

I'll see what I can figure out. There may be a way to handle some or all of this via sort order, and it will come to me later. It might involve tracking new info like you mentioned, and/or using a different sort scheme for pre-allocation versus the main allocation. In the meantime I'm doing a little bit of refactoring of sort_clone_instance for readability, for my own benefit. I'm also not totally convinced that it's 100% adhering to the allocation order that it advertises in the comments...

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 23, 2021

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

@kgaillot
Copy link
Contributor

It seems that determining the ultimately correct sort order in advance would require doing a dry-run of most of the allocation process. Which might work but seems like a non-trivial amount of extra work for the scheduler.

That might not be out of the question. We do something similar when taking constraint dependencies' preferences into account (the "Rolling back" stuff). It would only be extra work in the rare cases where the first allocation is suboptimal.

I'll see what I can figure out. There may be a way to handle some or all of this via sort order, and it will come to me later. It might involve tracking new info like you mentioned, and/or using a different sort scheme for pre-allocation versus the main allocation. In the meantime I'm doing a little bit of refactoring of sort_clone_instance for readability, for my own benefit. I'm also not totally convinced that it's 100% adhering to the allocation order that it advertises in the comments...

I did #2327 while investigating. I hadn't finished what I wanted to do with it but if you want it, grab it. Makes tracing less unpleasant.

@kgaillot
Copy link
Contributor

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

->count is only used for clones, and is the number of instances of the particular clone being allocated (it's reset in distribute_children()). Notice count is not in ->details, which means it's not shared between all uses of the node object. When we keep node lists (data_set->nodes, rsc->allowed_nodes, etc.), everything in ->details is shared across all lists, and only the outer pe_node_t object is unique. The outer members can be different for each resource and are used during allocation.

@kgaillot
Copy link
Contributor

BTW what is node->count meant to keep track of? It seems to be used similarly to node->details->num_resources, but not quite.

->count is only used for clones, and is the number of instances of the particular clone being allocated (it's reset in distribute_children()). Notice count is not in ->details, which means it's not shared between all uses of the node object. When we keep node lists (data_set->nodes, rsc->allowed_nodes, etc.), everything in ->details is shared across all lists, and only the outer pe_node_t object is unique. The outer members can be different for each resource and are used during allocation.

One day we'll break backward compatibility so we can make bigger changes in the API. It would make more sense to have the ->details object be the node object (data_set->nodes could probably use just these), and call the current node object a node allocation object or something like that.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-rhbz1931023 branch 2 times, most recently from d2f7bb4 to 999549e Compare March 29, 2021 08:57
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 29, 2021

The latest push contains a different approach. It doesn't modify the sort order but it does eliminate the swapping. I considered various ways to modify the sort order to achieve what we want, and all of it seemed more complicated than this current approach.

Basically, during pre-allocation, if an instance prefers a different node compared to its current node, we make that node unavailable and try pre-allocation to the instance's current node again. We keep a counter of how many instances we need to reserve for nodes with higher weights. Every time an instance wants to go to a node other than its current node, we increment the reserved counter.

The reserved counter gets reset for each instance that we pre-allocate. We stop trying to pre-allocate an instance to its current node if allocated plus reserved reaches max -- allocating that instance to its current node is of too low a priority.

We have to do some bookkeeping of allowed_nodes along the way, since we have to modify an instance's allowed_nodes to tell the allocator which nodes should be unavailable. I try to explain in the comments as needed.

I also added three refactor commits that helped make sort_clone_instance() more readable for me when I was familiarizing myself with it.


I just noticed this conflicts with your WIP #2327, so one of our PRs would have to be updated. It wouldn't be hard to check rsc->allocated_to after an allocation instead of getting the chosen node from the return value.

Also note that a block from allocate_instance() has been moved to distribute_children as part of my current approach, so that we have direct access to allocated and max.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 29, 2021

Latest push is simply a rebase onto current master, to incorporate #2333.

@kgaillot
Copy link
Contributor

kgaillot commented Apr 5, 2021

FYI, I just merged a commit that changed crm_simulate's output formatting, so the scheduler regression test outputs here will need updating

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Apr 6, 2021

I've pushed with some changes that address problems and sources of confusion that arose along the way. Let me know if you want some of these to go into a separate PR for faster merging.

  • Rebased on current master and updated regression test outputs.
  • Removed the extraneous, whitespace-only regression test output changes that were accidentally added on the last push.
  • There's a new pcmk__location_e bitfield enum. This is currently private and contains values for allocated and running. The idea is to make it public at a compatibility break and use it in native_rsc_location() with the addition of a flag for pending.
  • pe_node_attribute_calculated() accepts a pcmk__location_e enum to determine whether it looks up the container's allocated host or current host. promotion_score() gets called once in apply_master_prefs() (pre-allocation) and once in pcmk__set_instance_roles(), so lookup_promotion_score() checks the provisional state to determine which location enum to use.
  • The apply_master_prefs() function has been renamed to pcmk__apply_promotion_prefs(). It's internal API.
  • promotion_order() is broken into helper functions for readability. This is especially important considering other changes that needed to be introduced, which add more ordering code.
  • Fixed an issue where positive bundle colocations (in both directions) are currently ignored when determining promotion order.

I'm still thinking about the best way to let negative colocations influence the promotion order. Remember the case that started all this: an IPaddr2 is colocated with a bundle's promoted instance and also banned from a particular host (controller-2). We have to tell the scheduler to prefer not to promote the bundle on controller-2. (I haven't checked whether the same problem affects regular promotable clones right now.)

However, the bundle shouldn't be prevented from promoting on controller-2, since the IPaddr2 is the dependent resource. It should only be discouraged from promoting there. It should be allowed to start on controller-2 if there is no better node with respect to dependent colocated resources' preferences.

Through all this, we don't want to overwrite the sort_index, because it holds the "final" preference based on promotion score, promoted role, and positive colocations. I think we want to be able to fall back on that. So we may need to store an additional piece of data related to the negative colocations, or use the node weights directly for an initial check and then fall back on sort_index if the weights are equal (or something similar).

On a related note, we also have to determine how the negative dependent colocations should weigh against the positive colocations -- e.g., whether they should be given the same precedence or not.

Anyway, this should all be easier to reason about now with the refactors and with the positive colocations fixed.

@nrwahl2 nrwahl2 force-pushed the nrwahl2-rhbz1931023 branch 4 times, most recently from 0d0fd6d to 67e2748 Compare April 6, 2021 09:17
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thoughts, need to look more in-depth later

include/crm/pengine/internal.h Outdated Show resolved Hide resolved
include/pcmki/pcmki_sched_allocate.h Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_sched_promotable.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_sched_promotable.c Show resolved Hide resolved
lib/pacemaker/pcmk_sched_promotable.c Outdated Show resolved Hide resolved
lib/pacemaker/pcmk_sched_promotable.c Outdated Show resolved Hide resolved
@kgaillot
Copy link
Contributor

I'll wait on a full review until you're done with whatever you're planning (no rush)

This commit adds a new pcmk__rsc_node_e bitfield enum containing values
for allocated, current, and pending. This indicates the criterion used
to look up a resource's location (e.g., where is it now vs. where is it
allocated?).

After a compatibility break, native_location() could use these flags
instead of an int. That would require making this enum public.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Use the pcmk__rsc_node_e enum as a flags argument to the
pe_node_attr_calculated() function. Pass pcmk__rsc_node_current as the
flags argument for existing calls, as this mimics the existing behavior.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
lookup_promotion_score() should get a container's promotion score from
the host to which it's allocated (if it's been allocated), rather than
the host on which it's running.

pe_node_attribute_calculated() now accepts a flags variable using these
a pcmk__rsc_node_e enum to specify whether the value should come from
the allocated_to host or the first running_on host.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Maximum allowed length of a stringified score is strlen("-INFINITY"). We
can use this instead of hard-coded integer array sizes with
score2char_stack().

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, bundle colocations aren't considered when determining
promotion order. Bundle colocations apply to the bundle and its
containers; the clone wrapper is not aware of them.

This commit adds four tests for positive bundle colocations:
  - bundle (promoted) with ip, where ip has an INFINITY location score
  - bundle (promoted) with ip, where ip has a positive non-INFINITY
    location score
  - ip with bundle (promoted), where ip has an INFINITY location score
  - ip with bundle (promoted), where ip has a positive non-INFINITY
    location score

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, bundle colocations aren't considered when determining
promotion order. Bundle colocations apply to the bundle and its
containers; the clone wrapper is not aware of them.

This commit grabs the bundle's colocations into a working table. Then it
iterates over the clone's children, finds each child's bundle node, and
transfers node weights from the bundle node's host to the clone
wrapper's copy of the bundle node object.

This came up incidentally during an investigation of how bundles are
processed when scheduling promotions.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, when a promoted instance of an anonymous clone is stopped and
about to be recovered, the instances get shuffled. Instance 0, which is
running in non-promoted state on another node, gets moved to the
to-be-started-and-promoted node. Then instance 1 starts on the
non-promoted node. This causes an unnecessary restart on the
non-promoted node.

This will be fixed in an upcoming commit.

This commit adds six tests:
  - clone-anon-recover (correct)
  - clone-group-anon-recover (correct)
  - promotable-anon-recover-promoted (incorrect)
  - promotable-anon-recover-non-promoted (correct)
  - promotable-group-anon-recover-promoted (questionable)
    - correct in that it doesn't shuffle instances, but maybe incorrect
      in that it promotes a different node after one monitor failure on
      the promoted node
  - promotable-group-anon-recover-non-promoted (correct)

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, there are some circumstances under which running anonymous
clones may be shuffled around the cluster. The exact requirements to
reproduce the issue are unclear.

In the case of this test CIB, the issue disappears if:
  - the colocation constraint between the Filesystem and clvm resources
    is removed, or
  - certain ones of the INFINITY location constraints for the Filesystem
    resource is removed.

This will be fixed in an upcoming commit.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Currently, anonymous clone instances may be shuffled under certain
conditions, causing an unnecessary resource downtime when an instance is
moved away from its current running node.

For example, this can happen when a stopped promotable instance is
scheduled to promote and the stickiness is lower than the promotion
score (see the promotable-anon-recover-promoted test). Instance 0 gets
allocated first and goes to the node that will be promoted, causing it
to relocate if it's already running somewhere else.

There are also some other corner cases that can trigger shuffling, like
the one in the clone-anon-no-shuffle-constraints test.

The fix is to allocate an instance to its current node during
pre-allocation if that node is going to receive an instance at all.

Previously, if instance:0 was running on node1 and got pre-allocated to
node2 due to node2 having a higher weight, we backed out and immediately
gave up on pre-allocating instance:0.

Now, if instance:0 is running on node1 and gets pre-allocated to node2,
we increment the "reserved" counter (to ensure we don't allocate the max
number of instances without node2 getting one), and we make node2
unavailable. If allocated + reserved < max, we try pre-allocating
instance:0 again with node2 out of the picture.

This commit also updates several tests that contain unnecessary instance
moves, and it updates scores files that changed due to the fix.

Resolves: RHBZ#1931023

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Apr 21, 2021

Thanks, it's been pretty busy for me lately :)

New push to rebase on current master and update according to feedback:

  • Incorporated all the master -> promoted changes.
  • Changed pcmk__location_e to pcmk__rsc_node_e and updated the value names accordingly. Broke the addition of the enum into its own commit, and broke the use of the enum in pe_node_attribute_calculated into its own commit.
  • Moved score array declaration outside the for loop in apply_sort_indices_to_wrapper_weights().
  • Introduced PCMK__SCORE_MAX_LEN (defined as 9, via "-INFINITY") and used it in place of arbitrary-length arrays for score2char_stack(). I wasn't sure whether to put this in crm.h (with the INFINITY macros), util.h (with score2char_stack(), or internal.h. A case could be made for either of the public headers, but I put it in internal.h for now because we don't need it externally. I'm happy to move it though.
  • Made pointer and integer variables const where appropriate in newly added functions.
  • Fixed the work_nodes scheme in apply_promoted_dependent_colocations() and apply_promoted_primary_colocations() (I hope ;) ), by instead making and restoring from a backup table in the case of bundles.

I'm hoping to get a draft of the negative colocation stuff for promotion order taken care of by end of week. That's the last piece of the puzzle. I believe we need to consider negative preference scores of resources that depend on the promotable clone, without those dependent scores downright preventing a primary resource from promoting on a particular node.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to separate the first 5 commits, I can merge them first. I need to look at the rest more closely

{
const char *source;
const pe_resource_t *container = NULL;
const pe_node_t *host = NULL;
const char *lookup_type = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put some default in case no flags are set

host = container->allocated_to;
lookup_type = "allocated";

} else if (pcmk_is_set(flags, pcmk__rsc_node_current)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else if implies the flags can't be used together -- probably we want to have one take precedence, but use the other if nothing was found (when both flags are set)

@@ -37,6 +37,9 @@ extern bool pcmk__is_daemon;
// Number of elements in a statically defined array
#define PCMK__NELEM(a) ((int) (sizeof(a)/sizeof(a[0])) )

/* Maximum length of a stringified score value */
#define PCMK__SCORE_MAX_LEN (PCMK__NELEM(CRM_MINUS_INFINITY_S) - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use sizeof instead of PCMK__NELEM -- sizeof a literal string is the string length + 1 (i.e. the memory required to hold it)

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at one more commit


if (bundled) {
/* All explicit constraints are applied to the bundle, rather
* than to the clone wrapper as with a regular promotable clone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice Pacemaker Explained doesn't document how bundles should be used in constraints. I'm not sure offhand, but I think it might be possible to refer to the primitive inside a bundle in constraints, although it would be a bad idea (as with clones). We should be fine to ignore the possibility, though we should document the preferred usage at some point ...

I believe ordering constraints work correctly with bundles because they will be interpreted as referring to the pseudo-operations at the bundle level that the implicit resources are already ordered relative to.

Location preferences are handled in stage2() by the bundle's ->rsc_location(), which will recursively call ->rsc_location() for its implicit containers, IPs, and child using the bundle's location constraints.

I haven't checked how ticket constraints work with bundles. Hopefully they do ;)

For colocations involving bundles and clones, ->rsc_colocation_lh() is an assertion commented "Never called -- Instead we add the colocation constraints to the child and call from there". However I don't see anywhere that happens. For primitives, ->allocate() calls ->rsc_colocation_lh() for each colocation constraint where the primitive is the dependent resource. For clones, it calls ->merge_weights() instead, which updates the clone's own ->allowed_nodes, but I don't think it directly does anything with children -- it just plays into distribute_children() called after that. For bundles, ->allocate() will call the child clone's ->allocate() after calling distribute_children() and allocating the individual clone instances via the replica children.

So, I'm wondering if the root of the problem is that the bundle's ->allocate() should be doing more of what the clone's does before calling distribute_children(). In particular the clone child's ->allocate() will run through the clone child's colocation constraints, but I'm not sure any exist; the bundle ->allocate() maybe should run through its own colocation constraints. Alternatively, maybe we just need to copy the bundle's colocation constraints to the clone child. Then perhaps this resource switching stuff wouldn't be needed.

@kgaillot
Copy link
Contributor

kgaillot commented Apr 30, 2021

FYI, I just started the 2.1.0 release cycle. If you get back to this, feel free to keep it against master or resubmit against 2.1 as desired.

EDIT: 2.1.0 has been released, so keep this against master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants