Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add op_defaults and rsc_defaults expressions #2045

Merged
merged 17 commits into from
May 20, 2020

Conversation

clumens
Copy link
Contributor

@clumens clumens commented Apr 27, 2020

This is an extremely early version of this patch set. I've done very minimal testing, including not really any on a live cluster. I just wanted to start getting comments to see if I am on the right track.

With the following XML blurb:

      <primitive class="stonith" id="Fencing" type="fence_xvm">
        <instance_attributes id="Fencing-instance_attributes">
          <nvpair id="Fencing-instance_attributes-ip_family" name="ip_family" value="ipv4"/>
        </instance_attributes>
        <operations>
          <op id="Fencing-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="ocf" id="dummy" provider="pacemaker" type="Dummy">
        <instance_attributes id="dummy-instance_attributes">
          <nvpair id="dummy-instance_attributes-op_sleep" name="op_sleep" value="6"/>
        </instance_attributes>
        <operations>
          <op id="dummy-migrate_from-interval-0s" interval="0s" name="migrate_from" timeout="20s"/>
          <op id="dummy-migrate_to-interval-0s" interval="0s" name="migrate_to" timeout="20s"/>
          <op id="dummy-monitor-interval-60s" interval="60s" name="monitor" on-fail="stop"/>
          <op id="dummy-reload-interval-0s" interval="0s" name="reload" timeout="20s"/>
          <op id="dummy-start-interval-0s" interval="0s" name="start" timeout="20s"/>
          <op id="dummy-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
        </operations>
      </primitive>
    </resources>
    <constraints/>
    <tags/>
    <op_defaults>
      <meta_attributes id="op_defaults-options">
        <nvpair id="op_defaults-options-timeout" name="timeout" value="5s"/>
      </meta_attributes>
      <meta_attributes id="dummy-options">
        <rule id="dummy-options-timeout-rule" score="INFINITY">
          <rsc_expression id="dummy-rule-Dummy" class="ocf" provider="pacemaker" type="Dummy"/>
        </rule>
        <nvpair id="dummy-options-timeout" name="timeout" value="7s"/>
      </meta_attributes>
    </op_defaults>

And running that through crm_simulate with enough V args, I am seeing things like this:

(parse_op_key)  trace:   Action: monitor
(parse_op_key)  trace:   Resource: Fencing
(make_pairs)    trace: Checking for attributes
(pe_eval_expr)  trace: Testing rule dummy-options-timeout-rule
(pe_eval_rsc_defaults_expr)     trace: Testing rsc_defaults expression: dummy-rule-Dummy
(pe_eval_rsc_defaults_expr)     trace: Class doesn't match: ocf != stonith
(pe_eval_subexpr)       trace: Expression dummy-rule-Dummy failed on all nodes
(pe_eval_expr)  trace: Expression dummy-options-timeout-rule/dummy-rule-Dummy failed
(unpack_attr_set)       trace: Adding attributes from op_defaults-options
(populate_hash)         trace: Setting attribute: timeout = 5s

Showing that the rule does not apply for a fencing device, and:

(parse_op_key)  trace:   Action: monitor
(parse_op_key)  trace:   Resource: dummy
(make_pairs)    trace: Checking for attributes
(pe_eval_expr)  trace: Testing rule dummy-options-timeout-rule
(pe_eval_rsc_defaults_expr)     trace: Testing rsc_defaults expression: dummy-rule-Dummy
(pe_eval_subexpr)       trace: Expression dummy-rule-Dummy passed on all nodes
(pe_eval_expr)  trace: Rule dummy-options-timeout-rule passed
(unpack_attr_set)       trace: Adding attributes from dummy-options
(populate_hash)         trace: Setting attribute: timeout = 7s
(unpack_attr_set)       trace: Adding attributes from op_defaults-options

Showing that it does apply correctly for the dummy resource.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a solid approach.

include/crm/pengine/rules_internal.h Outdated Show resolved Hide resolved
include/crm/pengine/rules_internal.h Show resolved Hide resolved
include/crm/pengine/rules.h Show resolved Hide resolved
include/crm/pengine/internal.h Outdated Show resolved Hide resolved
include/crm/pengine/rules.h Show resolved Hide resolved
lib/pengine/utils.c Show resolved Hide resolved
see upgrade-2.10.xsl
- cibtr:table for="cluster-properties"
-->
<define name="cluster_property_set.nvpair.name-value-unsupported">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnpkrn , I forget whether we need to keep these "unsupported" blocks once the relevant file moves beyond 3.0 -- were they only needed for the 3.0 schema specifically, or do we need to keep them for all 3.x?

@kgaillot
Copy link
Contributor

Something went wrong with the monitor interval parsing, see the scheduler tests. Not sure where the issue is.

@clumens
Copy link
Contributor Author

clumens commented Apr 30, 2020

Something went wrong with the monitor interval parsing, see the scheduler tests. Not sure where the issue is.

Looks like it's in e993890. That patch made sense to me at the time, but not right now.

@kgaillot
Copy link
Contributor

Looks like it's in e993890. That patch made sense to me at the time, but not right now.

It makes sense to me too. unpack_operation() is only called by custom_action(), and custom_action() always has the key. I can't think of any situation where the action key has an interval different from the action meta-attributes, but that's the only thing I can see that would be different. I'd be curious if you traced the interval parsed from the key vs the one parsed from the XML to see when it's different.

@clumens
Copy link
Contributor Author

clumens commented Apr 30, 2020

I'd be happy to hear other ideas about how I can test this further.

@kgaillot
Copy link
Contributor

We do need to add a scheduler regression test. Come up with a CIB XML that contains several test cases, e.g. something using rsc_defaults, something using op_defaults, something using both, and something using neither. Then we have to get the results to show up in the output graph which means we'll need to trigger some action. Generally I find the easiest way is to set target-role=Stopped on an active resource, then the scheduler will want to stop it.

To add a scheduler test:

  1. Put the CIB XML at cts/scheduler/${TEST_NAME}.xml
  2. Run "cts/cts-scheduler --update --run $TEST_NAME" and verify the created files are as expected
  3. Edit cts/cts-scheduler.in to add test name and description to tests

@clumens
Copy link
Contributor Author

clumens commented May 1, 2020

This is starting to look pretty good.

Then we have to get the results to show up in the output graph which means we'll need to trigger some action. Generally I find the easiest way is to set target-role=Stopped on an active resource, then the scheduler will want to stop it.

Just to be clear, I should mark every resource with target-role=Stopped?

@kgaillot
Copy link
Contributor

kgaillot commented May 1, 2020

This is starting to look pretty good.

Then we have to get the results to show up in the output graph which means we'll need to trigger some action. Generally I find the easiest way is to set target-role=Stopped on an active resource, then the scheduler will want to stop it.

Just to be clear, I should mark every resource with target-role=Stopped?

More specifically whatever resources you want to trigger a stop action for. The operation timeout etc. values won't show up anywhere in the output unless there's an action scheduled that needs them. With an action scheduled, you can verify in the .exp file that the values are what you want.

@clumens clumens force-pushed the rhbz1628701 branch 3 times, most recently from 11fe44b to 82f8460 Compare May 11, 2020 15:10
@clumens clumens changed the title WIP: Add op_defaults and rsc_defaults expressions Add op_defaults and rsc_defaults expressions May 11, 2020
These new functions all take the same input arguments - an xmlNodePtr
and a pe_rule_eval_data_t.  This latter type holds all the parameters
that could possibly be useful for evaluating some rule.  Most functions
will only need a few items out of this structure.

Then, implement pe_test_*_expression in terms of these new functions.
cts/scheduler/op-rsc-defaults.xml Outdated Show resolved Hide resolved
cts/scheduler/op-rsc-defaults.xml Outdated Show resolved Hide resolved
cts/scheduler/op-rsc-defaults.xml Outdated Show resolved Hide resolved
<action_set>
<rsc_op id="10" operation="monitor" operation_key="uses-rsc_defaults-fencing_monitor_60000" on_node="cluster01" on_node_uuid="1">
<primitive id="uses-rsc_defaults-fencing" class="stonith" type="fence_xvm"/>
<attributes CRM_meta_interval="60000" CRM_meta_name="monitor" CRM_meta_on_node="cluster01" CRM_meta_on_node_uuid="1" CRM_meta_timeout="10000" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the 10s timeout here is coming from op_defaults-monitor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep, I see that. Is it reasonable to keep it this way in the test? Perhaps I should change op_defaults-monitor to something else so it doesn't get applied all over the place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we want to test a bunch of scenarios. I.e. we should have some action in the graph using the default-default timeout, another using a general op_defaults timeout, another using an op_expression, etc.

Testing rsc_defaults is trickier if none of them appears in the graph. The way to do that would be to set something like target-role or is-managed in the XML so the simulation shows something that needs to be done (e.g. setting target-role=Stopped for a started resource will show a start action, setting is-managed=false for a resource will show "unmanaged" in the status output, etc.)

cts/scheduler/op-rsc-defaults.xml Outdated Show resolved Hide resolved
cts/scheduler/op-rsc-defaults.exp Outdated Show resolved Hide resolved
cts/scheduler/op-rsc-defaults.xml Outdated Show resolved Hide resolved
lib/pengine/rules.c Show resolved Hide resolved
@clumens
Copy link
Contributor Author

clumens commented May 12, 2020

To make this PR maximally confusing, let's start over on the test cases. I think I was trying to do too much in one file which was just making things more difficult. I'm pushing a new test case that only handles op_defaults. Everything here appears to be working except for op-monitor-interval-defaults. I'm not sure whether that's due to an incorrect test, incorrect testing, or a bug in the code.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW one question we left implicit is what the precedence should be when two rules match. First match, last match, or most specific match could all make sense.

We also need to document this in Pacemaker Explained in the Rules chapter

<op_expression id="op-ping-default-expr" name="monitor"/>
</rule>
<nvpair id="op-ping-monitor-timeout" name="timeout" value="7s"/>
</meta_attributes>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All monitor actions (other than 10s-interval, which is covered below) should have a 7s timeout (note this applies to all resources, not just ping resources)

<op_defaults>
<meta_attributes id="op-defaults">
<nvpair id="op-defaults-timeout" name="timeout" value="5s"/>
</meta_attributes>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just translating into English, all actions not matched by something below should have a 5s timeout

<rsc_expression id="op-dummy-default-expr" class="ocf" provider="pacemaker" type="Dummy"/>
</rule>
<nvpair id="op-dummy-timeout" name="timeout" value="6s"/>
</meta_attributes>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All ocf:pacemaker:Dummy actions (other than monitor, which is covered below) should have a 6s timeout

<op_expression id="op-monitor-interval-default-expr" name="monitor" interval="10s"/>
</rule>
<nvpair id="op-monitor-interval-timeout" name="timeout" value="8s"/>
</meta_attributes>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 10-second monitors should have an 8s timeout

cts/scheduler/op-defaults.exp Show resolved Hide resolved
@clumens clumens force-pushed the rhbz1628701 branch 2 times, most recently from 221aab8 to 9965b78 Compare May 13, 2020 19:23
@clumens
Copy link
Contributor Author

clumens commented May 13, 2020

Updated with additional tests and documentation. The one thing I have not tested is what happens when two rules apply to the same resource. I believe the last one will take effect (and I've updated the documentation to that effect), but I need to add a test for it. That's coming next.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just had one question

doc/Pacemaker_Explained/en-US/Ch-Rules.txt Show resolved Hide resolved
doc/Pacemaker_Explained/en-US/Ch-Rules.txt Show resolved Hide resolved
doc/Pacemaker_Explained/en-US/Ch-Rules.txt Outdated Show resolved Hide resolved
doc/Pacemaker_Explained/en-US/Ch-Rules.txt Show resolved Hide resolved
lib/pengine/rules.c Outdated Show resolved Hide resolved
clumens added 15 commits May 18, 2020 10:09
These are like all the other rule evaluating functions, but they do not
have any wrappers for the older style API.
The core functions of pe_evaluate_rules, pe_test_rule, and
pe_test_expression have been turned into new, similarly named functions
that take a pe_rule_eval_data_t as an argument.  The old ones still
exist as wrappers around the new ones.
This is just to get rid of a couple extra arguments to some internal
functions and make them look like the external functions.
It should now take a pe_rule_eval_data_t instead of various separate
arguments.  This will allow passing further data that needs to be tested
against in the future (such as rsc_defaults and op_defaults).  It's also
convenient to make versions of pe_unpack_nvpairs and
pe_unpack_versioned_attributes that take the same arguments.

Then, adapt callers of pe__unpack_dataset_nvpairs to pass the new
argument.
Only show the "Setting attribute:" text when it comes time to actually
set the attribute.  Also show the value being set.  This makes it
clearer that an attribute is actually being set, not just that the
function is processing something.
@kgaillot
Copy link
Contributor

Woohoo!

@kgaillot kgaillot merged commit 81d4b39 into ClusterLabs:master May 20, 2020
@clumens clumens deleted the rhbz1628701 branch May 21, 2020 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants