Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: crmd: Support "actions-limit" - the number of jobs that the TE is allowed to execute in parallel on a node #360

Closed
wants to merge 1 commit into from

Conversation

gao-yan
Copy link
Member

@gao-yan gao-yan commented Sep 26, 2013

Except "batch-limit", which is a cluster-wide concurrency limit,
there are two per-node concurrency limits can be configured now:

  • actions-limit:
    The number of jobs that the TE is allowed to execute in parallel on a
    node. If set to zero or unset, defaults to at least 4 or twice the
    number of CPU cores on the DC. -1 means unlimited.
  • migration-limit:
    The number of migration (migrate_to/migrate_from) jobs that the TE is
    allowed to execute in parallel on a node. If set to zero or unset,
    defaults to half the number of CPU cores on the DC. -1 means unlimited.
  1. They can be configured as a global property in crm_config.
  2. They can be configured as a per-node attribute, which can override
    the global property.
  3. In practical, also depending on the configuration, any of them can
    be reached first.

@l-mb
Copy link
Contributor

l-mb commented Sep 26, 2013

Looks good. A few minor suggestions on the wording of the limits are in-line. Thank you!

…TE is allowed to execute in parallel on a node

Except "batch-limit", which is a cluster-wide concurrency limit,
there are two per-node concurrency limits can be configured now:

* actions-limit:
The number of jobs that the TE is allowed to execute in parallel on a
node. If set to zero or unset, defaults to at least 4 or twice the
number of CPU cores on the DC. -1 means unlimited.

* migration-limit:
The number of migration (migrate_to/migrate_from) jobs that the TE is
allowed to execute in parallel on a node. If set to zero or unset,
defaults to half the number of CPU cores on the DC. -1 means unlimited.

1. They can be configured as a global property in crm_config.
2. They can be configured as a per-node attribute, which can override
the global property.
3. In practical, also depending on the configuration, any of them can
be reached first.
@gao-yan
Copy link
Member Author

gao-yan commented Sep 26, 2013

Gooe suggestions. Changed, thanks!

@beekhof
Copy link
Member

beekhof commented Oct 2, 2013

Can we put this on hold for a bit?
I'd like to see if we can auto-tune this (David and I have been kicking around some ideas) instead of adding a new option.

actions-limit=0 <--- wouldn't that mean the node can't run any actions by default?

@gao-yan
Copy link
Member Author

gao-yan commented Oct 2, 2013

I left "-1" for unlimited, which could still be needed by some users. In practice, no user would want to block the cluster transition by setting it to zero, I think. So I chose zero for the default, which means to choose an appropriate default limit for user.

Personally, I don't think adding a new option would be a problem. We'd just need to make sure the default behavior is sane.

@l-mb
Copy link
Contributor

l-mb commented Oct 2, 2013

I do not believe auto-tuning this is feasible for all scenarios. We can try to choose a sane default, yes (whatever factor times number of cores), but nodes have various capabilities and resources different load impact. Users have a legitimate need to tune this as required for their configuration.

I like choosing good defaults and auto-tuning what we can, but that doesn't mean that not exposing tunables is always a good idea. We should not err from the "lets make everything configurable" to the "oh no, a tunable" philosophy ;-) ("What power users hate about GNOME/Apple, chapter 1")

@l-mb
Copy link
Contributor

l-mb commented Oct 2, 2013

(And, for what it is worth, this is an upgrade regression for some users that can't switch, or that hit node overload situations after an update. So for us, it's fairly high priority.)

@beekhof
Copy link
Member

beekhof commented Oct 3, 2013

On 03/10/2013, at 4:09 AM, Lars Marowsky-Brée notifications@github.com wrote:

I do not believe auto-tuning this is feasible for all scenarios. We can try to choose a sane default, yes (whatever factor times number of cores), but nodes have various capabilities and resources different load impact. Users have a legitimate need to tune this as required for their configuration.

I like choosing good defaults and auto-tuning what we can, but that doesn't mean that not exposing tunables is always a good idea. We should not err from the "lets make everything configurable" to the "oh no, a tunable" philosophy ;-) ("What power users hate about GNOME/Apple, chapter 1")

Yep, fair call.

I need to figure out how to start a few hundred LXC containers, so it happens to be an area I'll be looking at closely RealSoonNow.

Just give me a couple of days to see what comes out of my auto tuning ideas.
I'm not saying we don't need any options, but maybe we need some different ones or for them to work in a different way.

@beekhof
Copy link
Member

beekhof commented Oct 3, 2013

On 02/10/2013, at 10:50 PM, "Gao,Yan" notifications@github.com wrote:

I left "-1" for unlimited, which could still be needed by some users. In practice, no user would want to block the cluster transition by setting it to zero, I think. So I chose zero for the default,

But doesn't that make it a bad default? Everyone getting the value no-one would want?

which means to choose an appropriate default limit for user.

Personally, I don't think adding a new option would be a problem. We'd just need to make sure the default behavior is sane.


Reply to this email directly or view it on GitHub.

@gao-yan
Copy link
Member Author

gao-yan commented Oct 3, 2013

Since actually the default value is not fixed, it depends on the CPU cores on the DC. The issue is how we should indicate the "default_value" field in pe_opts[]. Shouldn't we still indicate it with a special integer?

@l-mb
Copy link
Contributor

l-mb commented Oct 3, 2013

Twice the number of cores is not an unreasonable default. I think most users won't have to touch it further, so I hope its not the default noone wants ;-)

Ultimately, users may ask for per resource type limits. We could add class/provider/type then. This might help with tons of vms and still starting everything else at higher concurrency. But we didn't want to make this too complex in v1, instead waiting for actual requests.

@beekhof
Copy link
Member

beekhof commented Oct 28, 2013

I think most are reasonably happy with the throttling code now. Closing.

@beekhof beekhof closed this Oct 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants