Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crush: update tunable docs. change default profile to jewel. #7964

Merged
merged 6 commits into from Mar 14, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
148 changes: 99 additions & 49 deletions doc/rados/operations/crush-map.rst
Expand Up @@ -947,33 +947,34 @@ The following example removes the ``rack12`` bucket from the hierarchy::
Tunables
========

.. versionadded:: 0.48

There are several magic numbers that were used in the original CRUSH
implementation that have proven to be poor choices. To support
the transition away from them, newer versions of CRUSH (starting with
the v0.48 argonaut series) allow the values to be adjusted or tuned.

Clusters running recent Ceph releases support using the tunable values
in the CRUSH maps. However, older clients and daemons will not correctly interact
with clusters using the "tuned" CRUSH maps. To detect this situation,
there are now features bits ``CRUSH_TUNABLES`` (value 0x40000) and ``CRUSH_TUNABLES2`` to
reflect support for tunables.

If the OSDMap currently used by the ``ceph-mon`` or ``ceph-osd``
daemon has non-legacy values, it will require the ``CRUSH_TUNABLES`` or ``CRUSH_TUNABLES2``
feature bits from clients and daemons who connect to it. This means
that old clients will not be able to connect.
Over time, we have made (and continue to make) improvements to the
CRUSH algorithm used to calculate the placement of data. In order to
support the change in behavior, we have introduced a series of tunable
options that control whether the legacy or improved variation of the
algorithm is used.

In order to use newer tunables, both clients and servers must support
the new version of CRUSH. For this reason, we have created
``profiles'' that are named after the Ceph version in which they were
introduced. For example, the ``firefly'' tunables are first supported
in the firefly release, and will not work with older (e.g., dumpling)
clients. Once a given set of tunables are changed from the legacy
default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
clients who do not support the new CRUSH features from connecting to
the cluster.

argonaut (legacy)
-----------------

At some future point in time, newly created clusters will have
improved default values for the tunables. This is a matter of waiting
until the support has been present in the Linux kernel clients long
enough to make this a painless transition for most users.
The legacy CRUSH behavior used by argonaut and older releases works
fine for most clusters, provided there are not too many OSDs that have
been marked out.

Impact of Legacy Values
-----------------------
bobtail
-------

The legacy values result in several misbehaviors:
The bobtail tunable profile (CRUSH_TUNABLES feature) fixes a few key
misbehaviors:

* For hierarchies with a small number of devices in the leaf buckets,
some PGs map to fewer than the desired number of replicas. This
Expand All @@ -987,8 +988,7 @@ The legacy values result in several misbehaviors:
* When some OSDs are marked out, the data tends to get redistributed
to nearby OSDs instead of across the entire hierarchy.

CRUSH_TUNABLES
--------------
The new tunables are:

* ``choose_local_tries``: Number of local retries. Legacy value is
2, optimal value is 0.
Expand All @@ -1001,57 +1001,107 @@ CRUSH_TUNABLES
50 is more appropriate for typical clusters. For extremely large
clusters, a larger value might be necessary.

CRUSH_TUNABLES2
---------------

* ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
will retry, or only try once and allow the original placement to
retry. Legacy default is 0, optimal value is 1.

CRUSH_TUNABLES3
---------------
Migration impact:

* Moving from argonaut to bobtail tunables triggers a moderate amount
of data movement. Use caution on a cluster that is already
populated with data.

firefly
-------

The firefly tunable profile (CRUSH_TUNABLES2 feature) fixes a problem
with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
mappings with too few results when too many OSDs have been marked out.

The new tunable is:

* ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
start with a non-zero value of r, based on how many attempts the
parent has already made. Legacy default is 0, but with this value
CRUSH is sometimes unable to find a mapping. The optimal value (in
terms of computational cost and correctness) is 1.

For existing clusters that have lots of existing data, changing
Migration impact:

* For existing clusters that have lots of existing data, changing
from 0 to 1 will cause a lot of data to move; a value of 4 or 5
will allow CRUSH to find a valid mapping but will make less data
move.

CRUSH_V4
--------
straw_calc_version tunable
--------------------------

* Bucket type ``straw2``: The new ``straw2`` bucket type fixes
several limitations in the original ``straw`` bucket.
Specifically, the old ``straw`` buckets would change some mappings
that should have changed when a weight was adjusted, while
``straw2`` achieves the original goal of only changing mappings to
or from the bucket item whose weight has changed.
There were some problems with the internal weights calculated and
stored in the CRUSH map for ``straw`` buckets. Specifically, when
there were items with a CRUSH weight of 0 or both a mix of weights and
some duplicated weights CRUSH would distribute data incorrectly (i.e.,
not in proportion to the weights).

Changing an existing bucket from ``straw`` to ``straw2`` is
possible but will result in a reasonably small amount of data
movement, depending on how much the bucket item weights vary from
each other. When the weights are all the same no data will move,
and when item weights vary significantly there will be more
movement.
The new tunable is:

CRUSH_TUNABLES5
---------------
* ``straw_calc_version``: A value of 0 preserves the old, broken
internal weight calculation; a value of 1 fixes the behavior.

Migration impact:

* Moving to straw_calc_version 1 and then adjusting a straw bucket
(by adding, removing, or reweighting an item, or by using the
reweight-all command) can trigger a small to moderate amount of
data movement *if* the cluster has hit one of the problematic
conditions.

hammer
------

The hammer tunable profile (CRUSH_V4 feature) does not affect the
mapping of existing CRUSH maps simply by changing the profile. However:

* There is a new bucket type (``straw2``) supported. The new
``straw2`` bucket type fixes several limitations in the original
``straw`` bucket. Specifically, the old ``straw`` buckets would
change some mappings that should have changed when a weight was
adjusted, while ``straw2`` achieves the original goal of only
changing mappings to or from the bucket item whose weight has
changed.

* ``straw2`` is the default for any newly created buckets.

Migration impact:

* Changing a bucket type from ``straw`` to ``straw2`` will result in
a reasonably small amount of data movement, depending on how much
the bucket item weights vary from each other. When the weights are
all the same no data will move, and when item weights vary
significantly there will be more movement.

jewel
-----

The jewel tunable profile (CRUSH_TUNABLES5 feature) improves the
overall behavior of CRUSH such that significantly fewer mappings
change when an OSD is marked out of the cluster.

The new tunable is:

* ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
use a better value for an inner loop that greatly reduces the number
of mapping changes when an OSD is marked out. The legacy value is 0,
while the new value of 1 uses the new approach.

Changing this value on an existing cluster will result in a very
Migration impact:

* Changing this value on an existing cluster will result in a very
large amount of data movement as almost every PG mapping is likely
to change.




Which client versions support CRUSH_TUNABLES
--------------------------------------------

Expand Down
4 changes: 3 additions & 1 deletion src/common/config_opts.h
Expand Up @@ -265,7 +265,9 @@ OPTION(mon_globalid_prealloc, OPT_U32, 10000) // how many globalids to preallo
OPTION(mon_osd_report_timeout, OPT_INT, 900) // grace period before declaring unresponsive OSDs dead
OPTION(mon_force_standby_active, OPT_BOOL, true) // should mons force standby-replay mds to be active
OPTION(mon_warn_on_old_mons, OPT_BOOL, true) // should mons set health to WARN if part of quorum is old?
OPTION(mon_warn_on_legacy_crush_tunables, OPT_BOOL, true) // warn if crush tunables are not optimal
OPTION(mon_warn_on_legacy_crush_tunables, OPT_BOOL, true) // warn if crush tunables are too old (older than mon_min_crush_required_version)
OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
OPTION(mon_min_osdmap_epochs, OPT_INT, 500)
Expand Down
3 changes: 3 additions & 0 deletions src/crush/CrushWrapper.cc
Expand Up @@ -1568,6 +1568,9 @@ void CrushWrapper::dump_tunables(Formatter *f) const
f->dump_int("optimal_tunables", (int)has_optimal_tunables());
f->dump_int("legacy_tunables", (int)has_legacy_tunables());

// be helpful about minimum version required
f->dump_string("minimum_required_version", get_min_required_version());

f->dump_int("require_feature_tunables", (int)has_nondefault_tunables());
f->dump_int("require_feature_tunables2", (int)has_nondefault_tunables2());
f->dump_int("has_v2_rules", (int)has_v2_rules());
Expand Down
20 changes: 14 additions & 6 deletions src/crush/CrushWrapper.h
Expand Up @@ -164,7 +164,7 @@ class CrushWrapper {
crush->straw_calc_version = 1;
}
void set_tunables_default() {
set_tunables_bobtail();
set_tunables_firefly();
crush->straw_calc_version = 1;
}

Expand Down Expand Up @@ -232,7 +232,6 @@ class CrushWrapper {
crush->chooseleaf_descend_once == 0 &&
crush->chooseleaf_vary_r == 0 &&
crush->chooseleaf_stable == 0 &&
crush->straw_calc_version == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
}
bool has_bobtail_tunables() const {
Expand All @@ -243,7 +242,6 @@ class CrushWrapper {
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 0 &&
crush->chooseleaf_stable == 0 &&
crush->straw_calc_version == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
}
bool has_firefly_tunables() const {
Expand All @@ -254,7 +252,6 @@ class CrushWrapper {
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 1 &&
crush->chooseleaf_stable == 0 &&
crush->straw_calc_version == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
}
bool has_hammer_tunables() const {
Expand All @@ -265,7 +262,6 @@ class CrushWrapper {
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 1 &&
crush->chooseleaf_stable == 0 &&
crush->straw_calc_version == 1 &&
crush->allowed_bucket_algs == ((1 << CRUSH_BUCKET_UNIFORM) |
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
Expand All @@ -279,7 +275,6 @@ class CrushWrapper {
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 1 &&
crush->chooseleaf_stable == 1 &&
crush->straw_calc_version == 1 &&
crush->allowed_bucket_algs == ((1 << CRUSH_BUCKET_UNIFORM) |
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
Expand Down Expand Up @@ -321,6 +316,19 @@ class CrushWrapper {
bool is_v3_rule(unsigned ruleid) const;
bool is_v5_rule(unsigned ruleid) const;

string get_min_required_version() const {
if (has_v5_rules() || has_nondefault_tunables5())
return "jewel";
else if (has_v4_buckets())
return "hammer";
else if (has_nondefault_tunables3())
return "firefly";
else if (has_nondefault_tunables2() || has_nondefault_tunables())
return "bobtail";
else
return "argonaut";
}

// default bucket types
unsigned get_default_bucket_alg() const {
// in order of preference
Expand Down
17 changes: 15 additions & 2 deletions src/mon/OSDMonitor.cc
Expand Up @@ -2836,9 +2836,22 @@ void OSDMonitor::get_health(list<pair<health_status_t,string> >& summary,

// old crush tunables?
if (g_conf->mon_warn_on_legacy_crush_tunables) {
if (osdmap.crush->has_legacy_tunables()) {
string min = osdmap.crush->get_min_required_version();
if (min < g_conf->mon_crush_min_required_version) {
ostringstream ss;
ss << "crush map has legacy tunables";
ss << "crush map has legacy tunables (require " << min
<< ", min is " << g_conf->mon_crush_min_required_version << ")";
summary.push_back(make_pair(HEALTH_WARN, ss.str()));
if (detail) {
ss << "; see http://ceph.com/docs/master/rados/operations/crush-map/#tunables";
detail->push_back(make_pair(HEALTH_WARN, ss.str()));
}
}
}
if (g_conf->mon_warn_on_crush_straw_calc_version_zero) {
if (osdmap.crush->get_straw_calc_version() == 0) {
ostringstream ss;
ss << "crush map has straw_calc_version=0";
summary.push_back(make_pair(HEALTH_WARN, ss.str()));
if (detail) {
ss << "; see http://ceph.com/docs/master/rados/operations/crush-map/#tunables";
Expand Down