Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: have clients resend ops on pg split #13235

Merged
merged 35 commits into from Feb 15, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Feb 2, 2017

  • adds require_luminous_osd OSDMap flag
  • adds luminous server feature and a resend_on_split feature (same value)
  • adds a placeholder M feature
  • adds generic bits forcing upgrades to stop at L, enforcement for require_luminous_osd flag
  • fixes last_split_epoch tracking on OSD (well enough for our purposes)
  • renames last_force_op_resend to _preluminous, and adds a new one for post-
  • makes objecter resend on split to osds with the resend-on-split feature
  • makes osd discard all ops prior to split
  • fixes backoff to operate on a per-spg_t basis
  • renames backoffs to make more sense
  • fixes fragile objecter code around tiering and pg_num

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add an upgrade test that sets require_luminous_osds too

@@ -1965,6 +1965,15 @@ bool OSDMonitor::preprocess_boot(MonOpRequestRef op)
}
}

// make sure upgrades stop at luminous
if ((m->osd_features & CEPH_FEATURE_SERVER_M) &&
!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the places we use the jewel and kraken flags/features, there are a couple more that need luminous-flag/feature handling:

  • create_initial()
  • encode_pending()
  • preprocess_boot() - second check needed to prevent pre-luminous OSDs booting when the flag is set

@@ -6480,6 +6480,20 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
ss << "not all up OSDs have CEPH_FEATURE_SERVER_KRAKEN feature";
err = -EPERM;
}
} else if (key == "require_luminous_osds") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add to MonCommands.h too

if (!(features & CEPH_FEATURE_NEW_OSDOP_ENCODING)) {
// this was the first post-hammer thing we added; if it's missing, encode
// like hammer.
v = 21;
}
if ((features &
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) !=
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit redundant since FEATURE_RESEND_ON_SPLIT aka FEATURE_SERVER_LUMINOUS implies FEATURE_SERVER_KRAKEN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, should have been JEWEL. this is needed because we are recycling feature bits and older code will have these for other reasons. we really need to introduce a set of helpers to manage this mess...

@@ -2735,7 +2748,7 @@ int Objecter::_calc_target(op_target_t *t, bool any_change)
pg_num,
t->sort_bitwise,
sort_bitwise,
pg_t(prev_seed, pgid.pool(), pgid.preferred()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove pgid.preferred()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support was removed like 8 years ago; should really scrub it from the code.

@@ -2811,13 +2824,19 @@ int Objecter::_calc_target(op_target_t *t, bool any_change)
if (need_resend) {
return RECALC_OP_TARGET_NEED_RESEND;
}
if (con &&
con->has_features(CEPH_FEATURE_RESEND_ON_SPLIT |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't RESEND_ON_SPLIT imply SERVER_JEWEL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, need to check for both bc we recycled the feature bit

@liewegas liewegas force-pushed the wip-pg-split-interval branch 2 times, most recently from c5aa4f5 to 5190531 Compare February 3, 2017 21:46
@liewegas
Copy link
Member Author

liewegas commented Feb 4, 2017

http://pulpito.ceph.com/sage-2017-02-03_23:05:46-rados-wip-pg-split-interval---basic-smithi/

the one (non-ntp) failure looks like the mon failing to resend a pg_create message. this code is about to be rewritten for luminous so i'm not sure it's that important.

@liewegas
Copy link
Member Author

liewegas commented Feb 4, 2017

rebased (minor conflict with the ENXIO change)

Signed-off-by: Sage Weil <sage@redhat.com>
Rename the current last_force_op_resend for legacy clients, and add a new
one that only applies to new clients that have the new
CEPH_FEATURE_OSD_NEW_INTERVAL_ON_SPLIT feature.

Signed-off-by: Sage Weil <sage@redhat.com>
If the client has the new feature bit, use the new field; if they have the
older feature bit, use the old field.

Note that there is no change to the Objecter: last_force_op_resend is
still the "current" field that it should pay attention to.

Signed-off-by: Sage Weil <sage@redhat.com>
Pre-luminous clients do not understand that a split PG forms a new
interval.  Make them resend ops to work around this.

Signed-off-by: Sage Weil <sage@redhat.com>
@gregsfortytwo
Copy link
Member

I reviewed from "osdc/Objecter: populate both actual pgid and full has in MOSDOp" on in the faster dispatch PR and only had pretty minor comments. I'd missed some of the other bits you were ripping out though so I will take a look at this as well.

@liewegas liewegas force-pushed the wip-pg-split-interval branch 4 times, most recently from d7c7fa8 to 6cc8c75 Compare February 14, 2017 00:36
…ON_SPLIT

Signed-off-by: Sage Weil <sage@redhat.com>
New clients will resend.

Old clients will see a last_force_op_resend (now named
last_force_op_resend_preluminous in latest code) and resend.

We know this because we require that the monitors upgrade to luminous
before the OSDs, and the new mon code sets this field on split.

Signed-off-by: Sage Weil <sage@redhat.com>
Note that it is only (currently) important that this value be accurate
on the current OSD since we only use this value (currently) to discard
ops sent before the split.  If we are getting the history from a different
OSD in the cluster that doesn't have an up to date value it doesn't matter
because that implies a primary change and also a client resend.

Signed-off-by: Sage Weil <sage@redhat.com>
Drop unneeded snapid_t snapid and object_locator_t, which just duplicate
hobject_t fields.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
New clients need the actual pgid as well as the full hash (as part of the
target hobj).  Old clients only use the full hash value.  We need to pass
both to MOSDOp so it can encode based on the target features.

Signed-off-by: Sage Weil <sage@redhat.com>
Define common get_spg() and get_map_epoch() methods.

Signed-off-by: Sage Weil <sage@redhat.com>
Removed 7c414c5 (pre-bobtail).

Signed-off-by: Sage Weil <sage@redhat.com>
This prevents random messages from falling into and OpRequest and
dispatch_op().

Signed-off-by: Sage Weil <sage@redhat.com>
Clients are now expected to resend on split, and there is already an
interval change.

Signed-off-by: Sage Weil <sage@redhat.com>
Things like ObjectContext and lock state that are internal to the OSD
do not need to be in osd_types and shared with other parts of the code
base.

Notably, this fixes the problem with OpRequest needing things from
osd_types.h (osd_reqid_t for starters).  Others to follow.

Signed-off-by: Sage Weil <sage@redhat.com>
This is unused and not terribly useful.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Ensure that the ps value is < the pool pg_num.

Signed-off-by: Sage Weil <sage@redhat.com>
pg_read is only used for PG listing and hit_set_{list,get}; these
operations can't and shouldn't consider the tiering overlay.

This makes the _calc_target behavior with the explicit pgid make sense;
otherwise, what would it mean to try to read pg x.1 from pool x and get
redirected to pg y.1 in pool y?

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We use pi for pg_num and other values below; we need to update accordingly
if we follow the overlay.

Signed-off-by: Sage Weil <sage@redhat.com>
All callers now pass in an explicit pgid, including pg listing.  Since
we resend ops on split, there is not need to do any translation here,
even for the jewel and kraken osds that can handle a full hash value.

Signed-off-by: Sage Weil <sage@redhat.com>
Any time we are asked to calculate the target we should apply the
pool tiering parameters.  The previous logic of only doing so when the
target hadn't been calculated didn't make a whole lot of sense, and broke
our update of *pi that is needed to get the correct pg_num for the target
pool.  This didn't really matter for old clusters that take the raw pg,
but for luminous and beyond we need the exact spg_t which requires a
correct pg_num.

Signed-off-by: Sage Weil <sage@redhat.com>
and make it an MOSDFastDispatchOp.

Signed-off-by: Sage Weil <sage@redhat.com>
A backoff [range] is defined only within a specific spg_t; it does not
pass anything to children on split, or to another primary.

Signed-off-by: Sage Weil <sage@redhat.com>
Switch backoffs to be owned by a specific spg_t.  Instead of wonky split
logic, just clear them.  This is mostly just for convenience; we could
conceivably only clear the range belonging to children (just to stay
tidy--we'll never get a request in that range) but why bother.

The full pg backoffs are still defined by the range for the pg, although
it's a bit redundant--we could just as easily do [min,max).  This way we
get readable hobject ranges in the messages that go by without having to
map to/from pgids.

Add Session::add_backoff() helper to keep Session internals out of PG.h.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We may return a raw pointer that is about to get deallocated by
clear_backoffs().  Fix by returning a reference, preventing the free.

Signed-off-by: Sage Weil <sage@redhat.com>
In OSD::ms_handle_reset, we clear session->con before removing any
backoffs.  That means we have to check if con has been cleared after any
call to have_backoff, lest we race with ms_handle_reset and it removes the
backoffs but we don't realize our client session is disconnected.

Introduce a helper to do both these checks in a safe way, simplifying
callers while we're at it.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

http://pulpito.ceph.com/sage-2017-02-14_06:55:13-rados-wip-pg-split-interval---basic-smithi/

failures unrelated:

  • misordered op bc pg log too short
  • tcmalloc vs dlopen deadlock
  • live obc after flush--saw this on master this morning too
  • clock sync

@liewegas
Copy link
Member Author

retest this please

@gregsfortytwo
Copy link
Member

How did you check that the pg log was too short? That's not generally an issue in the nightlies AFAIK and is exactly the kind of bug we'd see in this kind of PR...

@liewegas
Copy link
Member Author

liewegas commented Feb 14, 2017 via email

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. This looks good to me, just a couple things.

case MSG_OSD_PG_INFO:
case MSG_OSD_PG_TRIM:
case MSG_OSD_BACKFILL_RESERVE:
case MSG_OSD_RECOVERY_RESERVE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, does that mean every random message we've received has been grabbing an OSD lock and going through this process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, an unrecognized message asserts in dispatch_op later; it just seemed better to enumerate them here (and not assert on a stray message type) than to assert in dispatch_op. (This after i spent several minutes trying to figure out which messages took this path.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better torn iao fast, we'll just have to remember to add them to this enumeration whenever we make new message types (which IIRC is why it wasn't done before). Too bad we don't have a good way to switch on the new intermediate classes.

class MOSDFastDispatchOp : public Message {
public:
virtual epoch_t get_map_epoch() const = 0;
virtual spg_t get_spg() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subclassing should also let us remove the templated replica_op_required_epoch() and handle_replica_op() functions in OSD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the fast dispatch patch does exactly that

@@ -20,6 +20,7 @@

#include "common/hobject.h"
#include "osd/osd_types.h"
#include "osd/osd_internal_types.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a pretty obtuse way to get osd_internal_types where it needs to go. Maybe it's included here by happenstance but shouldn't we place it in the right classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just the first (and apparently only) place I hit a build error and added it as there weren't any further errors after that.

@gregsfortytwo
Copy link
Member

Reviewed-by: Greg Farnum gfarnum@redhat.com

@jdurgin, I think this has covered all your concerns. Are we waiting on anything else?

@liewegas liewegas merged commit eb491a1 into ceph:master Feb 15, 2017
@liewegas liewegas deleted the wip-pg-split-interval branch February 15, 2017 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants