New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: have clients resend ops on pg split #13235
Conversation
liewegas
commented
Feb 2, 2017
•
edited
edited
- adds require_luminous_osd OSDMap flag
- adds luminous server feature and a resend_on_split feature (same value)
- adds a placeholder M feature
- adds generic bits forcing upgrades to stop at L, enforcement for require_luminous_osd flag
- fixes last_split_epoch tracking on OSD (well enough for our purposes)
- renames last_force_op_resend to _preluminous, and adds a new one for post-
- makes objecter resend on split to osds with the resend-on-split feature
- makes osd discard all ops prior to split
- fixes backoff to operate on a per-spg_t basis
- renames backoffs to make more sense
- fixes fragile objecter code around tiering and pg_num
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add an upgrade test that sets require_luminous_osds too
src/mon/OSDMonitor.cc
Outdated
@@ -1965,6 +1965,15 @@ bool OSDMonitor::preprocess_boot(MonOpRequestRef op) | |||
} | |||
} | |||
|
|||
// make sure upgrades stop at luminous | |||
if ((m->osd_features & CEPH_FEATURE_SERVER_M) && | |||
!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the places we use the jewel and kraken flags/features, there are a couple more that need luminous-flag/feature handling:
- create_initial()
- encode_pending()
- preprocess_boot() - second check needed to prevent pre-luminous OSDs booting when the flag is set
src/mon/OSDMonitor.cc
Outdated
@@ -6480,6 +6480,20 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op, | |||
ss << "not all up OSDs have CEPH_FEATURE_SERVER_KRAKEN feature"; | |||
err = -EPERM; | |||
} | |||
} else if (key == "require_luminous_osds") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add to MonCommands.h too
src/osd/osd_types.cc
Outdated
if (!(features & CEPH_FEATURE_NEW_OSDOP_ENCODING)) { | ||
// this was the first post-hammer thing we added; if it's missing, encode | ||
// like hammer. | ||
v = 21; | ||
} | ||
if ((features & | ||
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) != | ||
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit redundant since FEATURE_RESEND_ON_SPLIT aka FEATURE_SERVER_LUMINOUS implies FEATURE_SERVER_KRAKEN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, should have been JEWEL. this is needed because we are recycling feature bits and older code will have these for other reasons. we really need to introduce a set of helpers to manage this mess...
@@ -2735,7 +2748,7 @@ int Objecter::_calc_target(op_target_t *t, bool any_change) | |||
pg_num, | |||
t->sort_bitwise, | |||
sort_bitwise, | |||
pg_t(prev_seed, pgid.pool(), pgid.preferred()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove pgid.preferred()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support was removed like 8 years ago; should really scrub it from the code.
src/osdc/Objecter.cc
Outdated
@@ -2811,13 +2824,19 @@ int Objecter::_calc_target(op_target_t *t, bool any_change) | |||
if (need_resend) { | |||
return RECALC_OP_TARGET_NEED_RESEND; | |||
} | |||
if (con && | |||
con->has_features(CEPH_FEATURE_RESEND_ON_SPLIT | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't RESEND_ON_SPLIT imply SERVER_JEWEL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, need to check for both bc we recycled the feature bit
c5aa4f5
to
5190531
Compare
http://pulpito.ceph.com/sage-2017-02-03_23:05:46-rados-wip-pg-split-interval---basic-smithi/ the one (non-ntp) failure looks like the mon failing to resend a pg_create message. this code is about to be rewritten for luminous so i'm not sure it's that important. |
2305e2e
to
1663021
Compare
rebased (minor conflict with the ENXIO change) |
2437dff
to
76c6597
Compare
Signed-off-by: Sage Weil <sage@redhat.com>
Rename the current last_force_op_resend for legacy clients, and add a new one that only applies to new clients that have the new CEPH_FEATURE_OSD_NEW_INTERVAL_ON_SPLIT feature. Signed-off-by: Sage Weil <sage@redhat.com>
If the client has the new feature bit, use the new field; if they have the older feature bit, use the old field. Note that there is no change to the Objecter: last_force_op_resend is still the "current" field that it should pay attention to. Signed-off-by: Sage Weil <sage@redhat.com>
Pre-luminous clients do not understand that a split PG forms a new interval. Make them resend ops to work around this. Signed-off-by: Sage Weil <sage@redhat.com>
76c6597
to
d856e43
Compare
I reviewed from "osdc/Objecter: populate both actual pgid and full has in MOSDOp" on in the faster dispatch PR and only had pretty minor comments. I'd missed some of the other bits you were ripping out though so I will take a look at this as well. |
d7c7fa8
to
6cc8c75
Compare
…ON_SPLIT Signed-off-by: Sage Weil <sage@redhat.com>
New clients will resend. Old clients will see a last_force_op_resend (now named last_force_op_resend_preluminous in latest code) and resend. We know this because we require that the monitors upgrade to luminous before the OSDs, and the new mon code sets this field on split. Signed-off-by: Sage Weil <sage@redhat.com>
Note that it is only (currently) important that this value be accurate on the current OSD since we only use this value (currently) to discard ops sent before the split. If we are getting the history from a different OSD in the cluster that doesn't have an up to date value it doesn't matter because that implies a primary change and also a client resend. Signed-off-by: Sage Weil <sage@redhat.com>
Drop unneeded snapid_t snapid and object_locator_t, which just duplicate hobject_t fields. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
New clients need the actual pgid as well as the full hash (as part of the target hobj). Old clients only use the full hash value. We need to pass both to MOSDOp so it can encode based on the target features. Signed-off-by: Sage Weil <sage@redhat.com>
Define common get_spg() and get_map_epoch() methods. Signed-off-by: Sage Weil <sage@redhat.com>
Removed 7c414c5 (pre-bobtail). Signed-off-by: Sage Weil <sage@redhat.com>
This prevents random messages from falling into and OpRequest and dispatch_op(). Signed-off-by: Sage Weil <sage@redhat.com>
Clients are now expected to resend on split, and there is already an interval change. Signed-off-by: Sage Weil <sage@redhat.com>
Things like ObjectContext and lock state that are internal to the OSD do not need to be in osd_types and shared with other parts of the code base. Notably, this fixes the problem with OpRequest needing things from osd_types.h (osd_reqid_t for starters). Others to follow. Signed-off-by: Sage Weil <sage@redhat.com>
This is unused and not terribly useful. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Ensure that the ps value is < the pool pg_num. Signed-off-by: Sage Weil <sage@redhat.com>
pg_read is only used for PG listing and hit_set_{list,get}; these operations can't and shouldn't consider the tiering overlay. This makes the _calc_target behavior with the explicit pgid make sense; otherwise, what would it mean to try to read pg x.1 from pool x and get redirected to pg y.1 in pool y? Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We use pi for pg_num and other values below; we need to update accordingly if we follow the overlay. Signed-off-by: Sage Weil <sage@redhat.com>
All callers now pass in an explicit pgid, including pg listing. Since we resend ops on split, there is not need to do any translation here, even for the jewel and kraken osds that can handle a full hash value. Signed-off-by: Sage Weil <sage@redhat.com>
Any time we are asked to calculate the target we should apply the pool tiering parameters. The previous logic of only doing so when the target hadn't been calculated didn't make a whole lot of sense, and broke our update of *pi that is needed to get the correct pg_num for the target pool. This didn't really matter for old clusters that take the raw pg, but for luminous and beyond we need the exact spg_t which requires a correct pg_num. Signed-off-by: Sage Weil <sage@redhat.com>
and make it an MOSDFastDispatchOp. Signed-off-by: Sage Weil <sage@redhat.com>
A backoff [range] is defined only within a specific spg_t; it does not pass anything to children on split, or to another primary. Signed-off-by: Sage Weil <sage@redhat.com>
Switch backoffs to be owned by a specific spg_t. Instead of wonky split logic, just clear them. This is mostly just for convenience; we could conceivably only clear the range belonging to children (just to stay tidy--we'll never get a request in that range) but why bother. The full pg backoffs are still defined by the range for the pg, although it's a bit redundant--we could just as easily do [min,max). This way we get readable hobject ranges in the messages that go by without having to map to/from pgids. Add Session::add_backoff() helper to keep Session internals out of PG.h. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We may return a raw pointer that is about to get deallocated by clear_backoffs(). Fix by returning a reference, preventing the free. Signed-off-by: Sage Weil <sage@redhat.com>
In OSD::ms_handle_reset, we clear session->con before removing any backoffs. That means we have to check if con has been cleared after any call to have_backoff, lest we race with ms_handle_reset and it removes the backoffs but we don't realize our client session is disconnected. Introduce a helper to do both these checks in a safe way, simplifying callers while we're at it. Signed-off-by: Sage Weil <sage@redhat.com>
6cc8c75
to
b729e62
Compare
http://pulpito.ceph.com/sage-2017-02-14_06:55:13-rados-wip-pg-split-interval---basic-smithi/ failures unrelated:
|
retest this please |
How did you check that the pg log was too short? That's not generally an issue in the nightlies AFAIK and is exactly the kind of bug we'd see in this kind of PR... |
the first instance of the op completed some 800 ops before the
out-of-order dup reply was sent, and the pg log was set a something
like 400.
fwiw i fixed 2 instances of out-of-order bugs from backoff races yesterday
(see last 2 commits), but the bugs were actually in master. i scheduled
another run on this branch to see if anything else shakes out
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This looks good to me, just a couple things.
case MSG_OSD_PG_INFO: | ||
case MSG_OSD_PG_TRIM: | ||
case MSG_OSD_BACKFILL_RESERVE: | ||
case MSG_OSD_RECOVERY_RESERVE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, does that mean every random message we've received has been grabbing an OSD lock and going through this process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, an unrecognized message asserts in dispatch_op later; it just seemed better to enumerate them here (and not assert on a stray message type) than to assert in dispatch_op. (This after i spent several minutes trying to figure out which messages took this path.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's better torn iao fast, we'll just have to remember to add them to this enumeration whenever we make new message types (which IIRC is why it wasn't done before). Too bad we don't have a good way to switch on the new intermediate classes.
class MOSDFastDispatchOp : public Message { | ||
public: | ||
virtual epoch_t get_map_epoch() const = 0; | ||
virtual spg_t get_spg() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This subclassing should also let us remove the templated replica_op_required_epoch() and handle_replica_op() functions in OSD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the fast dispatch patch does exactly that
@@ -20,6 +20,7 @@ | |||
|
|||
#include "common/hobject.h" | |||
#include "osd/osd_types.h" | |||
#include "osd/osd_internal_types.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pretty obtuse way to get osd_internal_types where it needs to go. Maybe it's included here by happenstance but shouldn't we place it in the right classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just the first (and apparently only) place I hit a build error and added it as there weren't any further errors after that.
Reviewed-by: Greg Farnum gfarnum@redhat.com @jdurgin, I think this has covered all your concerns. Are we waiting on anything else? |