New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd,osdc: pg and osd-based backoff #12342
Conversation
Didn't look at most of this, but I think we need docs describing the invariants and rules before this goes any farther — I'm sure something can be divined based on the code but we really need specific, explainable rules for how this interacts with stuff like flushing (at the user API level) and throttles (for the internal developer) or it'll be completely unsupportable. |
@byo please test/review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas: Started a review and test of this code, most of the stuff is way beyond my current understanding of ceph codebase so I did rather brief reading there.
I had to do one patch to unbreak make_debs.sh
:
diff --git a/src/test/mon/test_mon_workloadgen.cc b/src/test/mon/test_mon_workloadgen.cc
index c1b26ef..1930689 100644
--- a/src/test/mon/test_mon_workloadgen.cc
+++ b/src/test/mon/test_mon_workloadgen.cc
@@ -32,6 +32,7 @@
#include "osd/osd_types.h"
#include "osd/OSD.h"
+#include "osd/Session.h"
#include "osdc/Objecter.h"
#include "mon/MonClient.h"
#include "msg/Dispatcher.h"
@@ -900,7 +901,7 @@ class OSDStub : public TestStub
bool ms_handle_reset(Connection *con) {
dout(1) << __func__ << dendl;
- OSD::Session *session = (OSD::Session *)con->get_priv();
+ Session *session = (Session *)con->get_priv();
if (!session)
return false;
session->put();
I will deploy this in my test environment now, let's see how it works.
doc/dev/rados-client-protocol.rst
Outdated
|
||
Ordinarily the OSD will simply queue any requests it can't immeidately | ||
process in memory until such time as it can. This can become | ||
problematic because the OSD limits the total amount of RAM consuemd by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "consuemd" => "consumed"
doc/dev/rados-client-protocol.rst
Outdated
Ordinarily the OSD will simply queue any requests it can't immeidately | ||
process in memory until such time as it can. This can become | ||
problematic because the OSD limits the total amount of RAM consuemd by | ||
incoming messages: if the threshold is reached, new messages will not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also say about the limit of number of simultaneously processed messages (independently of their size)?
src/messages/MOSDBackoff.h
Outdated
/* | ||
* Ceph - scalable distributed file system | ||
* | ||
* Copyright (C) 2004-2006 Sage Weil <sage@newdream.net> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be updated ;)
doc/dev/rados-client-protocol.rst
Outdated
Either way, the request ultimately targets a PG. The client sends the | ||
request to the primary for the assocated PG. | ||
|
||
Each request is assigned a unique tid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this has to increase at each request? It looks like the code does require this: https://github.com/ceph/ceph/pull/12342/files#diff-dfb9ddca0a3ee32b266623e8fa489626R2408
@liewegas: I'm getting a client crush when running rados bench after applying this patch. Currently trying to find a test scenario to easily reproduce that. |
The crush I'm getting is pretty awkward, what I'm doing is: take 3 OSDs down, all from different racks (different failure domains) so that there are blocked objects, then start Most of the time I'm getting stack traces like:
I've also seen this:
Worth noting is that I was running rados bench from the mon node. This mon is running in docker container (old style, lxc-based) on ubuntu 14.04. I was using some older packages first but same happened once I fully upgraded to newest 14.04 stuff. I also tested it on a totally separate VM, started with clean up-to-date 14.04 and installed packages I compiled and used on the cluster, guess what: it works flawlessly, couldn't reproduce the crash. This VM will definitely have much higher network latency when talking to the cluster, maybe that's the reason? On the other hand, I also tried to test this version using fio but was unable to run it at:
|
4f5133c
to
4b8b66d
Compare
0a274cc
to
0f468d3
Compare
f41c4fc
to
5912fa3
Compare
http://pulpito.ceph.com/sage-2017-02-02_14:58:21-rados:thrash-wip-backoff---basic-smithi looking good! i made a small change and need to send it through one more time, but don't expect problems. |
http://pulpito.ceph.com/sage-2017-02-05_18:49:00-rados:thrash-wip-backoff---basic-smithi/ is pretty clean (failures releated to lab cluster). the exception is one assert from a live obc in on_flushed, which is either a new regression in master or related to backoff... :/ |
Nevermind.. that failure is an unrelated bug (that is fixed). This is ready for final review and (hopefully) merge! |
Great to see this, we'll try to put it into our testing machinery soon |
That would be great, thanks!
|
src/osd/OSD.cc
Outdated
} | ||
spg_t pgid; | ||
if (!osdmap->get_primary_shard(_pgid, &pgid)) { | ||
// missing pool or acting set empty -- drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a normal situation? If the backoff spans across multiple PGs, this could break the loop in the middle before ops to all PGs are enqueued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, fixing
src/osd/OSD.cc
Outdated
// map hobject range to PG(s) | ||
bool queued = false; | ||
hobject_t pos = m->begin; | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be rephrased as do...while
@liewegas I did test in our infrastructure, so far looks really good, works as expected and no issues under load. I briefly checked the code, left few minor comments (few old ones also still apply). |
@byo thanks, updated! This will go into master shortly and will definitely be part of luminous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got all the corner cases for resetting backoffs that I could think of. RefCountedObject priv object ownership is confusingly asymmetric, but not worth changing now.
This lets us avoid an rbtree lookup. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
No reason to waste CPU recalculating a hash value! Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Link Backoff into Session, PG. Tear them down on session reset. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
ec5f85c
to
9de7ed7
Compare
Signed-off-by: Sage Weil <sage@redhat.com>
Issue at top of do_request. Release on activation or peering interval change. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Do these midway down do_op. Reorder the scrub waitlist after the degraded and unreadable waitlists. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ding backoffs Signed-off-by: Sage Weil <sage@redhat.com>
9de7ed7
to
46ade14
Compare
If a PG is blocked (down, incomplete) or an object is blocked (unfound), the OSD
can send a MOSDBackoff message to the client to ask it to stop sending requests
for that PG or object. Later, when the OSD recovers (PG peers or object is found)
it can send an unblock message, at which point the client will resend all blocked
messages.
The primary benefit is that the OSD doesn't need to keep blocked messages around in
memory (the client will resend them later). This prevents a blocked PG or object
from accumulating blocked requests that then fill up the OSD's memory limit on client
request memory.