Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd,osdc: pg and osd-based backoff #12342

Merged
merged 14 commits into from Feb 11, 2017
Merged

osd,osdc: pg and osd-based backoff #12342

merged 14 commits into from Feb 11, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Dec 6, 2016

If a PG is blocked (down, incomplete) or an object is blocked (unfound), the OSD
can send a MOSDBackoff message to the client to ask it to stop sending requests
for that PG or object. Later, when the OSD recovers (PG peers or object is found)
it can send an unblock message, at which point the client will resend all blocked
messages.

The primary benefit is that the OSD doesn't need to keep blocked messages around in
memory (the client will resend them later). This prevents a blocked PG or object
from accumulating blocked requests that then fill up the OSD's memory limit on client
request memory.

@gregsfortytwo
Copy link
Member

Didn't look at most of this, but I think we need docs describing the invariants and rules before this goes any farther — I'm sure something can be divined based on the code but we really need specific, explainable rules for how this interacts with stuff like flushing (at the user API level) and throttles (for the internal developer) or it'll be completely unsupportable.

@liewegas
Copy link
Member Author

liewegas commented Dec 9, 2016

@byo please test/review

@tchaikov tchaikov self-assigned this Dec 9, 2016
Copy link

@byo byo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas: Started a review and test of this code, most of the stuff is way beyond my current understanding of ceph codebase so I did rather brief reading there.

I had to do one patch to unbreak make_debs.sh:

diff --git a/src/test/mon/test_mon_workloadgen.cc b/src/test/mon/test_mon_workloadgen.cc
index c1b26ef..1930689 100644
--- a/src/test/mon/test_mon_workloadgen.cc
+++ b/src/test/mon/test_mon_workloadgen.cc
@@ -32,6 +32,7 @@
 
 #include "osd/osd_types.h"
 #include "osd/OSD.h"
+#include "osd/Session.h"
 #include "osdc/Objecter.h"
 #include "mon/MonClient.h"
 #include "msg/Dispatcher.h"
@@ -900,7 +901,7 @@ class OSDStub : public TestStub
 
   bool ms_handle_reset(Connection *con) {
     dout(1) << __func__ << dendl;
-    OSD::Session *session = (OSD::Session *)con->get_priv();
+    Session *session = (Session *)con->get_priv();
     if (!session)
       return false;
     session->put();

I will deploy this in my test environment now, let's see how it works.


Ordinarily the OSD will simply queue any requests it can't immeidately
process in memory until such time as it can. This can become
problematic because the OSD limits the total amount of RAM consuemd by
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo "consuemd" => "consumed"

Ordinarily the OSD will simply queue any requests it can't immeidately
process in memory until such time as it can. This can become
problematic because the OSD limits the total amount of RAM consuemd by
incoming messages: if the threshold is reached, new messages will not
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also say about the limit of number of simultaneously processed messages (independently of their size)?

/*
* Ceph - scalable distributed file system
*
* Copyright (C) 2004-2006 Sage Weil <sage@newdream.net>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be updated ;)

Either way, the request ultimately targets a PG. The client sends the
request to the primary for the assocated PG.

Each request is assigned a unique tid.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this has to increase at each request? It looks like the code does require this: https://github.com/ceph/ceph/pull/12342/files#diff-dfb9ddca0a3ee32b266623e8fa489626R2408

@byo
Copy link

byo commented Dec 13, 2016

@liewegas: I'm getting a client crush when running rados bench after applying this patch. Currently trying to find a test scenario to easily reproduce that.

@byo
Copy link

byo commented Dec 14, 2016

The crush I'm getting is pretty awkward, what I'm doing is: take 3 OSDs down, all from different racks (different failure domains) so that there are blocked objects, then startrados bench -p rbd -t 1200 1200 write, wait for it to initialize, then while it's still running, start bringing OSDs up.

Most of the time I'm getting stack traces like:

(gdb) bt
#0 0x00007fffef5ca38a in Objecter::_check_op_pool_dne (this=this@entry=0x5555558c24b0,
op=op@entry=0x555555918240, sl=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1512
#1 0x00007fffef5d049e in Objecter::_scan_requests (this=this@entry=0x5555558c24b0,
s=s@entry=0x5555558e6d10, force_resend=force_resend@entry=false, cluster_full=cluster_full@entry=false,
pool_full_map=pool_full_map@entry=0x7fffe0d5de50, need_resend=..., need_resend_linger=...,
need_resend_command=..., sul=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1091
#2 0x00007fffef5dcb81 in Objecter::handle_osd_map (this=this@entry=0x5555558c24b0,
m=m@entry=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1270
#3 0x00007fffef5de957 in Objecter::ms_dispatch (this=0x5555558c24b0, m=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1010
#4 0x00007fffef81a8db in ms_deliver_dispatch (m=0x7fffdc0131b0, this=0x555555835900)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/Messenger.h:593
---Type to continue, or q to quit---
#5 DispatchQueue::entry (this=0x555555835a50)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.cc:197
#6 0x00007fffef6b971d in DispatchQueue::DispatchThread::entry (this=)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.h:103
#7 0x00007fffe68db182 in start_thread (arg=0x7fffe0d60700) at pthread_create.c:312
#8 0x00007fffe5be400d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I've also seen this:

(gdb) bt
#0 0x00007fffe5b20c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fffe5b24028 in __GI_abort () at abort.c:89
#2 0x00007fffef62ec85 in ceph::__ceph_assert_fail (assertion=, file=, line=2110,
func=0x7fffef8a9e30 Objecter::tick()::__PRETTY_FUNCTION__ "void Objecter::tick()") at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/assert.cc:78
#3 0x00007fffef5d214d in Objecter::tick (this=0x5555558c2650) at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:2110
#4 0x00007fffef5a7831 in operator() (this=0x5555558cc6f0) at /usr/include/c++/4.8/functional:2471
#5 ceph::timer_detail::timerceph::time_detail::mono_clock::timer_thread (this=0x5555558c2850)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_timer.h:137
#6 0x00007fffe647ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffe68db184 in start_thread (arg=0x7fffe155d700) at pthread_create.c:312
#8 0x00007fffe5be437d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Worth noting is that I was running rados bench from the mon node. This mon is running in docker container (old style, lxc-based) on ubuntu 14.04. I was using some older packages first but same happened once I fully upgraded to newest 14.04 stuff.

I also tested it on a totally separate VM, started with clean up-to-date 14.04 and installed packages I compiled and used on the cluster, guess what: it works flawlessly, couldn't reproduce the crash. This VM will definitely have much higher network latency when talking to the cluster, maybe that's the reason?

On the other hand, I also tried to test this version using fio but was unable to run it at:

/tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_crypto.cc: 77: FAILED assert(crypto_context != __null)
I was running fio 2.1.11, didn't yet check the newest version.

@liewegas liewegas force-pushed the wip-backoff branch 4 times, most recently from 4f5133c to 4b8b66d Compare December 22, 2016 03:51
@liewegas liewegas force-pushed the wip-backoff branch 3 times, most recently from 0a274cc to 0f468d3 Compare December 28, 2016 18:45
@liewegas liewegas force-pushed the wip-backoff branch 2 times, most recently from f41c4fc to 5912fa3 Compare February 2, 2017 19:29
@liewegas liewegas changed the title DNM: osd,osdc: pg and osd-based backoff osd,osdc: pg and osd-based backoff Feb 2, 2017
@liewegas liewegas requested a review from jdurgin February 2, 2017 19:53
@liewegas
Copy link
Member Author

liewegas commented Feb 2, 2017

http://pulpito.ceph.com/sage-2017-02-02_14:58:21-rados:thrash-wip-backoff---basic-smithi looking good! i made a small change and need to send it through one more time, but don't expect problems.

@liewegas
Copy link
Member Author

liewegas commented Feb 6, 2017

http://pulpito.ceph.com/sage-2017-02-05_18:49:00-rados:thrash-wip-backoff---basic-smithi/ is pretty clean (failures releated to lab cluster). the exception is one assert from a live obc in on_flushed, which is either a new regression in master or related to backoff... :/

@liewegas
Copy link
Member Author

liewegas commented Feb 6, 2017

Nevermind.. that failure is an unrelated bug (that is fixed). This is ready for final review and (hopefully) merge!

@byo
Copy link

byo commented Feb 6, 2017

Great to see this, we'll try to put it into our testing machinery soon

@liewegas
Copy link
Member Author

liewegas commented Feb 6, 2017 via email

src/osd/OSD.cc Outdated
}
spg_t pgid;
if (!osdmap->get_primary_shard(_pgid, &pgid)) {
// missing pool or acting set empty -- drop
Copy link

@byo byo Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a normal situation? If the backoff spans across multiple PGs, this could break the loop in the middle before ops to all PGs are enqueued.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixing

src/osd/OSD.cc Outdated
// map hobject range to PG(s)
bool queued = false;
hobject_t pos = m->begin;
while (true) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be rephrased as do...while

@byo
Copy link

byo commented Feb 9, 2017

@liewegas I did test in our infrastructure, so far looks really good, works as expected and no issues under load. I briefly checked the code, left few minor comments (few old ones also still apply).
Hope this will end up in the master soon, is it going to be a part of luminous?

@liewegas
Copy link
Member Author

liewegas commented Feb 9, 2017

@byo thanks, updated! This will go into master shortly and will definitely be part of luminous.

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got all the corner cases for resetting backoffs that I could think of. RefCountedObject priv object ownership is confusingly asymmetric, but not worth changing now.

This lets us avoid an rbtree lookup.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
No reason to waste CPU recalculating a hash value!

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Link Backoff into Session, PG.  Tear them down on session
reset.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Issue at top of do_request.  Release on activation or peering
interval change.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Do these midway down do_op.  Reorder the scrub waitlist after the
degraded and unreadable waitlists.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…ding backoffs

Signed-off-by: Sage Weil <sage@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants