osd,osdc: pg and osd-based backoff #12342

liewegas · 2016-12-06T18:53:10Z

If a PG is blocked (down, incomplete) or an object is blocked (unfound), the OSD
can send a MOSDBackoff message to the client to ask it to stop sending requests
for that PG or object. Later, when the OSD recovers (PG peers or object is found)
it can send an unblock message, at which point the client will resend all blocked
messages.

The primary benefit is that the OSD doesn't need to keep blocked messages around in
memory (the client will resend them later). This prevents a blocked PG or object
from accumulating blocked requests that then fill up the OSD's memory limit on client
request memory.

gregsfortytwo · 2016-12-07T00:44:08Z

Didn't look at most of this, but I think we need docs describing the invariants and rules before this goes any farther — I'm sure something can be divined based on the code but we really need specific, explainable rules for how this interacts with stuff like flushing (at the user API level) and throttles (for the internal developer) or it'll be completely unsupportable.

liewegas · 2016-12-09T14:45:24Z

@byo please test/review

byo

@liewegas: Started a review and test of this code, most of the stuff is way beyond my current understanding of ceph codebase so I did rather brief reading there.

I had to do one patch to unbreak make_debs.sh:

diff --git a/src/test/mon/test_mon_workloadgen.cc b/src/test/mon/test_mon_workloadgen.cc
index c1b26ef..1930689 100644
--- a/src/test/mon/test_mon_workloadgen.cc
+++ b/src/test/mon/test_mon_workloadgen.cc
@@ -32,6 +32,7 @@
 
 #include "osd/osd_types.h"
 #include "osd/OSD.h"
+#include "osd/Session.h"
 #include "osdc/Objecter.h"
 #include "mon/MonClient.h"
 #include "msg/Dispatcher.h"
@@ -900,7 +901,7 @@ class OSDStub : public TestStub
 
   bool ms_handle_reset(Connection *con) {
     dout(1) << __func__ << dendl;
-    OSD::Session *session = (OSD::Session *)con->get_priv();
+    Session *session = (Session *)con->get_priv();
     if (!session)
       return false;
     session->put();

I will deploy this in my test environment now, let's see how it works.

byo · 2016-12-12T14:38:28Z

doc/dev/rados-client-protocol.rst

+
+Ordinarily the OSD will simply queue any requests it can't immeidately
+process in memory until such time as it can.  This can become
+problematic because the OSD limits the total amount of RAM consuemd by


Typo "consuemd" => "consumed"

byo · 2016-12-12T14:40:10Z

doc/dev/rados-client-protocol.rst

+Ordinarily the OSD will simply queue any requests it can't immeidately
+process in memory until such time as it can.  This can become
+problematic because the OSD limits the total amount of RAM consuemd by
+incoming messages: if the threshold is reached, new messages will not


Shouldn't we also say about the limit of number of simultaneously processed messages (independently of their size)?

byo · 2016-12-12T14:52:01Z

src/messages/MOSDBackoff.h

+/*
+ * Ceph - scalable distributed file system
+ *
+ * Copyright (C) 2004-2006 Sage Weil <sage@newdream.net>


I guess this should be updated ;)

byo · 2016-12-12T15:12:51Z

doc/dev/rados-client-protocol.rst

+Either way, the request ultimately targets a PG.  The client sends the
+request to the primary for the assocated PG.
+
+Each request is assigned a unique tid.


Does this has to increase at each request? It looks like the code does require this: https://github.com/ceph/ceph/pull/12342/files#diff-dfb9ddca0a3ee32b266623e8fa489626R2408

byo · 2016-12-13T17:03:56Z

@liewegas: I'm getting a client crush when running rados bench after applying this patch. Currently trying to find a test scenario to easily reproduce that.

byo · 2016-12-14T12:28:06Z

The crush I'm getting is pretty awkward, what I'm doing is: take 3 OSDs down, all from different racks (different failure domains) so that there are blocked objects, then startrados bench -p rbd -t 1200 1200 write, wait for it to initialize, then while it's still running, start bringing OSDs up.

Most of the time I'm getting stack traces like:

(gdb) bt
#0 0x00007fffef5ca38a in Objecter::_check_op_pool_dne (this=this@entry=0x5555558c24b0,
op=op@entry=0x555555918240, sl=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1512
#1 0x00007fffef5d049e in Objecter::_scan_requests (this=this@entry=0x5555558c24b0,
s=s@entry=0x5555558e6d10, force_resend=force_resend@entry=false, cluster_full=cluster_full@entry=false,
pool_full_map=pool_full_map@entry=0x7fffe0d5de50, need_resend=..., need_resend_linger=...,
need_resend_command=..., sul=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1091
#2 0x00007fffef5dcb81 in Objecter::handle_osd_map (this=this@entry=0x5555558c24b0,
m=m@entry=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1270
#3 0x00007fffef5de957 in Objecter::ms_dispatch (this=0x5555558c24b0, m=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1010
#4 0x00007fffef81a8db in ms_deliver_dispatch (m=0x7fffdc0131b0, this=0x555555835900)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/Messenger.h:593
---Type to continue, or q to quit---
#5 DispatchQueue::entry (this=0x555555835a50)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.cc:197
#6 0x00007fffef6b971d in DispatchQueue::DispatchThread::entry (this=)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.h:103
#7 0x00007fffe68db182 in start_thread (arg=0x7fffe0d60700) at pthread_create.c:312
#8 0x00007fffe5be400d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I've also seen this:

(gdb) bt
#0 0x00007fffe5b20c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fffe5b24028 in __GI_abort () at abort.c:89
#2 0x00007fffef62ec85 in ceph::__ceph_assert_fail (assertion=, file=, line=2110,
func=0x7fffef8a9e30 Objecter::tick()::__PRETTY_FUNCTION__ "void Objecter::tick()") at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/assert.cc:78
#3 0x00007fffef5d214d in Objecter::tick (this=0x5555558c2650) at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:2110
#4 0x00007fffef5a7831 in operator() (this=0x5555558cc6f0) at /usr/include/c++/4.8/functional:2471
#5 ceph::timer_detail::timerceph::time_detail::mono_clock::timer_thread (this=0x5555558c2850)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_timer.h:137
#6 0x00007fffe647ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffe68db184 in start_thread (arg=0x7fffe155d700) at pthread_create.c:312
#8 0x00007fffe5be437d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Worth noting is that I was running rados bench from the mon node. This mon is running in docker container (old style, lxc-based) on ubuntu 14.04. I was using some older packages first but same happened once I fully upgraded to newest 14.04 stuff.

I also tested it on a totally separate VM, started with clean up-to-date 14.04 and installed packages I compiled and used on the cluster, guess what: it works flawlessly, couldn't reproduce the crash. This VM will definitely have much higher network latency when talking to the cluster, maybe that's the reason?

On the other hand, I also tried to test this version using fio but was unable to run it at:

/tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_crypto.cc: 77: FAILED assert(crypto_context != __null)
I was running fio 2.1.11, didn't yet check the newest version.

liewegas · 2017-02-02T19:55:13Z

http://pulpito.ceph.com/sage-2017-02-02_14:58:21-rados:thrash-wip-backoff---basic-smithi looking good! i made a small change and need to send it through one more time, but don't expect problems.

liewegas · 2017-02-06T14:52:02Z

http://pulpito.ceph.com/sage-2017-02-05_18:49:00-rados:thrash-wip-backoff---basic-smithi/ is pretty clean (failures releated to lab cluster). the exception is one assert from a live obc in on_flushed, which is either a new regression in master or related to backoff... :/

liewegas · 2017-02-06T16:03:42Z

Nevermind.. that failure is an unrelated bug (that is fixed). This is ready for final review and (hopefully) merge!

byo · 2017-02-06T17:41:08Z

Great to see this, we'll try to put it into our testing machinery soon

liewegas · 2017-02-06T17:43:15Z

That would be great, thanks!

byo · 2017-02-09T11:15:56Z

src/osd/OSD.cc

+    }
+    spg_t pgid;
+    if (!osdmap->get_primary_shard(_pgid, &pgid)) {
+      // missing pool or acting set empty -- drop


Is this a normal situation? If the backoff spans across multiple PGs, this could break the loop in the middle before ops to all PGs are enqueued.

good point, fixing

byo · 2017-02-09T11:16:24Z

src/osd/OSD.cc

+  // map hobject range to PG(s)
+  bool queued = false;
+  hobject_t pos = m->begin;
+  while (true) {


This could be rephrased as do...while

byo · 2017-02-09T12:54:04Z

@liewegas I did test in our infrastructure, so far looks really good, works as expected and no issues under load. I briefly checked the code, left few minor comments (few old ones also still apply).
Hope this will end up in the master soon, is it going to be a part of luminous?

liewegas · 2017-02-09T14:52:41Z

@byo thanks, updated! This will go into master shortly and will definitely be part of luminous.

jdurgin

You got all the corner cases for resetting backoffs that I could think of. RefCountedObject priv object ownership is confusingly asymmetric, but not worth changing now.

This lets us avoid an rbtree lookup. Signed-off-by: Sage Weil <sage@redhat.com>