msg: mark daemons down on RST + ECONNREFUSED #8558

branch-predictor · 2016-04-12T14:01:52Z

When a daemon goes down (because it is killed or suicided because of assert)
on otherwise healthy machine, TCP stack will take care of its connections
and send RST (connection reset) packets to all daemons that were connected
to downed daemon. OSDs already handle that and attempt to reconnect, with
grace timer starting to count. When the grace timer runs out and daemons
still cannot reconnect, OSD in question is marked down.
This changeset adds additional handler (handle_refused()) to the dispatchers
and code that detects when connection attempt fails with ECONNREFUSED error
(connection refused) which is a clear indication that host is alive, but
daemon isn't, so daemons can instantly mark the other side as undoubtly
downed without the need for grace timer.
This changeset also adds more info to connections so figuring out which OSD
actually failed is a bit easier.
In current state, only OSDs take advantage of handle_refused() facility,
but I don't see why other daemons shouldn't.

Signed-off-by: Piotr Dałek git@predictor.org.pl

branch-predictor · 2016-04-12T14:04:20Z

This is more of PoC that doesn't take Async messenger into consideration (yet). Comments are welcome.

gregsfortytwo · 2016-04-12T14:35:07Z

src/msg/simple/Pipe.cc

+    ldout(msgr->cct, 10) << "connection refused!" << dendl;
+    if (connection_state->clear_pipe(this))
+      msgr->dispatch_queue.queue_refused(connection_state.get());
+  }


I haven't done a careful review, but the idea of this patch looks good.

This part isn't really safe though. We can get errors which don't directly call fault() or meander a bit before ending up here. We'll want some kind of shutdown error flag we can set at the failure site and check here or elsewhere.

On Tue, Apr 12, 2016 at 07:35:43AM -0700, Gregory Farnum wrote:

I haven't done a careful review, but the idea of this patch looks good.

This part isn't really safe though. We can get errors which don't directly call fault() or meander a bit before ending up here. We'll want some kind of shutdown error flag we can set at the failure site and check here or elsewhere.

I'll look into this, but I'm quite sure that ECONNREFUSED here would only
occur after attempt to connect() to port that isn't listening.

yuyuyu101 · 2016-04-12T14:54:21Z

looks a interesting idea, refuse is a good semantic to decide behavior

branch-predictor · 2016-04-13T12:24:11Z

Rebased and moved connection error checking to right after connect, that fixes some test failures. Thanks @gregsfortytwo!

liewegas · 2016-04-22T16:27:52Z

src/osd/OSD.cc

+                                                osdmap->get_inst(id),
+                                                cct->_conf->osd_heartbeat_grace + 1,
+                                                osdmap->get_epoch()
+                                                ));


Let's update teh message properly.. something like

diff --git a/src/messages/MOSDFailure.h b/src/messages/MOSDFailure.h index a1032e6..0ff1342 100644 --- a/src/messages/MOSDFailure.h +++ b/src/messages/MOSDFailure.h @@ -24,22 +24,34 @@ class MOSDFailure : public PaxosServiceMessage { static const int HEAD_VERSION = 3; public: + enum { + FLAG_FAILED = 1, // if set, failure; if not, recovery + FLAG_IMMEDIATE = 2, // known failure, not a timeout + }; + uuid_d fsid; entity_inst_t target_osd; - __u8 is_failed; + __u8 flags; epoch_t epoch; int32_t failed_for; // known to be failed since at least this long MOSDFailure() : PaxosServiceMessage(MSG_OSD_FAILURE, 0, HEAD_VERSION) { } MOSDFailure(const uuid_d &fs, const entity_inst_t& f, int duration, epoch_t e) : PaxosServiceMessage(MSG_OSD_FAILURE, e, HEAD_VERSION), - fsid(fs), target_osd(f), is_failed(true), epoch(e), failed_for(duration) { } + fsid(fs), target_osd(f), + flags(FLAG_FAILED), + epoch(e), failed_for(duration) { } private: ~MOSDFailure() {} public: entity_inst_t get_target() { return target_osd; } - bool if_osd_failed() { return is_failed; } + bool if_osd_failed() { + return flags & FLAG_FAILED; + } + bool is_immediate() { + return flags & FLAG_IMMEDIATE; + } epoch_t get_epoch() { return epoch; } void decode_payload() { @@ -49,9 +61,9 @@ public: ::decode(target_osd, p); ::decode(epoch, p); if (header.version >= 2) - ::decode(is_failed, p); + ::decode(flags, p); else - is_failed = true; + flags = FLAG_FAILED; if (header.version >= 3) ::decode(failed_for, p); else @@ -62,14 +74,15 @@ public: ::encode(fsid, payload); ::encode(target_osd, payload); ::encode(epoch, payload); - ::encode(is_failed, payload); + ::encode(flags, payload); ::encode(failed_for, payload); } const char *get_type_name() const { return "osd_failure"; } void print(ostream& out) const { out << "osd_failure(" - << (is_failed ? "failed " : "recovered ") + << (if_osd_failed() ? "failed " : "recovered ") + << (is_immediate() ? "immediate " : "timeout ") << target_osd << " for " << failed_for << "sec e" << epoch << " v" << version << ")"; }

then we can use this for other stuff too, like a watchdog.

On Fri, Apr 22, 2016 at 09:28:13AM -0700, Sage Weil wrote:

Let's update teh message properly.. something like

[..]
public:

enum {

FLAG_FAILED = 1, // if set, failure; if not, recovery

FLAG_IMMEDIATE = 2, // known failure, not a timeout

};

[..]
then we can use this for other stuff too, like a watchdog.

And introduce other daemon failure modes, too.

Piotr Dałek

liewegas · 2016-04-22T16:29:49Z

Looking at this again, I think this approach makes a lot of sense, and should be safe, too. Worst case, we mark an osd down quickly instead of slowly, which is not much of a problem.

Can you break this apart into separate patches? e.g.,

add the new msgr callback
implement it in msg/simple
update failure message
update mon to use immediate flag
update osd to implement hook

branch-predictor · 2016-05-05T21:22:54Z

@liewegas I have broken it into a bunch of separate commits, added new failure flag to MOSDFailure class and new option to disable new behaivior.

yuyuyu101 · 2016-05-11T07:45:48Z

back from ML, we can continue review this?

yuyuyu101 · 2016-05-11T08:00:53Z

src/msg/simple/Pipe.cc

-	     << ", " << cpp_strerror(rc) << dendl;
+	     << ", " << cpp_strerror(stored_errno) << dendl;
+    if (stored_errno == ECONNREFUSED) {
+      ldout(msgr->cct, 10) << "connection refused!" << dendl;


I'd like to set level to 0

I would go with 1 or 2

liewegas · 2016-05-11T14:16:12Z

I like this better!

Please make the second patch only add the callbacks, but leave them empty.. don't do anytihng fancy in OSD yet.
then update the failure message
then in the final patch add the osd config option that will send a fast fail message from OSD::ms_refused.

yuyuyu101 · 2016-05-11T14:26:55Z

if you could add async msgr support, it would be good since it should be only two line code

branch-predictor · 2016-05-22T13:48:39Z

@yuyuyu101 Done, please take a look.
@liewegas rebased and split again.

yuyuyu101 · 2016-05-25T10:05:02Z

@branch-predictor I think we set peer osd id in Session instead of in Connection, it would be make more sense.

yuyuyu101 · 2016-05-25T10:08:42Z

Another question, do we really need immediate flag? If connection refused, we only send a normal MOSDFailure message. If osd down, just like heartbeat, multi osds will send MOSDFailure in handle_refuse function. So it won't occur too much latency?

Because I concern the immediate flag will cause osd flipping in unknown env..... Like message delay?

branch-predictor · 2016-05-25T10:20:54Z

On Wed, May 25, 2016 at 03:09:05AM -0700, Haomai Wang wrote:

Another question, do we really need immediate flag? If connection refused, we only send a normal MOSDFailure message. If osd down, just like heartbeat, multi osds will send MOSDFailure in handle_refuse function. So it won't occur too much latency?

Because I concern the immediate flag will cause osd flipping in unknown env..... Like message delay?

We set FLAG_IMMEDIATE only in ms_handle_refused as a signal that we're
certain that OSD won't get up soon. In other cases (regular hertbeat
timeout) we don't set it and it's up to mons to decide whether they should
be marked as down or not. As for flapping, we discussed it on ceph-devel ml
and came to conclusion that it requires either broken firewall or network
configuration to cause this, and these are more serious issues that should
be resolved first before worrying about OSDs flapping (either way, flapping
OSDs could be good for getting someone's attention).

Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

yuyuyu101 · 2016-08-21T05:54:14Z

I felt a little nervous to pass osd id via caller code. what if use OSDMap::identify_osd(const entity_addr_t& addr) to get osd id in ms_handle_refused()? @liewegas

yuyuyu101 · 2016-08-21T05:57:01Z

hmm, I think we don't have a way let server side connection get peer id in messenger level. so I think identitfy_osd via entity_addr_t is a reasonable way

liewegas · 2016-08-21T23:23:43Z

@yuyuyu101 Yeah, I was a bit nervous about that too. That means we would drop 6841eff.

branch-predictor · 2016-08-27T22:33:46Z

identify_osd() alone is not enough as it doesn't take hb addrs into account. I just added identify_osd_on_all_channels() to work around it, not influencing any existing users.

yuriw · 2016-09-12T17:30:12Z

@branch-predictor I am trying to rebuild with this PR and got conflicts:

[yuriw@smithi018 build]$ git pull https://github.com/branch-predictor/ceph.git bp-mark-down-on-perm-rst
From https://github.com/branch-predictor/ceph

branch bp-mark-down-on-perm-rst -> FETCH_HEAD
Auto-merging src/test/msgr/test_msgr.cc
CONFLICT (modify/delete): src/test/Makefile.am deleted in HEAD and modified in c8a9ce3. Version c8a9ce3 of src/test/Makefile.am left in tree.
Auto-merging src/osdc/Objecter.cc
Auto-merging src/osd/OSD.cc
Auto-merging src/msg/DispatchQueue.h
Auto-merging src/common/config_opts.h
Auto-merging src/client/Client.h
Auto-merging src/client/Client.cc
Automatic merge failed; fix conflicts and then commit the result.

pls fix and assign "needs-qa" tag

Added new callback (ms_handle_refused) to dispatchers. It is called once connection attempt fails with ECONNREFUSED. Also added dummy ms_handle_refused handlers across codebase. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

Added implementation of ms_handle_refused in OSD code, so it sends MOSDFailure message in case the peer connection fails with ECONNREFUSED *and* it is known to be up and new option "osd fast fail on connection refused" which enables or disables new behavior. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

…lure Change "is_failed" field to "flags" and use it to distinguish between timeout and immediate, known OSD failure. Then use that in OSD and MON, and make sure "min_reporters" don't affect known failures by actually going around failure heuristic code. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

This commit adds code that detects ECONNREFUSED and dispatches appropriate event further in Async messenger. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

Test checks both async and simple messenger and also checks whether disabling "osd fast fail on connection refused" option restores old behavior. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

Doesn't seem to be necessary anymore. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

branch-predictor · 2016-09-13T18:04:02Z

@yuriw This is because of #11007 being merged. Rebased - should be OK now (only the Makefile.am part of commit was removed).
Unfortunately I don't have tagging rights so I can't put needs-qa tag.

yuriw · 2016-09-13T18:13:54Z

@branch-predictor thx
ack @liewegas

yuriw · 2016-09-17T15:48:38Z

http://pulpito.front.sepia.ceph.com:80/yuriw-2016-09-14_16:05:47-rados-wip-yuri-testing_2016_09_14-distro-basic-smithi/

http://pulpito.ceph.com/yuriw-2016-09-15_15:38:24-rados-wip-yuri-testing_2016_09_14-distro-basic-smithi/

gregsfortytwo reviewed Apr 12, 2016
View reviewed changes

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 2 times, most recently from d49e353 to 35573de Compare April 13, 2016 12:22

liewegas added pending-discussion feature core labels Apr 22, 2016

liewegas reviewed Apr 22, 2016
View reviewed changes

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 2 times, most recently from b2765f4 to d699678 Compare May 5, 2016 20:24

yuyuyu101 reviewed May 11, 2016
View reviewed changes

liewegas removed the pending-discussion label May 11, 2016

liewegas changed the title ~~[RFC] [DNM] msg/simple: mark daemons down on RST + ECONNREFUSED~~ msg/simple: mark daemons down on RST + ECONNREFUSED May 11, 2016

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 4 times, most recently from f25f438 to 470cf9c Compare May 22, 2016 13:39

branch-predictor changed the title ~~msg/simple: mark daemons down on RST + ECONNREFUSED~~ msg: mark daemons down on RST + ECONNREFUSED May 22, 2016

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 4 times, most recently from 412debc to 4873351 Compare August 20, 2016 22:01

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 3 times, most recently from 086cd67 to c8a9ce3 Compare August 27, 2016 22:05

liewegas added the needs-qa label Aug 29, 2016

yuriw added the wip-yuri-testing label Sep 2, 2016

yuriw removed needs-qa wip-yuri-testing labels Sep 12, 2016

branch-predictor added 6 commits September 13, 2016 19:57

msg/simple: add ms_handle_refused callback

d58d7d3

Added new callback (ms_handle_refused) to dispatchers. It is called once connection attempt fails with ECONNREFUSED. Also added dummy ms_handle_refused handlers across codebase. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

msg/async: implement ECONNREFUSED detection

5083742

This commit adds code that detects ECONNREFUSED and dispatches appropriate event further in Async messenger. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

test/osd: add test for fast mark down functionality

3ed8169

Test checks both async and simple messenger and also checks whether disabling "osd fast fail on connection refused" option restores old behavior. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

osdc: remove MOSDFailure include

675a8de

Doesn't seem to be necessary anymore. Signed-off-by: Piotr Dałek <git@predictor.org.pl>

branch-predictor force-pushed the bp-mark-down-on-perm-rst branch from c8a9ce3 to 675a8de Compare September 13, 2016 17:58

liewegas added the needs-qa label Sep 13, 2016

yuriw added the wip-yuri-testing label Sep 13, 2016

yuriw merged commit a033dc6 into ceph:master Sep 17, 2016

branch-predictor deleted the bp-mark-down-on-perm-rst branch January 24, 2018 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msg: mark daemons down on RST + ECONNREFUSED #8558

msg: mark daemons down on RST + ECONNREFUSED #8558

branch-predictor commented Apr 12, 2016 •

edited

branch-predictor commented Apr 12, 2016

gregsfortytwo Apr 12, 2016

branch-predictor Apr 12, 2016

yuyuyu101 commented Apr 12, 2016

branch-predictor commented Apr 13, 2016

liewegas Apr 22, 2016 •

edited

branch-predictor Apr 22, 2016

liewegas commented Apr 22, 2016

branch-predictor commented May 5, 2016

yuyuyu101 commented May 11, 2016

yuyuyu101 May 11, 2016

liewegas May 11, 2016

liewegas commented May 11, 2016

yuyuyu101 commented May 11, 2016

branch-predictor commented May 22, 2016

yuyuyu101 commented May 25, 2016

yuyuyu101 commented May 25, 2016

branch-predictor commented May 25, 2016

yuyuyu101 commented Aug 21, 2016

yuyuyu101 commented Aug 21, 2016

liewegas commented Aug 21, 2016

branch-predictor commented Aug 27, 2016

yuriw commented Sep 12, 2016

branch-predictor commented Sep 13, 2016

yuriw commented Sep 13, 2016

yuriw commented Sep 17, 2016

msg: mark daemons down on RST + ECONNREFUSED #8558

msg: mark daemons down on RST + ECONNREFUSED #8558

Conversation

branch-predictor commented Apr 12, 2016 • edited

branch-predictor commented Apr 12, 2016

gregsfortytwo Apr 12, 2016

Choose a reason for hiding this comment

branch-predictor Apr 12, 2016

Choose a reason for hiding this comment

yuyuyu101 commented Apr 12, 2016

branch-predictor commented Apr 13, 2016

liewegas Apr 22, 2016 • edited

Choose a reason for hiding this comment

branch-predictor Apr 22, 2016

Choose a reason for hiding this comment

liewegas commented Apr 22, 2016

branch-predictor commented May 5, 2016

yuyuyu101 commented May 11, 2016

yuyuyu101 May 11, 2016

Choose a reason for hiding this comment

liewegas May 11, 2016

Choose a reason for hiding this comment

liewegas commented May 11, 2016

yuyuyu101 commented May 11, 2016

branch-predictor commented May 22, 2016

yuyuyu101 commented May 25, 2016

yuyuyu101 commented May 25, 2016

branch-predictor commented May 25, 2016

yuyuyu101 commented Aug 21, 2016

yuyuyu101 commented Aug 21, 2016

liewegas commented Aug 21, 2016

branch-predictor commented Aug 27, 2016

yuriw commented Sep 12, 2016

branch-predictor commented Sep 13, 2016

yuriw commented Sep 13, 2016

yuriw commented Sep 17, 2016

branch-predictor commented Apr 12, 2016 •

edited

liewegas Apr 22, 2016 •

edited