Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg: mark daemons down on RST + ECONNREFUSED #8558

Merged
merged 6 commits into from Sep 17, 2016

Conversation

branch-predictor
Copy link
Contributor

@branch-predictor branch-predictor commented Apr 12, 2016

When a daemon goes down (because it is killed or suicided because of assert)
on otherwise healthy machine, TCP stack will take care of its connections
and send RST (connection reset) packets to all daemons that were connected
to downed daemon. OSDs already handle that and attempt to reconnect, with
grace timer starting to count. When the grace timer runs out and daemons
still cannot reconnect, OSD in question is marked down.
This changeset adds additional handler (handle_refused()) to the dispatchers
and code that detects when connection attempt fails with ECONNREFUSED error
(connection refused) which is a clear indication that host is alive, but
daemon isn't, so daemons can instantly mark the other side as undoubtly
downed without the need for grace timer.
This changeset also adds more info to connections so figuring out which OSD
actually failed is a bit easier.
In current state, only OSDs take advantage of handle_refused() facility,
but I don't see why other daemons shouldn't.

Signed-off-by: Piotr Dałek git@predictor.org.pl

@branch-predictor
Copy link
Contributor Author

This is more of PoC that doesn't take Async messenger into consideration (yet). Comments are welcome.

ldout(msgr->cct, 10) << "connection refused!" << dendl;
if (connection_state->clear_pipe(this))
msgr->dispatch_queue.queue_refused(connection_state.get());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a careful review, but the idea of this patch looks good.

This part isn't really safe though. We can get errors which don't directly call fault() or meander a bit before ending up here. We'll want some kind of shutdown error flag we can set at the failure site and check here or elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Tue, Apr 12, 2016 at 07:35:43AM -0700, Gregory Farnum wrote:

I haven't done a careful review, but the idea of this patch looks good.

This part isn't really safe though. We can get errors which don't directly call fault() or meander a bit before ending up here. We'll want some kind of shutdown error flag we can set at the failure site and check here or elsewhere.

I'll look into this, but I'm quite sure that ECONNREFUSED here would only
occur after attempt to connect() to port that isn't listening.

@yuyuyu101
Copy link
Member

looks a interesting idea, refuse is a good semantic to decide behavior

@branch-predictor branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 2 times, most recently from d49e353 to 35573de Compare April 13, 2016 12:22
@branch-predictor
Copy link
Contributor Author

Rebased and moved connection error checking to right after connect, that fixes some test failures. Thanks @gregsfortytwo!

osdmap->get_inst(id),
cct->_conf->osd_heartbeat_grace + 1,
osdmap->get_epoch()
));
Copy link
Member

@liewegas liewegas Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update teh message properly.. something like

diff --git a/src/messages/MOSDFailure.h b/src/messages/MOSDFailure.h
index a1032e6..0ff1342 100644
--- a/src/messages/MOSDFailure.h
+++ b/src/messages/MOSDFailure.h
@@ -24,22 +24,34 @@ class MOSDFailure : public PaxosServiceMessage {
   static const int HEAD_VERSION = 3;
 
  public:
+  enum {
+    FLAG_FAILED = 1,     // if set, failure; if not, recovery
+    FLAG_IMMEDIATE = 2,  // known failure, not a timeout
+  };
+  
   uuid_d fsid;
   entity_inst_t target_osd;
-  __u8 is_failed;
+  __u8 flags;
   epoch_t       epoch;
   int32_t failed_for;  // known to be failed since at least this long
 
   MOSDFailure() : PaxosServiceMessage(MSG_OSD_FAILURE, 0, HEAD_VERSION) { }
   MOSDFailure(const uuid_d &fs, const entity_inst_t& f, int duration, epoch_t e)
     : PaxosServiceMessage(MSG_OSD_FAILURE, e, HEAD_VERSION),
-      fsid(fs), target_osd(f), is_failed(true), epoch(e), failed_for(duration) { }
+      fsid(fs), target_osd(f),
+      flags(FLAG_FAILED),
+      epoch(e), failed_for(duration) { }
 private:
   ~MOSDFailure() {}
 
 public: 
   entity_inst_t get_target() { return target_osd; }
-  bool if_osd_failed() { return is_failed; }
+  bool if_osd_failed() {
+    return flags & FLAG_FAILED;
+  }
+  bool is_immediate() {
+    return flags & FLAG_IMMEDIATE;
+  }
   epoch_t get_epoch() { return epoch; }
 
   void decode_payload() {
@@ -49,9 +61,9 @@ public:
     ::decode(target_osd, p);
     ::decode(epoch, p);
     if (header.version >= 2)
-      ::decode(is_failed, p);
+      ::decode(flags, p);
     else
-      is_failed = true;
+      flags = FLAG_FAILED;
     if (header.version >= 3)
       ::decode(failed_for, p);
     else
@@ -62,14 +74,15 @@ public:
     ::encode(fsid, payload);
     ::encode(target_osd, payload);
     ::encode(epoch, payload);
-    ::encode(is_failed, payload);
+    ::encode(flags, payload);
     ::encode(failed_for, payload);
   }
 
   const char *get_type_name() const { return "osd_failure"; }
   void print(ostream& out) const {
     out << "osd_failure("
-       << (is_failed ? "failed " : "recovered ")
+       << (if_osd_failed() ? "failed " : "recovered ")
+       << (is_immediate() ? "immediate " : "timeout ")
        << target_osd << " for " << failed_for << "sec e" << epoch
        << " v" << version << ")";
   }

then we can use this for other stuff too, like a watchdog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Fri, Apr 22, 2016 at 09:28:13AM -0700, Sage Weil wrote:

Let's update teh message properly.. something like

[..]
public:

  • enum {
  • FLAG_FAILED = 1, // if set, failure; if not, recovery
  • FLAG_IMMEDIATE = 2, // known failure, not a timeout
  • };
  • [..]
    then we can use this for other stuff too, like a watchdog.

And introduce other daemon failure modes, too.

Piotr Dałek

@liewegas
Copy link
Member

Looking at this again, I think this approach makes a lot of sense, and should be safe, too. Worst case, we mark an osd down quickly instead of slowly, which is not much of a problem.

Can you break this apart into separate patches? e.g.,

  • add the new msgr callback
  • implement it in msg/simple
  • update failure message
  • update mon to use immediate flag
  • update osd to implement hook

@branch-predictor branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 2 times, most recently from b2765f4 to d699678 Compare May 5, 2016 20:24
@branch-predictor
Copy link
Contributor Author

@liewegas I have broken it into a bunch of separate commits, added new failure flag to MOSDFailure class and new option to disable new behaivior.

@yuyuyu101
Copy link
Member

back from ML, we can continue review this?

<< ", " << cpp_strerror(rc) << dendl;
<< ", " << cpp_strerror(stored_errno) << dendl;
if (stored_errno == ECONNREFUSED) {
ldout(msgr->cct, 10) << "connection refused!" << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to set level to 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with 1 or 2

@liewegas
Copy link
Member

I like this better!

  1. Please make the second patch only add the callbacks, but leave them empty.. don't do anytihng fancy in OSD yet.
  2. then update the failure message
  3. then in the final patch add the osd config option that will send a fast fail message from OSD::ms_refused.

@liewegas liewegas changed the title [RFC] [DNM] msg/simple: mark daemons down on RST + ECONNREFUSED msg/simple: mark daemons down on RST + ECONNREFUSED May 11, 2016
@yuyuyu101
Copy link
Member

if you could add async msgr support, it would be good since it should be only two line code

@branch-predictor branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 4 times, most recently from f25f438 to 470cf9c Compare May 22, 2016 13:39
@branch-predictor
Copy link
Contributor Author

@yuyuyu101 Done, please take a look.
@liewegas rebased and split again.

@branch-predictor branch-predictor changed the title msg/simple: mark daemons down on RST + ECONNREFUSED msg: mark daemons down on RST + ECONNREFUSED May 22, 2016
@yuyuyu101
Copy link
Member

@branch-predictor I think we set peer osd id in Session instead of in Connection, it would be make more sense.

@yuyuyu101
Copy link
Member

Another question, do we really need immediate flag? If connection refused, we only send a normal MOSDFailure message. If osd down, just like heartbeat, multi osds will send MOSDFailure in handle_refuse function. So it won't occur too much latency?

Because I concern the immediate flag will cause osd flipping in unknown env..... Like message delay?

@branch-predictor
Copy link
Contributor Author

On Wed, May 25, 2016 at 03:09:05AM -0700, Haomai Wang wrote:

Another question, do we really need immediate flag? If connection refused, we only send a normal MOSDFailure message. If osd down, just like heartbeat, multi osds will send MOSDFailure in handle_refuse function. So it won't occur too much latency?

Because I concern the immediate flag will cause osd flipping in unknown env..... Like message delay?

We set FLAG_IMMEDIATE only in ms_handle_refused as a signal that we're
certain that OSD won't get up soon. In other cases (regular hertbeat
timeout) we don't set it and it's up to mons to decide whether they should
be marked as down or not. As for flapping, we discussed it on ceph-devel ml
and came to conclusion that it requires either broken firewall or network
configuration to cause this, and these are more serious issues that should
be resolved first before worrying about OSDs flapping (either way, flapping
OSDs could be good for getting someone's attention).

Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

@branch-predictor branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 4 times, most recently from 412debc to 4873351 Compare August 20, 2016 22:01
@yuyuyu101
Copy link
Member

I felt a little nervous to pass osd id via caller code. what if use OSDMap::identify_osd(const entity_addr_t& addr) to get osd id in ms_handle_refused()? @liewegas

@yuyuyu101
Copy link
Member

hmm, I think we don't have a way let server side connection get peer id in messenger level. so I think identitfy_osd via entity_addr_t is a reasonable way

@liewegas
Copy link
Member

@yuyuyu101 Yeah, I was a bit nervous about that too. That means we would drop 6841eff.

@branch-predictor branch-predictor force-pushed the bp-mark-down-on-perm-rst branch 3 times, most recently from 086cd67 to c8a9ce3 Compare August 27, 2016 22:05
@branch-predictor
Copy link
Contributor Author

identify_osd() alone is not enough as it doesn't take hb addrs into account. I just added identify_osd_on_all_channels() to work around it, not influencing any existing users.

@yuriw
Copy link
Contributor

yuriw commented Sep 12, 2016

@branch-predictor I am trying to rebuild with this PR and got conflicts:

[yuriw@smithi018 build]$ git pull https://github.com/branch-predictor/ceph.git bp-mark-down-on-perm-rst
From https://github.com/branch-predictor/ceph

  • branch bp-mark-down-on-perm-rst -> FETCH_HEAD
    Auto-merging src/test/msgr/test_msgr.cc
    CONFLICT (modify/delete): src/test/Makefile.am deleted in HEAD and modified in c8a9ce3. Version c8a9ce3 of src/test/Makefile.am left in tree.
    Auto-merging src/osdc/Objecter.cc
    Auto-merging src/osd/OSD.cc
    Auto-merging src/msg/DispatchQueue.h
    Auto-merging src/common/config_opts.h
    Auto-merging src/client/Client.h
    Auto-merging src/client/Client.cc
    Automatic merge failed; fix conflicts and then commit the result.

pls fix and assign "needs-qa" tag

Added new callback (ms_handle_refused) to dispatchers. It is called
once connection attempt fails with ECONNREFUSED.
Also added dummy ms_handle_refused handlers across codebase.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
Added implementation of ms_handle_refused in OSD code, so it sends
MOSDFailure message in case the peer connection fails with ECONNREFUSED
*and* it is known to be up and new option "osd fast fail on connection
refused" which enables or disables new behavior.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
…lure

Change "is_failed" field to "flags" and use it to distinguish between timeout
and immediate, known OSD failure. Then use that in OSD and MON, and make sure
"min_reporters" don't affect known failures by actually going around failure
heuristic code.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
This commit adds code that detects ECONNREFUSED and dispatches appropriate
event further in Async messenger.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
Test checks both async and simple messenger and also checks whether
disabling "osd fast fail on connection refused" option restores old
behavior.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
Doesn't seem to be necessary anymore.

Signed-off-by: Piotr Dałek <git@predictor.org.pl>
@branch-predictor
Copy link
Contributor Author

@yuriw This is because of #11007 being merged. Rebased - should be OK now (only the Makefile.am part of commit was removed).
Unfortunately I don't have tagging rights so I can't put needs-qa tag.

@yuriw
Copy link
Contributor

yuriw commented Sep 13, 2016

@branch-predictor thx
ack @liewegas

@yuriw yuriw merged commit a033dc6 into ceph:master Sep 17, 2016
@branch-predictor branch-predictor deleted the bp-mark-down-on-perm-rst branch January 24, 2018 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants