mon/OSDMonitor: fix process osd failure #12938

LiumxNL · 2017-01-16T08:24:54Z

No description provided.

LiumxNL · 2017-01-16T08:28:29Z

@tchaikov some cleanup, pls review, thanks!

tchaikov · 2017-01-16T10:59:22Z

src/mon/OSDMonitor.cc

-        if (ls.front())
-          mon->no_reply(ls.front());
-	ls.pop_front();
+      MonOpRequestRef record_op = fi.cancel_report(reporter);


if osd failed finally, this may make these reporters cannot receive lastest
update right away,

could you elaborate this a little bit?

although one reporter has canceled fail, but if some more osds report this osd failed after that, and suppose that mon has enough reporter to mark this osd down, then, in process_failures(), it will call take_all_failures() which get all reporter's report_message, only if reporter has recorded report_message mon will send lastest osdmap to that reporter. so if we drop report_message of all another reporters when one cancel report may cause these reporter cannot receive the update right away, although it will finally acknowledge it by peer's sharing osdmap

note to myself and posterity:

if osd.42 is reported to as "failed" by its peer. and later the peer receives a ping_reply from osd.42 somehow, so it sends a failure message to monitor to cancel the previous failure report. as a side-effect, on the monitor handling this failure report, all report messages from other OSD peers were also erased from the failure_info. but their osd ids and failed_since are kept around in the failure_info. and they will be taken into consideration in OSDMonitor::check_failure() even the report messages are reset.

if osd.42 is finally marked down, all the osds not reverting their failure report will not be updated with the latest osdmap. but we could.

the related change was introduced by ad12b0d, so we don't leak the failure messages before canceling the report. but failure_report and failure_reporter_t are able to take care of their life cycles just fine without the fix.

tchaikov · 2017-01-16T11:02:05Z

src/mon/OSDMonitor.cc

+    // calculate failure time
+    utime_t now = ceph_clock_now(g_ceph_context);
+    utime_t failed_since =
+    m->get_recv_stamp() - utime_t(m->failed_for, 0);


wrong indent.

tchaikov · 2017-01-18T08:29:42Z

src/mon/OSDMonitor.cc

-        if (ls.front())
-          mon->no_reply(ls.front());
-	ls.pop_front();
+      MonOpRequestRef record_op = fi.cancel_report(reporter);


@LiumxNL

nit, could you s/record_op/report_op/? then it's good to qa run.

tchaikov · 2017-01-18T08:31:42Z

modulo the nit, lgtm. just needs to s/record_op/report_op/ before the merge, i think.

…sonable if osd failed finally, this may make these reporters cannot receive lastest update right away, besides, it's not effective to make a traverse of all reporters Signed-off-by: Mingxin Liu <mingxin@xsky.com>

Signed-off-by: Mingxin Liu <mingxin@xsky.com>

LiumxNL · 2017-01-18T11:17:36Z

@tchaikov updated.

tchaikov · 2017-01-23T12:08:28Z

failed test tracked by http://tracker.ceph.com/issues/18583

tchaikov reviewed Jan 16, 2017

View reviewed changes

tchaikov requested changes Jan 16, 2017

View reviewed changes

tchaikov added the mon label Jan 16, 2017

tchaikov self-assigned this Jan 16, 2017

LiumxNL force-pushed the fix-process-osd-failure branch 2 times, most recently from 2479eb2 to a0958bc Compare January 16, 2017 15:28

liewegas changed the title ~~OSDMonitor: fix process osd failure~~ mon/OSDMonitor: fix process osd failure Jan 17, 2017

tchaikov reviewed Jan 18, 2017

View reviewed changes

tchaikov approved these changes Jan 18, 2017

View reviewed changes

tchaikov added the needs-qa label Jan 18, 2017

LiumxNL added 2 commits January 18, 2017 19:16

OSDMonitor: calculate failure time only when osd reported failed

0ec21a5

Signed-off-by: Mingxin Liu <mingxin@xsky.com>

LiumxNL force-pushed the fix-process-osd-failure branch from a0958bc to 0ec21a5 Compare January 18, 2017 11:16

tchaikov added the wip-kefu-testing label Jan 22, 2017

tchaikov merged commit 1e8ca9b into ceph:master Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon/OSDMonitor: fix process osd failure #12938

mon/OSDMonitor: fix process osd failure #12938

LiumxNL commented Jan 16, 2017

LiumxNL commented Jan 16, 2017

tchaikov Jan 16, 2017

LiumxNL Jan 16, 2017

tchaikov Jan 18, 2017

tchaikov Jan 16, 2017

tchaikov Jan 18, 2017

tchaikov commented Jan 18, 2017

LiumxNL commented Jan 18, 2017

tchaikov commented Jan 23, 2017

mon/OSDMonitor: fix process osd failure #12938

mon/OSDMonitor: fix process osd failure #12938

Conversation

LiumxNL commented Jan 16, 2017

LiumxNL commented Jan 16, 2017

tchaikov Jan 16, 2017

Choose a reason for hiding this comment

LiumxNL Jan 16, 2017

Choose a reason for hiding this comment

tchaikov Jan 18, 2017

Choose a reason for hiding this comment

tchaikov Jan 16, 2017

Choose a reason for hiding this comment

tchaikov Jan 18, 2017

Choose a reason for hiding this comment

tchaikov commented Jan 18, 2017

LiumxNL commented Jan 18, 2017

tchaikov commented Jan 23, 2017