mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

mslovy · 2016-11-21T17:18:43Z

Currently, monitor may wait 'paxos_propose_interval' before propose_pending
However, we don't want this long wait for message MSG_ALIVE and MSG_PGTEMP since it will block the peering process so that all IO will block a long time when adding or removing osd in the cluster.
Thus, wait paxos_min_wait(default = 0.05) for those messages. Even through, we can set paxos_min_wait = 0 to make the propose_pending immediately if we want.

Signed-off-by: yaoning yaoning@unitedstack.com

gregsfortytwo

While it would be nice to push these through immediately, ALIVE and PGTEMP are exactly the message types we need to throttle. These are generated very quickly during thrash storms and if we generated a new OSD map for each ones, you'd be much more likely to destroy your cluster.

Unless @athanatos or @liewegas have other thoughts I don't think this is something we can do. Maybe if there were a bunch of extra tracking logic so it only does this for n pgtemp updates within a given time period, so you could let one OSD change go through quickly but throttle once the cluster is in trouble?

mslovy · 2016-11-22T05:49:59Z

Yeah, paxos_min_wait (default set: 0.05) is one of the throttle policy which make sure that it is only possible to have 20 maximum OSDMap changes per second, so the things we want to answers is whether it can finish at least 20 paxos updates in one second for Monitor. Any one here thinks leveldb and rocksdb cannot finish 20 updates per second?

Also, if during thrash storms, I think most OSDs are remain in unavailable state, so the purpose is still to make the whole cluster state picture clear as fast as possible if the monitor has enough ability.

Furthermore, I think it is possible to add throttle count to track the OSDMap update within one second and reset it in tick(). In this way, we exactly controll the throughput of the update? and do not wait any times if the throttle allows the request going on.

gregsfortytwo · 2016-11-22T15:24:36Z

The question isn't about what the monitor's leveldb can push through; it's about how many cluster state updates the OSDs can handle. They need to do a fixed amount of fairly expensive processing on every map update and spamming out 20 updates every second is guaranteed to put most clusters into a death spiral.

athanatos · 2016-11-22T18:19:08Z

Yeah, OSDMaps are costly to process on the OSD side. That's the main concern.

mslovy · 2016-11-23T03:31:09Z

Ok, I think the main concern is that it is quite costy for pg to catch up the lastest map, right? the handle_osd_map in class OSD is not quite expensive.

Here, generally, it seems there are two cases(that is whether acting_up_affected):

If yes, state machine should break the current state and enter Reset state, like the case you mentioned such like OSD is flapping. If the OSDMap generates too fast, then it will be sequentially dealing with each OSDMap updating Event and make the things worse.
I think the impact is related to the number of PGs affected during flapping and the number of PGs in each OSD.
If no, then it will be much better such like Updating up_thru according to the MSG_ALIVE. The OSDMap does not affect and break any pgs state machine and all things go well?

So can we immediately rise a propose for MSG_ALIVE, but need to do some further constraint on MSG_PGTEMP?

The goal of the improvement will be:

if the things is predicted such like we just start one OSDs. we want MSG_PGTEMP and MSG_ALIVE acks as fast as possible.
if too many flapping occurs, throttle the OSDMap updating in order to make sure it works fine.

so we can use
ENUM PROPOSE_WAIT {
NO_WAIT,
MIN_WAIT,
WAIT
}
to indicate what should be done.

Based on the throttle strategy, message types and something else to determine the delay time, right?

I will reconsider and proposal new strategy later.

athanatos · 2016-11-23T16:42:24Z

Mmm, we still probably want to combine a bunch of up_thru messages into 1 map if possible.

mslovy · 2016-11-25T06:50:28Z

Yeah, I agree that it is better to combine a bunch of up_thru messages.

Actually, what we want is:

start a single osd, then we want the peering process finished immediately without waiting
after backfilling, pg_temp will be cleared and then activating pg again, we want this process finish immediately

So what about current update PR?

gregsfortytwo · 2016-11-28T21:56:47Z

src/mon/OSDMonitor.cc

+      last_no_wait_time = now;
+      return true;
+    }
+  }


This logic doubles the number of OSDMaps we'd otherwise generate — we'll do two every paxos_propose_interval. You'd need something at least a little more sophisticated. Maybe it would work if you renamed last_no_wait_time -> last_attempted_nowait_time, set it every time you run this check, and only allow it to succeed every two paxos_propose_intervals? Or just only let it go through if at least a paxos_propose_interval has passed since the previous OSDMap was generated?

liewegas · 2016-11-28T22:05:16Z

What about something like this: In general, we don't want to propose more often than paxos_propose_interval, unless it is truly a high-priority even (like a command from the cli) that has no delay. But if we get an up_thru message and haven't proposed a map in a while (> propose interval) then then we could propose immediately without exceeding our rate limit. We still delay the min amount to hopefully batch a bit, but it's the min_interval value, so not long. If another up_thru comes after that, it has to wait a full interval before proposing again (because we just published a map).

I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later.

mslovy · 2017-01-11T03:59:38Z

hi，all，based on the comments, what's about this refined version?

gregsfortytwo · 2017-01-11T07:04:55Z

src/mon/OSDMonitor.h

@@ -131,6 +131,9 @@ class OSDMonitor : public PaxosService {
  bool check_failure(utime_t now, int target_osd, failure_info_t& fi);
  void force_failure(utime_t now, int target_osd);

+  // the time of last msg(MSG_ALIVE and MSG_PGTEMP) proposed without delay
+  utime_t last_attempped_minwait_time;


"attempted", not "attempped"?

gregsfortytwo · 2017-01-11T07:06:54Z

src/mon/OSDMonitor.cc

+    dout(15) << " propose as fast as possible for up_thru/pg_temp" << dendl;
+
+    utime_t now = ceph_clock_now();
+    if (now - last_attempped_minwait_time > g_conf->paxos_propose_interval) {


This needs to look at the last map generated without using minwait as well -- it might be more recent!

do you mean this?

gregsfortytwo

This looks fine to me now. Testing required.

delay and batch proposed as before if lots of messages arrive at the same time if too many messages are MSG_ALIVE and MSG_PGTEMP, wait a long period as before Signed-off-by: yaoning <yaoning@unitedstack.com>

tchaikov · 2017-01-13T07:05:02Z

src/mon/OSDMonitor.cc

+  // want to merge OSDMap changes as much as possible
+  if ((pending_inc.new_primary_temp.size() == 1
+      || pending_inc.new_up_thru.size() == 1)
+      && pending_inc.new_state.size() < 2) {


@mslovy why we are checking the new_state.size() here? just to minimize the impact of flapping osd? and only let the single osd change pass through? am i right?

Yeah, you are right. If lots of osd flapping and update its state. I think it predictably a lots of primary_temp and up_thru request arrived in a following short period. At this time, we do not want the first primary_temp or up_thru updated individually but expect to wait the following request arrived and update at once.

yuriw · 2017-01-16T21:12:13Z

http://pulpito.ceph.com/yuriw-2017-01-13_21:05:46-rados-wip-yuri-testing2_2017_1_13-distro-basic-smithi/

http://pulpito.ceph.com/yuriw-2017-01-14_16:29:03-rados-wip-yuri-testing2_2017_1_13---basic-smithi/

(see http://tracker.ceph.com/issues/18478)

centos run
http://pulpito.front.sepia.ceph.com/yuriw-2017-01-14_16:34:35-rados-wip-yuri-testing2_2017_1_13-distro-basic-vps/

xiaoxichen · 2018-05-23T17:05:09Z

This change doesn't looks right and causing twice as much proposal as we targeted to (limited by paxos_propose_interval).

Imaging we have a sequence of pg_temp/up_thru during a large recovery.

now =T
The 1st up_thru/pg_temp will go through fast path and trigger propose in T + paxos_min_wait, last_attempted_minwait_time = T.

now = T+ paxos_min_wait
The [2, K] up_thru will failed by now - last_attempted_minwait_time > g_conf->paxos_propose_interval and go through PaxosService::should_propose, which will schedule the propose in T+paxos_propose_interval

now= T+ paxos_propose_interval + paxos_min_wait
The K+1 up_thru/pg_temp comes, both (now - last_attempted_minwait_time > g_conf->paxos_propose_interval and now - paxos->get_last_commit_time() > g_conf->paxos_min_wait satisfied, so we trigger another propose in now+ paxos_min_wait = T+ paxos_propose_interval +paxos_min_wait.

clearly we made TWO proposal in each paxos_propose_interval, which almost taking down our cluster when we reweight a brunch of OSDs

It looks to me like PaxosService::should_propose is exactly implement what @liewegas suggested in previous comment, this commit is not necessary.

I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later.

xiaoxichen · 2018-05-25T09:42:27Z

#22243

tchaikov self-assigned this Nov 21, 2016

gregsfortytwo requested changes Nov 21, 2016

View reviewed changes

ghost added the core label Nov 23, 2016

mslovy force-pushed the wip-monmsg-minwait branch from 21764e7 to 4c91d42 Compare November 25, 2016 06:47

mslovy changed the title ~~mon: delay for a short period (paxos_min_wait) for message MSG_ALIVE and MSG_PGTEMP~~ mon: no delay for single message MSG_ALIVE and MSG_PGTEMP Nov 25, 2016

mslovy force-pushed the wip-monmsg-minwait branch 3 times, most recently from 0e0fb57 to de2ceb6 Compare November 25, 2016 10:32

liewegas self-assigned this Nov 26, 2016

gregsfortytwo requested changes Nov 28, 2016

View reviewed changes

gregsfortytwo assigned mslovy and unassigned tchaikov and liewegas Nov 28, 2016

mslovy force-pushed the wip-monmsg-minwait branch from de2ceb6 to 710dca8 Compare January 11, 2017 03:58

mslovy force-pushed the wip-monmsg-minwait branch 3 times, most recently from ba6e44b to 9736050 Compare January 11, 2017 05:49

gregsfortytwo reviewed Jan 11, 2017

View reviewed changes

mslovy force-pushed the wip-monmsg-minwait branch 2 times, most recently from ac4040c to 91ad31e Compare January 11, 2017 15:15

gregsfortytwo approved these changes Jan 11, 2017

View reviewed changes

gregsfortytwo added mon needs-qa performance labels Jan 11, 2017

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP

ca65210

delay and batch proposed as before if lots of messages arrive at the same time if too many messages are MSG_ALIVE and MSG_PGTEMP, wait a long period as before Signed-off-by: yaoning <yaoning@unitedstack.com>

mslovy force-pushed the wip-monmsg-minwait branch from 91ad31e to ca65210 Compare January 12, 2017 01:09

tchaikov reviewed Jan 13, 2017

View reviewed changes

yuriw added the wip-yuri2-testing label Jan 13, 2017

tchaikov approved these changes Jan 14, 2017

View reviewed changes

tchaikov assigned tchaikov and unassigned mslovy Jan 14, 2017

yuriw merged commit f1618fe into ceph:master Jan 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

mslovy commented Nov 21, 2016

gregsfortytwo left a comment

mslovy commented Nov 22, 2016 •

edited

gregsfortytwo commented Nov 22, 2016

athanatos commented Nov 22, 2016

mslovy commented Nov 23, 2016

athanatos commented Nov 23, 2016

mslovy commented Nov 25, 2016

gregsfortytwo Nov 28, 2016

liewegas commented Nov 28, 2016

mslovy commented Jan 11, 2017

gregsfortytwo Jan 11, 2017

mslovy Jan 11, 2017

gregsfortytwo Jan 11, 2017

mslovy Jan 11, 2017

gregsfortytwo left a comment

tchaikov Jan 13, 2017

mslovy Jan 14, 2017

yuriw commented Jan 16, 2017

xiaoxichen commented May 23, 2018 •

edited

xiaoxichen commented May 25, 2018

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107

Conversation

mslovy commented Nov 21, 2016

gregsfortytwo left a comment

Choose a reason for hiding this comment

mslovy commented Nov 22, 2016 • edited

gregsfortytwo commented Nov 22, 2016

athanatos commented Nov 22, 2016

mslovy commented Nov 23, 2016

athanatos commented Nov 23, 2016

mslovy commented Nov 25, 2016

gregsfortytwo Nov 28, 2016

Choose a reason for hiding this comment

liewegas commented Nov 28, 2016

mslovy commented Jan 11, 2017

gregsfortytwo Jan 11, 2017

Choose a reason for hiding this comment

mslovy Jan 11, 2017

Choose a reason for hiding this comment

gregsfortytwo Jan 11, 2017

Choose a reason for hiding this comment

mslovy Jan 11, 2017

Choose a reason for hiding this comment

gregsfortytwo left a comment

Choose a reason for hiding this comment

tchaikov Jan 13, 2017

Choose a reason for hiding this comment

mslovy Jan 14, 2017

Choose a reason for hiding this comment

yuriw commented Jan 16, 2017

xiaoxichen commented May 23, 2018 • edited

xiaoxichen commented May 25, 2018

mslovy commented Nov 22, 2016 •

edited

xiaoxichen commented May 23, 2018 •

edited