New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: no delay for single message MSG_ALIVE and MSG_PGTEMP #12107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it would be nice to push these through immediately, ALIVE and PGTEMP are exactly the message types we need to throttle. These are generated very quickly during thrash storms and if we generated a new OSD map for each ones, you'd be much more likely to destroy your cluster.
Unless @athanatos or @liewegas have other thoughts I don't think this is something we can do. Maybe if there were a bunch of extra tracking logic so it only does this for n pgtemp updates within a given time period, so you could let one OSD change go through quickly but throttle once the cluster is in trouble?
Yeah, paxos_min_wait (default set: 0.05) is one of the throttle policy which make sure that it is only possible to have 20 maximum OSDMap changes per second, so the things we want to answers is whether it can finish at least 20 paxos updates in one second for Monitor. Any one here thinks leveldb and rocksdb cannot finish 20 updates per second? Also, if during thrash storms, I think most OSDs are remain in unavailable state, so the purpose is still to make the whole cluster state picture clear as fast as possible if the monitor has enough ability. Furthermore, I think it is possible to add throttle count to track the OSDMap update within one second and reset it in tick(). In this way, we exactly controll the throughput of the update? and do not wait any times if the throttle allows the request going on. |
The question isn't about what the monitor's leveldb can push through; it's about how many cluster state updates the OSDs can handle. They need to do a fixed amount of fairly expensive processing on every map update and spamming out 20 updates every second is guaranteed to put most clusters into a death spiral. |
Yeah, OSDMaps are costly to process on the OSD side. That's the main concern. |
Ok, I think the main concern is that it is quite costy for pg to catch up the lastest map, right? the handle_osd_map in class OSD is not quite expensive. Here, generally, it seems there are two cases(that is whether acting_up_affected):
So can we immediately rise a propose for MSG_ALIVE, but need to do some further constraint on MSG_PGTEMP? The goal of the improvement will be:
so we can use Based on the throttle strategy, message types and something else to determine the delay time, right? I will reconsider and proposal new strategy later. |
Mmm, we still probably want to combine a bunch of up_thru messages into 1 map if possible. |
21764e7
to
4c91d42
Compare
Yeah, I agree that it is better to combine a bunch of up_thru messages. Actually, what we want is:
So what about current update PR? |
0e0fb57
to
de2ceb6
Compare
last_no_wait_time = now; | ||
return true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic doubles the number of OSDMaps we'd otherwise generate — we'll do two every paxos_propose_interval. You'd need something at least a little more sophisticated. Maybe it would work if you renamed last_no_wait_time -> last_attempted_nowait_time, set it every time you run this check, and only allow it to succeed every two paxos_propose_intervals? Or just only let it go through if at least a paxos_propose_interval has passed since the previous OSDMap was generated?
What about something like this: In general, we don't want to propose more often than paxos_propose_interval, unless it is truly a high-priority even (like a command from the cli) that has no delay. But if we get an up_thru message and haven't proposed a map in a while (> propose interval) then then we could propose immediately without exceeding our rate limit. We still delay the min amount to hopefully batch a bit, but it's the min_interval value, so not long. If another up_thru comes after that, it has to wait a full interval before proposing again (because we just published a map). I think this boils down to noting the time when the last round finished, and for one of these expidited messages, setting the delay such that we propose no sooner than previous round + the normal interval or now + min interval, whichever is later. |
de2ceb6
to
710dca8
Compare
hi,all,based on the comments, what's about this refined version? |
ba6e44b
to
9736050
Compare
@@ -131,6 +131,9 @@ class OSDMonitor : public PaxosService { | |||
bool check_failure(utime_t now, int target_osd, failure_info_t& fi); | |||
void force_failure(utime_t now, int target_osd); | |||
|
|||
// the time of last msg(MSG_ALIVE and MSG_PGTEMP) proposed without delay | |||
utime_t last_attempped_minwait_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"attempted", not "attempped"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry, fixed
dout(15) << " propose as fast as possible for up_thru/pg_temp" << dendl; | ||
|
||
utime_t now = ceph_clock_now(); | ||
if (now - last_attempped_minwait_time > g_conf->paxos_propose_interval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to look at the last map generated without using minwait as well -- it might be more recent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean this?
ac4040c
to
91ad31e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me now. Testing required.
delay and batch proposed as before if lots of messages arrive at the same time if too many messages are MSG_ALIVE and MSG_PGTEMP, wait a long period as before Signed-off-by: yaoning <yaoning@unitedstack.com>
91ad31e
to
ca65210
Compare
// want to merge OSDMap changes as much as possible | ||
if ((pending_inc.new_primary_temp.size() == 1 | ||
|| pending_inc.new_up_thru.size() == 1) | ||
&& pending_inc.new_state.size() < 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mslovy why we are checking the new_state.size()
here? just to minimize the impact of flapping osd? and only let the single osd change pass through? am i right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. If lots of osd flapping and update its state. I think it predictably a lots of primary_temp and up_thru request arrived in a following short period. At this time, we do not want the first primary_temp or up_thru updated individually but expect to wait the following request arrived and update at once.
This change doesn't looks right and causing twice as much proposal as we targeted to (limited by paxos_propose_interval). Imaging we have a sequence of pg_temp/up_thru during a large recovery. now =T now = T+ paxos_min_wait now= T+ paxos_propose_interval + paxos_min_wait clearly we made TWO proposal in each paxos_propose_interval, which almost taking down our cluster when we reweight a brunch of OSDs It looks to me like
|
Currently, monitor may wait 'paxos_propose_interval' before propose_pending
However, we don't want this long wait for message MSG_ALIVE and MSG_PGTEMP since it will block the peering process so that all IO will block a long time when adding or removing osd in the cluster.
Thus, wait paxos_min_wait(default = 0.05) for those messages. Even through, we can set paxos_min_wait = 0 to make the propose_pending immediately if we want.
Signed-off-by: yaoning yaoning@unitedstack.com