New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: more deterministic timing on frag split/join #12022
Conversation
Build finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions.
|
||
// Hit the dir, in case the resulting fragments are beyond the split | ||
// size | ||
mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IRD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem quite right — a "hit" is a statement of hotness. If you want to check for split size, we should probably expose that a little more directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- I've separated out the part I really want to call into MDBalancer::maybe_fragment
// request instead of waiting for the timer. | ||
// Count null dentries in effective size to take account of insertions | ||
// that haven't committed yet. | ||
auto effective_size = dir->get_frag_size() + dir->get_num_head_null(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we get null dentries from actually deleting things, or doing lookups on incomplete CDirs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was hacky. I've added CDir::should_split_fast to do this properly.
if (!split_queue.empty()) { | ||
dout(10) << "do_fragmenting " << split_queue.size() << " dirs marked for possible splitting" << dendl; | ||
CDir *split_dir = mds->mdcache->get_dirfrag(frag); | ||
// FIXME maybe instead of this check we should ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME.
I'm generally a fan of proactive cleanups because it makes the invariants simpler, but I'm not sure which is the better option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I've gone with leaving things in split_pending even if they're no longer in the cache, because that list really just exists to avoid double-queueing the contexts that do the delayed split, and trying to cancel these contexts would be much harder than just checking if the dir is gone when we get called back.
return; | ||
} | ||
dout(10) << __func__ << " splitting " << *split_dir << dendl; | ||
if (!split_dir || !split_dir->is_auth()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first half is redundant with the above. Should probably combine the cases if that's the route to go down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned this up to have two separate (non-redundant) cases with different log messages
// Count null dentries in effective size to take account of insertions | ||
// that haven't committed yet. | ||
auto effective_size = dir->get_frag_size() + dir->get_num_head_null(); | ||
if (effective_size > g_conf->mds_bal_split_size * 1.5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're setting a hard-coded hard limit of 1.5 times the desired split? Shouldn't it be configurable? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True -- added a setting and lots of fresh new documentation. I'm open to opinions about what I named it though!
Tested by ceph/ceph-qa-suite#1275 |
@gregsfortytwo this is ready for another review when you have a sec |
jenkins test this please (eio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
void MDBalancer::queue_merge(CDir *dir) | ||
{ | ||
merge_queue.insert(dir->dirfrag()); | ||
split_pending.erase(frag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor quibble: you can move this erase into the if statement above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dout(10) << "drop split on " << frag << " because not in cache" << dendl; | ||
return; | ||
} | ||
if (!split_dir->is_auth()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check looks redundant with what's already done in MDCache::can_fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Updated.
break; | ||
} | ||
dout(10) << " all sibs under " << sibfg << " " << sibs << " should merge" << dendl; | ||
fg = fg.parent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be desirable to eliminate some of these redundant checks with MDCache::merge_dir + MDCache::can_fragment, although I don't think it's as simple as just removing this loop?
are rejected with ENOSPC. | ||
:Type: 32-bit Integer | ||
:Default: ``100000`` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docs for mds_bal_fragment_fast_limit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in there, just awkward to grep because of interchangeable underscores vs spaces in config names
a power of two number of new fragments. The number of new | ||
fragments is given by two to the power ``mds_bal_split_bits``, i.e. | ||
if ``mds_bal_split_bits`` is 2, then four new fragments will be | ||
created. The default setting is 3, i.e. splits create 8 new fragments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change we need in this PR, but that seems like an awful lot. I suspect these values were "tuned" for Sage's HPC use-case testing and we probably want to bring it down to just 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes look good, but a few more comments.
OPTION(mds_bal_interval, OPT_INT, 10) // seconds | ||
OPTION(mds_bal_fragment_interval, OPT_INT, 5) // seconds | ||
OPTION(mds_bal_fragment_size_max, OPT_INT, 10000*10) // order of magnitude higher than split size | ||
OPTION(mds_bal_fragment_fast_limit, OPT_FLOAT, 1.5) // multiple of size_max that triggers immediate split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about mds_bal_fragment_fast_factor instead, since it's a multiplier and not an absolute value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Changed.
@@ -3285,6 +3285,8 @@ void Server::handle_client_openc(MDRequestRef& mdr) | |||
} | |||
|
|||
journal_and_reply(mdr, in, dn, le, fin); | |||
|
|||
mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IWR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely we already had a hit_dir() somewhere in the openc pipeline?
So maybe you want another maybe_fragment call here, but not hit_dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good point. Changed.
Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
Increment these when we have finished splitting or merging, not partway through the process. This makes testing more deterministic: once I've seen the counter increment, I'm sure that the children no longer have STATE_FRAGMENTING set as a result of their parent's split-in-progress. Signed-off-by: John Spray <john.spray@redhat.com>
The allows_dirfrags() test was in the wrong place, causing in some cases a root fragment to be passed into queue_merge. Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
This was just dropping its argument and calling through to show_subtrees. Signed-off-by: John Spray <john.spray@redhat.com>
...by using timer instead of tick() Fixes: http://tracker.ceph.com/issues/17853 Signed-off-by: John Spray <john.spray@redhat.com>
In _fragment_finish we would like to try splitting the new frags without applying a spurious hit to their temperatures. Same for the start of openc where we would like to do an early check for splitting without hitting the dir twice. Signed-off-by: John Spray <john.spray@redhat.com>
Check it during the initial request, not just on completion, so that when doing lots of creates we get a chance to split the directory before it zooms past the size threshold. Signed-off-by: John Spray <john.spray@redhat.com>
In case insertions have occurred during the split that would immediately take the new fragments over the split threshold. Signed-off-by: John Spray <john.spray@redhat.com>
...and update the config ref. Includes the new mds_bal_fragment_fast_factor setting. Signed-off-by: John Spray <john.spray@redhat.com>
Updated @gregsfortytwo @batrick |
|
No description provided.