Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: more deterministic timing on frag split/join #12022

Merged
merged 11 commits into from Nov 28, 2016

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Nov 16, 2016

No description provided.

@jcsp jcsp added the cephfs Ceph File System label Nov 16, 2016
@ceph-jenkins
Copy link
Collaborator

Build finished.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions.


// Hit the dir, in case the resulting fragments are beyond the split
// size
mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IRD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem quite right — a "hit" is a statement of hotness. If you want to check for split size, we should probably expose that a little more directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I've separated out the part I really want to call into MDBalancer::maybe_fragment

// request instead of waiting for the timer.
// Count null dentries in effective size to take account of insertions
// that haven't committed yet.
auto effective_size = dir->get_frag_size() + dir->get_num_head_null();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get null dentries from actually deleting things, or doing lookups on incomplete CDirs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was hacky. I've added CDir::should_split_fast to do this properly.

if (!split_queue.empty()) {
dout(10) << "do_fragmenting " << split_queue.size() << " dirs marked for possible splitting" << dendl;
CDir *split_dir = mds->mdcache->get_dirfrag(frag);
// FIXME maybe instead of this check we should ensure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME.
I'm generally a fan of proactive cleanups because it makes the invariants simpler, but I'm not sure which is the better option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I've gone with leaving things in split_pending even if they're no longer in the cache, because that list really just exists to avoid double-queueing the contexts that do the delayed split, and trying to cancel these contexts would be much harder than just checking if the dir is gone when we get called back.

return;
}
dout(10) << __func__ << " splitting " << *split_dir << dendl;
if (!split_dir || !split_dir->is_auth()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first half is redundant with the above. Should probably combine the cases if that's the route to go down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned this up to have two separate (non-redundant) cases with different log messages

// Count null dentries in effective size to take account of insertions
// that haven't committed yet.
auto effective_size = dir->get_frag_size() + dir->get_num_head_null();
if (effective_size > g_conf->mds_bal_split_size * 1.5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're setting a hard-coded hard limit of 1.5 times the desired split? Shouldn't it be configurable? ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True -- added a setting and lots of fresh new documentation. I'm open to opinions about what I named it though!

@jcsp
Copy link
Contributor Author

jcsp commented Nov 17, 2016

Tested by ceph/ceph-qa-suite#1275

@jcsp
Copy link
Contributor Author

jcsp commented Nov 17, 2016

@gregsfortytwo this is ready for another review when you have a sec

@ghost
Copy link

ghost commented Nov 18, 2016

jenkins test this please (eio)

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

void MDBalancer::queue_merge(CDir *dir)
{
merge_queue.insert(dir->dirfrag());
split_pending.erase(frag);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor quibble: you can move this erase into the if statement above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

dout(10) << "drop split on " << frag << " because not in cache" << dendl;
return;
}
if (!split_dir->is_auth()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check looks redundant with what's already done in MDCache::can_fragment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Updated.

break;
}
dout(10) << " all sibs under " << sibfg << " " << sibs << " should merge" << dendl;
fg = fg.parent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be desirable to eliminate some of these redundant checks with MDCache::merge_dir + MDCache::can_fragment, although I don't think it's as simple as just removing this loop?

are rejected with ENOSPC.
:Type: 32-bit Integer
:Default: ``100000``

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docs for mds_bal_fragment_fast_limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in there, just awkward to grep because of interchangeable underscores vs spaces in config names

a power of two number of new fragments. The number of new
fragments is given by two to the power ``mds_bal_split_bits``, i.e.
if ``mds_bal_split_bits`` is 2, then four new fragments will be
created. The default setting is 3, i.e. splits create 8 new fragments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change we need in this PR, but that seems like an awful lot. I suspect these values were "tuned" for Sage's HPC use-case testing and we probably want to bring it down to just 2?

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look good, but a few more comments.

OPTION(mds_bal_interval, OPT_INT, 10) // seconds
OPTION(mds_bal_fragment_interval, OPT_INT, 5) // seconds
OPTION(mds_bal_fragment_size_max, OPT_INT, 10000*10) // order of magnitude higher than split size
OPTION(mds_bal_fragment_fast_limit, OPT_FLOAT, 1.5) // multiple of size_max that triggers immediate split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mds_bal_fragment_fast_factor instead, since it's a multiplier and not an absolute value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Changed.

@@ -3285,6 +3285,8 @@ void Server::handle_client_openc(MDRequestRef& mdr)
}

journal_and_reply(mdr, in, dn, le, fin);

mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IWR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we already had a hit_dir() somewhere in the openc pipeline?
So maybe you want another maybe_fragment call here, but not hit_dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point. Changed.

John Spray added 10 commits November 24, 2016 10:22
Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
Increment these when we have finished splitting
or merging, not partway through the process.

This makes testing more deterministic: once I've seen
the counter increment, I'm sure that the children
no longer have STATE_FRAGMENTING set as a result of
their parent's split-in-progress.

Signed-off-by: John Spray <john.spray@redhat.com>
The allows_dirfrags() test was in the wrong place, causing
in some cases a root fragment to be passed into queue_merge.

Signed-off-by: John Spray <john.spray@redhat.com>
Signed-off-by: John Spray <john.spray@redhat.com>
This was just dropping its argument and
calling through to show_subtrees.

Signed-off-by: John Spray <john.spray@redhat.com>
...by using timer instead of tick()

Fixes: http://tracker.ceph.com/issues/17853
Signed-off-by: John Spray <john.spray@redhat.com>
In _fragment_finish we would like to try splitting
the new frags without applying a spurious hit to their
temperatures.  Same for the start of openc where we would
like to do an early check for splitting without hitting
the dir twice.

Signed-off-by: John Spray <john.spray@redhat.com>
Check it during the initial request, not just
on completion, so that when doing lots of creates
we get a chance to split the directory before
it zooms past the size threshold.

Signed-off-by: John Spray <john.spray@redhat.com>
In case insertions have occurred during the split that would
immediately take the new fragments over the split threshold.

Signed-off-by: John Spray <john.spray@redhat.com>
...and update the config ref.

Includes the new mds_bal_fragment_fast_factor setting.

Signed-off-by: John Spray <john.spray@redhat.com>
@jcsp
Copy link
Contributor Author

jcsp commented Nov 24, 2016

Updated @gregsfortytwo @batrick

@jcsp jcsp assigned batrick and gregsfortytwo and unassigned jcsp Nov 24, 2016
@batrick
Copy link
Member

batrick commented Nov 28, 2016

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

@batrick batrick removed their assignment Nov 28, 2016
@gregsfortytwo gregsfortytwo merged commit 0e60bf8 into ceph:master Nov 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System
Projects
None yet
4 participants