mds: more deterministic timing on frag split/join #12022

jcsp · 2016-11-16T14:19:55Z

No description provided.

ceph-jenkins · 2016-11-16T14:55:12Z

Build finished.

gregsfortytwo

A few questions.

gregsfortytwo · 2016-11-16T22:33:47Z

src/mds/MDCache.cc

+
+    // Hit the dir, in case the resulting fragments are beyond the split
+    // size
+    mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IRD);


This doesn't seem quite right — a "hit" is a statement of hotness. If you want to check for split size, we should probably expose that a little more directly?

Yes -- I've separated out the part I really want to call into MDBalancer::maybe_fragment

gregsfortytwo · 2016-11-17T06:24:00Z

src/mds/MDBalancer.cc

+        // request instead of waiting for the timer.
+        // Count null dentries in effective size to take account of insertions
+        // that haven't committed yet.
+        auto effective_size = dir->get_frag_size() + dir->get_num_head_null();


Can't we get null dentries from actually deleting things, or doing lookups on incomplete CDirs?

Yes, this was hacky. I've added CDir::should_split_fast to do this properly.

gregsfortytwo · 2016-11-17T06:26:36Z

src/mds/MDBalancer.cc

-  if (!split_queue.empty()) {
-    dout(10) << "do_fragmenting " << split_queue.size() << " dirs marked for possible splitting" << dendl;
+    CDir *split_dir = mds->mdcache->get_dirfrag(frag);
+    // FIXME maybe instead of this check we should ensure


FIXME.
I'm generally a fan of proactive cleanups because it makes the invariants simpler, but I'm not sure which is the better option here.

Yep. I've gone with leaving things in split_pending even if they're no longer in the cache, because that list really just exists to avoid double-queueing the contexts that do the delayed split, and trying to cancel these contexts would be much harder than just checking if the dir is gone when we get called back.

gregsfortytwo · 2016-11-17T06:27:22Z

src/mds/MDBalancer.cc

+      return;
+    }
+    dout(10) << __func__ << " splitting " << *split_dir << dendl;
+    if (!split_dir || !split_dir->is_auth()) {


The first half is redundant with the above. Should probably combine the cases if that's the route to go down.

Cleaned this up to have two separate (non-redundant) cases with different log messages

gregsfortytwo · 2016-11-17T06:31:53Z

src/mds/MDBalancer.cc

+        // Count null dentries in effective size to take account of insertions
+        // that haven't committed yet.
+        auto effective_size = dir->get_frag_size() + dir->get_num_head_null();
+        if (effective_size > g_conf->mds_bal_split_size * 1.5) {


You're setting a hard-coded hard limit of 1.5 times the desired split? Shouldn't it be configurable? ;)

True -- added a setting and lots of fresh new documentation. I'm open to opinions about what I named it though!

jcsp · 2016-11-17T23:50:58Z

Tested by ceph/ceph-qa-suite#1275

jcsp · 2016-11-17T23:55:17Z

@gregsfortytwo this is ready for another review when you have a sec

ghost · 2016-11-18T06:50:51Z

jenkins test this please (eio)

batrick

Looks good!

batrick · 2016-11-21T22:18:13Z

src/mds/MDBalancer.cc

-void MDBalancer::queue_merge(CDir *dir)
-{
-  merge_queue.insert(dir->dirfrag());
+    split_pending.erase(frag);


Minor quibble: you can move this erase into the if statement above.

batrick · 2016-11-21T22:23:34Z

src/mds/MDBalancer.cc

+      dout(10) << "drop split on " << frag << " because not in cache" << dendl;
+      return;
+    }
+    if (!split_dir->is_auth()) {


This check looks redundant with what's already done in MDCache::can_fragment.

True. Updated.

batrick · 2016-11-21T22:36:26Z

src/mds/MDBalancer.cc

+        break;
+      }
+      dout(10) << "  all sibs under " << sibfg << " " << sibs << " should merge" << dendl;
+      fg = fg.parent();


It might also be desirable to eliminate some of these redundant checks with MDCache::merge_dir + MDCache::can_fragment, although I don't think it's as simple as just removing this loop?

batrick · 2016-11-21T22:45:32Z

doc/cephfs/mds-config-ref.rst

+              are rejected with ENOSPC.
+:Type:  32-bit Integer
+:Default: ``100000``
+


Missing docs for mds_bal_fragment_fast_limit.

It's in there, just awkward to grep because of interchangeable underscores vs spaces in config names

gregsfortytwo · 2016-11-22T17:44:59Z

doc/cephfs/dirfrags.rst

+a power of two number of new fragments.  The number of new
+fragments is given by two to the power ``mds_bal_split_bits``, i.e.
+if ``mds_bal_split_bits`` is 2, then four new fragments will be
+created.  The default setting is 3, i.e. splits create 8 new fragments.


Not a change we need in this PR, but that seems like an awful lot. I suspect these values were "tuned" for Sage's HPC use-case testing and we probably want to bring it down to just 2?

gregsfortytwo

New changes look good, but a few more comments.

gregsfortytwo · 2016-11-22T17:48:42Z

src/common/config_opts.h

 OPTION(mds_bal_interval, OPT_INT, 10)           // seconds
 OPTION(mds_bal_fragment_interval, OPT_INT, 5)      // seconds
 OPTION(mds_bal_fragment_size_max, OPT_INT, 10000*10) // order of magnitude higher than split size
+OPTION(mds_bal_fragment_fast_limit, OPT_FLOAT, 1.5) // multiple of size_max that triggers immediate split


How about mds_bal_fragment_fast_factor instead, since it's a multiplier and not an absolute value?

Yeah. Changed.

gregsfortytwo · 2016-11-22T18:01:57Z

src/mds/Server.cc

@@ -3285,6 +3285,8 @@ void Server::handle_client_openc(MDRequestRef& mdr)
  }

  journal_and_reply(mdr, in, dn, le, fin);
+
+  mds->balancer->hit_dir(mdr->get_mds_stamp(), dir, META_POP_IWR);


Surely we already had a hit_dir() somewhere in the openc pipeline?
So maybe you want another maybe_fragment call here, but not hit_dir?

Yep, good point. Changed.

Signed-off-by: John Spray <john.spray@redhat.com>

Increment these when we have finished splitting or merging, not partway through the process. This makes testing more deterministic: once I've seen the counter increment, I'm sure that the children no longer have STATE_FRAGMENTING set as a result of their parent's split-in-progress. Signed-off-by: John Spray <john.spray@redhat.com>

The allows_dirfrags() test was in the wrong place, causing in some cases a root fragment to be passed into queue_merge. Signed-off-by: John Spray <john.spray@redhat.com>

Signed-off-by: John Spray <john.spray@redhat.com>

This was just dropping its argument and calling through to show_subtrees. Signed-off-by: John Spray <john.spray@redhat.com>

...by using timer instead of tick() Fixes: http://tracker.ceph.com/issues/17853 Signed-off-by: John Spray <john.spray@redhat.com>

In _fragment_finish we would like to try splitting the new frags without applying a spurious hit to their temperatures. Same for the start of openc where we would like to do an early check for splitting without hitting the dir twice. Signed-off-by: John Spray <john.spray@redhat.com>

Check it during the initial request, not just on completion, so that when doing lots of creates we get a chance to split the directory before it zooms past the size threshold. Signed-off-by: John Spray <john.spray@redhat.com>

In case insertions have occurred during the split that would immediately take the new fragments over the split threshold. Signed-off-by: John Spray <john.spray@redhat.com>

...and update the config ref. Includes the new mds_bal_fragment_fast_factor setting. Signed-off-by: John Spray <john.spray@redhat.com>

jcsp · 2016-11-24T10:46:55Z

Updated @gregsfortytwo @batrick

batrick · 2016-11-28T21:13:49Z

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

jcsp added the cephfs Ceph File System label Nov 16, 2016

jcsp mentioned this pull request Nov 16, 2016

mds: ensure fragmentation happens promptly #10607

Closed

gregsfortytwo requested changes Nov 17, 2016

View reviewed changes

gregsfortytwo assigned jcsp Nov 17, 2016

jcsp force-pushed the wip-17853 branch from 3906a1e to dfe1a37 Compare November 17, 2016 18:45

jcsp mentioned this pull request Nov 17, 2016

Directory fragmentation tests ceph/ceph-qa-suite#1275

Merged

batrick reviewed Nov 21, 2016

View reviewed changes

gregsfortytwo reviewed Nov 22, 2016

View reviewed changes

gregsfortytwo requested changes Nov 22, 2016

View reviewed changes

John Spray added 10 commits November 24, 2016 10:22

mds: add a performance counter for dirfrag merge

aef133d

Signed-off-by: John Spray <john.spray@redhat.com>

mds: include fragmenting in state print

6c1439b

Signed-off-by: John Spray <john.spray@redhat.com>

mds: fix fragment thrasher

aba7aa6

The allows_dirfrags() test was in the wrong place, causing in some cases a root fragment to be passed into queue_merge. Signed-off-by: John Spray <john.spray@redhat.com>

common: remove unused mds_bal_merge_[rd|wr]

1e94794

Signed-off-by: John Spray <john.spray@redhat.com>

mds/MDBalancer: remove stale show_imports fn

a8d55b0

This was just dropping its argument and calling through to show_subtrees. Signed-off-by: John Spray <john.spray@redhat.com>

mds: more deterministic timing on frag split/join

819d34e

...by using timer instead of tick() Fixes: http://tracker.ceph.com/issues/17853 Signed-off-by: John Spray <john.spray@redhat.com>

mds: maybe_fragment earlier in openc

ed134e7

Check it during the initial request, not just on completion, so that when doing lots of creates we get a chance to split the directory before it zooms past the size threshold. Signed-off-by: John Spray <john.spray@redhat.com>

mds: try fragmenting new fragments during split

3210ee7

In case insertions have occurred during the split that would immediately take the new fragments over the split threshold. Signed-off-by: John Spray <john.spray@redhat.com>

jcsp force-pushed the wip-17853 branch from c41207e to e4f161e Compare November 24, 2016 10:38

doc: explain directory fragmenation settings

15c6147

...and update the config ref. Includes the new mds_bal_fragment_fast_factor setting. Signed-off-by: John Spray <john.spray@redhat.com>

jcsp force-pushed the wip-17853 branch from e4f161e to 15c6147 Compare November 24, 2016 10:46

jcsp assigned batrick and gregsfortytwo and unassigned jcsp Nov 24, 2016

batrick approved these changes Nov 28, 2016

View reviewed changes

batrick removed their assignment Nov 28, 2016

gregsfortytwo approved these changes Nov 28, 2016

View reviewed changes

gregsfortytwo merged commit 0e60bf8 into ceph:master Nov 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: more deterministic timing on frag split/join #12022

mds: more deterministic timing on frag split/join #12022

jcsp commented Nov 16, 2016

ceph-jenkins commented Nov 16, 2016

gregsfortytwo left a comment

gregsfortytwo Nov 16, 2016

jcsp Nov 17, 2016

gregsfortytwo Nov 17, 2016

jcsp Nov 17, 2016

gregsfortytwo Nov 17, 2016

jcsp Nov 17, 2016

gregsfortytwo Nov 17, 2016

jcsp Nov 17, 2016

gregsfortytwo Nov 17, 2016

jcsp Nov 17, 2016

jcsp commented Nov 17, 2016

jcsp commented Nov 17, 2016

ghost commented Nov 18, 2016

batrick left a comment

batrick Nov 21, 2016

jcsp Nov 23, 2016

batrick Nov 21, 2016

jcsp Nov 23, 2016

batrick Nov 21, 2016

batrick Nov 21, 2016

jcsp Nov 23, 2016

gregsfortytwo Nov 22, 2016

gregsfortytwo left a comment

gregsfortytwo Nov 22, 2016

jcsp Nov 24, 2016

gregsfortytwo Nov 22, 2016

jcsp Nov 24, 2016

jcsp commented Nov 24, 2016

batrick commented Nov 28, 2016

mds: more deterministic timing on frag split/join #12022

mds: more deterministic timing on frag split/join #12022

Conversation

jcsp commented Nov 16, 2016

ceph-jenkins commented Nov 16, 2016

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Nov 17, 2016

jcsp commented Nov 17, 2016

ghost commented Nov 18, 2016

batrick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Nov 24, 2016

batrick commented Nov 28, 2016