Fix long stalls when calling ceph_fsync() #11710

jtlayton · 2016-10-31T19:58:07Z

The userland client fsync codepath is quite slow in some (rather common) workloads. I have a libcephfs test program that creates a file, writes 16k to it, calls ceph_fsync on it and then ceph_close. This is done in a loop 256 times.

Without this patchset when I run this program vs a vstart cluster under "time":

real    21m19.619s
user    0m0.431s
sys     0m0.364s

With this patchset:

real    0m11.491s
user    0m0.247s
sys     0m0.127s

Roughly 100x speedup. I first noticed this when testing with ganesha, backed by ceph, so this should help its workload as well.

jtlayton · 2016-11-01T01:25:00Z

Mostly passing fs sepia run here:

http://pulpito.ceph.com/jlayton-2016-10-31_20:35:33-fs-wip-jlayton-fsync---basic-mira/

jtlayton · 2016-11-01T12:22:57Z

Note too that if this protocol change looks ok, then we'll almost certainly want to propagate this change to the kernel client as well.

jtlayton · 2016-11-03T20:55:32Z

One thing that may make sense is to turn this bool into an (advisory) flags field or an enum that describes how the flush should occur. We could keep one bit for a "sync" flag, and one for "more data coming".

The latter could hint to the MDS that it should delay a log flush until it receives one from a client with that flag cleared. That could help the mds batch up updates from a single client more effectively.

liewegas

👍 on switching to a flags field for check_caps

liewegas · 2016-11-04T13:10:50Z

src/messages/MClientCaps.h

@@ -117,7 +120,8 @@ class MClientCaps : public Message {
      time_warp_seq(0),
      osd_epoch_barrier(0),
      oldest_flush_tid(0),
-      caller_uid(0), caller_gid(0) {
+      caller_uid(0), caller_gid(0),
+      sync(false) {


Need to set HEAD_VERSION = 10 above, and make the decoding of the new field conditional on that

Jewel didn't have btime or change_attr. Those fields would both be new as of kraken.

So to be clear, that means that we do need to worry about backward compatability vs. development builds since jewel? If not, then we can still call it struct version 9 and it should still work. If we care about sane behavior vs. earlier development builds though, then we probably will need to do as you suggest.

liewegas · 2016-11-04T13:11:13Z

src/messages/MClientCaps.h

@@ -263,6 +269,7 @@ class MClientCaps : public Message {
    if (header.version >= 9) {
      ::decode(btime, p);
      ::decode(change_attr, p);
+      ::decode(sync, p);


if (header.version >= 10)

gregsfortytwo · 2016-11-08T05:10:10Z

I'd like to go through this as well, self-assigning for that.

jtlayton · 2016-11-08T13:54:28Z

Ok, I think my latest PR should address all of the comments. @liewegas care to re-review? I'd also welcome review from @jcsp and/or @gregsfortytwo .

jcsp · 2016-11-09T11:31:31Z

src/mds/Locker.cc

@@ -2706,7 +2708,7 @@ void Locker::handle_client_caps(MClientCaps *m)
    }

    // filter wanted based on what we could ever give out (given auth/replica status)
-    bool need_flush = false;
+    bool need_flush = m->flags & CLIENT_CAPS_SYNC;


To make it clearer that we're always respecting this flag, maybe the need_flush assignments below should be changed into need_flush |=?

Yeah, that makes sense. Hmm...does |= work with bools? What might be better is to just check whether need_flush is already set before calling _need_flush_mdlog. I'll push a SQUASH patch on top.

We still need to sync out metadata on size and mtime changes, and we definitely do not want to delay syncing just because we only have write caps that are dirty. Remove the mask and check_caps whenever any caps are dirty. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Currently, when "is_delay" is true, we send the caps immediately. When it's false, we delay them. This seems backward to me and is highly confusing. Sage says: "It's definitely confusing. the call site that motivated the naming is the one from tick(), that passes true ... for the delayed flush that forces them to be sent but lots of other callers set it to true to avoid delaying. It could use a rename/cleanup." Let's change the name to "no_delay" to indicate that the caller wants to send caps immediately instead of delaying them. Signed-off-by: Jeff Layton <jlayton@redhat.com>

...and encode/decode it appropriately. The idea of this field is to be advisory, to allow the MDS to handle things differently if it so chooses. We deliberately do _not_ offer any firm policy here. We start with a flag that tells the MDS when an application may end up blocking on the result of this cap request. A new "sync" arg is added to send_cap, and we set the new flag in the cap request based on its value. For now, the callers all set the sync boolean to false everywhere to preserve the existing behavior. Signed-off-by: Jeff Layton <jlayton@redhat.com>

jtlayton · 2016-11-09T18:47:00Z

@jcsp : I think the SQUASH patch I just added should address your concern. Untested as of yet but it should do the right thing. I'll spin up a test run with it once the build is done.

gregsfortytwo · 2016-11-10T03:28:39Z

src/client/Client.cc

-    check_caps(in, true);
+    if (p.end())
+      flags |= CHECK_CAPS_SYNCHRONOUS;
+    check_caps(in, flags);


It looks like we're here forcing the caps immediately to disk on any invocation of flush_caps(). Can we rename the function to indicate that behavior?
Since we obviously call the per-file/per-session flush_caps() in lots of places where we don't want to force it instantly to disk.

To be clear, this is not the per-file/per-session flush_caps, -- this is the flush_caps function with no argument that flushes all caps back to the mds for unmount or syncfs. Can we rename this one to distinguish it from the other? Absolutely. I'm build-testing a patch for that now.

Build worked, so I went ahead and pushed the patch onto the pile.

jcsp · 2016-11-11T11:08:44Z

This (or at least a sufficiently recent version for me) passed tests here: http://pulpito.ceph.com/jspray-2016-11-10_21:38:43-fs-wip-jcsp-testing-20161108-distro-basic-mira/

@jtlayton I'm happy for this to merge as soon as it's squashed

If the client has set the sync flag in a cap update, then it is indicating that it's waiting on the reply. Ensure that we flush the journal in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com>

In a later patch, we'll want to have the client set the sync flag in the cap flush, to hint to the MDS that it should process it immediately. We could add a second bool, but let's instead do what the kernel client does which is to have a flags field. With that, the existing no_delay bool becomes CHECK_CAPS_NODELAY. We'll add other flags in subsequent patches. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Ensure that the client will request an immediate journal flush from the MDS when we'll end up waiting on the flush response. This patch should fix the fsync codepath, but we may need something similar for syncfs. Signed-off-by: Jeff Layton <jlayton@redhat.com>

…eature flags The kernel client lags the userland code a bit, and feature support for addr2 is not quite ready. Still, we want to allow the client to set the new flags field in a cap request before then so it can get better fsync performance. When we go to update the cap fields, grab the features from the peer, and verify that the appropriate flags are set before we apply updates to the btime and change_attr. Also, just have the function return early if dirty is 0, since it's a no-op in that case, and turn the comment above the function into an assertion. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Ensure that we ask the MDS to flush the journal on the last cap flush from sync_fs and umount codepaths. Signed-off-by: Jeff Layton <jlayton@redhat.com>

Per Greg's recommendation, change the name of this function to better indicate what it does now that we always request a journal flush on the last cap flush. Also, add a comment above the function to better explain why we do this. Signed-off-by: Jeff Layton <jlayton@redhat.com>

jtlayton · 2016-11-11T11:32:34Z

Squashed. @jcsp you may merge when ready.

Code was switched to flags field

jtlayton added cephfs Ceph File System performance labels Oct 31, 2016

liewegas added this to the kraken milestone Nov 4, 2016

liewegas previously requested changes Nov 4, 2016

View reviewed changes

jtlayton force-pushed the wip-jlayton-fsync branch 2 times, most recently from 2606db0 to d90f5cf Compare November 7, 2016 17:14

gregsfortytwo self-assigned this Nov 8, 2016

jtlayton force-pushed the wip-jlayton-fsync branch from d90f5cf to 55972e5 Compare November 8, 2016 13:51

jcsp reviewed Nov 9, 2016

View reviewed changes

jcsp approved these changes Nov 9, 2016

View reviewed changes

jtlayton force-pushed the wip-jlayton-fsync branch from 55972e5 to fba8b39 Compare November 9, 2016 12:44

jtlayton added 3 commits November 9, 2016 09:36

jtlayton force-pushed the wip-jlayton-fsync branch from fba8b39 to ed49f5f Compare November 9, 2016 14:40

gregsfortytwo reviewed Nov 10, 2016

View reviewed changes

gregsfortytwo removed their assignment Nov 10, 2016

jtlayton added 6 commits November 11, 2016 06:28

mds: do mds log flush if CLIENT_CAPS_SYNC is set

93954d0

If the client has set the sync flag in a cap update, then it is indicating that it's waiting on the reply. Ensure that we flush the journal in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com>

client: request journal flush after flushing caps in syncfs

ccdd778

Ensure that we ask the MDS to flush the journal on the last cap flush from sync_fs and umount codepaths. Signed-off-by: Jeff Layton <jlayton@redhat.com>

jtlayton force-pushed the wip-jlayton-fsync branch from ff9f277 to a374166 Compare November 11, 2016 11:30

jcsp merged commit 6b2dc4a into master Nov 11, 2016

jcsp deleted the wip-jlayton-fsync branch November 11, 2016 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix long stalls when calling ceph_fsync() #11710

Fix long stalls when calling ceph_fsync() #11710

jtlayton commented Oct 31, 2016 •

edited

jtlayton commented Nov 1, 2016

jtlayton commented Nov 1, 2016

jtlayton commented Nov 3, 2016

liewegas left a comment

liewegas Nov 4, 2016

jtlayton Nov 4, 2016

liewegas Nov 4, 2016

gregsfortytwo commented Nov 8, 2016

jtlayton commented Nov 8, 2016 •

edited

jcsp Nov 9, 2016

jtlayton Nov 9, 2016

jtlayton commented Nov 9, 2016

gregsfortytwo Nov 10, 2016

jtlayton Nov 10, 2016 •

edited

jcsp commented Nov 11, 2016

jtlayton commented Nov 11, 2016

Fix long stalls when calling ceph_fsync() #11710

Fix long stalls when calling ceph_fsync() #11710

Conversation

jtlayton commented Oct 31, 2016 • edited

jtlayton commented Nov 1, 2016

jtlayton commented Nov 1, 2016

jtlayton commented Nov 3, 2016

liewegas left a comment

Choose a reason for hiding this comment

liewegas Nov 4, 2016

Choose a reason for hiding this comment

jtlayton Nov 4, 2016

Choose a reason for hiding this comment

liewegas Nov 4, 2016

Choose a reason for hiding this comment

gregsfortytwo commented Nov 8, 2016

jtlayton commented Nov 8, 2016 • edited

jcsp Nov 9, 2016

Choose a reason for hiding this comment

jtlayton Nov 9, 2016

Choose a reason for hiding this comment

jtlayton commented Nov 9, 2016

gregsfortytwo Nov 10, 2016

Choose a reason for hiding this comment

jtlayton Nov 10, 2016 • edited

Choose a reason for hiding this comment

jcsp commented Nov 11, 2016

jtlayton commented Nov 11, 2016

jtlayton commented Oct 31, 2016 •

edited

jtlayton commented Nov 8, 2016 •

edited

jtlayton Nov 10, 2016 •

edited