New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long stalls when calling ceph_fsync() #11710
Conversation
Mostly passing fs sepia run here: http://pulpito.ceph.com/jlayton-2016-10-31_20:35:33-fs-wip-jlayton-fsync---basic-mira/ |
Note too that if this protocol change looks ok, then we'll almost certainly want to propagate this change to the kernel client as well. |
One thing that may make sense is to turn this bool into an (advisory) flags field or an enum that describes how the flush should occur. We could keep one bit for a "sync" flag, and one for "more data coming". The latter could hint to the MDS that it should delay a log flush until it receives one from a client with that flag cleared. That could help the mds batch up updates from a single client more effectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on switching to a flags field for check_caps
@@ -117,7 +120,8 @@ class MClientCaps : public Message { | |||
time_warp_seq(0), | |||
osd_epoch_barrier(0), | |||
oldest_flush_tid(0), | |||
caller_uid(0), caller_gid(0) { | |||
caller_uid(0), caller_gid(0), | |||
sync(false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to set HEAD_VERSION = 10 above, and make the decoding of the new field conditional on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jewel didn't have btime or change_attr. Those fields would both be new as of kraken.
So to be clear, that means that we do need to worry about backward compatability vs. development builds since jewel? If not, then we can still call it struct version 9 and it should still work. If we care about sane behavior vs. earlier development builds though, then we probably will need to do as you suggest.
@@ -263,6 +269,7 @@ class MClientCaps : public Message { | |||
if (header.version >= 9) { | |||
::decode(btime, p); | |||
::decode(change_attr, p); | |||
::decode(sync, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (header.version >= 10)
2606db0
to
d90f5cf
Compare
I'd like to go through this as well, self-assigning for that. |
d90f5cf
to
55972e5
Compare
Ok, I think my latest PR should address all of the comments. @liewegas care to re-review? I'd also welcome review from @jcsp and/or @gregsfortytwo . |
@@ -2706,7 +2708,7 @@ void Locker::handle_client_caps(MClientCaps *m) | |||
} | |||
|
|||
// filter wanted based on what we could ever give out (given auth/replica status) | |||
bool need_flush = false; | |||
bool need_flush = m->flags & CLIENT_CAPS_SYNC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it clearer that we're always respecting this flag, maybe the need_flush assignments below should be changed into need_flush |=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Hmm...does |= work with bools? What might be better is to just check whether need_flush is already set before calling _need_flush_mdlog. I'll push a SQUASH patch on top.
55972e5
to
fba8b39
Compare
We still need to sync out metadata on size and mtime changes, and we definitely do not want to delay syncing just because we only have write caps that are dirty. Remove the mask and check_caps whenever any caps are dirty. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Currently, when "is_delay" is true, we send the caps immediately. When it's false, we delay them. This seems backward to me and is highly confusing. Sage says: "It's definitely confusing. the call site that motivated the naming is the one from tick(), that passes true ... for the delayed flush that forces them to be sent but lots of other callers set it to true to avoid delaying. It could use a rename/cleanup." Let's change the name to "no_delay" to indicate that the caller wants to send caps immediately instead of delaying them. Signed-off-by: Jeff Layton <jlayton@redhat.com>
...and encode/decode it appropriately. The idea of this field is to be advisory, to allow the MDS to handle things differently if it so chooses. We deliberately do _not_ offer any firm policy here. We start with a flag that tells the MDS when an application may end up blocking on the result of this cap request. A new "sync" arg is added to send_cap, and we set the new flag in the cap request based on its value. For now, the callers all set the sync boolean to false everywhere to preserve the existing behavior. Signed-off-by: Jeff Layton <jlayton@redhat.com>
fba8b39
to
ed49f5f
Compare
@jcsp : I think the SQUASH patch I just added should address your concern. Untested as of yet but it should do the right thing. I'll spin up a test run with it once the build is done. |
check_caps(in, true); | ||
if (p.end()) | ||
flags |= CHECK_CAPS_SYNCHRONOUS; | ||
check_caps(in, flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're here forcing the caps immediately to disk on any invocation of flush_caps(). Can we rename the function to indicate that behavior?
Since we obviously call the per-file/per-session flush_caps() in lots of places where we don't want to force it instantly to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is not the per-file/per-session flush_caps, -- this is the flush_caps function with no argument that flushes all caps back to the mds for unmount or syncfs. Can we rename this one to distinguish it from the other? Absolutely. I'm build-testing a patch for that now.
Build worked, so I went ahead and pushed the patch onto the pile.
This (or at least a sufficiently recent version for me) passed tests here: http://pulpito.ceph.com/jspray-2016-11-10_21:38:43-fs-wip-jcsp-testing-20161108-distro-basic-mira/ @jtlayton I'm happy for this to merge as soon as it's squashed |
If the client has set the sync flag in a cap update, then it is indicating that it's waiting on the reply. Ensure that we flush the journal in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com>
In a later patch, we'll want to have the client set the sync flag in the cap flush, to hint to the MDS that it should process it immediately. We could add a second bool, but let's instead do what the kernel client does which is to have a flags field. With that, the existing no_delay bool becomes CHECK_CAPS_NODELAY. We'll add other flags in subsequent patches. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ensure that the client will request an immediate journal flush from the MDS when we'll end up waiting on the flush response. This patch should fix the fsync codepath, but we may need something similar for syncfs. Signed-off-by: Jeff Layton <jlayton@redhat.com>
…eature flags The kernel client lags the userland code a bit, and feature support for addr2 is not quite ready. Still, we want to allow the client to set the new flags field in a cap request before then so it can get better fsync performance. When we go to update the cap fields, grab the features from the peer, and verify that the appropriate flags are set before we apply updates to the btime and change_attr. Also, just have the function return early if dirty is 0, since it's a no-op in that case, and turn the comment above the function into an assertion. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ensure that we ask the MDS to flush the journal on the last cap flush from sync_fs and umount codepaths. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Per Greg's recommendation, change the name of this function to better indicate what it does now that we always request a journal flush on the last cap flush. Also, add a comment above the function to better explain why we do this. Signed-off-by: Jeff Layton <jlayton@redhat.com>
ff9f277
to
a374166
Compare
Squashed. @jcsp you may merge when ready. |
This PR should fix tracker bug 17563
The userland client fsync codepath is quite slow in some (rather common) workloads. I have a libcephfs test program that creates a file, writes 16k to it, calls ceph_fsync on it and then ceph_close. This is done in a loop 256 times.
Without this patchset when I run this program vs a vstart cluster under "time":
With this patchset:
Roughly 100x speedup. I first noticed this when testing with ganesha, backed by ceph, so this should help its workload as well.