Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long stalls when calling ceph_fsync() #11710

Merged
merged 9 commits into from Nov 11, 2016
Merged

Fix long stalls when calling ceph_fsync() #11710

merged 9 commits into from Nov 11, 2016

Conversation

jtlayton
Copy link
Contributor

@jtlayton jtlayton commented Oct 31, 2016

This PR should fix tracker bug 17563

The userland client fsync codepath is quite slow in some (rather common) workloads. I have a libcephfs test program that creates a file, writes 16k to it, calls ceph_fsync on it and then ceph_close. This is done in a loop 256 times.

Without this patchset when I run this program vs a vstart cluster under "time":

real    21m19.619s
user    0m0.431s
sys     0m0.364s

With this patchset:

real    0m11.491s
user    0m0.247s
sys     0m0.127s

Roughly 100x speedup. I first noticed this when testing with ganesha, backed by ceph, so this should help its workload as well.

@jtlayton jtlayton added cephfs Ceph File System performance labels Oct 31, 2016
@jtlayton
Copy link
Contributor Author

jtlayton commented Nov 1, 2016

@jtlayton
Copy link
Contributor Author

jtlayton commented Nov 1, 2016

Note too that if this protocol change looks ok, then we'll almost certainly want to propagate this change to the kernel client as well.

@jtlayton
Copy link
Contributor Author

jtlayton commented Nov 3, 2016

One thing that may make sense is to turn this bool into an (advisory) flags field or an enum that describes how the flush should occur. We could keep one bit for a "sync" flag, and one for "more data coming".

The latter could hint to the MDS that it should delay a log flush until it receives one from a client with that flag cleared. That could help the mds batch up updates from a single client more effectively.

@liewegas liewegas added this to the kraken milestone Nov 4, 2016
Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on switching to a flags field for check_caps

@@ -117,7 +120,8 @@ class MClientCaps : public Message {
time_warp_seq(0),
osd_epoch_barrier(0),
oldest_flush_tid(0),
caller_uid(0), caller_gid(0) {
caller_uid(0), caller_gid(0),
sync(false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to set HEAD_VERSION = 10 above, and make the decoding of the new field conditional on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jewel didn't have btime or change_attr. Those fields would both be new as of kraken.

So to be clear, that means that we do need to worry about backward compatability vs. development builds since jewel? If not, then we can still call it struct version 9 and it should still work. If we care about sane behavior vs. earlier development builds though, then we probably will need to do as you suggest.

@@ -263,6 +269,7 @@ class MClientCaps : public Message {
if (header.version >= 9) {
::decode(btime, p);
::decode(change_attr, p);
::decode(sync, p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (header.version >= 10)

@jtlayton jtlayton force-pushed the wip-jlayton-fsync branch 2 times, most recently from 2606db0 to d90f5cf Compare November 7, 2016 17:14
@gregsfortytwo gregsfortytwo self-assigned this Nov 8, 2016
@gregsfortytwo
Copy link
Member

I'd like to go through this as well, self-assigning for that.

@jtlayton
Copy link
Contributor Author

jtlayton commented Nov 8, 2016

Ok, I think my latest PR should address all of the comments. @liewegas care to re-review? I'd also welcome review from @jcsp and/or @gregsfortytwo .

@@ -2706,7 +2708,7 @@ void Locker::handle_client_caps(MClientCaps *m)
}

// filter wanted based on what we could ever give out (given auth/replica status)
bool need_flush = false;
bool need_flush = m->flags & CLIENT_CAPS_SYNC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it clearer that we're always respecting this flag, maybe the need_flush assignments below should be changed into need_flush |=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Hmm...does |= work with bools? What might be better is to just check whether need_flush is already set before calling _need_flush_mdlog. I'll push a SQUASH patch on top.

We still need to sync out metadata on size and mtime changes, and we
definitely do not want to delay syncing just because we only have
write caps that are dirty. Remove the mask and check_caps whenever
any caps are dirty.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Currently, when "is_delay" is true, we send the caps immediately. When
it's false, we delay them. This seems backward to me and is highly
confusing.

Sage says:

"It's definitely confusing.  the call site that motivated the naming is
 the one from tick(), that passes true ... for the delayed flush that
 forces them to be sent but lots of other callers set it to true to avoid
 delaying. It could use a rename/cleanup."

Let's change the name to "no_delay" to indicate that the caller wants
to send caps immediately instead of delaying them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
...and encode/decode it appropriately.

The idea of this field is to be advisory, to allow the MDS to handle
things differently if it so chooses. We deliberately do _not_ offer
any firm policy here.

We start with a flag that tells the MDS when an application may end up
blocking on the result of this cap request.

A new "sync" arg is added to send_cap, and we set the new flag in the
cap request based on its value. For now, the callers all set the sync
boolean to false everywhere to preserve the existing behavior.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

jtlayton commented Nov 9, 2016

@jcsp : I think the SQUASH patch I just added should address your concern. Untested as of yet but it should do the right thing. I'll spin up a test run with it once the build is done.

check_caps(in, true);
if (p.end())
flags |= CHECK_CAPS_SYNCHRONOUS;
check_caps(in, flags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're here forcing the caps immediately to disk on any invocation of flush_caps(). Can we rename the function to indicate that behavior?
Since we obviously call the per-file/per-session flush_caps() in lots of places where we don't want to force it instantly to disk.

Copy link
Contributor Author

@jtlayton jtlayton Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is not the per-file/per-session flush_caps, -- this is the flush_caps function with no argument that flushes all caps back to the mds for unmount or syncfs. Can we rename this one to distinguish it from the other? Absolutely. I'm build-testing a patch for that now.

Build worked, so I went ahead and pushed the patch onto the pile.

@gregsfortytwo gregsfortytwo removed their assignment Nov 10, 2016
@jcsp
Copy link
Contributor

jcsp commented Nov 11, 2016

This (or at least a sufficiently recent version for me) passed tests here: http://pulpito.ceph.com/jspray-2016-11-10_21:38:43-fs-wip-jcsp-testing-20161108-distro-basic-mira/

@jtlayton I'm happy for this to merge as soon as it's squashed

If the client has set the sync flag in a cap update, then it
is indicating that it's waiting on the reply. Ensure that we flush
the journal in that case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
In a later patch, we'll want to have the client set the sync flag in
the cap flush, to hint to the MDS that it should process it immediately.

We could add a second bool, but let's instead do what the kernel client
does which is to have a flags field. With that, the existing no_delay
bool becomes CHECK_CAPS_NODELAY.

We'll add other flags in subsequent patches.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ensure that the client will request an immediate journal flush from the
MDS when we'll end up waiting on the flush response. This patch should
fix the fsync codepath, but we may need something similar for syncfs.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
…eature flags

The kernel client lags the userland code a bit, and feature support for
addr2 is not quite ready. Still, we want to allow the client to set the
new flags field in a cap request before then so it can get better fsync
performance.

When we go to update the cap fields, grab the features from the peer,
and verify that the appropriate flags are set before we apply updates
to the btime and change_attr.

Also, just have the function return early if dirty is 0, since it's
a no-op in that case, and turn the comment above the function into
an assertion.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ensure that we ask the MDS to flush the journal on the last cap flush
from sync_fs and umount codepaths.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Per Greg's recommendation, change the name of this function to better
indicate what it does now that we always request a journal flush on
the last cap flush.

Also, add a comment above the function to better explain why we do this.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

Squashed. @jcsp you may merge when ready.

@jcsp jcsp dismissed liewegas’s stale review November 11, 2016 12:14

Code was switched to flags field

@jcsp jcsp merged commit 6b2dc4a into master Nov 11, 2016
@jcsp jcsp deleted the wip-jlayton-fsync branch November 11, 2016 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System performance
Projects
None yet
4 participants