New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: buffer the truncate if we have the Fx caps #43286
Conversation
you can't buffer "truncating file to smaller size" |
This won't buffer in that case, only buffer the truncate when the new size is larger. |
Is this due to the reason that the mds has special handling for shrinking truncates? AFAICS, shrinking truncate additionally notifies clients (with updated caps) and initiates file truncate (via objecter). For extending file truncates, the cap update to clients is probably deferred? |
src/client/Client.cc
Outdated
in->cap_dirtier_gid = perms.gid(); | ||
in->mark_caps_dirty(CEPH_CAP_FILE_EXCL); | ||
mask &= ~(CEPH_SETATTR_SIZE); | ||
mask |= CEPH_SETATTR_MTIME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could avoid some code duplication when updating inode (citme, uid,..) with what's done in the block below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I can fix this.
For this, I think Zheng means that for the current code only the IMO if we will buffer the truncate with smaller size we must adapt the MDS side code at the same time in this PR, which will make the code more complicated, seems doesn't make much sense AFAIK. |
Agreed. Let's just stick with not buffering truncates to smaller sizes. It wouldn't give us much benefit anyway. |
Sure. I wasn't proposing that we start buffering shrinking truncates, I was interested in the reason as to why its handled differently. The client should adhere to the MDS handling of an operation. |
Just to add, truncates in cephfs are really complex. It would be nice to have all of the cases carefully documented for posterity. |
src/client/Client.cc
Outdated
req->inode_drop |= CEPH_CAP_AUTH_SHARED; | ||
ldout(cct,10) << "changing mode to " << stx->stx_mode << dendl; | ||
if (!in->caps_issued_mask(CEPH_CAP_AUTH_SHARED) || | ||
stx->stx_mode != in->mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty confusing having the logic for each CEPH_SETATTR_* value split into two different stanzas -- one that occurs before "force request:" and one that occurs after. It sort of makes sense in the existing code -- we check to see if we can buffer the attribute change and if we can't we then build a request and send it off.
This patch changes that though and it's harder to follow the logic around whether the change will be buffered.
Why not do the testing/handling for As caps and the mode change in the CEPH_SETATTR_MODE section above this point (around line #7524)?
Ditto for the other CEPH_SETATTR_* changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty confusing having the logic for each CEPH_SETATTR_* value split into two different stanzas -- one that occurs before "force request:" and one that occurs after. It sort of makes sense in the existing code -- we check to see if we can buffer the attribute change and if we can't we then build a request and send it off.
Yeah, this sounds good.
This patch changes that though and it's harder to follow the logic around whether the change will be buffered.
Why not do the testing/handling for As caps and the mode change in the CEPH_SETATTR_MODE section above this point (around line #7524)?
Ditto for the other CEPH_SETATTR_* changes.
Sure, will improve it.
bca4269
to
6437ed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this patch. I think the existing organization of this function (where you decide whether to buffer any of the changes and then allocate a new req and fire it off if so), is basically correct. This seems like a step backward from that.
43caa12
to
3a8e870
Compare
Nice work @lxbsz ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks like good changes. Well done!
in->mark_caps_dirty(CEPH_CAP_FILE_EXCL); | ||
mask &= ~CEPH_SETATTR_MTIME; | ||
} else if (in->caps_issued_mask(CEPH_CAP_FILE_WR) && | ||
utime_t(stx->stx_mtime) > in->mtime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one concerns me. Would this not break a reader + writer where the reader is polling for mtime updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this sounds a problem. Anyway in this case we must update the mtime
to the MDS by either setattr request or cap update request.
To be simple I will remove this code currently. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:
if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
STATX_BLOCKS))
mask |= CEPH_CAP_FILE_SHARED;
...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.
Again -- we really need a canonical treatise on what caps cover what inode metadata (particularly for FILE caps). Maybe we should badger @liewegas, @gregsfortytwo and @ukernel for such a thing?
My guess though is that this has never been clearly defined throughout cephfs's existence. We'll probably need to write up a "spec" for this ourselves, and then fix the userland and kernel code to conform to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:
if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE| STATX_BLOCKS)) mask |= CEPH_CAP_FILE_SHARED;
...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.
You should be able to check this by grokking filelock in mds/locks.c, which...okay, when I started writing this I didn't but I think I remember how it works now:
You cannot issue both Fs and Fw to distinct clients — the CEPH_CAP_GWR can be issued only in LOCK_MIX or LOCK_EXCL(usive) states, and MIX doesn't allow CEPH_CAP_GSHARED. (EXCL does, but you can only be in that state with a single client on the file.)
Again -- we really need a canonical treatise on what caps cover what inode metadata (particularly for FILE caps). Maybe we should badger @liewegas, @gregsfortytwo and @ukernel for such a thing?
My guess though is that this has never been clearly defined throughout cephfs's existence. We'll probably need to write up a "spec" for this ourselves, and then fix the userland and kernel code to conform to it.
There are a few fuzzy bits — I seem to recall stuff like the dentry leases have timestamps that are provided on basic listings, and those don't get updated because it was wildly too expensive, but a full stat will refresh mtime properly — but I think this is pretty well defined. If you want to write out specific questions or inode metadata fields and their covering bits we can correct and help fill it in, but you've pretty well got it in ceph.git/doc/cephfs/capabilities.rst and being able to read locks.c will help you go a long way when trying to understand the conflicts and interactions you need to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:
if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE| STATX_BLOCKS)) mask |= CEPH_CAP_FILE_SHARED;
...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.
You should be able to check this by grokking filelock in mds/locks.c, which...okay, when I started writing this I didn't but I think I remember how it works now: You cannot issue both Fs and Fw to distinct clients — the CEPH_CAP_GWR can be issued only in LOCK_MIX or LOCK_EXCL(usive) states, and MIX doesn't allow CEPH_CAP_GSHARED. (EXCL does, but you can only be in that state with a single client on the file.)
Yeah, double checked this. The Fs
and Fw
are not allowed to distinct clients at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:
if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE| STATX_BLOCKS)) mask |= CEPH_CAP_FILE_SHARED;
...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.
Thanks @jtlayton, you just remind me that when polling for the mtime it should get the Fs
caps not the Fr
, and the other clients shouldn't have the Fw
caps, so this should be safe. I will add this back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@batrick @vshankar Please take a look.
As Jeff mentioned and I double checked, the Fs
and Fw
caps are not allowed to distinct clients at the same time.
If one reader is polling the mtime
updates it should use the statx()
with AT_STATX_FORCE_SYNC
. From the posix doc, without the AT_STATX_FORCE_SYNC
, it won't make sure that the mtime
will be the newest, just return what the local cache has.
Once the AT_STATX_FORCE_SYNC
was set, client will always sync the attributes from the MDS server by requiring the Fs
caps. Then in MDS it will rdlock the FILE lock, that means all the Fw
caps must be revoked. Then the writer client will flush it.
More detail about the Fs
caps, please see my another PR #43722.
3a8e870
to
22833db
Compare
1289883
to
ca57e9e
Compare
Rebased onto the master. |
jenkins test make check arm64 |
jenkins retest this please |
jenkins test api |
ca57e9e
to
9831685
Compare
Rebased it. |
9831685
to
2bb030a
Compare
Rebased it to address the check failures. |
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I would have liked to avoid code duplication, but that's NBD and can be done later.
If the As caps is issued and the gid/uid/mode/btime don't change, then it's safe to skip sending out a setattr request to MDS. Signed-off-by: Xiubo Li <xiubli@redhat.com>
If the Fx caps is issued and new size is larger we can try to buffer the truncate, and just ignore in case the new size equals to current size. Signed-off-by: Xiubo Li <xiubli@redhat.com>
When the Fw caps is issued and increasing the atime/mtime, we can buffer the atime/mtime. And when the Fs caps is issued and they are not changing we can just ignore them. In case when recovering the atime/mtime maybe decreased, no matter the Fsw caps are issued, always send the request immediatelly. Signed-off-by: Xiubo Li <xiubli@redhat.com>
If the caller perms is different with the current dirtier's, we can still cache the attributes locally if possible, but will mark the ctime needs to be updated forcely and the request will implicitly flush the dirty caps for us. Signed-off-by: Xiubo Li <xiubli@redhat.com>
2bb030a
to
569770a
Compare
@lxbsz Is this ready for review/testing? |
Yeah. |
jenkins test api |
Comments addressed.
Fixes: https://tracker.ceph.com/issues/16739
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox