Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: buffer the truncate if we have the Fx caps #43286

Merged
merged 4 commits into from Jan 13, 2022

Conversation

lxbsz
Copy link
Member

@lxbsz lxbsz commented Sep 23, 2021

Fixes: https://tracker.ceph.com/issues/16739

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the cephfs Ceph File System label Sep 23, 2021
@lxbsz lxbsz requested review from a team, jtlayton and batrick September 23, 2021 11:37
src/client/Client.cc Outdated Show resolved Hide resolved
@ukernel
Copy link
Contributor

ukernel commented Sep 26, 2021

you can't buffer "truncating file to smaller size"

@lxbsz
Copy link
Member Author

lxbsz commented Sep 30, 2021

you can't buffer "truncating file to smaller size"

This won't buffer in that case, only buffer the truncate when the new size is larger.

@vshankar
Copy link
Contributor

you can't buffer "truncating file to smaller size"

Is this due to the reason that the mds has special handling for shrinking truncates? AFAICS, shrinking truncate additionally notifies clients (with updated caps) and initiates file truncate (via objecter). For extending file truncates, the cap update to clients is probably deferred?

in->cap_dirtier_gid = perms.gid();
in->mark_caps_dirty(CEPH_CAP_FILE_EXCL);
mask &= ~(CEPH_SETATTR_SIZE);
mask |= CEPH_SETATTR_MTIME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid some code duplication when updating inode (citme, uid,..) with what's done in the block below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I can fix this.

@lxbsz
Copy link
Member Author

lxbsz commented Oct 8, 2021

you can't buffer "truncating file to smaller size"

Is this due to the reason that the mds has special handling for shrinking truncates? AFAICS, shrinking truncate additionally notifies clients (with updated caps) and initiates file truncate (via objecter). For extending file truncates, the cap update to clients is probably deferred?

For this, I think Zheng means that for the current code only the handle_client_setattr() will do the object truncating stuff. If we buffer the truncate with smaller size, then the client will defer do the truncate in the cap updating related code and in the MDS side the cap updating related code won't do the object truncating work.

IMO if we will buffer the truncate with smaller size we must adapt the MDS side code at the same time in this PR, which will make the code more complicated, seems doesn't make much sense AFAIK.

@jtlayton
Copy link
Contributor

jtlayton commented Oct 8, 2021

IMO if we will buffer the truncate with smaller size we must adapt the MDS side code at the same time in this PR, which will make the code more complicated, seems doesn't make much sense AFAIK.

Agreed. Let's just stick with not buffering truncates to smaller sizes. It wouldn't give us much benefit anyway.

@vshankar
Copy link
Contributor

vshankar commented Oct 8, 2021

For this, I think Zheng means that for the current code only the handle_client_setattr() will do the object truncating stuff. If we buffer the truncate with smaller size, then the client will defer do the truncate in the cap updating related code and in the MDS side the cap updating related code won't do the object truncating work.

IMO if we will buffer the truncate with smaller size we must adapt the MDS side code at the same time in this PR, which will make the code more complicated, seems doesn't make much sense AFAIK.

Sure. I wasn't proposing that we start buffering shrinking truncates, I was interested in the reason as to why its handled differently. The client should adhere to the MDS handling of an operation.

@jtlayton
Copy link
Contributor

jtlayton commented Oct 8, 2021

Just to add, truncates in cephfs are really complex. It would be nice to have all of the cases carefully documented for posterity.

req->inode_drop |= CEPH_CAP_AUTH_SHARED;
ldout(cct,10) << "changing mode to " << stx->stx_mode << dendl;
if (!in->caps_issued_mask(CEPH_CAP_AUTH_SHARED) ||
stx->stx_mode != in->mode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty confusing having the logic for each CEPH_SETATTR_* value split into two different stanzas -- one that occurs before "force request:" and one that occurs after. It sort of makes sense in the existing code -- we check to see if we can buffer the attribute change and if we can't we then build a request and send it off.

This patch changes that though and it's harder to follow the logic around whether the change will be buffered.

Why not do the testing/handling for As caps and the mode change in the CEPH_SETATTR_MODE section above this point (around line #7524)?

Ditto for the other CEPH_SETATTR_* changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty confusing having the logic for each CEPH_SETATTR_* value split into two different stanzas -- one that occurs before "force request:" and one that occurs after. It sort of makes sense in the existing code -- we check to see if we can buffer the attribute change and if we can't we then build a request and send it off.

Yeah, this sounds good.

This patch changes that though and it's harder to follow the logic around whether the change will be buffered.

Why not do the testing/handling for As caps and the mode change in the CEPH_SETATTR_MODE section above this point (around line #7524)?

Ditto for the other CEPH_SETATTR_* changes.

Sure, will improve it.

@lxbsz lxbsz force-pushed the improve_setattr branch 2 times, most recently from bca4269 to 6437ed0 Compare October 11, 2021 11:49
@lxbsz lxbsz requested a review from jtlayton October 11, 2021 11:50
Copy link
Contributor

@jtlayton jtlayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this patch. I think the existing organization of this function (where you decide whether to buffer any of the changes and then allocate a new req and fire it off if so), is basically correct. This seems like a step backward from that.

src/client/Client.cc Show resolved Hide resolved
@lxbsz lxbsz force-pushed the improve_setattr branch 2 times, most recently from 43caa12 to 3a8e870 Compare October 11, 2021 14:52
@jtlayton jtlayton self-requested a review October 11, 2021 14:59
@jtlayton
Copy link
Contributor

Nice work @lxbsz !

@lxbsz lxbsz added the needs-qa label Oct 14, 2021
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks like good changes. Well done!

in->mark_caps_dirty(CEPH_CAP_FILE_EXCL);
mask &= ~CEPH_SETATTR_MTIME;
} else if (in->caps_issued_mask(CEPH_CAP_FILE_WR) &&
utime_t(stx->stx_mtime) > in->mtime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one concerns me. Would this not break a reader + writer where the reader is polling for mtime updates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this sounds a problem. Anyway in this case we must update the mtime to the MDS by either setattr request or cap update request.

To be simple I will remove this code currently. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:

        if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
                    STATX_BLOCKS))
                mask |= CEPH_CAP_FILE_SHARED;

...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.

Again -- we really need a canonical treatise on what caps cover what inode metadata (particularly for FILE caps). Maybe we should badger @liewegas, @gregsfortytwo and @ukernel for such a thing?

My guess though is that this has never been clearly defined throughout cephfs's existence. We'll probably need to write up a "spec" for this ourselves, and then fix the userland and kernel code to conform to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:

        if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
                    STATX_BLOCKS))
                mask |= CEPH_CAP_FILE_SHARED;

...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.

You should be able to check this by grokking filelock in mds/locks.c, which...okay, when I started writing this I didn't but I think I remember how it works now:
You cannot issue both Fs and Fw to distinct clients — the CEPH_CAP_GWR can be issued only in LOCK_MIX or LOCK_EXCL(usive) states, and MIX doesn't allow CEPH_CAP_GSHARED. (EXCL does, but you can only be in that state with a single client on the file.)

Again -- we really need a canonical treatise on what caps cover what inode metadata (particularly for FILE caps). Maybe we should badger @liewegas, @gregsfortytwo and @ukernel for such a thing?

My guess though is that this has never been clearly defined throughout cephfs's existence. We'll probably need to write up a "spec" for this ourselves, and then fix the userland and kernel code to conform to it.

There are a few fuzzy bits — I seem to recall stuff like the dentry leases have timestamps that are provided on basic listings, and those don't get updated because it was wildly too expensive, but a full stat will refresh mtime properly — but I think this is pretty well defined. If you want to write out specific questions or inode metadata fields and their covering bits we can correct and help fill it in, but you've pretty well got it in ceph.git/doc/cephfs/capabilities.rst and being able to read locks.c will help you go a long way when trying to understand the conflicts and interactions you need to worry about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:

        if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
                    STATX_BLOCKS))
                mask |= CEPH_CAP_FILE_SHARED;

...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.

You should be able to check this by grokking filelock in mds/locks.c, which...okay, when I started writing this I didn't but I think I remember how it works now: You cannot issue both Fs and Fw to distinct clients — the CEPH_CAP_GWR can be issued only in LOCK_MIX or LOCK_EXCL(usive) states, and MIX doesn't allow CEPH_CAP_GSHARED. (EXCL does, but you can only be in that state with a single client on the file.)

Yeah, double checked this. The Fs and Fw are not allowed to distinct clients at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some general confusion around what caps cover these attributes. The kernel wants Fs to read the mtime in a getattr request:

        if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
                    STATX_BLOCKS))
                mask |= CEPH_CAP_FILE_SHARED;

...but that looks wrong, in hindsight. Can a client hold Fw caps while another has Fs? I suspect that it can. If so, then the kclient's getattr path is not working as it should.

Thanks @jtlayton, you just remind me that when polling for the mtime it should get the Fs caps not the Fr, and the other clients shouldn't have the Fw caps, so this should be safe. I will add this back.

Copy link
Member Author

@lxbsz lxbsz Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick @vshankar Please take a look.

As Jeff mentioned and I double checked, the Fs and Fw caps are not allowed to distinct clients at the same time.

If one reader is polling the mtime updates it should use the statx() with AT_STATX_FORCE_SYNC. From the posix doc, without the AT_STATX_FORCE_SYNC, it won't make sure that the mtime will be the newest, just return what the local cache has.

Once the AT_STATX_FORCE_SYNC was set, client will always sync the attributes from the MDS server by requiring the Fs caps. Then in MDS it will rdlock the FILE lock, that means all the Fw caps must be revoked. Then the writer client will flush it.

More detail about the Fs caps, please see my another PR #43722.

@lxbsz
Copy link
Member Author

lxbsz commented Nov 18, 2021

Rebased onto the master.

@lxbsz
Copy link
Member Author

lxbsz commented Nov 18, 2021

jenkins test make check arm64

@lxbsz
Copy link
Member Author

lxbsz commented Nov 18, 2021

jenkins retest this please

@vshankar
Copy link
Contributor

jenkins test api

@lxbsz
Copy link
Member Author

lxbsz commented Nov 30, 2021

Rebased it.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 1, 2021

Rebased it to address the check failures.

@lxbsz
Copy link
Member Author

lxbsz commented Dec 1, 2021

jenkins test make check

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I would have liked to avoid code duplication, but that's NBD and can be done later.

If the As caps is issued and the gid/uid/mode/btime don't change,
then it's safe to skip sending out a setattr request to MDS.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
If the Fx caps is issued and new size is larger we can try to
buffer the truncate, and just ignore in case the new size equals
to current size.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
When the Fw caps is issued and increasing the atime/mtime, we can
buffer the atime/mtime. And when the Fs caps is issued and they
are not changing we can just ignore them.

In case when recovering the atime/mtime maybe decreased, no matter
the Fsw caps are issued, always send the request immediatelly.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
If the caller perms is different with the current dirtier's, we
can still cache the attributes locally if possible, but will
mark the ctime needs to be updated forcely and the request will
implicitly flush the dirty caps for us.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
@vshankar
Copy link
Contributor

vshankar commented Jan 6, 2022

@lxbsz Is this ready for review/testing?

@lxbsz
Copy link
Member Author

lxbsz commented Jan 6, 2022

@lxbsz Is this ready for review/testing?

Yeah.

@vshankar
Copy link
Contributor

vshankar commented Jan 6, 2022

jenkins test api

@vshankar
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System needs-review
Projects
None yet
6 participants