Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jewel mds: order directories by hash and fix simultaneous readdir races #9655

Merged
merged 12 commits into from Jun 13, 2016

Conversation

gregsfortytwo
Copy link
Member

@gregsfortytwo gregsfortytwo commented Jun 12, 2016

http://tracker.ceph.com/issues/16251

Ordering directories by hash makes readdir stable across directory fragmentation, and allows
an easier fix for racing readdirs on the client side.

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 6572c2a)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit c41ceb9)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Current code saves the readdir result into MedaRequest, then updates
dir_result_t according to MetaRequest. I can't see any reason why
we need to do this.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit db5d60d)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
This gives us stable ordering of dentries. (Previously ordering of
dentries changes after directory gets fragmented)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit f483224)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
so that we can introduce new flags for readdir reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 92cfbdf)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Client::seekdir doesn't reset dirp->at_cache_name for a forward seek
within same frag. So the dentry with name == at_cache_name may not be
the one prior to the readdir postion.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 0e32115)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit bd6546e)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 680766e)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Now the ordering of dentries is stable across directory fragmentation.
There is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 98a01af)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 9b17d14)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
We close Inode::dir when it's empty. Once closing the dir, we lose
track of {release,ordered}_count. This causes direcotry to be wrongly
marked as complete. (dir is trimmed to empty in the middle of readdir)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 235fcf6)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
Current readdir code uses list to track the order of the dentries
in readdir replies.  When handling a readdir reply, it pushes the
resulting dentries to the back of directory's dentry_list. After
readdir finishes, the dentry_list reflects how MDS sorts dentries.

This method is racy when there are simultaneous readdirs. The fix
is use vector instead of list to trace how dentries are sorted in
its parent directory. As long as shared_gen doesn't change, each
dentry is at fixed position of the vector. So cocurrent readdirs
do not affect each other.

Fixes: http://tracker.ceph.com/issues/15508
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 9d297c5)

Signed-off-by: Greg Farnum <gfarnum@redhat.com
@gregsfortytwo gregsfortytwo added cephfs Ceph File System bug-fix labels Jun 12, 2016
@gregsfortytwo gregsfortytwo added this to the jewel milestone Jun 12, 2016
@gregsfortytwo gregsfortytwo changed the title Jewel mds: order directories by hash and fix simultaneous readdir races DNM Jewel mds: order directories by hash and fix simultaneous readdir races Jun 12, 2016
@gregsfortytwo
Copy link
Member Author

DNM until testing is done (shortly).

@smithfarm
Copy link
Contributor

Changelog:

  • added backport tracker link to PR description

@gregsfortytwo gregsfortytwo changed the title DNM Jewel mds: order directories by hash and fix simultaneous readdir races Jewel mds: order directories by hash and fix simultaneous readdir races Jun 13, 2016
@gregsfortytwo
Copy link
Member Author

http://pulpito.ceph.com/gregf-2016-06-12_19:51:59-kcephfs-greg-fs-jewel-testing---basic-mira/

SELinux failures, and missing support for pool namespace vxattrs (just an old kernel, I think).

@gregsfortytwo
Copy link
Member Author

http://pulpito.ceph.com/gregf-2016-06-12_16:01:34-fs-greg-fs-jewel-testing---basic-mira/ had some OSD issues but is otherwise good.

@gregsfortytwo gregsfortytwo merged commit f902309 into ceph:jewel Jun 13, 2016
@gregsfortytwo gregsfortytwo deleted the wip-jewel-15508 branch June 13, 2016 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix cephfs Ceph File System
Projects
None yet
3 participants