Skip to content

Commit

Permalink
BTRFS/NFSD: provide more unique inode number for btrfs export
Browse files Browse the repository at this point in the history
BTRFS does not provide unique inode numbers across a filesystem.
It only provide unique inode numbers within a subvolume and
uses synthetic device numbers for different subvolumes to ensure
uniqueness for device+inode.

nfsd cannot use these varying synthetic device numbers.  If nfsd were to
synthesise different stable filesystem ids to give to the client, that
would cause subvolumes to appear in the mount table on the client, even
though they don't appear in the mount table on the server.  Also, NFSv3
doesn't support changing the filesystem id without a new explicit mount
on the client (this is partially supported in practice, but violates the
protocol specification and has problems in some edge cases).

So currently, the roots of all subvolumes report the same inode number
in the same filesystem to NFS clients and tools like 'find' notice that
a directory has the same identity as an ancestor, and so refuse to
enter that directory.

This patch allows btrfs (or any filesystem) to provide a 64bit number
that can be xored with the inode number to make the number more unique.
Rather than the client being certain to see duplicates, with this patch
it is possible but extremely rare.

The number that btrfs provides is a swab64() version of the subvolume
identifier.  This has most entropy in the high bits (the low bits of the
subvolume identifer), while the inode has most entropy in the low bits.
The result will always be unique within a subvolume, and will almost
always be unique across the filesystem.

If an upgrade of the NFS server caused all inode numbers in an exportfs
BTRFS filesystem to appear to the client to change, the client may not
handle this well.  The Linux client will cause any open files to become
'stale'.  If the mount point changed inode number, the whole mount would
become inaccessible.

To avoid this, an unused byte in the filehandle (fh_auth) has been
repurposed as "fh_options".  (The use of #defines make fh_flags a
problematic choice).  The new behaviour of uniquifying inode number is
only activated when this bit is set.

NFSD will only set this bit in filehandles it reports if the filehandle
of the parent (provided by the client) contains the bit, or if
 - the filehandle for the parent is not provided or is for a different
   export and
 - the filehandle refers to a BTRFS filesystem.

Thus if you have a BTRFS filesystem originally mounted from a server
without this patch, the flag will never be set and the current behaviour
will continue.  Only once you re-mount the filesystem (or the filesystem
is re-auto-mounted) will the inode numbers change.  When that happens,
it is likely that the filesystem st_dev number seen on the client will
change anyway.

Signed-off-by: NeilBrown <neilb@suse.de>
  • Loading branch information
neilbrown authored and intel-lab-lkp committed Aug 23, 2021
1 parent 2734d6c commit e99ff00
Show file tree
Hide file tree
Showing 8 changed files with 87 additions and 12 deletions.
4 changes: 4 additions & 0 deletions fs/btrfs/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
generic_fillattr(&init_user_ns, inode, stat);
stat->dev = BTRFS_I(inode)->root->anon_dev;

if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
stat->ino_uniquifier =
swab64(BTRFS_I(inode)->root->root_key.objectid);

spin_lock(&BTRFS_I(inode)->lock);
delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
inode_bytes = inode_get_bytes(inode);
Expand Down
15 changes: 14 additions & 1 deletion fs/nfsd/nfs3xdr.c
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
{
struct user_namespace *userns = nfsd_user_namespace(rqstp);
__be32 *p;
u64 ino;
u64 fsid;

p = xdr_reserve_space(xdr, XDR_UNIT * 21);
Expand Down Expand Up @@ -377,7 +378,8 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
p = xdr_encode_hyper(p, fsid);

/* fileid */
p = xdr_encode_hyper(p, stat->ino);
ino = nfsd_uniquify_ino(fhp, stat);
p = xdr_encode_hyper(p, ino);

p = encode_nfstime3(p, &stat->atime);
p = encode_nfstime3(p, &stat->mtime);
Expand Down Expand Up @@ -1151,6 +1153,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
if (xdr_stream_encode_item_present(xdr) < 0)
return false;
/* fileid */
if (!resp->dir_have_uniquifier) {
struct kstat stat;
if (fh_getattr(&resp->fh, &stat) == nfs_ok)
resp->dir_ino_uniquifier =
nfsd_ino_uniquifier(&resp->fh, &stat);
else
resp->dir_ino_uniquifier = 0;
resp->dir_have_uniquifier = true;
}
if (resp->dir_ino_uniquifier != ino)
ino ^= resp->dir_ino_uniquifier;
if (xdr_stream_encode_u64(xdr, ino) < 0)
return false;
/* name */
Expand Down
7 changes: 4 additions & 3 deletions fs/nfsd/nfs4xdr.c
Original file line number Diff line number Diff line change
Expand Up @@ -3114,10 +3114,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
fhp->fh_handle.fh_size);
}
if (bmval0 & FATTR4_WORD0_FILEID) {
u64 ino = nfsd_uniquify_ino(fhp, &stat);
p = xdr_reserve_space(xdr, 8);
if (!p)
goto out_resource;
p = xdr_encode_hyper(p, stat.ino);
p = xdr_encode_hyper(p, ino);
}
if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
p = xdr_reserve_space(xdr, 8);
Expand Down Expand Up @@ -3274,7 +3275,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,

p = xdr_reserve_space(xdr, 8);
if (!p)
goto out_resource;
goto out_resource;
/*
* Get parent's attributes if not ignoring crossmount
* and this is the root of a cross-mounted filesystem.
Expand All @@ -3284,7 +3285,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
err = get_parent_attributes(exp, &parent_stat);
if (err)
goto out_nfserr;
ino = parent_stat.ino;
ino = nfsd_uniquify_ino(fhp, &parent_stat);
}
p = xdr_encode_hyper(p, ino);
}
Expand Down
13 changes: 11 additions & 2 deletions fs/nfsd/nfsfh.c
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)

if (--data_left < 0)
return error;
if (fh->fh_auth_type != 0)
if ((fh->fh_options & ~NFSD_FH_OPTION_ALL) != 0)
return error;
len = key_len(fh->fh_fsid_type) / 4;
if (len == 0)
Expand Down Expand Up @@ -569,6 +569,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,

struct inode * inode = d_inode(dentry);
dev_t ex_dev = exp_sb(exp)->s_dev;
u8 options = 0;

dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
MAJOR(ex_dev), MINOR(ex_dev),
Expand All @@ -585,6 +586,14 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;

if (ref_fh && ref_fh->fh_export == exp) {
options = ref_fh->fh_handle.fh_options;
} else {
/* Set options as needed */
if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
options |= NFSD_FH_OPTION_INO_UNIQUIFY;
}

if (ref_fh == fhp)
fh_put(ref_fh);

Expand Down Expand Up @@ -615,7 +624,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
} else {
fhp->fh_handle.fh_size =
key_len(fhp->fh_handle.fh_fsid_type) + 4;
fhp->fh_handle.fh_auth_type = 0;
fhp->fh_handle.fh_options = options;

mk_fsid(fhp->fh_handle.fh_fsid_type,
fhp->fh_handle.fh_fsid,
Expand Down
22 changes: 22 additions & 0 deletions fs/nfsd/nfsfh.h
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,28 @@ enum fsid_source {
};
extern enum fsid_source fsid_source(const struct svc_fh *fhp);

enum nfsd_fh_options {
NFSD_FH_OPTION_INO_UNIQUIFY = 1, /* BTRFS only */

NFSD_FH_OPTION_ALL = 1
};

static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp,
const struct kstat *stat)
{
if (fhp->fh_handle.fh_options & NFSD_FH_OPTION_INO_UNIQUIFY)
return stat->ino_uniquifier;
return 0;
}

static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp,
const struct kstat *stat)
{
u64 u = nfsd_ino_uniquifier(fhp, stat);
if (u != stat->ino)
return stat->ino ^ u;
return stat->ino;
}

/*
* This might look a little large to "inline" but in all calls except
Expand Down
2 changes: 2 additions & 0 deletions fs/nfsd/xdr3.h
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,8 @@ struct nfsd3_readdirres {
struct xdr_buf dirlist;
struct svc_fh scratch;
struct readdir_cd common;
u64 dir_ino_uniquifier;
bool dir_have_uniquifier;
unsigned int cookie_offset;
struct svc_rqst * rqstp;

Expand Down
18 changes: 18 additions & 0 deletions include/linux/stat.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,24 @@ struct kstat {
struct timespec64 btime; /* File creation time */
u64 blocks;
u64 mnt_id;
/*
* BTRFS does not provide unique inode numbers within a filesystem,
* depending on a synthetic 'dev' to provide uniqueness.
* NFSd cannot make use of this 'dev' number so clients often see
* duplicate inode numbers.
* For BTRFS, 'ino' is unlikely to use the high bits until the filesystem
* has created a great many inodes.
* It puts another number in ino_uniquifier which:
* - has most entropy in the high bits
* - is different precisely when 'dev' is different
* - is stable across unmount/remount
* NFSd can xor this with 'ino' to get a substantially more unique
* number for reporting to the client.
* The ino_uniquifier for a directory can reasonably be applied
* to inode numbers reported by the readdir filldir callback.
* It is NOT currently exported to user-space.
*/
u64 ino_uniquifier;
};

#endif
18 changes: 12 additions & 6 deletions include/uapi/linux/nfsd/nfsfh.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,17 @@ struct nfs_fhbase_old {
* The file handle starts with a sequence of four-byte words.
* The first word contains a version number (1) and three descriptor bytes
* that tell how the remaining 3 variable length fields should be handled.
* These three bytes are auth_type, fsid_type and fileid_type.
* These three bytes are options, fsid_type and fileid_type.
*
* All four-byte values are in host-byte-order.
*
* The auth_type field is deprecated and must be set to 0.
* The options field (previously auth_type) can be used when nfsd behaviour
* needs to change in a non-compatible way, usually for some specific
* filesystem. Options should only be set in filehandles for filesystems which
* need them.
* Current values:
* 1 - BTRFS only. Cause stat->ino_uniquifier to be used to improve inode
* number uniqueness.
*
* The fsid_type identifies how the filesystem (or export point) is
* encoded.
Expand All @@ -67,7 +73,7 @@ struct nfs_fhbase_new {
union {
struct {
__u8 fb_version_aux; /* == 1, even => nfs_fhbase_old */
__u8 fb_auth_type_aux;
__u8 fb_options_aux;
__u8 fb_fsid_type_aux;
__u8 fb_fileid_type_aux;
__u32 fb_auth[1];
Expand All @@ -76,7 +82,7 @@ struct nfs_fhbase_new {
};
struct {
__u8 fb_version; /* == 1, even => nfs_fhbase_old */
__u8 fb_auth_type;
__u8 fb_options;
__u8 fb_fsid_type;
__u8 fb_fileid_type;
__u32 fb_auth_flex[]; /* flexible-array member */
Expand Down Expand Up @@ -106,11 +112,11 @@ struct knfsd_fh {

#define fh_version fh_base.fh_new.fb_version
#define fh_fsid_type fh_base.fh_new.fb_fsid_type
#define fh_auth_type fh_base.fh_new.fb_auth_type
#define fh_options fh_base.fh_new.fb_options
#define fh_fileid_type fh_base.fh_new.fb_fileid_type
#define fh_fsid fh_base.fh_new.fb_auth_flex

/* Do not use, provided for userspace compatiblity. */
#define fh_auth fh_base.fh_new.fb_auth
#define fh_auth fh_base.fh_new.fb_options

#endif /* _UAPI_LINUX_NFSD_FH_H */

0 comments on commit e99ff00

Please sign in to comment.