Boris-Burkov/b…
Commits on Apr 8, 2021
-
btrfs: verity metadata orphan items
If we don't finish creating fsverity metadata for a file, or fail to clean up already created metadata after a failure, we could leak the verity items. To address this issue, we use the orphan mechanism. When we start enabling verity on a file, we also add an orphan item for that inode. When we are finished, we delete the orphan. However, if we are interrupted midway, the orphan will be present at mount and we can cleanup the half-formed verity state. There is a possible race with a normal unlink operation: if unlink and verity run on the same file in parallel, it is possible for verity to succeed and delete the still legitimate orphan added by unlink. Then, if we are interrupted and mount in that state, we will never clean up the inode properly. This is also possible for a file created with O_TMPFILE. Check nlink==0 before deleting to avoid this race. A final thing to note is that this is a resurrection of using orphans to signal orphaned metadata that isn't the inode itself. This makes the comment discussing deprecating that concept a bit messy in full context. Signed-off-by: Boris Burkov <boris@bur.io>
-
btrfs: fallback to buffered io for verity files
Reading the contents with direct IO would circumvent verity checks, so fallback to buffered reads. For what it's worth, this is how ext4 handles it as well. Signed-off-by: Boris Burkov <boris@bur.io>
-
btrfs: check verity for reads of inline extents and holes
The majority of reads receive a verity check after the bio is complete as the page is marked uptodate. However, there is a class of reads which are handled with btrfs logic in readpage, rather than by submitting a bio. Specifically, these are inline extents, preallocated extents, and holes. Tweak readpage so that if it is going to mark such a page uptodate, it first checks verity on it. Now if a veritied file has corruption to this class of EXTENT_DATA items, it will be detected at read time. There is one annoying edge case that requires checking for start < last_byte: if userspace reads to the end of a file with page aligned size and then tries to keep reading (as cat does), the buffered read code will try to read the page past the end of the file, and expects it to be filled with 0s and marked uptodate. That bogus page is not part of the data hashed by verity, so we have to ignore it. Signed-off-by: Boris Burkov <boris@bur.io>
-
btrfs: initial fsverity support
Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and the other stores the Merkle tree data itself. Verity checking is done at the end of IOs to ensure each page is checked before it is marked uptodate. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode compat_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Signed-off-by: Chris Mason <clm@fb.com>
-
btrfs: add compat_flags to btrfs_inode_item
The tree checker currently rejects unrecognized flags when it reads btrfs_inode_item. Practically, this means that adding a new flag makes the change backwards incompatible if the flag is ever set on a file. Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a new "compat_flags". These flags are zero on inode creation in btrfs and mkfs and are ignored by an older kernel, so it should be safe to use them in this way. Signed-off-by: Boris Burkov <boris@bur.io>
Commits on Apr 6, 2021
-
-
btrfs: add and use readahead_batch_length
Implement readahead_batch_length() to determine the number of bytes in the current batch of readahead pages and use it in btrfs. Also use the readahead_pos to get the offset. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: move forward declarations to the beginning of extent_io.h
There are two forward declarations deep in extent_io.h, move them to the beginning and remove the duplicate one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: fix race between transaction aborts and fsyncs leading to use-…
…after-free There is a race between a task aborting a transaction during a commit, a task doing an fsync and the transaction kthread, which leads to an use-after-free of the log root tree. When this happens, it results in a stack trace like the following: [99678.547335] BTRFS info (device dm-0): forced readonly [99678.547340] BTRFS warning (device dm-0): Skipping commit of aborted transaction. [99678.547341] BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure [99678.547373] BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5) [99678.547533] BTRFS warning (device dm-0): Skipping commit of aborted transaction. [99678.548743] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10 [99678.549188] BTRFS error (device dm-0): error writing primary super block to device 1 [99678.551100] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10 [99678.551149] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10 [99678.551205] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10 [99678.551401] BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers) [99678.565169] BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure [99678.566132] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [99678.567526] CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1 [99678.568531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [99678.569980] RIP: 0010:__mutex_lock+0x139/0xa40 [99678.570556] Code: c0 74 19 (...) [99678.573752] RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202 [99678.574723] RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002 [99678.576027] RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000 [99678.577314] RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000 [99678.578601] R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040 [99678.579890] R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358 [99678.581282] FS: 00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000 [99678.582818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [99678.583771] CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0 [99678.584600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [99678.585425] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [99678.586247] Call Trace: [99678.586542] ? __btrfs_handle_fs_error+0xde/0x146 [btrfs] [99678.587260] ? btrfs_sync_log+0x7c1/0xf20 [btrfs] [99678.587930] ? btrfs_sync_log+0x7c1/0xf20 [btrfs] [99678.588573] btrfs_sync_log+0x7c1/0xf20 [btrfs] [99678.589222] btrfs_sync_file+0x40c/0x580 [btrfs] [99678.589947] do_fsync+0x38/0x70 [99678.590514] __x64_sys_fsync+0x10/0x20 [99678.591196] do_syscall_64+0x33/0x80 [99678.591829] entry_SYSCALL_64_after_hwframe+0x44/0xae [99678.592744] RIP: 0033:0x7fa9142a55c3 [99678.593403] Code: 8b 15 09 (...) [99678.596777] RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [99678.598143] RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3 [99678.599450] RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005 [99678.600770] RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c [99678.602067] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340 [99678.603380] R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0 [99678.604714] Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...) [99678.616646] ---[ end trace ee2f1b19327d791d ]--- The steps that lead to this crash are the following: 1) We are at transaction N; 2) We have two tasks with a transaction handle attached to transaction N. Task A and Task B. Task B is doing an fsync; 3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree into a local variable named 'log_root_tree' at the top of btrfs_sync_log(). Task B is about to call write_all_supers(), but before that... 4) Task A calls btrfs_commit_transaction(), and after it sets the transaction state to TRANS_STATE_COMMIT_START, an error happens before it waits for the transaction's 'num_writers' counter to reach a value of 1 (no one else attached to the transaction), so it jumps to the label "cleanup_transaction"; 5) Task A then calls cleanup_transaction(), where it aborts the transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state, setting the ->aborted field of the transaction and the handle to an errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state. After that, at cleanup_transaction(), it deletes the transaction from the list of transactions (fs_info->trans_list), sets the transaction to the state TRANS_STATE_COMMIT_DOING and then waits for the number of writers to go down to 1, as it's currently 2 (1 for task A and 1 for task B); 6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction(). There it sees the list fs_info->trans_list is empty, and then proceeds into calling btrfs_drop_all_logs(), which frees the log root tree with a call to btrfs_free_log_root_tree(); 7) Task B calls write_all_supers() and, shortly after, under the label 'out_wake_log_root', it deferences the pointer stored in 'log_root_tree', which was already freed in the previous step by the transaction kthread. This results in a use-after-free leading to a crash. Fix this by deleting the transaction from the list of transactions at cleanup_transaction() only after setting the transaction state to TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are attached to the transaction to release their transaction handles. This makes the transaction kthread wait for all the tasks attached to the transaction to be done with the transaction before dropping the log roots and doing other cleanups. Fixes: ef67963 ("btrfs: drop logs when we've aborted a transaction") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do more graceful error/warning for 32bit kernel
Due to the pagecache limit of 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. This is 16T for 4K page size while 256T for 64K page size. And unlike other fses, btrfs uses internally mapped u64 address space for all of its metadata, this is more tricky than other fses. Users can have a fs which doesn't have metadata beyond the boundary at mount time, but later balance can cause btrfs to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8 of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with -EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8 of the max file size is allocated, do an early warning. Above error/warning message will only be outputted once for each fs to reduce dmesg flood. Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: zoned: automatically reclaim zones
When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% per default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: rename delete_unused_bgs_mutex
As a preparation for another user, rename the unused_bgs_mutex into reclaim_bgs_lock. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: check return value of btrfs_commit_transaction in relocation
There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do proper error handling in merge_reloc_roots
We have a BUG_ON() if we get an error back from btrfs_get_fs_root(). This honestly should never fail, as at this point we have a solid coordination of fs root to reloc root, and these roots will all be in memory. But in the name of killing BUG_ON()'s remove these and handle the error condition properly, ASSERT()'ing for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle extent corruption with select_one_root properly
In corruption cases we could have paths from a block up to no root at all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by adding an ASSERT() for developers, and returning an error for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: cleanup error handling in prepare_to_merge
This probably can't happen even with a corrupt file system, because we would have failed much earlier on than here. However there's no reason we can't just check and bail out as appropriate, so do that and convert the correctness BUG_ON() to an ASSERT(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do not panic in __add_reloc_root
If we have a duplicate entry for a reloc root then we could have fs corruption that resulted in a double allocation. Since this shouldn't happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all of the callers of __add_reloc_root() to catch any logic mistakes for developers, otherwise normal error handling will happen for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle __add_reloc_root failures in btrfs_recover_relocation
We can already handle errors appropriately from this function, deal with an error coming from __add_reloc_root appropriately. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do proper error handling in create_reloc_inode
We already handle some errors in this function, and the callers do the correct error handling, so clean up the rest of the function to do the appropriate error handling. There's a little extra work that needs to be done here, as we create the inode item before we create the orphan item. We could potentially add the orphan item, but if we failed to create the inode item we would have to abort the transaction. Instead add a helper to delete the inode item we created in the case that we're unable to look up the inode (this would likely be caused by an ENOMEM), which if it succeeds means we can avoid a transaction abort in this particular error case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: remove the extent item sanity checks in relocate_block_group
These checks are all taken care of for us by the tree checker code: - the flags don't change or are updated consistently - the v0 extent item format is invalid and caught in many other places too Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being se…
…t improperly We need to validate that a data extent item does not have the FULL_BACKREF flag set on its flags. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle extent reference errors in do_relocation
We can already deal with errors appropriately from do_relocation, simply handle any errors that come from changing the refs at this point cleanly. We have to abort the transaction if we fail here as we've modified metadata at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle errors in reference count manipulation in replace_path
If any of the reference count manipulation stuff fails in replace_path we need to abort the transaction, as we've modified the blocks already. We can simply break at this point and everything will be cleaned up. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle btrfs_search_slot failure in replace_path
The search can fail for various reasons, in case of errors there's no cleanup to be done so we can pass the error to the caller, adjusting for the case where the key is not found and search slot returns 1. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle btrfs_cow_block errors in replace_path
If we error out COWing the root node when doing a replace_path then we simply unlock and free the buffer and return the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
A few BUG_ON()'s in replace_path are purely to keep us from making logical mistakes, so replace them with ASSERT()'s. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do proper error handling in btrfs_update_reloc_root
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for all sorts of reasons, including IO errors. Instead of panicing the box lets return the error, now that all callers properly handle those errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>