Josef-Bacik/bt…
Commits on Feb 20, 2022
-
btrfs: pass btrfs_fs_info to btrfs_recover_relocation
We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com>
-
btrfs: use btrfs_fs_info for deleting snapshots and cleaner
We're passing a root around here, but we only really need the fs_info, so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead, and then fix up all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com>
-
btrfs: do not start relocation until in progress drops are done
We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item ------------[ cut here ]------------ WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ torvalds#8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> ---[ end trace 7ebc95131709d2b0 ]--- Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Commits on Feb 15, 2022
-
-
btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub
Although RAID56 has complex repair mechanism, which involves reading the whole full stripe, but for current data stripe scrub, it's in fact no different than SINGLE/RAID1. The point here is, for data stripe we just check the csum for each extent we hit. Only for csum mismatch case, our repair path divides. So we can still reuse scrub_simple_mirror() for RAID56 data stripes. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: introduce dedicated helper to scrub simple-stripe based range
The new entrance will iterate through each data stripe which belongs to the target device. And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: introduce dedicated helper to scrub simple-mirror based range
The new helper, scrub_simple_mirror(), will scrub all extents inside a range which only has simple mirror based duplication. This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each data stripe for RAID0/RAID10. Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C* profiles. As one can see, the new entrance for those simple-mirror based profiles can be small enough (with comments, just reach 100 lines). This function will be the basis for the incoming scrub refactor. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: introduce a helper to locate an extent item
The new helper, find_first_extent_item(), will locate an extent item (either EXTENT_ITEM or METADATA_ITEM) which covers the any byte of the search range. This helper will later be used to refactor scrub code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: expand subpage support to any PAGE_SIZE > 4K
With the recent change in metadata handling, we can handle metadata in the following cases: - nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for both metadata and data. - nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE Invalid case for now. As we require nodesize >= sectorsize. - nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for data, but regular page routine for metadata. - nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE Go regular page routine for both metadata and data. Now we can handle any sectorsize < PAGE_SIZE, plus the existing sectorsize == PAGE_SIZE support. But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we will only support 4K and PAGE_SIZE as sector size. The idea here is to reduce the test combinations, and push 4K as the default standard in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size comes, especially like 16K, the limitation is a bit blockage. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: use dummy extent buffer for super block sys chunk array read
In function btrfs_read_sys_array(), we allocate a real extent buffer using btrfs_find_create_tree_block(). Such extent buffer will be even cached into buffer_radix tree, and using btree inode address space. However we only use such extent buffer to enable the accessors, thus we don't even need to bother using real extent buffer, a dummy one is what we really need. And for dummy extent buffer, we no longer need to do any special handling for the first page, as subpage helper is already doing it properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: zoned: mark relocation as writing
There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extent never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit add sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is properly called. Fixes: 18bb8bb ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: naota#56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
fs: add asserting functions for sb_start_{write,pagefault,intwrite}
Add an assert function sb_assert_write_started() to check if sb_start_write() is properly called. It is used in the next commit. Also, add the assert functions for sb_start_pagefault() and sb_start_intwrite(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do not clean up repair bio if submit fails
The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of UAF and NULL pointer deref bugs because we race with the endio function that is cleaning up the bio. Instead just return STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do not try to repair bio that has no mirror set
If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: do not double complete bio on errors during compressed reads
I hit some weird panics while fixing up the error handling from btrfs_lookup_bio_sums(). Turns out the compression path will complete the bio we use if we set up any of the compression bios and then return an error, and then btrfs_submit_data_bio() will also call bio_endio() on the bio. Fix this by making btrfs_submit_compressed_read() responsible for calling bio_endio() on the bio if there are any errors. Currently it was only doing it if we created the compression bios, otherwise it was depending on btrfs_submit_data_bio() to do the right thing. This creates the above problem, so fix up btrfs_submit_compressed_read() to always call bio_endio() in case of an error, and then simply return from btrfs_submit_data_bio() if we had to call btrfs_submit_compressed_read(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: track compressed bio errors as blk_status_t
Right now we just have a binary "errors" flag, so any error we get on the compressed bio's gets translated to EIO. This isn't necessarily a bad thing, but if we get an ENOMEM it may be nice to know that's what happened instead of an EIO. Track our errors as a blk_status_t, and do the appropriate setting of the errors accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: remove the bio argument from finish_compressed_bio_read
This bio is usually one of the compressed bio's, and we don't actually need it in this function, so remove the argument and stop passing it around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: check correct bio in finish_compressed_bio_read
Commit c09abff ("btrfs: cloned bios must not be iterated by bio_for_each_segment_all") added ASSERT()'s to make sure we weren't calling bio_for_each_segment_all() on a RAID5/6 bio. However it was checking the bio that the compression code passed in, not the cb->orig_bio that we actually iterate over, so adjust this ASSERT() to check the correct bio. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle csum lookup errors properly on reads
Currently any error we get while trying to lookup csums during reads shows up as a missing csum, and then on the read completion side we spit out an error saying there was a csum mismatch and we increase the device corruption count. However we could have gotten an EIO from the lookup. We could also be inside of a memory constrained container and gotten a ENOMEM while trying to do the read. In either case we don't want to make this look like a file system corruption problem, we want to make it look like the actual error it is. Capture any negative value, convert it to the appropriate blk_sts_t, free the csum array if we have one and bail. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make search_csum_tree return 0 if we get -EFBIG
We can either fail to find a csum entry at all and return -ENOENT, or we can find a range that is close, but return -EFBIG. In essence these both mean the same thing when we are doing a lookup for a csum in an existing range, we didn't find a csum. We want to treat both of these errors the same way, complain loudly that there wasn't a csum. This currently happens anyway because we do count = search_csum_tree(); if (count <= 0) { // reloc and error handling } however it forces us to incorrectly treat EIO or ENOMEM errors as on disk corruption. Fix this by returning 0 if we get either -ENOENT or -EFBIG from btrfs_lookup_csum() so we can do proper error handling. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> -
btrfs: add BTRFS_IOC_ENCODED_WRITE
The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we fill in zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - We hold the inode lock during the operation. - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - We don't do read repair, because it turns out that read repair is currently broken for compressed data. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: add definitions and documentation for encoded I/O ioctls
In order to allow sending and receiving compressed data without decompressing it, we need an interface to write pre-compressed data directly to the filesystem and the matching interface to read compressed data without decompressing it. This adds the definitions for ioctls to do that and detailed explanations of how to use them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: optionally extend i_size in cow_file_range_inline()
Currently, an inline extent is always created after i_size is extended from btrfs_dirty_pages(). However, for encoded writes, we only want to update i_size after we successfully created the inline extent. Add an update_i_size parameter to cow_file_range_inline() and insert_inline_extent() and pass in the size of the extent rather than determining it from i_size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>