Skip to content
Permalink
Josef-Bacik/bt…
Switch branches/tags

Commits on Feb 20, 2022

  1. btrfs: pass btrfs_fs_info to btrfs_recover_relocation

    We don't need a root here, we just need the btrfs_fs_info, we can just
    get the specific roots we need from fs_info.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    josefbacik authored and intel-lab-lkp committed Feb 20, 2022
  2. btrfs: use btrfs_fs_info for deleting snapshots and cleaner

    We're passing a root around here, but we only really need the fs_info,
    so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead,
    and then fix up all the callers appropriately.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    josefbacik authored and intel-lab-lkp committed Feb 20, 2022
  3. btrfs: do not start relocation until in progress drops are done

    We hit a bug with a recovering relocation on mount for one of our file
    systems in production.  I reproduced this locally by injecting errors
    into snapshot delete with balance running at the same time.  This
    presented as an error while looking up an extent item
    
    ------------[ cut here ]------------
    WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
    CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ torvalds#8
    RIP: 0010:lookup_inline_extent_backref+0x647/0x680
    RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
    RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
    R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
    R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
    FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
    Call Trace:
     <TASK>
     insert_inline_extent_backref+0x46/0xd0
     __btrfs_inc_extent_ref.isra.0+0x5f/0x200
     ? btrfs_merge_delayed_refs+0x164/0x190
     __btrfs_run_delayed_refs+0x561/0xfa0
     ? btrfs_search_slot+0x7b4/0xb30
     ? btrfs_update_root+0x1a9/0x2c0
     btrfs_run_delayed_refs+0x73/0x1f0
     ? btrfs_update_root+0x1a9/0x2c0
     btrfs_commit_transaction+0x50/0xa50
     ? btrfs_update_reloc_root+0x122/0x220
     prepare_to_merge+0x29f/0x320
     relocate_block_group+0x2b8/0x550
     btrfs_relocate_block_group+0x1a6/0x350
     btrfs_relocate_chunk+0x27/0xe0
     btrfs_balance+0x777/0xe60
     balance_kthread+0x35/0x50
     ? btrfs_balance+0xe60/0xe60
     kthread+0x16b/0x190
     ? set_kthread_struct+0x40/0x40
     ret_from_fork+0x22/0x30
     </TASK>
    ---[ end trace 7ebc95131709d2b0 ]---
    
    Normally snapshot deletion and relocation are excluded from running at
    the same time by the fs_info->cleaner_mutex.  However if we had a
    pending balance waiting to get the ->cleaner_mutex, and a snapshot
    deletion was running, and then the box crashed, we would come up in a
    state where we have a half deleted snapshot.
    
    Again, in the normal case the snapshot deletion needs to complete before
    relocation can start, but in this case relocation could very well start
    before the snapshot deletion completes, as we simply add the root to the
    dead roots list and wait for the next time the cleaner runs to clean up
    the snapshot.
    
    Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
    had a pending drop_progress key.  If they do then we know we were in the
    middle of the drop operation and set a flag on the fs_info.  Then
    balance can wait until this flag is cleared to start up again.
    
    If there are DEAD_ROOT's that don't have a drop_progress set then we're
    safe to start balance right away as we'll be properly protected by the
    cleaner_mutex.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    josefbacik authored and intel-lab-lkp committed Feb 20, 2022

Commits on Feb 15, 2022

  1. btrfs: use scrub_simple_mirror() to handle RAID56 data stripe scrub

    Although RAID56 has complex repair mechanism, which involves reading the
    whole full stripe, but for current data stripe scrub, it's in fact no
    different than SINGLE/RAID1.
    
    The point here is, for data stripe we just check the csum for each
    extent we hit.
    Only for csum mismatch case, our repair path divides.
    
    So we can still reuse scrub_simple_mirror() for RAID56 data stripes.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  2. btrfs: introduce dedicated helper to scrub simple-stripe based range

    The new entrance will iterate through each data stripe which belongs to
    the target device.
    
    And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
    just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  3. btrfs: introduce dedicated helper to scrub simple-mirror based range

    The new helper, scrub_simple_mirror(), will scrub all extents inside a range
    which only has simple mirror based duplication.
    
    This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
    data stripe for RAID0/RAID10.
    
    Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
    profiles.
    As one can see, the new entrance for those simple-mirror based profiles
    can be small enough (with comments, just reach 100 lines).
    
    This function will be the basis for the incoming scrub refactor.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  4. btrfs: introduce a helper to locate an extent item

    The new helper, find_first_extent_item(), will locate an extent item
    (either EXTENT_ITEM or METADATA_ITEM) which covers the any byte of the
    search range.
    
    This helper will later be used to refactor scrub code.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  5. btrfs: expand subpage support to any PAGE_SIZE > 4K

    With the recent change in metadata handling, we can handle metadata in
    the following cases:
    
    - nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE
      Go subpage routine for both metadata and data.
    
    - nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE
      Invalid case for now. As we require nodesize >= sectorsize.
    
    - nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE
      Go subpage routine for data, but regular page routine for metadata.
    
    - nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE
      Go regular page routine for both metadata and data.
    
    Now we can handle any sectorsize < PAGE_SIZE, plus the existing
    sectorsize == PAGE_SIZE support.
    
    But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we
    will only support 4K and PAGE_SIZE as sector size.
    
    The idea here is to reduce the test combinations, and push 4K as the
    default standard in the future.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  6. btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine

    The reason why we only support 64K page size for subpage is, for 64K
    page size we can ensure no matter what the nodesize is, we can fit it
    into one page.
    
    When other page size comes, especially like 16K, the limitation is a bit
    blockage.
    
    To remove such limitation, we allow nodesize >= PAGE_SIZE case to go
    the non-subpage routine.
    By this, we can allow 4K sectorsize on 16K page size.
    
    Although this introduces another smaller limitation, the metadata can
    not cross page boundary, which is already met by most recent mkfs.
    
    Another small improvement is, we can avoid the overhead for metadata if
    nodesize >= PAGE_SIZE.
    For 4K sector size and 64K page size/node size, or 4K sector size and
    16K page size/node size, we don't need to allocate extra memory for the
    metadata pages.
    
    Please note that, this patch will not yet enable other page size support
    yet.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  7. btrfs: use dummy extent buffer for super block sys chunk array read

    In function btrfs_read_sys_array(), we allocate a real extent buffer
    using btrfs_find_create_tree_block().
    
    Such extent buffer will be even cached into buffer_radix tree, and using
    btree inode address space.
    
    However we only use such extent buffer to enable the accessors, thus we
    don't even need to bother using real extent buffer, a dummy one is
    what we really need.
    
    And for dummy extent buffer, we no longer need to do any special
    handling for the first page, as subpage helper is already doing it
    properly.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Feb 15, 2022
  8. btrfs: zoned: mark relocation as writing

    There is a hung_task issue with running generic/068 on an SMR
    device. The hang occurs while a process is trying to thaw the
    filesystem. The process is trying to take sb->s_umount to thaw the
    FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
    waiting for an ordered extent to finish. However, as the FS is frozen,
    the ordered extent never finish.
    
    Having an ordered extent while the FS is frozen is the root cause of
    the hang. The ordered extent is initiated from btrfs_relocate_chunk()
    which is called from btrfs_reclaim_bgs_work().
    
    This commit add sb_*_write() around btrfs_relocate_chunk() call
    site. For the usual "btrfs balance" command, we already call it with
    mnt_want_file() in btrfs_ioctl_balance().
    
    Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is
    properly called.
    
    Fixes: 18bb8bb ("btrfs: zoned: automatically reclaim zones")
    CC: stable@vger.kernel.org # 5.13+
    Link: naota#56
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    naota authored and kdave committed Feb 15, 2022
  9. fs: add asserting functions for sb_start_{write,pagefault,intwrite}

    Add an assert function sb_assert_write_started() to check if
    sb_start_write() is properly called. It is used in the next commit.
    
    Also, add the assert functions for sb_start_pagefault() and
    sb_start_intwrite().
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    naota authored and kdave committed Feb 15, 2022
  10. btrfs: do not clean up repair bio if submit fails

    The submit helper will always run bio_endio() on the bio if it fails to
    submit, so cleaning up the bio just leads to a variety of UAF and NULL
    pointer deref bugs because we race with the endio function that is
    cleaning up the bio.  Instead just return STS_OK as the repair function
    has to continue to process the rest of the pages, and the endio for the
    repair bio will do the appropriate cleanup for the page that it was
    given.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  11. btrfs: do not try to repair bio that has no mirror set

    If we fail to submit a bio for whatever reason, we may not have setup a
    mirror_num for that bio.  This means we shouldn't try to do the repair
    workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
    clean_io_failure.  Instead simply skip the repair workflow if we have no
    mirror set, and add an assert to btrfs_check_repairable() to make it
    easier to catch what is happening in the future.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  12. btrfs: do not double complete bio on errors during compressed reads

    I hit some weird panics while fixing up the error handling from
    btrfs_lookup_bio_sums().  Turns out the compression path will complete
    the bio we use if we set up any of the compression bios and then return
    an error, and then btrfs_submit_data_bio() will also call bio_endio() on
    the bio.
    
    Fix this by making btrfs_submit_compressed_read() responsible for
    calling bio_endio() on the bio if there are any errors.  Currently it
    was only doing it if we created the compression bios, otherwise it was
    depending on btrfs_submit_data_bio() to do the right thing.  This
    creates the above problem, so fix up btrfs_submit_compressed_read() to
    always call bio_endio() in case of an error, and then simply return from
    btrfs_submit_data_bio() if we had to call
    btrfs_submit_compressed_read().
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  13. btrfs: track compressed bio errors as blk_status_t

    Right now we just have a binary "errors" flag, so any error we get on
    the compressed bio's gets translated to EIO.  This isn't necessarily a
    bad thing, but if we get an ENOMEM it may be nice to know that's what
    happened instead of an EIO.  Track our errors as a blk_status_t, and do
    the appropriate setting of the errors accordingly.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  14. btrfs: remove the bio argument from finish_compressed_bio_read

    This bio is usually one of the compressed bio's, and we don't actually
    need it in this function, so remove the argument and stop passing it
    around.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  15. btrfs: check correct bio in finish_compressed_bio_read

    Commit c09abff ("btrfs: cloned bios must not be iterated by
    bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
    calling bio_for_each_segment_all() on a RAID5/6 bio.  However it was
    checking the bio that the compression code passed in, not the
    cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
    check the correct bio.
    
    Reviewed-by: Boris Burkov <boris@bur.io>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  16. btrfs: handle csum lookup errors properly on reads

    Currently any error we get while trying to lookup csums during reads
    shows up as a missing csum, and then on the read completion side we spit
    out an error saying there was a csum mismatch and we increase the device
    corruption count.
    
    However we could have gotten an EIO from the lookup.  We could also be
    inside of a memory constrained container and gotten a ENOMEM while
    trying to do the read.  In either case we don't want to make this look
    like a file system corruption problem, we want to make it look like the
    actual error it is.  Capture any negative value, convert it to the
    appropriate blk_sts_t, free the csum array if we have one and bail.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  17. btrfs: make search_csum_tree return 0 if we get -EFBIG

    We can either fail to find a csum entry at all and return -ENOENT, or we
    can find a range that is close, but return -EFBIG.  In essence these
    both mean the same thing when we are doing a lookup for a csum in an
    existing range, we didn't find a csum.  We want to treat both of these
    errors the same way, complain loudly that there wasn't a csum.  This
    currently happens anyway because we do
    
    	count = search_csum_tree();
    	if (count <= 0) {
    		// reloc and error handling
    	}
    
    however it forces us to incorrectly treat EIO or ENOMEM errors as on
    disk corruption.  Fix this by returning 0 if we get either -ENOENT or
    -EFBIG from btrfs_lookup_csum() so we can do proper error handling.
    
    Reviewed-by: Boris Burkov <boris@bur.io>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 15, 2022
  18. btrfs: add BTRFS_IOC_ENCODED_WRITE

    The implementation resembles direct I/O: we have to flush any ordered
    extents, invalidate the page cache, and do the io tree/delalloc/extent
    map/ordered extent dance. From there, we can reuse the compression code
    with a minor modification to distinguish the write from writeback. This
    also creates inline extents when possible.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Feb 15, 2022
  19. btrfs: add BTRFS_IOC_ENCODED_READ ioctl

    There are 4 main cases:
    
    1. Inline extents: we copy the data straight out of the extent buffer.
    2. Hole/preallocated extents: we fill in zeroes.
    3. Regular, uncompressed extents: we read the sectors we need directly
       from disk.
    4. Regular, compressed extents: we read the entire compressed extent
       from disk and indicate what subset of the decompressed extent is in
       the file.
    
    This initial implementation simplifies a few things that can be improved
    in the future:
    
    - We hold the inode lock during the operation.
    - Cases 1, 3, and 4 allocate temporary memory to read into before
      copying out to userspace.
    - We don't do read repair, because it turns out that read repair is
      currently broken for compressed data.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Feb 15, 2022
  20. btrfs: add definitions and documentation for encoded I/O ioctls

    In order to allow sending and receiving compressed data without
    decompressing it, we need an interface to write pre-compressed data
    directly to the filesystem and the matching interface to read compressed
    data without decompressing it. This adds the definitions for ioctls to
    do that and detailed explanations of how to use them.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Feb 15, 2022
  21. btrfs: optionally extend i_size in cow_file_range_inline()

    Currently, an inline extent is always created after i_size is extended
    from btrfs_dirty_pages(). However, for encoded writes, we only want to
    update i_size after we successfully created the inline extent. Add an
    update_i_size parameter to cow_file_range_inline() and
    insert_inline_extent() and pass in the size of the extent rather than
    determining it from i_size.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ reformat comment ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Feb 15, 2022
Older