Skip to content
Permalink
David-Sterba/b…

Commits on Feb 6, 2020

  1. btrfs: print message when tree-log replay starts

    There's no logged information about tree-log replay although this is
    something that points to previous unclean unmount. Other filesystems
    report that as well.
    
    Suggested-by: Chris Murphy <lists@colorremedies.com>
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave authored and 0day robot committed Feb 6, 2020

Commits on Feb 2, 2020

  1. btrfs: do not zero f_bavail if we have available space

    There was some logic added a while ago to clear out f_bavail in statfs()
    if we did not have enough free metadata space to satisfy our global
    reserve.  This was incorrect at the time, however didn't really pose a
    problem for normal file systems because we would often allocate chunks
    if we got this low on free metadata space, and thus wouldn't really hit
    this case unless we were actually full.
    
    Fast forward to today and now we are much better about not allocating
    metadata chunks all of the time.  Couple this with d792b0f ("btrfs:
    always reserve our entire size for the global reserve") which now means
    we'll easily have a larger global reserve than our free space, we are
    now more likely to trip over this while still having plenty of space.
    
    Fix this by skipping this logic if the global rsv's space_info is not
    full.  space_info->full is 0 unless we've attempted to allocate a chunk
    for that space_info and that has failed.  If this happens then the space
    for the global reserve is definitely sacred and we need to report
    b_avail == 0, but before then we can just use our calculated b_avail.
    
    Reported-by: Martin Steigerwald <martin@lichtvoll.de>
    Fixes: ca8a51b ("btrfs: statfs: report zero available if metadata are exhausted")
    CC: stable@vger.kernel.org # 4.5+
    Reviewed-by: Qu Wenruo <wqu@suse.com>
    Tested-By: Martin Steigerwald <martin@lichtvoll.de>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Feb 2, 2020

Commits on Jan 31, 2020

  1. Btrfs: send, fix emission of invalid clone operations within the same…

    … file
    
    When doing an incremental send and a file has extents shared with itself
    at different file offsets, it's possible for send to emit clone operations
    that will fail at the destination because the source range goes beyond the
    file's current size. This happens when the file size has increased in the
    send snapshot, there is a hole between the shared extents and both shared
    extents are at file offsets which are greater the file's size in the
    parent snapshot.
    
    Example:
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt/sdb
    
      $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
      $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      $ btrfs send -f /tmp/1.snap /mnt/sdb/base
    
      # Create a 320K extent at file offset 512K.
      $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
      $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
      $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
      $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
      $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
    
      # Clone part of that 320K extent into a lower file offset (192K).
      # This file offset is greater than the file's size in the parent
      # snapshot (64K). Also the clone range is a bit behind the offset of
      # the 320K extent so that we leave a hole between the shared extents.
      $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
    
      $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
    
      $ mkfs.btrfs -f /dev/sdc
      $ mount /dev/sdc /mnt/sdc
    
      $ btrfs receive -f /tmp/1.snap /mnt/sdc
      $ btrfs receive -f /tmp/2.snap /mnt/sdc
      ERROR: failed to clone extents to foobar: Invalid argument
    
    The problem is that after processing the extent at file offset 256K, which
    refers to the first 128K of the 320K extent created by the buffered write
    operations, we have 'cur_inode_next_write_offset' set to 384K, which
    corresponds to the end offset of the partially shared extent (256K + 128K)
    and to the current file size in the receiver. Then when we process the
    extent at offset 512K, we do extent backreference iteration to figure out
    if we can clone the extent from some other inode or from the same inode,
    and we consider the extent at offset 256K of the same inode as a valid
    source for a clone operation, which is not correct because at that point
    the current file size in the receiver is 384K, which corresponds to the
    end of last processed extent (at file offset 256K), so using a clone
    source range from 256K to 256K + 320K is invalid because that goes past
    the current size of the file (384K) - this makes the receiver get an
    -EINVAL error when attempting the clone operation.
    
    So fix this by excluding clone sources that have a range that goes beyond
    the current file size in the receiver when iterating extent backreferences.
    
    A test case for fstests follows soon.
    
    Fixes: 11f2069 ("Btrfs: send, allow clone operations within the same file")
    CC: stable@vger.kernel.org # 5.5+
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Jan 31, 2020
  2. btrfs: do not do delalloc reservation under page lock

    We ran into a deadlock in production with the fixup worker.  The stack
    traces were as follows:
    
    Thread responsible for the writeout, waiting on the page lock
    
      [<0>] io_schedule+0x12/0x40
      [<0>] __lock_page+0x109/0x1e0
      [<0>] extent_write_cache_pages+0x206/0x360
      [<0>] extent_writepages+0x40/0x60
      [<0>] do_writepages+0x31/0xb0
      [<0>] __writeback_single_inode+0x3d/0x350
      [<0>] writeback_sb_inodes+0x19d/0x3c0
      [<0>] __writeback_inodes_wb+0x5d/0xb0
      [<0>] wb_writeback+0x231/0x2c0
      [<0>] wb_workfn+0x308/0x3c0
      [<0>] process_one_work+0x1e0/0x390
      [<0>] worker_thread+0x2b/0x3c0
      [<0>] kthread+0x113/0x130
      [<0>] ret_from_fork+0x35/0x40
      [<0>] 0xffffffffffffffff
    
    Thread of the fixup worker who is holding the page lock
    
      [<0>] start_delalloc_inodes+0x241/0x2d0
      [<0>] btrfs_start_delalloc_roots+0x179/0x230
      [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
      [<0>] btrfs_check_data_free_space+0x53/0xa0
      [<0>] btrfs_delalloc_reserve_space+0x20/0x70
      [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
      [<0>] normal_work_helper+0x11c/0x360
      [<0>] process_one_work+0x1e0/0x390
      [<0>] worker_thread+0x2b/0x3c0
      [<0>] kthread+0x113/0x130
      [<0>] ret_from_fork+0x35/0x40
      [<0>] 0xffffffffffffffff
    
    Thankfully the stars have to align just right to hit this.  First you
    have to end up in the fixup worker, which is tricky by itself (my
    reproducer does DIO reads into a MMAP'ed region, so not a common
    operation).  Then you have to have less than a page size of free data
    space and 0 unallocated space so you go down the "commit the transaction
    to free up pinned space" path.  This was accomplished by a random
    balance that was running on the host.  Then you get this deadlock.
    
    I'm still in the process of trying to force the deadlock to happen on
    demand, but I've hit other issues.  I can still trigger the fixup worker
    path itself so this patch has been tested in that regard, so the normal
    case is fine.
    
    Fixes: 87826df ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 31, 2020
  3. btrfs: drop the -EBUSY case in __extent_writepage_io

    Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
    do not need this -EBUSY case.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 31, 2020
  4. Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker

    For COW, btrfs expects pages dirty pages to have been through a few setup
    steps.  This includes reserving space for the new block allocations and marking
    the range in the state tree for delayed allocation.
    
    A few places outside btrfs will dirty pages directly, especially when unmapping
    mmap'd pages.  In order for these to properly go through COW, we run them
    through a fixup worker to wait for stable pages, and do the delalloc prep.
    
    87826df added a window where the dirty pages were cleaned, but pending
    more action from the fixup worker.  We clear_page_dirty_for_io() before
    we call into writepage, so the page is no longer dirty.  The commit
    changed it so now we leave the page clean between unlocking it here and
    the fixup worker starting at some point in the future.
    
    During this window, page migration can jump in and relocate the page.  Once our
    fixup work actually starts, it finds page->mapping is NULL and we end up
    freeing the page without ever writing it.
    
    This leads to crc errors and other exciting problems, since it screws up the
    whole statemachine for waiting for ordered extents.  The fix here is to keep
    the page dirty while we're waiting for the fixup worker to get to work.
    This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
    if we queued the page up for fixup, which will cause the writepage
    function to redirty the page.
    
    Because we now expect the page to be dirty once it gets to the fixup
    worker we must adjust the error cases to call clear_page_dirty_for_io()
    on the page.  That is the bulk of the patch, but it is not the fix, the
    fix is the -EAGAIN from btrfs_writepage_cow_fixup.  We cannot separate
    these two changes out because the error conditions change with the new
    expectations.
    
    Signed-off-by: Chris Mason <clm@fb.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    masoncl authored and kdave committed Jan 31, 2020
  5. btrfs: take overcommit into account in inc_block_group_ro

    inc_block_group_ro does a calculation to see if we have enough room left
    over if we mark this block group as read only in order to see if it's ok
    to mark the block group as read only.
    
    The problem is this calculation _only_ works for data, where our used is
    always less than our total.  For metadata we will overcommit, so this
    will almost always fail for metadata.
    
    Fix this by exporting btrfs_can_overcommit, and then see if we have
    enough space to remove the remaining free space in the block group we
    are trying to mark read only.  If we do then we can mark this block
    group as read only.
    
    Reviewed-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 31, 2020
  6. btrfs: fix force usage in inc_block_group_ro

    For some reason we've translated the do_chunk_alloc that goes into
    btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
    two different things.
    
    force for inc_block_group_ro is used when we are forcing the block group
    read only no matter what, for example when the underlying chunk is
    marked read only.  We need to not do the space check here as this block
    group needs to be read only.
    
    btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
    we need to pre-allocate a chunk before marking the block group read
    only.  This has nothing to do with forcing, and in fact we _always_ want
    to do the space check in this case, so unconditionally pass false for
    force in this case.
    
    Then fixup inc_block_group_ro to honor force as it's expected and
    documented to do.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 31, 2020
  7. btrfs: Correctly handle empty trees in find_first_clear_extent_bit

    Raviu reported that running his regular fs_trim segfaulted with the
    following backtrace:
    
    [  237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
    [  237.525984] ------------[ cut here ]------------
    [  237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
    [  237.525992] invalid opcode: 0000 [#1] SMP PTI
    [  237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G     U     OE     5.4.14-8-vanilla #1
    [  237.526001] Hardware name: ASUSTeK COMPUTER INC.
    [  237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
    [  237.526079] Call Trace:
    [  237.526120]  find_first_clear_extent_bit+0x13d/0x150 [btrfs]
    [  237.526148]  btrfs_trim_fs+0x211/0x3f0 [btrfs]
    [  237.526184]  btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
    [  237.526219]  btrfs_ioctl+0x129a/0x2ed0 [btrfs]
    [  237.526227]  ? filemap_map_pages+0x190/0x3d0
    [  237.526232]  ? do_filp_open+0xaf/0x110
    [  237.526238]  ? _copy_to_user+0x22/0x30
    [  237.526242]  ? cp_new_stat+0x150/0x180
    [  237.526247]  ? do_vfs_ioctl+0xa4/0x640
    [  237.526278]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
    [  237.526283]  do_vfs_ioctl+0xa4/0x640
    [  237.526288]  ? __do_sys_newfstat+0x3c/0x60
    [  237.526292]  ksys_ioctl+0x70/0x80
    [  237.526297]  __x64_sys_ioctl+0x16/0x20
    [  237.526303]  do_syscall_64+0x5a/0x1c0
    [  237.526310]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
    
    That was due to btrfs_fs_device::aloc_tree being empty. Initially I
    thought this wasn't possible and as a percaution have put the assert in
    find_first_clear_extent_bit. Turns out this is indeed possible and could
    happen when a file system with SINGLE data/metadata profile has a 2nd
    device added. Until balance is run or a new chunk is allocated on this
    device it will be completely empty.
    
    In this case find_first_clear_extent_bit should return the full range
    [0, -1ULL] and let the caller handle this i.e for trim the end will be
    capped at the size of actual device.
    
    Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
    Fixes: 45bfcfc ("btrfs: Implement find_first_clear_extent_bit")
    CC: stable@vger.kernel.org # 5.2+
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 31, 2020
  8. btrfs: flush write bio if we loop in extent_write_cache_pages

    There exists a deadlock with range_cyclic that has existed forever.  If
    we loop around with a bio already built we could deadlock with a writer
    who has the page locked that we're attempting to write but is waiting on
    a page in our bio to be written out.  The task traces are as follows
    
      PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
       #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
       #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
       #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
       #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
       #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
       #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
       torvalds#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
       torvalds#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
       torvalds#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
       torvalds#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
    
      PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
      "aio-dio-invalid"
       #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
       #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
       #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
       #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
       #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
       #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
       torvalds#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
       torvalds#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
       torvalds#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
       torvalds#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
    
    I used drgn to find the respective pages we were stuck on
    
    page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
    page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
    
    As you can see the kworker is waiting for bit 0 (PG_locked) on index
    7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
    8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
    following
    
      crash> struct extent_page_data ffffc900297bbbb0
      struct extent_page_data {
        bio = 0xffff889f747ed830,
        tree = 0xffff889eed6ba448,
        extent_locked = 0,
        sync_io = 0
      }
    
    Probably worth mentioning as well that it waits for writeback of the
    page to complete while holding a lock on it (at prepare_pages()).
    
    Using drgn I walked the bio pages looking for page
    0xffffea00fbfc7500 which is the one we're waiting for writeback on
    
      bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
      for i in range(0, bio.bi_vcnt.value_()):
          bv = bio.bi_io_vec[i]
          if bv.bv_page.value_() == 0xffffea00fbfc7500:
    	  print("FOUND IT")
    
    which validated what I suspected.
    
    The fix for this is simple, flush the epd before we loop back around to
    the beginning of the file during writeout.
    
    Fixes: b293f02 ("Btrfs: Add writepages support")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 31, 2020
  9. Btrfs: fix race between adding and putting tree mod seq elements and …

    …nodes
    
    There is a race between adding and removing elements to the tree mod log
    list and rbtree that can lead to use-after-free problems.
    
    Consider the following example that explains how/why the problems happens:
    
    1) Task A has mod log element with sequence number 200. It currently is
       the only element in the mod log list;
    
    2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
       access the tree mod log. When it enters the function, it initializes
       'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
       before checking if there are other elements in the mod seq list.
       Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
       unlocks the lock 'tree_mod_seq_lock';
    
    3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
       itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
       sequence number of 201;
    
    4) Some other task, name it task C, modifies a btree and because there
       elements in the mod seq list, it adds a tree mod elem to the tree
       mod log rbtree. That node added to the mod log rbtree is assigned
       a sequence number of 202;
    
    5) Task B, which is doing fiemap and resolving indirect back references,
       calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
       calls tree_mod_log_search() - the search returns the mod log node
       from the rbtree with sequence number 202, created by task C;
    
    6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
       the mod log rbtree and finds the node with sequence number 202. Since
       202 is less than the previously computed 'min_seq', (u64)-1, it
       removes the node and frees it;
    
    7) Task B still has a pointer to the node with sequence number 202, and
       it dereferences the pointer itself and through the call to
       __tree_mod_log_rewind(), resulting in a use-after-free problem.
    
    This issue can be triggered sporadically with the test case generic/561
    from fstests, and it happens more frequently with a higher number of
    duperemove processes. When it happens to me, it either freezes the VM or
    it produces a trace like the following before crashing:
    
      [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
      [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
      [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
      [ 1245.321307] Code: ....
      [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
      [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
      [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
      [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
      [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
      [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
      [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
      [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
      [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1245.321706] Call Trace:
      [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
      [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
      [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
      [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
      [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
      [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
      [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
      [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
      [ 1245.322081]  ksys_ioctl+0x70/0x80
      [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
      [ 1245.322126]  do_syscall_64+0x5c/0x280
      [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1245.322155] RIP: 0033:0x7fdee3942dd7
      [ 1245.322177] Code: ....
      [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
      [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
      [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
      [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
      [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
      [ 1245.322423] Modules linked in: ....
      [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
    
    Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
    sequence number and iterates the rbtree while holding the lock
    'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
    lock, since it is now redundant.
    
    Fixes: bd989ba ("Btrfs: add tree modification log functions")
    Fixes: 097b8a7 ("Btrfs: join tree mod log code with the code holding back delayed refs")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Jan 31, 2020

Commits on Jan 23, 2020

  1. btrfs: free block groups after free'ing fs trees

    Sometimes when running generic/475 we would trip the
    WARN_ON(cache->reserved) check when free'ing the block groups on umount.
    This is because sometimes we don't commit the transaction because of IO
    errors and thus do not cleanup the tree logs until at umount time.
    
    These blocks are still reserved until they are cleaned up, but they
    aren't cleaned up until _after_ we do the free block groups work.  Fix
    this by moving the free after free'ing the fs roots, that way all of the
    tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
    loops of generic/475 confirmed this fixes the problem.
    
    CC: stable@vger.kernel.org # 4.9+
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 23, 2020
  2. btrfs: Fix split-brain handling when changing FSID to metadata uuid

    Current code doesn't correctly handle the situation which arises when
    a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
    changed to the one in metadata uuid. This causes the incompat flag to
    disappear.
    
    In case of a power failure we could end up in a situation where part of
    the disks in a multi-disk filesystem are correctly reverted to
    METADATA_UUID_INCOMPAT flag unset state, while others have
    METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
    
    This patch corrects the behavior required to handle the case where a
    disk of the second type is scanned first, creating the necessary
    btrfs_fs_devices. Subsequently, when a disk which has already completed
    the transition is scanned it should overwrite the data in
    btrfs_fs_devices.
    
    Reported-by: Su Yue <Damenly_Su@gmx.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  3. btrfs: Handle another split brain scenario with metadata uuid feature

    There is one more cases which isn't handled by the original metadata
    uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
    the user decides to change the FSID to the original one e.g. have
    metadata_uuid and fsid match. In case of power failure while this
    operation is in progress we could end up in a situation where some of
    the disks have the incompat bit removed and the other half have both
    METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
    
    This patch handles the case where a disk that has successfully changed
    its FSID such that it equals METADATA_UUID is scanned first.
    Subsequently when a disk with both
    METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
    find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
    This is done by extending find_fsid_changed to correctly find
    btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
    the metadata_uuid of the currently scanned device.
    
    Fixes: cc5de4e ("btrfs: Handle final split-brain possibility during fsid change")
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reported-by: Su Yue <Damenly_Su@gmx.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  4. btrfs: Factor out metadata_uuid code from find_fsid.

    find_fsid became rather hairy with the introduction of metadata uuid
    changing feature. Alleviate this by factoring out the metadata uuid
    specific code in a dedicated function which deals with finding
    correct fsid for a device with changed uuid.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Su Yue <Damenly_Su@gmx.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Damenly authored and kdave committed Jan 23, 2020
  5. btrfs: Call find_fsid from find_fsid_inprogress

    Since find_fsid_inprogress should also handle the case in which an fs
    didn't change its FSID make it call find_fsid directly. This makes the
    code in device_list_add simpler by eliminating a conditional call of
    find_fsid. No functional changes.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Su Yue <Damenly_Su@gmx.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Damenly authored and kdave committed Jan 23, 2020
  6. Btrfs: fix infinite loop during fsync after rename operations

    Recently fsstress (from fstests) sporadically started to trigger an
    infinite loop during fsync operations. This turned out to be because
    support for the rename exchange and whiteout operations was added to
    fsstress in fstests. These operations, unlike any others in fsstress,
    cause file names to be reused, whence triggering this issue. However
    it's not necessary to use rename exchange and rename whiteout operations
    trigger this issue, simple rename operations and file creations are
    enough to trigger the issue.
    
    The issue boils down to when we are logging inodes that conflict (that
    had the name of any inode we need to log during the fsync operation), we
    keep logging them even if they were already logged before, and after
    that we check if there's any other inode that conflicts with them and
    then add it again to the list of inodes to log. Skipping already logged
    inodes fixes the issue.
    
    Consider the following example:
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
    
      $ mkdir /mnt/testdir                           # inode 257
    
      $ touch /mnt/testdir/zz                        # inode 258
      $ ln /mnt/testdir/zz /mnt/testdir/zz_link
    
      $ touch /mnt/testdir/a                         # inode 259
    
      $ sync
    
      # The following 3 renames achieve the same result as a rename exchange
      # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
    
      $ mv /mnt/testdir/a /mnt/testdir/a/tmp
      $ mv /mnt/testdir/zz_link /mnt/testdir/a
      $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
    
      # The following rename and file creation give the same result as a
      # rename whiteout operation (<rename_whiteout> zz to a2).
    
      $ mv /mnt/testdir/zz /mnt/testdir/a2
      $ touch /mnt/testdir/zz                        # inode 260
    
      $ xfs_io -c fsync /mnt/testdir/zz
        --> results in the infinite loop
    
    The following steps happen:
    
    1) When logging inode 260, we find that its reference named "zz" was
       used by inode 258 in the previous transaction (through the commit
       root), so inode 258 is added to the list of conflicting indoes that
       need to be logged;
    
    2) After logging inode 258, we find that its reference named "a" was
       used by inode 259 in the previous transaction, and therefore we add
       inode 259 to the list of conflicting inodes to be logged;
    
    3) After logging inode 259, we find that its reference named "zz_link"
       was used by inode 258 in the previous transaction - we add inode 258
       to the list of conflicting inodes to log, again - we had already
       logged it before at step 3. After logging it again, we find again
       that inode 259 conflicts with him, and we add again 259 to the list,
       etc - we end up repeating all the previous steps.
    
    So fix this by skipping logging of conflicting inodes that were already
    logged.
    
    Fixes: 6b5fc43 ("Btrfs: fix fsync after succession of renames of different files")
    CC: stable@vger.kernel.org # 5.1+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Jan 23, 2020
  7. btrfs: set trans->drity in btrfs_commit_transaction

    If we abort a transaction we have the following sequence
    
    if (!trans->dirty && list_empty(&trans->new_bgs))
    	return;
    WRITE_ONCE(trans->transaction->aborted, err);
    
    The idea being if we didn't modify anything with our trans handle then
    we don't really need to abort the whole transaction, maybe the other
    trans handles are fine and we can carry on.
    
    However in the case of create_snapshot we add a pending_snapshot object
    to our transaction and then commit the transaction.  We don't actually
    modify anything.  sync() behaves the same way, attach to an existing
    transaction and commit it.  This means that if we have an IO error in
    the right places we could abort the committing transaction with our
    trans->dirty being not set and thus not set transaction->aborted.
    
    This is a problem because in the create_snapshot() case we depend on
    pending->error being set to something, or btrfs_commit_transaction
    returning an error.
    
    If we are not the trans handle that gets to commit the transaction, and
    we're waiting on the commit to happen we get our return value from
    cur_trans->aborted.  If this was not set to anything because sync() hit
    an error in the transaction commit before it could modify anything then
    cur_trans->aborted would be 0.  Thus we'd return 0 from
    btrfs_commit_transaction() in create_snapshot.
    
    This is a problem because we then try to do things with
    pending_snapshot->snap, which will be NULL because we didn't create the
    snapshot, and then we'll get a NULL pointer dereference like the
    following
    
    "BUG: kernel NULL pointer dereference, address: 00000000000001f0"
    RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
    Call Trace:
     ? btrfs_mksubvol.isra.31+0x3f2/0x510
     btrfs_mksubvol.isra.31+0x4bc/0x510
     ? __sb_start_write+0xfa/0x200
     ? mnt_want_write_file+0x24/0x50
     btrfs_ioctl_snap_create_transid+0x16c/0x1a0
     btrfs_ioctl_snap_create_v2+0x11e/0x1a0
     btrfs_ioctl+0x1534/0x2c10
     ? free_debug_processing+0x262/0x2a3
     do_vfs_ioctl+0xa6/0x6b0
     ? do_sys_open+0x188/0x220
     ? syscall_trace_enter+0x1f8/0x330
     ksys_ioctl+0x60/0x90
     __x64_sys_ioctl+0x16/0x20
     do_syscall_64+0x4a/0x1b0
    
    In order to fix this we need to make sure anybody who calls
    commit_transaction has trans->dirty set so that they properly set the
    trans->transaction->aborted value properly so any waiters know bad
    things happened.
    
    This was found while I was running generic/475 with my modified
    fsstress, it reproduced within a few runs.  I ran with this patch all
    night and didn't see the problem again.
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 23, 2020
  8. btrfs: drop log root for dropped roots

    If we fsync on a subvolume and create a log root for that volume, and
    then later delete that subvolume we'll never clean up its log root.  Fix
    this by making switch_commit_roots free the log for any dropped roots we
    encounter.  The extra churn is because we need a btrfs_trans_handle, not
    the btrfs_transaction.
    
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 23, 2020
  9. btrfs: sysfs, add devid/dev_state kobject and device attributes

    New sysfs attributes that track the filesystem status of devices, stored
    in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
    a directory for each device, with name corresponding to the numerical
    device id.
    
      in_fs_metadata    - device is in the list of fs metadata
      missing           - device is missing (no device node or block device)
      replace_target    - device is target of replace
      writeable         - writes from fs are allowed
    
    These attributes reflect the state of the device::dev_state and created
    at mount time.
    
    Sample output:
      $ pwd
       /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
      $ ls
        in_fs_metadata  missing  replace_target  writeable
      $ cat missing
        0
    
    The output from these attributes are 0 or 1. 0 indicates unset and 1
    indicates set.  These attributes are readonly.
    
    It is observed that the device delete thread and sysfs read thread will
    not race because the delete thread calls sysfs kobject_put() which in
    turn waits for existing sysfs read to complete.
    
    Note for device replace devid swap:
    
    During the replace the target device temporarily assumes devid 0 before
    assigning the devid of the soruce device.
    
    In btrfs_dev_replace_finishing() we remove source sysfs devid using the
    function btrfs_sysfs_remove_devices_attr(), so after that call
    kobject_rename() to update the devid in the sysfs.  This adds and calls
    btrfs_sysfs_update_devid() helper function to update the device id.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ update changelog ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Jan 23, 2020
  10. btrfs: Refactor btrfs_rmap_block to improve readability

    Move variables to appropriate scope. Remove last BUG_ON in the function
    and rework error handling accordingly. Make the duplicate detection code
    more straightforward. Use in_range macro. And give variables more
    descriptive name by explicitly distinguishing between IO stripe size
    (size recorded in the chunk item) and data stripe size (the size of
    an actual stripe, constituting a logical chunk/block group).
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  11. btrfs: Add self-tests for btrfs_rmap_block

    Add RAID1 and single testcases to verify that data stripes are excluded
    from super block locations and that the address mapping is valid.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ update changelog ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  12. btrfs: selftests: Add support for dummy devices

    Add basic infrastructure to create and link dummy btrfs_devices. This
    will be used in the pending btrfs_rmap_block test which deals with
    the block groups.
    
    Calling btrfs_alloc_dummy_device will link the newly created device to
    the passed fs_info and the test framework will free them once the test
    is finished.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  13. btrfs: Move and unexport btrfs_rmap_block

    It's used only during initial block group reading to map physical
    address of super block to a list of logical ones. Make it private to
    block-group.c, add proper kernel doc and ensure it's exported only for
    tests.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jan 23, 2020
  14. btrfs: separate definition of assertion failure handlers

    There's a report where objtool detects unreachable instructions, eg.:
    
      fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
    
    This seems to be a false positive due to compiler version. The cause is
    in the ASSERT macro implementation that does the conditional check as
    IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
    
    To avoid that, use the ifdefs directly.
    
    There are still 2 reports that aren't fixed:
    
      fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
      fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
    
    Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
    Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
    Reported-by: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jan 23, 2020

Commits on Jan 20, 2020

  1. btrfs: device stats, log when stats are zeroed

    We had a report indicating that some read errors aren't reported by the
    device stats in the userland. It is important to have the errors
    reported in the device stat as user land scripts might depend on it to
    take the reasonable corrective actions. But to debug these issue we need
    to be really sure that request to reset the device stat did not come
    from the userland itself. So log an info message when device error reset
    happens.
    
    For example:
     BTRFS info (device sdc): device stats zeroed by btrfs(9223)
    
    Reported-by: philip@philip-seeger.de
    Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Jan 20, 2020
  2. btrfs: fix improper setting of scanned for range cyclic write cache p…

    …ages
    
    We noticed that we were having regular CG OOM kills in cases where there
    was still enough dirty pages to avoid OOM'ing.  It turned out there's
    this corner case in btrfs's handling of range_cyclic where files that
    were being redirtied were not getting fully written out because of how
    we do range_cyclic writeback.
    
    We unconditionally were setting scanned = 1; the first time we found any
    pages in the inode.  This isn't actually what we want, we want it to be
    set if we've scanned the entire file.  For range_cyclic we could be
    starting in the middle or towards the end of the file, so we could write
    one page and then not write any of the other dirty pages in the file
    because we set scanned = 1.
    
    Fix this by not setting scanned = 1 if we find pages.  The rules for
    setting scanned should be
    
    1) !range_cyclic.  In this case we have a specified range to write out.
    2) range_cyclic && index == 0.  In this case we've started at the
       beginning and there is no need to loop around a second time.
    3) range_cyclic && we started at index > 0 and we've reached the end of
       the file without satisfying our nr_to_write.
    
    This patch fixes both of our writepages implementations to make sure
    these rules hold true.  This fixed our over zealous CG OOMs in
    production.
    
    Fixes: d1310b2 ("Btrfs: Split the extent_map code into two parts")
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ add comment ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jan 20, 2020
  3. btrfs: safely advance counter when looking up bio csums

    Dan's smatch tool reports
    
      fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
      warn: should this be 'count == -1'
    
    which points to the while (count--) loop. With count == 0 the check
    itself could decrement it to -1. There's a WARN_ON a few lines below
    that has never been seen in practice though.
    
    It turns out that the value of page_bytes_left matches the count (by
    sectorsize multiples). The loop never reaches the state where count
    would go to -1, because page_bytes_left == 0 is found first and this
    breaks out.
    
    For clarity, use only plain check on count (and only for positive
    value), decrement safely inside the loop. Any other discrepancy after
    the whole bio list processing should be reported by the exising
    WARN_ON_ONCE as well.
    
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jan 20, 2020
  4. btrfs: remove unused member btrfs_device::work

    This is a leftover from recently removed bio scheduling framework.
    
    Fixes: ba8a9d0 ("Btrfs: delete the entire async bio submission framework")
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jan 20, 2020
  5. btrfs: remove unnecessary wrapper get_alloc_profile

    btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
    The only difference is btrfs_get_alloc_profile() is visible to other
    functions in btrfs while get_alloc_profile() is static and thus only
    visible to functions in block-group.c.
    
    Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
    get rid of the unnecessary second function.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Johannes Thumshirn <jth@kernel.org>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    morbidrsa authored and kdave committed Jan 20, 2020
  6. btrfs: add correction to handle -1 edge case in async discard

    From Dave's testing described below, it's possible to drive a file
    system to have bogus values of discardable_extents and _bytes.  As
    btrfs_discard_calc_delay() is the only user of discardable_extents, we
    can correct here for any negative discardable_extents/discardable_bytes.
    
    The problem is not reliably reproducible. The workload that created it
    was based on linux git tree, switching between release tags, then
    everytihng deleted followed by a full rebalance. At this state the
    values of discardable_bytes was 16K and discardable_extents was -1,
    expected values 0 and 0.
    
    Repeating the workload again did not correct the bogus values so the
    offset seems to be stable once it happens.
    
    Reported-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: David Sterba <dsterba@suse.com>
    dennisszhou authored and kdave committed Jan 20, 2020
  7. btrfs: ensure removal of discardable_* in free_bitmap()

    Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
    However, there are certain cases where we may free the free space cache
    via __btrfs_remove_free_space_cache(). This exposes a path where
    free_bitmap() is called regardless. This may result in a bad accounting
    situation for discardable_bytes and discardable_extents. So, remove the
    stats and call btrfs_discard_update_discardable().
    
    Signed-off-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: David Sterba <dsterba@suse.com>
    dennisszhou authored and kdave committed Jan 20, 2020
  8. btrfs: make smaller extents more likely to go into bitmaps

    It's less than ideal for small extents to eat into our extent budget, so
    force extents <= 32KB into the bitmaps save for the first handful.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Dennis Zhou <dennis@kernel.org>
    Signed-off-by: David Sterba <dsterba@suse.com>
    dennisszhou authored and kdave committed Jan 20, 2020
Older
You can’t perform that action at this time.