Skip to content
Permalink
Boris-Burkov/b…

Commits on Jun 17, 2020

  1. btrfs: fix fatal extent_buffer readahead vs releasepage race

    Under somewhat convoluted conditions, it is possible to attempt to
    release an extent_buffer that is under io, which triggers a BUG_ON in
    btrfs_release_extent_buffer_pages.
    
    This relies on a few different factors. First, extent_buffer reads done
    as readahead for searching use WAIT_NONE, so they free the local extent
    buffer reference while the io is outstanding. However, they should still
    be protected by TREE_REF. However, if the system is doing signficant
    reclaim, and simultaneously heavily accessing the extent_buffers, it is
    possible for releasepage to race with two concurrent readahead attempts
    in a way that leaves TREE_REF unset when the readahead extent buffer is
    released.
    
    Essentially, if two tasks race to allocate a new extent_buffer, but the
    winner who attempts the first io is rebuffed by a page being locked
    (likely by the reclaim itself) then the loser will still go ahead with
    issuing the readahead. The loser's call to find_extent_buffer must also
    race with the reclaim task reading the extent_buffer's refcount as 1 in
    a way that allows the reclaim to re-clear the TREE_REF checked by
    find_extent_buffer.
    
    The following represents an example execution demonstrating the race:
    
                CPU0                                                         CPU1                                           CPU2
    reada_for_search                                            reada_for_search
      readahead_tree_block                                        readahead_tree_block
        find_create_tree_block                                      find_create_tree_block
          alloc_extent_buffer                                         alloc_extent_buffer
                                                                      find_extent_buffer // not found
                                                                      allocates eb
                                                                      lock pages
                                                                      associate pages to eb
                                                                      insert eb into radix tree
                                                                      set TREE_REF, refs == 2
                                                                      unlock pages
                                                                  read_extent_buffer_pages // WAIT_NONE
                                                                    not uptodate (brand new eb)
                                                                                                                lock_page
                                                                    if !trylock_page
                                                                      goto unlock_exit // not an error
                                                                  free_extent_buffer
                                                                    release_extent_buffer
                                                                      atomic_dec_and_test refs to 1
            find_extent_buffer // found
                                                                                                                try_release_extent_buffer
                                                                                                                  take refs_lock
                                                                                                                  reads refs == 1; no io
              atomic_inc_not_zero refs to 2
              mark_buffer_accessed
                check_buffer_tree_ref
                  // not STALE, won't take refs_lock
                  refs == 2; TREE_REF set // no action
        read_extent_buffer_pages // WAIT_NONE
                                                                                                                  clear TREE_REF
                                                                                                                  release_extent_buffer
                                                                                                                    atomic_dec_and_test refs to 1
                                                                                                                    unlock_page
          still not uptodate (CPU1 read failed on trylock_page)
          locks pages
          set io_pages > 0
          submit io
          return
        release_extent_buffer
          dec refs to 0
          delete from radix tree
          btrfs_release_extent_buffer_pages
            BUG_ON(io_pages > 0)!!!
    
    We observe this at a very low rate in production and were also able to
    reproduce it in a test environment by introducing some spurious delays
    and by introducing probabilistic trylock_page failures.
    
    To fix it, we apply check_tree_ref at a point where it could not
    possibly be unset by a competing task: after io_pages has been
    incremented. There is no race in write_one_eb, that we know of, but for
    consistency, apply it there too. All the codepaths that clear TREE_REF
    check for io, so they would not be able to clear it after this point.
    
    Signed-off-by: Boris Burkov <boris@bur.io>
    Boris Burkov 0day robot
    Boris Burkov authored and 0day robot committed Jun 17, 2020

Commits on May 28, 2020

  1. btrfs: fix space_info bytes_may_use underflow during space cache writ…

    …eout
    
    We always preallocate a data extent for writing a free space cache, which
    causes writeback to always try the nocow path first, since the free space
    inode has the prealloc bit set in its flags.
    
    However if the block group that contains the data extent for the space
    cache has been turned to RO mode due to a running scrub or balance for
    example, we have to fallback to the cow path. In that case once a new data
    extent is allocated we end up calling btrfs_add_reserved_bytes(), which
    decrements the counter named bytes_may_use from the data space_info object
    with the expection that this counter was previously incremented with the
    same amount (the size of the data extent).
    
    However when we started writeout of the space cache at cache_save_setup(),
    we incremented the value of the bytes_may_use counter through a call to
    btrfs_check_data_free_space() and then decremented it through a call to
    btrfs_prealloc_file_range_trans() immediately after. So when starting the
    writeback if we fallback to cow mode we have to increment the counter
    bytes_may_use of the data space_info again to compensate for the extent
    allocation done by the cow path.
    
    When this issue happens we are incorrectly decrementing the bytes_may_use
    counter and when its current value is smaller then the amount we try to
    subtract we end up with the following warning:
    
     ------------[ cut here ]------------
     WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
     Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
     CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
     Workqueue: writeback wb_workfn (flush-btrfs-1591)
     RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
     Code: ff ff 48 (...)
     RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
     RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
     RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
     RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
     R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
     R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
     FS:  0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      find_free_extent+0x4a0/0x16c0 [btrfs]
      btrfs_reserve_extent+0x91/0x180 [btrfs]
      cow_file_range+0x12d/0x490 [btrfs]
      btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
      ? find_lock_delalloc_range+0x221/0x250 [btrfs]
      writepage_delalloc+0xe8/0x150 [btrfs]
      __extent_writepage+0xe8/0x4c0 [btrfs]
      extent_write_cache_pages+0x237/0x530 [btrfs]
      extent_writepages+0x44/0xa0 [btrfs]
      do_writepages+0x23/0x80
      __writeback_single_inode+0x59/0x700
      writeback_sb_inodes+0x267/0x5f0
      __writeback_inodes_wb+0x87/0xe0
      wb_writeback+0x382/0x590
      ? wb_workfn+0x4a2/0x6c0
      wb_workfn+0x4a2/0x6c0
      process_one_work+0x26d/0x6a0
      worker_thread+0x4f/0x3e0
      ? process_one_work+0x6a0/0x6a0
      kthread+0x103/0x140
      ? kthread_create_worker_on_cpu+0x70/0x70
      ret_from_fork+0x3a/0x50
     irq event stamp: 0
     hardirqs last  enabled at (0): [<0000000000000000>] 0x0
     hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
     softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
     softirqs last disabled at (0): [<0000000000000000>] 0x0
     ---[ end trace bd7c03622e0b0a52 ]---
     ------------[ cut here ]------------
    
    So fix this by incrementing the bytes_may_use counter of the data
    space_info when we fallback to the cow path. If the cow path is successful
    the counter is decremented after extent allocation (by
    btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
    well when clearing the delalloc range (extent_clear_unlock_delalloc()).
    
    This could be triggered sporadically by the test case btrfs/061 from
    fstests.
    
    Fixes: 82d5902 ("Btrfs: Support reading/writing on disk free ino cache")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 28, 2020
  2. btrfs: fix space_info bytes_may_use underflow after nocow buffered write

    When doing a buffered write we always try to reserve data space for it,
    even when the file has the NOCOW bit set or the write falls into a file
    range covered by a prealloc extent. This is done both because it is
    expensive to check if we can do a nocow write (checking if an extent is
    shared through reflinks or if there's a hole in the range for example),
    and because when writeback starts we might actually need to fallback to
    COW mode (for example the block group containing the target extents was
    turned into RO mode due to a scrub or balance).
    
    When we are unable to reserve data space we check if we can do a nocow
    write, and if we can, we proceed with dirtying the pages and setting up
    the range for delalloc. In this case the bytes_may_use counter of the
    data space_info object is not incremented, unlike in the case where we
    are able to reserve data space (done through btrfs_check_data_free_space()
    which calls btrfs_alloc_data_chunk_ondemand()).
    
    Later when running delalloc we attempt to start writeback in nocow mode
    but we might revert back to cow mode, for example because in the meanwhile
    a block group was turned into RO mode by a scrub or relocation. The cow
    path after successfully allocating an extent ends up calling
    btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
    the data space_info object to have been incremented before - but we did
    not do it when the buffered write started, since there was not enough
    available data space. So btrfs_add_reserved_bytes() ends up decrementing
    the bytes_may_use counter anyway, and when the counter's current value
    is smaller then the size of the allocated extent we get a stack trace
    like the following:
    
     ------------[ cut here ]------------
     WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
     Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
     CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
     Workqueue: writeback wb_workfn (flush-btrfs-1754)
     RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
     Code: ff ff 48 (...)
     RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
     RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
     RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
     RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
     R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
     R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
     FS:  0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      find_free_extent+0x4a0/0x16c0 [btrfs]
      btrfs_reserve_extent+0x91/0x180 [btrfs]
      cow_file_range+0x12d/0x490 [btrfs]
      run_delalloc_nocow+0x341/0xa40 [btrfs]
      btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
      ? find_lock_delalloc_range+0x221/0x250 [btrfs]
      writepage_delalloc+0xe8/0x150 [btrfs]
      __extent_writepage+0xe8/0x4c0 [btrfs]
      extent_write_cache_pages+0x237/0x530 [btrfs]
      ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
      extent_writepages+0x44/0xa0 [btrfs]
      do_writepages+0x23/0x80
      __writeback_single_inode+0x59/0x700
      writeback_sb_inodes+0x267/0x5f0
      __writeback_inodes_wb+0x87/0xe0
      wb_writeback+0x382/0x590
      ? wb_workfn+0x4a2/0x6c0
      wb_workfn+0x4a2/0x6c0
      process_one_work+0x26d/0x6a0
      worker_thread+0x4f/0x3e0
      ? process_one_work+0x6a0/0x6a0
      kthread+0x103/0x140
      ? kthread_create_worker_on_cpu+0x70/0x70
      ret_from_fork+0x3a/0x50
     irq event stamp: 0
     hardirqs last  enabled at (0): [<0000000000000000>] 0x0
     hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
     softirqs last  enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
     softirqs last disabled at (0): [<0000000000000000>] 0x0
     ---[ end trace f9f6ef8ec4cd8ec9 ]---
    
    So to fix this, when falling back into cow mode check if space was not
    reserved, by testing for the bit EXTENT_NORESERVE in the respective file
    range, and if not, increment the bytes_may_use counter for the data
    space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
    that if the cow path fails it decrements the bytes_may_use counter when
    clearing the delalloc range (through the btrfs_clear_delalloc_extent()
    callback).
    
    Fixes: 7ee9e44 ("Btrfs: check if we can nocow if we don't have data space")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 28, 2020
  3. btrfs: fix wrong file range cleanup after an error filling dealloc range

    If an error happens while running dellaloc in COW mode for a range, we can
    end up calling extent_clear_unlock_delalloc() for a range that goes beyond
    our range's end offset by 1 byte, which affects 1 extra page. This results
    in clearing bits and doing page operations (such as a page unlock) outside
    our target range.
    
    Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
    offset, instead of an exclusive end offset, at cow_file_range().
    
    Fixes: a315e68 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 28, 2020
  4. btrfs: remove redundant local variable in read_block_for_search

    The local 'b' variable is only used to directly read values from passed
    extent buffer. So eliminate  it and directly use the input parameter.
    Furthermore this shrinks the size of the following functions:
    
    ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
    add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-73 (-73)
    Function                                     old     new   delta
    read_block_for_search.isra                   876     871      -5
    push_node_left                              1112    1044     -68
    Total: Before=50348, After=50275, chg -0.14%
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed May 28, 2020
  5. btrfs: open code key_search

    This function wraps the optimisation implemented by d7396f0
    ("Btrfs: optimize key searches in btrfs_search_slot") however this
    optimisation is really used in only one place - btrfs_search_slot.
    
    Just open code the optimisation and also add a comment explaining how it
    works since it's not clear just by looking at the code - the key point
    here is it depends on an internal invariant that BTRFS' btree provides,
    namely intermediate pointers always contain the key at slot0 at the
    child node. So in the case of exact match we can safely assume that the
    given key will always be in slot 0 on lower levels.
    
    Furthermore this results in a reduction of btrfs_search_slot's size:
    
    ./scripts/bloat-o-meter ctree.orig fs/btrfs/ctree.o
    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-75 (-75)
    Function                                     old     new   delta
    btrfs_search_slot                           2783    2708     -75
    Total: Before=50423, After=50348, chg -0.15%
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed May 28, 2020
  6. btrfs: split btrfs_direct_IO to read and write part

    The read and write versions don't have anything in common except for the
    call to iomap_dio_rw.  So split this function, and merge each half into
    its only caller.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Christoph Hellwig authored and kdave committed May 28, 2020
  7. btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK

    Since we now perform direct reads using i_rwsem, we can remove this
    inode flag used to co-ordinate unlocked reads.
    
    The truncate call takes i_rwsem. This means it is correctly synchronized
    with concurrent direct reads.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <jth@kernel.org>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 28, 2020
  8. fs: remove dio_end_io()

    Since we removed the last user of dio_end_io(), remove the helper
    function dio_end_io().
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 28, 2020
  9. btrfs: switch to iomap_dio_rw() for dio

    Switch from __blockdev_direct_IO() to iomap_dio_rw().
    Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
    as iomap_begin() for iomap direct I/O functions. This function
    allocates and locks all the blocks required for the I/O.
    btrfs_submit_direct() is used as the submit_io() hook for direct I/O
    ops.
    
    Since we need direct I/O reads to go through iomap_dio_rw(), we change
    file_operations.read_iter() to a btrfs_file_read_iter() which calls
    btrfs_direct_IO() for direct reads and falls back to
    generic_file_buffered_read() for incomplete reads and buffered reads.
    
    We don't need address_space.direct_IO() anymore so set it to noop.
    Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
    capable of direct I/O reads from a hole, so we don't need to return
    -ENOENT.
    
    BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
    exclusive in case of writes. This guards against simultaneous truncates.
    
    Use iomap->iomap_end() to check for failed or incomplete direct I/O:
     - for writes, call __endio_write_update_ordered()
     - for reads, unlock extents
    
    btrfs_dio_data is now hooked in iomap->private and not
    current->journal_info. It carries the reservation variable and the
    amount of data submitted, so we can calculate the amount of data to call
    __endio_write_update_ordered in case of an error.
    
    This patch removes last use of struct buffer_head from btrfs.
    
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 28, 2020

Commits on May 25, 2020

  1. iomap: remove lockdep_assert_held()

    Filesystems such as btrfs can perform direct I/O without holding the
    inode->i_rwsem in some of the cases like writing within i_size.  So,
    remove the check for lockdep_assert_held() in iomap_dio_rw().
    
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 25, 2020
  2. iomap: add a filesystem hook for direct I/O bio submission

    This helps filesystems to perform tasks on the bio while submitting for
    I/O. This could be post-write operations such as data CRC or data
    replication for fs-handled RAID.
    
    Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 25, 2020
  3. fs: export generic_file_buffered_read()

    Export generic_file_buffered_read() to be used to supplement incomplete
    direct reads.
    
    Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed May 25, 2020
  4. btrfs: turn space cache writeout failure messages into debug messages

    Since commit 1afb648 ("btrfs: use standard debug config option to
    enable free-space-cache debug prints"), we started to log error messages
    that were never logged before since there was no DEBUG macro defined
    anywhere. This started to make test case btrfs/187 to fail very often,
    as it greps for any btrfs error messages in dmesg/syslog and fails if
    any is found:
    
    (...)
    btrfs/186 1s ...  2s
    btrfs/187       - output mismatch (see .../results//btrfs/187.out.bad)
        \--- tests/btrfs/187.out     2019-05-17 12:48:32.537340749 +0100
        \+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
        \@@ -1,3 +1,8 @@
         QA output created by 187
         Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
         Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
        +[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
        +[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
        +[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
        +[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
        ...
        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
    btrfs/188 4s ...  2s
    (...)
    
    The space cache write failures happen due to ENOSPC when attempting to
    update the free space cache items in the root tree. This happens because
    when starting or joining a transaction we don't know how many block
    groups we will end up changing (due to extent allocation or release) and
    therefore never reserve space for updating free space cache items.
    More often than not, the free space cache writeout succeeds since the
    metadata space info is not yet full nor very close to being full, but
    when it is, the space cache writeout fails with ENOSPC.
    
    Occasional failures to write space caches are not considered critical
    since they can be rebuilt when mounting the filesystem or the next
    attempt to write a free space cache in the next transaction commit might
    succeed, so we used to hide those error messages with a preprocessor
    check for the existence of the DEBUG macro that was never enabled
    anywhere.
    
    A few other generic test cases also trigger the error messages due to
    ENOSPC failure when writing free space caches as well, however they don't
    fail since they don't grep dmesg/syslog for any btrfs specific error
    messages.
    
    So change the messages from 'error' level to 'debug' level, as it doesn't
    make much sense to have error messages triggered only if the debug macro
    is enabled plus, more importantly, the error is not serious nor highly
    unexpected.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  5. btrfs: include error on messages about failure to write space/inode c…

    …aches
    
    Currently the error messages logged when we fail to write a free space
    cache or an inode cache are not very useful as they don't mention what
    was the error. So include the error number in the messages.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  6. btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()

    The label 'fail_unlock' is pointless, all it does is to jump to the label
    'out', so just remove it.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  7. btrfs: do not ignore error from btrfs_next_leaf() when inserting chec…

    …ksums
    
    We are currently treating any non-zero return value from btrfs_next_leaf()
    the same way, by going to the code that inserts a new checksum item in the
    tree. However if btrfs_next_leaf() returns an error (a value < 0), we
    should just stop and return the error, and not behave as if nothing has
    happened, since in that case we do not have a way to know if there is a
    next leaf or we are currently at the last leaf already.
    
    So fix that by returning the error from btrfs_next_leaf().
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  8. btrfs: make checksum item extension more efficient

    When we want to add checksums into the checksums tree, or a log tree, we
    try whenever possible to extend existing checksum items, as this helps
    reduce amount of metadata space used, since adding a new item uses extra
    metadata space for a btrfs_item structure (25 bytes).
    
    However we have two inefficiencies in the current approach:
    
    1) After finding a checksum item that covers a range with an end offset
       that matches the start offset of the checksum range we want to insert,
       we release the search path populated by btrfs_lookup_csum() and then
       do another COW search on tree with the goal of getting additional
       space for at least one checksum. Doing this path release and then
       searching again is a waste of time because very often the leaf already
       has enough free space for at least one more checksum;
    
    2) After the COW search that guarantees we get free space in the leaf for
       at least one more checksum, we end up not doing the extension of the
       previous checksum item, and fallback to insertion of a new checksum
       item, if the leaf doesn't have an amount of free space larger then the
       space required for 2 checksums plus one btrfs_item structure - this is
       pointless for two reasons:
    
       a) We want to extend an existing item, so we don't need to account for
          a btrfs_item structure (25 bytes);
    
       b) We made the COW search with an insertion size for 1 single checksum,
          so if the leaf ends up with a free space amount smaller then 2
          checksums plus the size of a btrfs_item structure, we give up on the
          extension of the existing item and jump to the 'insert' label, where
          we end up releasing the path and then doing yet another search to
          insert a new checksum item for a single checksum.
    
    Fix these inefficiencies by doing the following:
    
    - For case 1), before releasing the path just check if the leaf already
      has enough space for at least 1 more checksum, and if it does, jump
      directly to the item extension code, with releasing our current path,
      which was already COWed by btrfs_lookup_csum();
    
    - For case 2), fix the logic so that for item extension we require only
      that the leaf has enough free space for 1 checksum, and not a minimum
      of 2 checksums plus space for a btrfs_item structure.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  9. btrfs: fix corrupt log due to concurrent fsync of inodes with shared …

    …extents
    
    When we have extents shared amongst different inodes in the same subvolume,
    if we fsync them in parallel we can end up with checksum items in the log
    tree that represent ranges which overlap.
    
    For example, consider we have inodes A and B, both sharing an extent that
    covers the logical range from X to X + 64KiB:
    
    1) Task A starts an fsync on inode A;
    
    2) Task B starts an fsync on inode B;
    
    3) Task A calls btrfs_csum_file_blocks(), and the first search in the
       log tree, through btrfs_lookup_csum(), returns -EFBIG because it
       finds an existing checksum item that covers the range from X - 64KiB
       to X;
    
    4) Task A checks that the checksum item has not reached the maximum
       possible size (MAX_CSUM_ITEMS) and then releases the search path
       before it does another path search for insertion (through a direct
       call to btrfs_search_slot());
    
    5) As soon as task A releases the path and before it does the search
       for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
       too, because there is an existing checksum item that has an end
       offset that matches the start offset (X) of the checksum range we want
       to log;
    
    6) Task B releases the path;
    
    7) Task A does the path search for insertion (through btrfs_search_slot())
       and then verifies that the checksum item that ends at offset X still
       exists and extends its size to insert the checksums for the range from
       X to X + 64KiB;
    
    8) Task A releases the path and returns from btrfs_csum_file_blocks(),
       having inserted the checksums into an existing checksum item that got
       its size extended. At this point we have one checksum item in the log
       tree that covers the logical range from X - 64KiB to X + 64KiB;
    
    9) Task B now does a search for insertion using btrfs_search_slot() too,
       but it finds that the previous checksum item no longer ends at the
       offset X, it now ends at an of offset X + 64KiB, so it leaves that item
       untouched.
    
       Then it releases the path and calls btrfs_insert_empty_item()
       that inserts a checksum item with a key offset corresponding to X and
       a size for inserting a single checksum (4 bytes in case of crc32c).
       Subsequent iterations end up extending this new checksum item so that
       it contains the checksums for the range from X to X + 64KiB.
    
       So after task B returns from btrfs_csum_file_blocks() we end up with
       two checksum items in the log tree that have overlapping ranges, one
       for the range from X - 64KiB to X + 64KiB, and another for the range
       from X to X + 64KiB.
    
    Having checksum items that represent ranges which overlap, regardless of
    being in the log tree or in the chekcsums tree, can lead to problems where
    checksums for a file range end up not being found. This type of problem
    has happened a few times in the past and the following commits fixed them
    and explain in detail why having checksum items with overlapping ranges is
    problematic:
    
      27b9a81 "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
      b84b839 "Btrfs: fix file read corruption after extent cloning and fsync"
      40e046a "Btrfs: fix missing data checksums after replaying a log tree"
    
    Since this specific instance of the problem can only happen when logging
    inodes, because it is the only case where concurrent attempts to insert
    checksums for the same range can happen, fix the issue by using an extent
    io tree as a range lock to serialize checksum insertion during inode
    logging.
    
    This issue could often be reproduced by the test case generic/457 from
    fstests. When it happens it produces the following trace:
    
     BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
     BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
     BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
          item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
          item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
          item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
          item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
          item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
          item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
     (...)
     BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
     ------------[ cut here ]------------
     WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
     Modules linked in: btrfs dm_thin_pool ...
     CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
     RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
     Code: c7 c7 ...
     RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
     RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
     RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
     RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
     R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
     R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
     FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      btree_submit_bio_hook+0x67/0xc0 [btrfs]
      submit_one_bio+0x31/0x50 [btrfs]
      btree_write_cache_pages+0x2db/0x4b0 [btrfs]
      ? __filemap_fdatawrite_range+0xb1/0x110
      do_writepages+0x23/0x80
      __filemap_fdatawrite_range+0xd2/0x110
      btrfs_write_marked_extents+0x15e/0x180 [btrfs]
      btrfs_sync_log+0x206/0x10a0 [btrfs]
      ? kmem_cache_free+0x315/0x3b0
      ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
      ? __mutex_unlock_slowpath+0x45/0x2a0
      ? lockref_put_or_lock+0x9/0x30
      ? dput+0x2d/0x580
      ? dput+0xb5/0x580
      ? btrfs_sync_file+0x464/0x4d0 [btrfs]
      btrfs_sync_file+0x464/0x4d0 [btrfs]
      do_fsync+0x38/0x60
      __x64_sys_fsync+0x10/0x20
      do_syscall_64+0x5c/0x280
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
     RIP: 0033:0x7fb41953a6d0
     Code: 48 3d ...
     RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
     RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
     RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
     RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
     R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
     R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
     irq event stamp: 0
     hardirqs last  enabled at (0): [<0000000000000000>] 0x0
     hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
     softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
     softirqs last disabled at (0): [<0000000000000000>] 0x0
     ---[ end trace d543fc76f5ad7fd8 ]---
    
    In that trace the tree checker detected the overlapping checksum items at
    the time when we triggered writeback for the log tree when syncing the
    log.
    
    Another trace that can happen is due to BUG_ON() when deleting checksum
    items while logging an inode:
    
     BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
     BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
     BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
      item 0 key (257 1 0) itemoff 16123 itemsize 160
              inode generation 7 size 262144 mode 100600
      item 1 key (257 12 256) itemoff 16103 itemsize 20
      item 2 key (257 108 0) itemoff 16050 itemsize 53
              extent data disk bytenr 13631488 nr 4096
              extent data offset 0 nr 131072 ram 131072
     (...)
     ------------[ cut here ]------------
     kernel BUG at fs/btrfs/ctree.c:3153!
     invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
     CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
     RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
     Code: 0f b6 ...
     RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
     RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
     RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
     RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
     R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
     R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
     FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      btrfs_del_csums+0x2f4/0x540 [btrfs]
      copy_items+0x4b5/0x560 [btrfs]
      btrfs_log_inode+0x910/0xf90 [btrfs]
      btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
      ? dget_parent+0x5/0x370
      btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
      btrfs_sync_file+0x42b/0x4d0 [btrfs]
      __x64_sys_msync+0x199/0x200
      do_syscall_64+0x5c/0x280
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
     RIP: 0033:0x7fe586c65760
     Code: 00 f7 ...
     RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
     RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
     RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
     RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
     R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
     R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
     Modules linked in: dm_log_writes ...
     ---[ end trace c92a7f447a8515f5 ]---
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed May 25, 2020
  10. btrfs: unexport btrfs_compress_set_level()

    btrfs_compress_set_level() can be static function in the file
    compression.c.
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed May 25, 2020
  11. btrfs: simplify iget helpers

    The inode lookup starting at btrfs_iget takes the full location key,
    while only the objectid is used to match the inode, because the lookup
    happens inside the given root thus the inode number is unique.
    The entire location key is properly set up in btrfs_init_locked_inode.
    
    Simplify the helpers and pass only inode number, renaming it to 'ino'
    instead of 'objectid'. This allows to remove temporary variables key,
    saving some stack space.
    
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed May 25, 2020
  12. btrfs: open code read_fs_root

    After the update to btrfs_get_fs_root, read_fs_root has become trivial
    wrapper that can be open coded.
    
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed May 25, 2020
  13. btrfs: simplify root lookup by id

    The main function to lookup a root by its id btrfs_get_fs_root takes the
    whole key, while only using the objectid. The value of offset is preset
    to (u64)-1 but not actually used until btrfs_find_root that does the
    actual search.
    
    Switch btrfs_get_fs_root to use only objectid and remove all local
    variables that existed just for the lookup. The actual key for search is
    set up in btrfs_get_fs_root, reusing another key variable.
    
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed May 25, 2020
  14. btrfs: reloc: clear DEAD_RELOC_TREE bit for orphan roots to prevent r…

    …unaway balance
    
    [BUG]
    There are several reported runaway balance, that balance is flooding the
    log with "found X extents" where the X never changes.
    
    [CAUSE]
    Commit d2311e6 ("btrfs: relocation: Delay reloc tree deletion after
    merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
    indicate that one subvolume has finished its tree blocks swap with its
    reloc tree.
    
    However if balance is canceled or hits ENOSPC halfway, we didn't clear
    the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
    until unmount.
    
    Any subvolume root with that bit, would cause backref cache to skip this
    tree block, as it has finished its tree block swap.  This would cause
    all tree blocks of that root be ignored by balance, leading to runaway
    balance.
    
    [FIX]
    Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
    the original subvolume of orphan reloc root.
    
    Add an umount check for the stale bit still set.
    
    Fixes: d2311e6 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed May 25, 2020
  15. btrfs: reloc: fix reloc root leak and NULL pointer dereference

    [BUG]
    When balance is canceled, there is a pretty high chance that unmounting
    the fs can lead to lead the NULL pointer dereference:
    
      BTRFS warning (device dm-3): page private not zero on page 223158272
      ...
      BTRFS warning (device dm-3): page private not zero on page 223162368
      BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
      BUG: kernel NULL pointer dereference, address: 0000000000000168
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 2 PID: 5793 Comm: umount Tainted: G           O      5.7.0-rc5-custom+ #53
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:__lock_acquire+0x5dc/0x24c0
      Call Trace:
       lock_acquire+0xab/0x390
       _raw_spin_lock+0x39/0x80
       btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
       release_extent_buffer+0xb2/0x170 [btrfs]
       free_extent_buffer+0x66/0xb0 [btrfs]
       btrfs_put_root+0x8e/0x130 [btrfs]
       btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
       btrfs_free_fs_info+0xe5/0x120 [btrfs]
       btrfs_kill_super+0x1f/0x30 [btrfs]
       deactivate_locked_super+0x3b/0x80
       deactivate_super+0x3e/0x50
       cleanup_mnt+0x109/0x160
       __cleanup_mnt+0x12/0x20
       task_work_run+0x67/0xa0
       exit_to_usermode_loop+0xc5/0xd0
       syscall_return_slowpath+0x205/0x360
       do_syscall_64+0x6e/0xb0
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x7fd028ef740b
    
    [CAUSE]
    When balance is canceled, all reloc roots are marked as orphan, and
    orphan reloc roots are going to be cleaned up.
    
    However for orphan reloc roots and merged reloc roots, their lifespan
    are quite different:
    
    	Merged reloc roots	|	Orphan reloc roots by cancel
    --------------------------------------------------------------------
    create_reloc_root()		| create_reloc_root()
    |- refs == 1			| |- refs == 1
    				|
    btrfs_grab_root(reloc_root);	| btrfs_grab_root(reloc_root);
    |- refs == 2			| |- refs == 2
    				|
    root->reloc_root = reloc_root;	| root->reloc_root = reloc_root;
    		>>> No difference so far <<<
    				|
    prepare_to_merge()		| prepare_to_merge()
    |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
    				|
    merge_reloc_roots()		| merge_reloc_roots()
    |- merge_reloc_root()		| |- Doing nothing to put reloc root
       |- insert_dirty_subvol()	| |- refs == 2
          |- __del_reloc_root()	|
             |- btrfs_put_root()	|
                |- refs == 1	|
    		>>> Now orphan reloc roots still have refs 2 <<<
    				|
    clean_dirty_subvols()		| clean_dirty_subvols()
    |- btrfs_drop_snapshot()	| |- btrfS_drop_snapshot()
       |- reloc_root get freed	|    |- reloc_root still has refs 2
    				|	related ebs get freed, but
    				|	reloc_root still recorded in
    				|	allocated_roots
    btrfs_check_leaked_roots()	| btrfs_check_leaked_roots()
    |- No leaked roots		| |- Leaked reloc_roots detected
    				| |- btrfs_put_root()
    				|    |- free_extent_buffer(root->node);
    				|       |- eb already freed, caused NULL
    				|	   pointer dereference
    
    [FIX]
    The fix is to clear fs_root->reloc_root and put it at
    merge_reloc_roots() time, so that we won't leak reloc roots.
    
    Fixes: d2311e6 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    CC: stable@vger.kernel.org # 5.1+
    Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed May 25, 2020
  16. btrfs: reduce lock contention when creating snapshot

    When creating a snapshot, ordered extents need to be flushed and this
    can take a long time.
    
    In create_snapshot there are two locks held when this happens:
    
      1. Destination directory inode lock
      2. Global subvolume semaphore
    
    This will unnecessarily block other operations like subvolume destroy,
    create, or setflag until the snapshot is created.
    
    We can fix that by moving the flush outside the locked section as this
    does not depend on the aforementioned locks.  The code factors out the
    snapshot related work from create_snapshot to btrfs_mksnapshot.
    
    __btrfs_ioctl_snap_create
      btrfs_mksubvol
        create_subvol
      btrfs_mksnapshot
        <flush>
        btrfs_mksubvol
          create_snapshot
    
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Robbie Ko <robbieko@synology.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Robbie Ko authored and kdave committed May 25, 2020
  17. btrfs: don't set SHAREABLE flag for data reloc tree

    SHAREABLE flag is set for subvolumes because users can create snapshot
    for subvolumes, thus sharing tree blocks of them.
    
    But data reloc tree is not exposed to user space, as it's only an
    internal tree for data relocation, thus it doesn't need the full path
    replacement handling at all.
    
    This patch will make data reloc tree a non-shareable tree, and add
    btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
    can grab it from fs_info directly.
    
    This would slightly improve tree relocation, as now data reloc tree
    can go through regular COW routine to get relocated, without bothering
    the complex tree reloc tree routine.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed May 25, 2020
  18. btrfs: inode: cleanup the log-tree exceptions in btrfs_truncate_inode…

    …_items()
    
    There are a lot of root owner checks in btrfs_truncate_inode_items()
    like:
    
    	if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) ||
    	    root == fs_info->tree_root)
    
    But considering that, only these trees can have INODE_ITEMs:
    
    - tree root (for v1 space cache)
    - subvolume trees
    - tree reloc trees
    - data reloc tree
    - log trees
    
    And since subvolume/tree reloc/data reloc trees all have SHAREABLE bit,
    and we're checking tree root manually, so above check is just excluding
    log trees.
    
    This patch will replace two of such checks to a simpler one:
    
    	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
    
    This would merge btrfs_drop_extent_cache() and lock_extent_bits() call
    into the same if branch.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed May 25, 2020
  19. btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE

    The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
    
    In fact, that bit can only be set to those trees:
    
    - Subvolume roots
    - Data reloc root
    - Reloc roots for above roots
    
    All other trees won't get this bit set.  So just by the result, it is
    obvious that, roots with this bit set can have tree blocks shared with
    other trees.  Either shared by snapshots, or by reloc roots (an special
    snapshot created by relocation).
    
    This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
    make it easier to understand, and update all comment mentioning
    "reference counted" to follow the rename.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed May 25, 2020
  20. btrfs: drop stale reference to volume_mutex

    Commit dccdb07 ("btrfs: kill btrfs_fs_info::volume_mutex") removed
    the last use of the volume_mutex, forgetting to update the comment.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed May 25, 2020
  21. btrfs: update documentation of set/get helpers

    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed May 25, 2020
Older
You can’t perform that action at this time.