Skip to content
Permalink
Johannes-Thums…

Commits on Apr 28, 2020

  1. btrfs: rename btrfs_parse_device_options back to btrfs_parse_early_op…

    …tions
    
    As btrfs_parse_device_options() now doesn't only parse the -o device mount
    option but -o auth_key as well, it makes sense to rename it back to
    btrfs_parse_early_options().
    
    This reverts commit fa59f27.
    
    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Johannes Thumshirn 0day robot
    Johannes Thumshirn authored and 0day robot committed Apr 28, 2020
  2. btrfs: add authentication support

    Add authentication support for a BTRFS file-system.
    
    This works, because in BTRFS every meta-data block as well as every
    data-block has a own checksum. For meta-data the checksum is in the
    meta-data node itself. For data blocks, the checksums are stored in the
    checksum tree.
    
    When replacing the checksum algorithm with a keyed hash, like HMAC(SHA256),
    a key is needed to mount a verified file-system. This key also needs to be
    used at file-system creation time.
    
    We have to used a keyed hash scheme, in contrast to doing a normal
    cryptographic hash, to guarantee integrity of the file system, as a
    potential attacker could just replay file-system operations and the
    changes would go unnoticed.
    
    Having a keyed hash only on the topmost Node of a tree or even just in the
    super-block and using cryptographic hashes on the normal meta-data nodes
    and checksum tree entries doesn't work either, as the BTRFS B-Tree's Nodes
    do not include the checksums of their respective child nodes, but only the
    block pointers and offsets where to find them on disk.
    
    Also note, we do not need a incompat R/O flag for this, because if an old
    kernel tries to mount an authenticated file-system it will fail the
    initial checksum type verification and thus refuses to mount.
    
    The key has to be supplied by the kernel's keyring and the method of
    getting the key securely into the kernel is not subject of this patch.
    
    Example usage:
    Create a file-system with authentication key 0123456
    mkfs.btrfs --csum hmac-sha256 --auth-key 0123456 /dev/disk
    
    Add the key to the kernel's keyring as keyid 'btrfs:foo'
    keyctl add logon btrfs:foo 0123456 @U
    
    Mount the fs using the 'btrfs:foo' key
    mount -t btrfs -o auth_key=btrfs:foo /dev/disk /mnt/point
    
    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Johannes Thumshirn 0day robot
    Johannes Thumshirn authored and 0day robot committed Apr 28, 2020

Commits on Apr 23, 2020

  1. btrfs: remove useless check for copy_items() return value

    At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
    value greater than 0. That used to happen in the past to signal the caller
    that the path given to it was released and reused for other searches, but
    as of commit 0e56315 ("Btrfs: fix missing hole after hole punching
    and fsync when using NO_HOLES"), the copy_items() function does not have
    that behaviour anymore and always returns 0 or a negative value. So just
    remove that check at btrfs_log_prealloc_extents(), which the previously
    mentioned commit forgot to remove.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Apr 23, 2020
  2. btrfs: fix partial loss of prealloc extent past i_size after fsync

    When we have an inode with a prealloc extent that starts at an offset
    lower than the i_size and there is another prealloc extent that starts at
    an offset beyond i_size, we can end up losing part of the first prealloc
    extent (the part that starts at i_size) and have an implicit hole if we
    fsync the file and then have a power failure.
    
    Consider the following example with comments explaining how and why it
    happens.
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
    
      # Create our test file with 2 consecutive prealloc extents, each with a
      # size of 128Kb, and covering the range from 0 to 256Kb, with a file
      # size of 0.
      $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
      $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
    
      # Fsync the file to record both extents in the log tree.
      $ xfs_io -c "fsync" /mnt/foo
    
      # Now do a redudant extent allocation for the range from 0 to 64Kb.
      # This will merely increase the file size from 0 to 64Kb. Instead we
      # could also do a truncate to set the file size to 64Kb.
      $ xfs_io -c "falloc 0 64K" /mnt/foo
    
      # Fsync the file, so we update the inode item in the log tree with the
      # new file size (64Kb). This also ends up setting the number of bytes
      # for the first prealloc extent to 64Kb. This is done by the truncation
      # at btrfs_log_prealloc_extents().
      # This means that if a power failure happens after this, a write into
      # the file range 64Kb to 128Kb will not use the prealloc extent and
      # will result in allocation of a new extent.
      $ xfs_io -c "fsync" /mnt/foo
    
      # Now set the file size to 256K with a truncate and then fsync the file.
      # Since no changes happened to the extents, the fsync only updates the
      # i_size in the inode item at the log tree. This results in an implicit
      # hole for the file range from 64Kb to 128Kb, something which fsck will
      # complain when not using the NO_HOLES feature if we replay the log
      # after a power failure.
      $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
    
    So instead of always truncating the log to the inode's current i_size at
    btrfs_log_prealloc_extents(), check first if there's a prealloc extent
    that starts at an offset lower than the i_size and with a length that
    crosses the i_size - if there is one, just make sure we truncate to a
    size that corresponds to the end offset of that prealloc extent, so
    that we don't lose the part of that extent that starts at i_size if a
    power failure happens.
    
    A test case for fstests follows soon.
    
    Fixes: 31d11b8 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Apr 23, 2020
  3. btrfs: fix transaction leak in btrfs_recover_relocation

    btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins
    a btrfs_trans_handle object into transactions and returns a reference of
    it with increased refcount to "trans".
    
    When btrfs_recover_relocation() returns, "trans" becomes invalid, so the
    refcount should be decreased to keep refcount balanced.
    
    The reference counting issue happens in one exception handling path of
    btrfs_recover_relocation(). When read_fs_root() failed, the refcnt
    increased by btrfs_join_transaction() is not decreased, causing a refcnt
    leak.
    
    Fix this issue by calling btrfs_end_transaction() on this error path
    when read_fs_root() failed.
    
    Fixes: 79787ea ("btrfs: replace many BUG_ONs with proper error handling")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
    Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    sherlly authored and kdave committed Apr 23, 2020

Commits on Apr 22, 2020

  1. btrfs: fix block group leak when removing fails

    btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
    returns a local reference of the block group that contains the given
    bytenr to "block_group" with increased refcount.
    
    When btrfs_remove_block_group() returns, "block_group" becomes invalid,
    so the refcount should be decreased to keep refcount balanced.
    
    The reference counting issue happens in several exception handling paths
    of btrfs_remove_block_group(). When those error scenarios occur such as
    btrfs_alloc_path() returns NULL, the function forgets to decrease its
    refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
    leak.
    
    Fix this issue by jumping to "out_put_group" label and calling
    btrfs_put_block_group() when those error scenarios occur.
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
    Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    sherlly authored and kdave committed Apr 22, 2020
  2. btrfs: simplify direct I/O read repair

    Direct I/O read repair was originally implemented in commit 8b110e3
    ("Btrfs: implement repair function when direct read fails"). This
    implementation is unnecessarily complicated. There is major code
    duplication between __btrfs_subio_endio_read() (checks checksums and
    handles I/O errors for files with checksums),
    __btrfs_correct_data_nocsum() (handles I/O errors for files without
    checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
    for retries of files with checksums), and btrfs_retry_endio_nocsum()
    (handles I/O errors for retries of files without checksum). If it sounds
    like these should be one function, that's because they should.
    Additionally, these functions are very hard to follow due to their
    excessive use of goto.
    
    This commit replaces the original implementation. After the previous
    commit getting rid of orig_bio, we can reuse the same endio callback for
    repair I/O and the original I/O, we just need to track the file offset
    and original iterator in the repair bio. We can also unify the handling
    of files with and without checksums and simplify the control flow. We
    also no longer have to wait for each repair I/O to complete one by one.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  3. btrfs: get rid of one layer of bios in direct I/O

    In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
    path:
    
    1. The bio created by the generic direct I/O code (dio_bio).
    2. A clone of dio_bio we create in btrfs_submit_direct() to represent
       the entire direct I/O range (orig_bio).
    3. A partial clone of orig_bio limited to the size of a RAID stripe that
       we create in btrfs_submit_direct_hook().
    4. Clones of each of those split bios for each RAID stripe that we
       create in btrfs_map_bio().
    
    As of the previous commit, the second layer (orig_bio) is no longer
    needed for anything: we can split dio_bio instead, and complete dio_bio
    directly when all of the cloned bios complete. This lets us clean up a
    bunch of cruft, including dip->subio_endio and dip->errors (we can use
    dio_bio->bi_status instead). It also enables the next big cleanup of
    direct I/O read repair.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  4. btrfs: put direct I/O checksums in btrfs_dio_private instead of bio

    The next commit will get rid of btrfs_dio_private->orig_bio. The only
    thing we really need it for is containing all of the checksums, but we
    can easily put the checksum array in btrfs_dio_private and have the
    submitted bios reference the array. We can also look the checksums up
    while we're setting up instead of the current awkward logic that looks
    them up for orig_bio when the first split bio is submitted.
    
    (Interestingly, btrfs_dio_private did contain the
    checksums before commit 23ea8e5 ("Btrfs: load checksum data once
    when submitting a direct read io"), but it didn't look them up up
    front.)
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  5. btrfs: convert btrfs_dio_private->pending_bios to refcount_t

    This is really a reference count now, so convert it to refcount_t and
    rename it to refs.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  6. btrfs: remove unused btrfs_dio_private::private

    We haven't used this since commit 9be3395 ("Btrfs: use a btrfs
    bioset instead of abusing bio internals").
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  7. btrfs: make btrfs_check_repairable() static

    Since its introduction in commit 2fe6303 ("Btrfs: split
    bio_readpage_error into several functions"), btrfs_check_repairable()
    has only been used from extent_io.c where it is defined.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  8. btrfs: rename __readpage_endio_check to check_data_csum

    __readpage_endio_check() is also used from the direct I/O read code, so
    give it a more descriptive name.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  9. btrfs: clarify btrfs_lookup_bio_sums documentation

    Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
    
    * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
      the declaration in the code to make that clear, too.
    * dst must be large enough to hold nblocks * csum_size, not just
      csum_size.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  10. btrfs: don't do repair validation for checksum errors

    The purpose of the validation step is to distinguish between good and
    bad sectors in a failed multi-sector read. If a multi-sector read
    succeeded but some of those sectors had checksum errors, we don't need
    to validate anything; we know the sectors with bad checksums need to be
    repaired.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  11. btrfs: look at full bi_io_vec for repair decision

    Read repair does two things: it finds a good copy of data to return to
    the reader, and it corrects the bad copy on disk. If a read of multiple
    sectors has an I/O error, repair does an extra "validation" step that
    issues a separate read for each sector. This allows us to find the exact
    failing sectors and only rewrite those.
    
    This heuristic is implemented in
    bio_readpage_error()/btrfs_check_repairable() as:
    
    	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
    	if (failed_bio_pages > 1)
    		do validation
    
    However, at this point, bi_iter may have already been advanced. This
    means that we'll skip the validation step and rewrite the entire failed
    read.
    
    Fix it by getting the actual size from the biovec (which we can do
    because this is only called for non-cloned bios, although that will
    change in a later commit).
    
    Fixes: 8a2ee44 ("btrfs: look at bi_size for repair decisions")
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  12. btrfs: fix double __endio_write_update_ordered in direct I/O

    In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
    we complete the ordered extent range. However, we don't mark that the
    range doesn't need to be cleaned up from btrfs_direct_IO() until later.
    Therefore, if we fail to allocate the btrfs_dio_private, we complete the
    ordered extent range twice. We could fix this by updating
    unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
    that creating the btrfs_dio_private and submitting the bios are
    separate, and once the btrfs_dio_private is created, cleanup always
    happens through the btrfs_dio_private.
    
    The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
    is really subtle. We have the following:
    
      1. btrfs_direct_IO sets those two to the same value.
    
      2. When we call __blockdev_direct_IO unless
         btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
         modify unsubmitted_oe_range_start so that start < end. Cleanup
         won't happen.
    
      3. We come into btrfs_submit_direct - if it dip allocation fails we'd
         return with oe_range_end now modified so cleanup will happen.
    
      4. If we manage to allocate the dip we reset the unsubmitted range
         members to be equal so that cleanup happens from
         btrfs_endio_direct_write.
    
    This 4-step logic is not really obvious, especially given it's scattered
    across 3 functions.
    
    Fixes: f28a492 ("Btrfs: fix leaking of ordered extents after direct IO write error")
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    [ add range start/end logic explanation from Nikolay ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  13. btrfs: fix error handling when submitting direct I/O bio

    In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
    stripe or chunk, we submit orig_bio without cloning it. In this case, we
    don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
    decrement pending_bios to -1, and we never complete orig_bio. Fix it by
    initializing pending_bios to 1 instead of incrementing later.
    
    Fixing this exposes another bug: we put orig_bio prematurely and then
    put it again from end_io. Fix it by not putting orig_bio.
    
    After this change, pending_bios is really more of a reference count, but
    I'll leave that cleanup separate to keep the fix small.
    
    Fixes: e65e153 ("btrfs: fix panic caused by direct IO")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  14. block: add bio_for_each_bvec_all()

    An upcoming Btrfs fix needs to know the original size of a non-cloned
    bios. Rather than accessing the bvec table directly, let's add a
    bio_for_each_bvec_all() accessor.
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    osandov authored and kdave committed Apr 22, 2020
  15. btrfs: drop logs when we've aborted a transaction

    Dave reported a problem where we were panicing with generic/475 with
    misc-5.7.  This is because we were doing IO after we had stopped all of
    the worker threads, because we do the log tree cleanup on roots at drop
    time.  Cleaning up the log tree will always need to do reads if we
    happened to have evicted the blocks from memory.
    
    Because of this simply add a helper to btrfs_cleanup_transaction() that
    will go through and drop all of the log roots.  This gets run before we
    do the close_ctree() work, and thus we are allowed to do any reads that
    we would need.  I ran this through many iterations of generic/475 with
    constrained memory and I did not see the issue.
    
      general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      CPU: 2 PID: 12359 Comm: umount Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs]
      RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000
      RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460
      RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0
      R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
      R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000
      FS:  00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btrfs_end_bio+0x81/0x130 [btrfs]
       __split_and_process_bio+0xaf/0x4e0 [dm_mod]
       ? percpu_counter_add_batch+0xa3/0x120
       dm_process_bio+0x98/0x290 [dm_mod]
       ? generic_make_request+0xfb/0x410
       dm_make_request+0x4d/0x120 [dm_mod]
       ? generic_make_request+0xfb/0x410
       generic_make_request+0x12a/0x410
       ? submit_bio+0x38/0x160
       submit_bio+0x38/0x160
       ? percpu_counter_add_batch+0xa3/0x120
       btrfs_map_bio+0x289/0x570 [btrfs]
       ? kmem_cache_alloc+0x24d/0x300
       btree_submit_bio_hook+0x79/0xc0 [btrfs]
       submit_one_bio+0x31/0x50 [btrfs]
       read_extent_buffer_pages+0x2fe/0x450 [btrfs]
       btree_read_extent_buffer_pages+0x7e/0x170 [btrfs]
       walk_down_log_tree+0x343/0x690 [btrfs]
       ? walk_log_tree+0x3d/0x380 [btrfs]
       walk_log_tree+0xf7/0x380 [btrfs]
       ? plist_requeue+0xf0/0xf0
       ? delete_node+0x4b/0x230
       free_log_tree+0x4c/0x130 [btrfs]
       ? wait_log_commit+0x140/0x140 [btrfs]
       btrfs_free_log+0x17/0x30 [btrfs]
       btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs]
       btrfs_free_fs_roots+0x10c/0x190 [btrfs]
       ? do_raw_spin_unlock+0x49/0xc0
       ? _raw_spin_unlock+0x29/0x40
       ? release_extent_buffer+0x121/0x170 [btrfs]
       close_ctree+0x289/0x2e6 [btrfs]
       generic_shutdown_super+0x6c/0x110
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20 [btrfs]
       deactivate_locked_super+0x3a/0x70
    
    Reported-by: David Sterba <dsterba@suse.com>
    Fixes: 8c38938 ("btrfs: move the root freeing stuff into btrfs_put_root")
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Apr 22, 2020
  16. btrfs: discard: Use the correct style for SPDX License Identifier

    This patch corrects the SPDX License Identifier style in header file
    related to Btrfs File System support.  For C header files
    Documentation/process/license-rules.rst mandates C-like comments
    (opposed to C source files where C++ style should be used).
    
    Changes made by using a script provided by Joe Perches here:
    https://lkml.org/lkml/2019/2/7/46.
    
    Suggested-by: Joe Perches <joe@perches.com>
    Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    nishadkamdar authored and kdave committed Apr 22, 2020
  17. btrfs: simplify error handling of clean_pinned_extents()

    At clean_pinned_extents(), whether we end up returning success or failure,
    we pretty much have to do the same things:
    
    1) unlock unused_bg_unpin_mutex
    2) decrement reference count on the previous transaction
    
    We also call btrfs_dec_block_group_ro() in case of failure, but that is
    better done in its caller, btrfs_delete_unused_bgs(), since its the
    caller that calls inc_block_group_ro(), so it should be responsible for
    the decrement operation, as it is in case any of the other functions it
    calls fail.
    
    So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
    into  btrfs_delete_unused_bgs() and unify the error and success return
    paths for clean_pinned_extents(), reducing duplicated code and making it
    simpler.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Apr 22, 2020
  18. btrfs: fix memory leak of transaction when deleting unused block group

    When cleaning pinned extents right before deleting an unused block group,
    we check if there's still a previous transaction running and if so we
    increment its reference count before using it for cleaning pinned ranges
    in its pinned extents iotree. However we ended up never decrementing the
    reference count after using the transaction, resulting in a memory leak.
    
    Fix it by decrementing the reference count.
    
    Fixes: fe119a6 ("btrfs: switch to per-transaction pinned extents")
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Apr 22, 2020
  19. btrfs: remove the redundant parameter level in btrfs_bin_search()

    All callers pass the eb::level so we can get read it directly inside the
    btrfs_bin_search and key_search.
    
    This is inspired by the work of Marek in U-boot.
    
    CC: Marek Behun <marek.behun@nic.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Apr 22, 2020

Commits on Apr 20, 2020

  1. btrfs: discard: Use the correct style for SPDX License Identifier

    This patch corrects the SPDX License Identifier style in header file
    related to Btrfs File System support.  For C header files
    Documentation/process/license-rules.rst mandates C-like comments
    (opposed to C source files where C++ style should be used).
    
    Changes made by using a script provided by Joe Perches here:
    https://lkml.org/lkml/2019/2/7/46.
    
    Suggested-by: Joe Perches <joe@perches.com>
    Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    nishadkamdar authored and kdave committed Apr 20, 2020
  2. btrfs: make btrfs_read_disk_super return struct btrfs_disk_super

    Instead of returning both the page and the super block structure, make
    btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
    As a result the function signature is simplified. Also,
    read_cache_page_gfp can never return NULL so check its return value only
    for IS_ERR.
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Apr 20, 2020
  3. btrfs: use list_for_each_entry_safe in free_reloc_roots

    The function always works on a local copy of the reloc root list, which
    cannot be modified outside of it so using list_for_each_entry is fine.
    Additionally the macro handles empty lists so drop list_empty checks of
    callers. No semantic changes.
    
    Reviewed-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Apr 20, 2020
  4. btrfs: don't force read-only after error in drop snapshot

    Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
    forced read-only. This is not a transaction abort and the filesystem is
    otherwise ok, so the error should be just propagated to the callers.
    
    This is caused by unnecessary call to btrfs_handle_fs_error for all
    errors, except EAGAIN. This does not make sense as the standard
    transaction abort mechanism is in btrfs_drop_snapshot so all relevant
    failures are handled.
    
    Originally in commit cb1b69f ("Btrfs: forced readonly when
    btrfs_drop_snapshot() fails") there was no return value at all, so the
    btrfs_std_error made some sense but once the error handling and
    propagation has been implemented we don't need it anymore.
    
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Apr 20, 2020
  5. btrfs: remove pointless assertion on reclaim_size counter

    The reclaim_size counter of a space_info object is unsigned. So its value
    can never be negative, it's pointless to have an assertion that checks
    its value is >= 0, therefore remove it.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Apr 20, 2020
  6. btrfs: tree-checker: remove duplicate definition of 'inode_item_err'

    Remove the duplicate definition of 'inode_item_err' in the file
    tree-checker.c that got there by accident in c23c77b ("btrfs:
    tree-checker: Refactor inode key check into seperate function").
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Zheng Wei <wei.zheng@vivo.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Zheng Wei authored and kdave committed Apr 20, 2020
  7. btrfs: force chunk allocation if our global rsv is larger than metadata

    Nikolay noticed a bunch of test failures with my global rsv steal
    patches.  At first he thought they were introduced by them, but they've
    been failing for a while with 64k nodes.
    
    The problem is with 64k nodes we have a global reserve that calculates
    out to 13MiB on a freshly made file system, which only has 8MiB of
    metadata space.  Because of changes I previously made we no longer
    account for the global reserve in the overcommit logic, which means we
    correctly allow overcommit to happen even though we are already
    overcommitted.
    
    However in some corner cases, for example btrfs/170, we will allocate
    the entire file system up with data chunks before we have enough space
    pressure to allocate a metadata chunk.  Then once the fs is full we
    ENOSPC out because we cannot overcommit and the global reserve is taking
    up all of the available space.
    
    The most ideal way to deal with this is to change our space reservation
    stuff to take into account the height of the tree's that we're
    modifying, so that our global reserve calculation does not end up so
    obscenely large.
    
    However that is a huge undertaking.  Instead fix this by forcing a chunk
    allocation if the global reserve is larger than the total metadata
    space.  This gives us essentially the same behavior that happened
    before, we get a chunk allocated and these tests can pass.
    
    This is meant to be a stop-gap measure until we can tackle the "tree
    height only" project.
    
    Fixes: 0096420 ("btrfs: do not account global reserve in can_overcommit")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Apr 20, 2020
  8. btrfs: run btrfs_try_granting_tickets if a priority ticket fails

    With normal tickets we could have a large reservation at the front of
    the list that is unable to be satisfied, but a smaller ticket later on
    that can be satisfied.  The way we handle this is to run
    btrfs_try_granting_tickets() in maybe_fail_all_tickets().
    
    However no such protection exists for priority tickets.  Fix this by
    handling it in handle_reserve_ticket().  If we've returned after
    attempting to flush space in a priority related way, we'll still be on
    the priority list and need to be removed.
    
    We rely on the flushing to free up space and wake the ticket, but if
    there is not enough space to reclaim _but_ there's enough space in the
    space_info to handle subsequent reservations then we would have gotten
    an ENOSPC erroneously.
    
    Address this by catching where we are still on the list, meaning we were
    a priority ticket, and removing ourselves and then running
    btrfs_try_granting_tickets().  This will handle this particular corner
    case.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Apr 20, 2020
Older
You can’t perform that action at this time.