Skip to content
Permalink
Nikolay-Boriso…

Commits on Jul 15, 2020

  1. btrfs: Switch seed device to list api

    While this patch touches a bunch of files the conversion is
    straighforward. Instead of using the implicit linked list anchored at
    btrfs_fs_devices::seed the code is switched to using
    list_for_each_entry. Previous patches in the series already factored out
    code that processed both main and seed devices so in those cases
    the factored out functions are called on the main fs_devices and then
    on every seed dev inside list_for_each_entry.
    
    Using list api also allows to simplify deletion from the seed dev list
    performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
    substituting a while() loop with a simple list_del_init.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Nikolay Borisov 0day robot
    Nikolay Borisov authored and 0day robot committed Jul 15, 2020
  2. btrfs: Simplify setting/clearing fs_info to btrfs_fs_devices

    It makes no sense to have sysfs-related routines be responsible for
    properly initialising the fs_info pointer of struct btrfs_fs_device.
    Instead this can be streamlined by making it the responsibility of
    btrfs_init_devices_late to initialize it. That function already
    initializes fs_info of every individual device in btrfs_fs_devices.
    
    As far as clearing it is concerned it makes sense to move it to
    close_fs_devices. That function is only called when struct
    btrfs_fs_devices is no longer in use - either for holding seeds or main
    devices for a mounted filesystem.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Nikolay Borisov 0day robot
    Nikolay Borisov authored and 0day robot committed Jul 15, 2020
  3. btrfs: Make close_fs_devices return void

    The return value of this function conveys absolutely no information.
    All callers already check the state of  fs_devices->opened to decide
    how to proceed. So conver the function to returning void. While at it
    make btrfs_close_devices also return void.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Nikolay Borisov 0day robot
    Nikolay Borisov authored and 0day robot committed Jul 15, 2020
  4. btrfs: Factor out loop logic from btrfs_free_extra_devids

    This prepares the code to switching seeds devices to a proper list.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Nikolay Borisov 0day robot
    Nikolay Borisov authored and 0day robot committed Jul 15, 2020
  5. btrfs: Factor out reada loop in __reada_start_machine

    This is in preparation for moving fs_devices to proper lists.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Nikolay Borisov 0day robot
    Nikolay Borisov authored and 0day robot committed Jul 15, 2020

Commits on Jul 9, 2020

  1. btrfs: fixups of BTRFS_I in iomap-dio code

    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jul 9, 2020
  2. btrfs: switch to iomap_dio_rw() for dio

    Switch from __blockdev_direct_IO() to iomap_dio_rw().
    Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
    as iomap_begin() for iomap direct I/O functions. This function
    allocates and locks all the blocks required for the I/O.
    btrfs_submit_direct() is used as the submit_io() hook for direct I/O
    ops.
    
    Since we need direct I/O reads to go through iomap_dio_rw(), we change
    file_operations.read_iter() to a btrfs_file_read_iter() which calls
    btrfs_direct_IO() for direct reads and falls back to
    generic_file_buffered_read() for incomplete reads and buffered reads.
    
    We don't need address_space.direct_IO() anymore so set it to noop.
    Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
    capable of direct I/O reads from a hole, so we don't need to return
    -ENOENT.
    
    BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
    exclusive in case of writes. This guards against simultaneous truncates.
    
    Use iomap->iomap_end() to check for failed or incomplete direct I/O:
     - for writes, call __endio_write_update_ordered()
     - for reads, unlock extents
    
    btrfs_dio_data is now hooked in iomap->private and not
    current->journal_info. It carries the reservation variable and the
    amount of data submitted, so we can calculate the amount of data to call
    __endio_write_update_ordered in case of an error.
    
    This patch removes last use of struct buffer_head from btrfs.
    
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed Jul 9, 2020
  3. iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation f…

    …ails
    
    For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
    that if the page invalidation fails, return back control to the
    filesystem so it may fallback to buffered mode.
    
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed Jul 9, 2020
  4. iomap: Convert wait_for_completion to flags

    Convert wait_for_completion boolean to flags so we can pass more flags
    to iomap_dio_rw()
    
    Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    goldwynr authored and kdave committed Jul 9, 2020
  5. btrfs: wire up iter_file_splice_write

    btrfs implements the iter_write op and thus can use the more efficient
    iov_iter based splice implementation.  For now falling back to the less
    efficient default is pretty harmless, but I have a pending series that
    removes the default, and thus would cause btrfs to not support splice
    at all.
    
    Reported-by: Andy Lavr <andy.lavr@gmail.com>
    Tested-by: Andy Lavr <andy.lavr@gmail.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Christoph Hellwig authored and kdave committed Jul 9, 2020
  6. btrfs: wire up iter_file_splice_write

    btrfs implements the iter_write op and thus can use the more efficient
    iov_iter based splice implementation.  For now falling back to the less
    efficient default is pretty harmless, but I have a pending series that
    removes the default, and thus would cause btrfs to not support splice
    at all.
    
    Reported-by: Andy Lavr <andy.lavr@gmail.com>
    Tested-by: Andy Lavr <andy.lavr@gmail.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Christoph Hellwig authored and kdave committed Jul 9, 2020
  7. btrfs: qgroup: remove the ASYNC_COMMIT mechanism in favor of qgroup r…

    …eserve retry-after-EDQUOT
    
    commit a514d63 ("btrfs: qgroup: Commit transaction in advance to
    reduce early EDQUOT") tries to reduce the early EDQUOT problems by
    checking the qgroup free against threshold and try to wake up commit
    kthread to free some space.
    
    The problem of that mechanism is, it can only free qgroup per-trans
    metadata space, can't do anything to data, nor prealloc qgroup space.
    
    Now since we have the ability to flush qgroup space, and implements
    retry-after-EDQUOT behavior, such mechanism is completely replaced.
    
    So this patch will cleanup such mechanism in favor of
    retry-after-EDQUOT.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 9, 2020
  8. btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

    [PROBLEM]
    There are known problem related to how btrfs handles qgroup reserved
    space.
    One of the most obvious case is the the test case btrfs/153, which do
    fallocate, then write into the preallocated range.
    
      btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
          @@ -1,2 +1,5 @@
           QA output created by 153
          +pwrite: Disk quota exceeded
          +/mnt/scratch/testfile2: Disk quota exceeded
          +/mnt/scratch/testfile2: Disk quota exceeded
           Silence is golden
          ...
          (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
    
    [CAUSE]
    Since commit c6887cd ("Btrfs: don't do nocow check unless we have to"),
    we always reserve space no matter if it's COW or not.
    
    Such behavior change is mostly for performance, and reverting it is not
    a good idea anyway.
    
    For preallcoated extent, we reserve qgroup data space for it already,
    and since we also reserve data space for qgroup at buffered write time,
    it needs twice the space for us to write into preallocated space.
    
    This leads to the -EDQUOT in buffered write routine.
    
    And we can't follow the same solution, unlike data/meta space check,
    qgroup reserved space is shared between data/meta.
    The EDQUOT can happen at the metadata reservation, so doing NODATACOW
    check after qgroup reservation failure is not a solution.
    
    [FIX]
    To solve the problem, we don't return -EDQUOT directly, but every time
    we got a -EDQUOT, we try to flush qgroup space by:
    - Flush all inodes of the root
      NODATACOW writes will free the qgroup reserved at run_dealloc_range().
      However we don't have the infrastructure to only flush NODATACOW
      inodes, here we flush all inodes anyway.
    
    - Wait ordered extents
      This would convert the preallocated metadata space into per-trans
      metadata, which can be freed in later transaction commit.
    
    - Commit transaction
      This will free all per-trans metadata space.
    
    Also we don't want to trigger flush too racy, so here we introduce a
    per-root mutex to ensure if there is a running qgroup flushing, we wait
    for it to end and don't start re-flush.
    
    Fixes: c6887cd ("Btrfs: don't do nocow check unless we have to")
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 9, 2020
  9. btrfs: qgroup: allow to unreserve range without releasing other ranges

    [PROBLEM]
    Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
    reserved space of the changeset.
    
    For example:
    	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
    	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
    	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
    
    If the last btrfs_qgroup_reserve_data() failed, it will release the
    entire [0, 3M) range.
    
    This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
    go error handling and need to release all reserved ranges anyway.
    
    But this also means the following call is not possible:
    
    	ret = btrfs_qgroup_reserve_data();
    	if (ret == -EDQUOT) {
    		/* Do something to free some qgroup space */
    		ret = btrfs_qgroup_reserve_data();
    	}
    
    As if the first btrfs_qgroup_reserve_data() fails, it will free all
    reserved qgroup space.
    
    [CAUSE]
    This is because we release all reserved ranges when
    btrfs_qgroup_reserve_data() fails.
    
    [FIX]
    This patch will implement a new function, qgroup_unreserve_range(), to
    iterate through the ulist nodes, to find any nodes in the failure range,
    and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
    decrease the extent_changeset::bytes_changed, so that we can revert to
    previous state.
    
    This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
    happens.
    
    Suggested-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 9, 2020
  10. btrfs: fix double put of block group with nocow

    While debugging a patch that I wrote I was hitting use-after-free panics
    when accessing block groups on unmount.  This turned out to be because
    in the nocow case if we bail out of doing the nocow for whatever reason
    we need to call btrfs_dec_nocow_writers() if we called the inc.  This
    puts our block group, but a few error cases does
    
    if (nocow) {
        btrfs_dec_nocow_writers();
        goto error;
    }
    
    unfortunately, error is
    
    error:
    	if (nocow)
    		btrfs_dec_nocow_writers();
    
    so we get a double put on our block group.  Fix this by dropping the
    error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
    error label now.
    
    Fixes: 762bf09 ("btrfs: improve error handling in run_delalloc_nocow")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  11. btrfs: fix double put of block group with nocow

    While debugging a patch that I wrote I was hitting use-after-free panics
    when accessing block groups on unmount.  This turned out to be because
    in the nocow case if we bail out of doing the nocow for whatever reason
    we need to call btrfs_dec_nocow_writers() if we called the inc.  This
    puts our block group, but a few error cases does
    
    if (nocow) {
        btrfs_dec_nocow_writers();
        goto error;
    }
    
    unfortunately, error is
    
    error:
    	if (nocow)
    		btrfs_dec_nocow_writers();
    
    so we get a double put on our block group.  Fix this by dropping the
    error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
    error label now.
    
    Fixes: 762bf09 ("btrfs: improve error handling in run_delalloc_nocow")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  12. btrfs: convert block group refcount to refcount_t

    We have refcount_t now with the associated library to handle refcounts,
    which gives us extra debugging around reference count mistakes that may
    be made.  For example it'll warn on any transition from 0->1 or 0->-1,
    which is handy for noticing cases where we've messed up reference
    counting.  Convert the block group ref counting from an atomic_t to
    refcount_t and use the appropriate helpers.
    
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  13. btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl

    With the recent addition of filesystem checksum types other than CRC32c,
    it is not anymore hard-coded which checksum type a btrfs filesystem uses.
    
    Up to now there is no good way to read the filesystem checksum, apart from
    reading the filesystem UUID and then query sysfs for the checksum type.
    
    Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
    command which usually is used to query filesystem features. Also add a
    flags member indicating that the kernel responded with a set csum_type and
    csum_size field.
    
    For compatibility reasons, only return the csum_type and csum_size if
    the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
    clear any unknown flags so we don't pass false positives to user-space
    newer than the kernel.
    
    To simplify further additions to the ioctl, also switch the padding to a
    u8 array. Pahole was used to verify the result of this switch:
    
    pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
    struct btrfs_ioctl_fs_info_args {
            __u64                      max_id;               /*     0     8 */
            __u64                      num_devices;          /*     8     8 */
            __u8                       fsid[16];             /*    16    16 */
            __u32                      nodesize;             /*    32     4 */
            __u32                      sectorsize;           /*    36     4 */
            __u32                      clone_alignment;      /*    40     4 */
            __u32                      flags;                /*    44     4 */
            __u16                      csum_type;            /*    48     2 */
            __u16                      csum_size;            /*    50     2 */
            __u8                       reserved[972];        /*    52   972 */
    
            /* size: 1024, cachelines: 16, members: 10 */
    };
    
    Fixes: 3951e7f ("btrfs: add xxhash64 to checksumming algorithms")
    Fixes: 3831bf0 ("btrfs: add sha256 to checksumming algorithm")
    CC: stable@vger.kernel.org # 5.5+
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Johannes Thumshirn authored and kdave committed Jul 9, 2020
  14. btrfs: add a comment explaining the data flush steps

    The data flushing steps are not obvious to people other than myself and
    Chris.  Write a giant comment explaining the reasoning behind each flush
    step for data as well as why it is in that particular order.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  15. btrfs: do async reclaim for data reservations

    Now that we have the data ticketing stuff in place, move normal data
    reservations to use an async reclaim helper to satisfy tickets.  Before
    we could have multiple tasks race in and both allocate chunks, resulting
    in more data chunks than we would necessarily need.  Serializing these
    allocations and making a single thread responsible for flushing will
    only allocate chunks as needed, as well as cut down on transaction
    commits and other flush related activities.
    
    Priority reservations will still work as they have before, simply
    trying to allocate a chunk until they can make their reservation.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  16. btrfs: flush delayed refs when trying to reserve data space

    We can end up with free'd extents in the delayed refs, and thus
    may_commit_transaction() may not think we have enough pinned space to
    commit the transaction and we'll ENOSPC early.  Handle this by running
    the delayed refs in order to make sure pinned is uptodate before we try
    to commit the transaction.
    
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  17. btrfs: run delayed iputs before committing the transaction for data

    Before we were waiting on iputs after we committed the transaction, but
    this doesn't really make much sense.  We want to reclaim any space we
    may have in order to be more likely to commit the transaction, due to
    pinned space being added by running the delayed iputs.  Fix this by
    making delayed iputs run before committing the transaction.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  18. btrfs: don't force commit if we are data

    We used to unconditionally commit the transaction at least 2 times and
    then on the 3rd try check against pinned space to make sure committing
    the transaction was worth the effort.  This is overkill, we know nobody
    is going to steal our reservation, and if we can't make our reservation
    with the pinned amount simply bail out.
    
    This also cleans up the passing of bytes_needed to
    may_commit_transaction, as that was the thing we added into place in
    order to accomplish this behavior.  We no longer need it so remove that
    mess.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  19. btrfs: drop the commit_cycles stuff for data reservations

    This was an old wart left over from how we previously did data
    reservations.  Before we could have people race in and take a
    reservation while we were flushing space, so we needed to make sure we
    looped a few times before giving up.  Now that we're using the ticketing
    infrastructure we don't have to worry about this and can drop the logic
    altogether.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  20. btrfs: use the same helper for data and metadata reservations

    Now that data reservations follow the same pattern as metadata
    reservations we can simply rename __reserve_metadata_bytes to
    __reserve_bytes and use that helper for data reservations.
    
    Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
    because we can never overcommit.  We also will never pass in FLUSH_ALL
    for data, so we'll simply be added to the priority list and go straight
    into handle_reserve_ticket.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
  21. btrfs: serialize data reservations if we are flushing

    Nikolay reported a problem where generic/371 would fail sometimes with a
    slow drive.  The gist of the test is that we fallocate a file in
    parallel with a pwrite of a different file.  These two files combined
    are smaller than the file system, but sometimes the pwrite would ENOSPC.
    
    A fair bit of investigation uncovered the fact that the fallocate
    workload was racing in and grabbing the free space that the pwrite
    workload was trying to free up so it could make its own reservation.
    After a few loops of this eventually the pwrite workload would error out
    with an ENOSPC.
    
    We've had the same problem with metadata as well, and we serialized all
    metadata allocations to satisfy this problem.  This wasn't usually a
    problem with data because data reservations are more straightforward,
    but obviously could still happen.
    
    Fix this by not allowing reservations to occur if there are any pending
    tickets waiting to be satisfied on the space info.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 9, 2020
Older
You can’t perform that action at this time.