Skip to content
Permalink
Omar-Sandoval/…
Switch branches/tags

Commits on Oct 19, 2021

  1. btrfs: fix deadlock when defragging transparent huge pages

    Attempting to defragment a Btrfs file containing a transparent huge page
    immediately deadlocks with the following stack trace:
    
      #0  context_switch (kernel/sched/core.c:4940:2)
      #1  __schedule (kernel/sched/core.c:6287:8)
      #2  schedule (kernel/sched/core.c:6366:3)
      #3  io_schedule (kernel/sched/core.c:8389:2)
      #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
      #5  __lock_page (mm/filemap.c:1648:2)
      torvalds#6  lock_page (./include/linux/pagemap.h:625:3)
      torvalds#7  pagecache_get_page (mm/filemap.c:1910:4)
      torvalds#8  find_or_create_page (./include/linux/pagemap.h:420:9)
      torvalds#9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
      torvalds#10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
      torvalds#11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
      torvalds#12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
      torvalds#13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
      torvalds#14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
      torvalds#15 vfs_ioctl (fs/ioctl.c:51:10)
      torvalds#16 __do_sys_ioctl (fs/ioctl.c:874:11)
      torvalds#17 __se_sys_ioctl (fs/ioctl.c:860:1)
      torvalds#18 __x64_sys_ioctl (fs/ioctl.c:860:1)
      torvalds#19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
      torvalds#20 do_syscall_64 (arch/x86/entry/common.c:80:7)
      torvalds#21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
    
    A huge page is represented by a compound page, which consists of a
    struct page for each PAGE_SIZE page within the huge page. The first
    struct page is the "head page", and the remaining are "tail pages".
    
    Defragmentation attempts to lock each page in the range. However,
    lock_page() on a tail page actually locks the corresponding head page.
    So, if defragmentation tries to lock more than one struct page in a
    compound page, it tries to lock the same head page twice and deadlocks
    with itself.
    
    Ideally, we should be able to defragment transparent huge pages.
    However, THP for filesystems is currently read-only, so a lot of code is
    not ready to use huge pages for I/O. For now, let's just return
    ETXTBUSY.
    
    This can be reproduced with the following on a kernel with
    CONFIG_READ_ONLY_THP_FOR_FS=y:
    
      $ cat create_thp_file.c
      #include <fcntl.h>
      #include <stdbool.h>
      #include <stdio.h>
      #include <stdint.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <sys/mman.h>
    
      static const char zeroes[1024 * 1024];
      static const size_t FILE_SIZE = 2 * 1024 * 1024;
    
      int main(int argc, char **argv)
      {
              if (argc != 2) {
                      fprintf(stderr, "usage: %s PATH\n", argv[0]);
                      return EXIT_FAILURE;
              }
              int fd = creat(argv[1], 0777);
              if (fd == -1) {
                      perror("creat");
                      return EXIT_FAILURE;
              }
              size_t written = 0;
              while (written < FILE_SIZE) {
                      ssize_t ret = write(fd, zeroes,
                                          sizeof(zeroes) < FILE_SIZE - written ?
                                          sizeof(zeroes) : FILE_SIZE - written);
                      if (ret < 0) {
                              perror("write");
                              return EXIT_FAILURE;
                      }
                      written += ret;
              }
              close(fd);
              fd = open(argv[1], O_RDONLY);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
    
              /*
               * Reserve some address space so that we can align the file mapping to
               * the huge page size.
               */
              void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
              if (placeholder_map == MAP_FAILED) {
                      perror("mmap (placeholder)");
                      return EXIT_FAILURE;
              }
    
              void *aligned_address =
                      (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
    
              void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                               MAP_SHARED | MAP_FIXED, fd, 0);
              if (map == MAP_FAILED) {
                      perror("mmap");
                      return EXIT_FAILURE;
              }
              if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                      perror("madvise");
                      return EXIT_FAILURE;
              }
    
              char *line = NULL;
              size_t line_capacity = 0;
              FILE *smaps_file = fopen("/proc/self/smaps", "r");
              if (!smaps_file) {
                      perror("fopen");
                      return EXIT_FAILURE;
              }
              for (;;) {
                      for (size_t off = 0; off < FILE_SIZE; off += 4096)
                              ((volatile char *)map)[off];
    
                      ssize_t ret;
                      bool this_mapping = false;
                      while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                              unsigned long start, end, huge;
                              if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                      this_mapping = (start <= (uintptr_t)map &&
                                                      (uintptr_t)map < end);
                              } else if (this_mapping &&
                                         sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                         huge > 0) {
                                      return EXIT_SUCCESS;
                              }
                      }
    
                      sleep(6);
                      rewind(smaps_file);
                      fflush(smaps_file);
              }
      }
      $ ./create_thp_file huge
      $ btrfs fi defrag -czstd ./huge
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    osandov authored and intel-lab-lkp committed Oct 19, 2021

Commits on Oct 14, 2021

  1. Merge branch 'for-next-next-v5.15-20211014' into for-next-20211014

    # Conflicts:
    #	fs/btrfs/tree-log.c
    kdave committed Oct 14, 2021
  2. btrfs: zoned: use greedy gc for auto reclaim

    Currently auto reclaim of unusable zones reclaims the block-groups in the
    order they have been added to the reclaim list.
    
    Change this to a greedy algorithm by sorting the list so we have the
    block-groups with the least amount of valid bytes reclaimed first.
    
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    morbidrsa authored and kdave committed Oct 14, 2021
  3. btrfs: clear BTRFS_DEV_STATE_MISSING bit in btrfs_close_one_device

    bug: kdave/btrfs-progs#389
    
    The previous patch does not fix the bug right:
    https://lore.kernel.org/linux-btrfs/1632330390-29793-1-git-send-email-zhanglikernel@gmail.com
    So I write a new one
    
    It seems that the cause of the error is decrementing
    fs_devices->missing_devices but not clearing device->dev_state.
    Every time we umount filesystem, it would call close_ctree,
    And it would eventually involve btrfs_close_one_device to close the device,
    but it only decrements fs_devices->missing_devices but does not clear
    the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
    Overflow, because every time umount, fs_devices->missing_devices will
    decrease. If fs_devices->missing_devices value hit 0, it would overflow.
    
    I add the debug print in read_one_dev
    function(not in patch) to print fs_devices->missing_devices value.
    
    [root@zllke test]# truncate -s 10g test1
    [root@zllke test]# truncate -s 10g test2
    [root@zllke test]# losetup /dev/loop1 test1
    [root@zllke test]# losetup /dev/loop2 test2
    [root@zllke test]# mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
    [root@zllke test]# losetup -d /dev/loop2
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# dmesg
    [  168.728888] loop1: detected capacity change from 0 to 20971520
    [  168.751227] BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
    [  169.179102] loop2: detected capacity change from 0 to 20971520
    [  169.198307] BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
    [  190.696579] BTRFS info (device loop1): flagging fs with big metadata feature
    [  190.699445] BTRFS info (device loop1): allowing degraded mounts
    [  190.701819] BTRFS info (device loop1): using free space tree
    [  190.704126] BTRFS info (device loop1): has skinny extents
    [  190.708890] BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
    [  190.711958] BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
    [  190.715370] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
    [  209.075744] BTRFS info (device loop1): flagging fs with big metadata feature
    [  209.079106] BTRFS info (device loop1): allowing degraded mounts
    [  209.082042] BTRFS info (device loop1): using free space tree
    [  209.084791] BTRFS info (device loop1): has skinny extents
    [  209.089172] BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
    [  209.093074] BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
    [  209.096848] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
    [  218.778031] BTRFS info (device loop1): flagging fs with big metadata feature
    [  218.781504] BTRFS info (device loop1): allowing degraded mounts
    [  218.784319] BTRFS info (device loop1): using free space tree
    [  218.786902] BTRFS info (device loop1): has skinny extents
    [  218.791190] BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
    [  218.795532] BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
    [  218.799320] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
    
    If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
    
    After apply this patch, the fs_devices->missing_devices seems to be right
    
    [root@zllke test]# truncate -s 10g test1
    [root@zllke test]# truncate -s 10g test2
    [root@zllke test]# losetup /dev/loop1 test1
    [root@zllke test]# losetup /dev/loop2 test2
    [root@zllke test]# mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
    [root@zllke test]# losetup -d /dev/loop2
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# mount -o degraded /dev/loop1 /mnt/1
    [root@zllke test]# umount /mnt/1
    [root@zllke test]# dmesg
    [   80.647739] loop1: detected capacity change from 0 to 20971520
    [   81.268113] loop2: detected capacity change from 0 to 20971520
    [   90.694332] BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
    [   90.705180] BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
    [  104.935735] BTRFS info (device loop1): flagging fs with big metadata feature
    [  104.939020] BTRFS info (device loop1): allowing degraded mounts
    [  104.941637] BTRFS info (device loop1): disk space caching is enabled
    [  104.944442] BTRFS info (device loop1): has skinny extents
    [  104.948848] BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
    [  104.952365] BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
    [  104.956220] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
    [  104.960602] BTRFS info (device loop1): checking UUID tree
    [  157.888711] BTRFS info (device loop1): flagging fs with big metadata feature
    [  157.892915] BTRFS info (device loop1): allowing degraded mounts
    [  157.896333] BTRFS info (device loop1): disk space caching is enabled
    [  157.899244] BTRFS info (device loop1): has skinny extents
    [  157.905068] BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
    [  157.908981] BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
    [  157.913540] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
    [  161.057615] BTRFS info (device loop1): flagging fs with big metadata feature
    [  161.060874] BTRFS info (device loop1): allowing degraded mounts
    [  161.063422] BTRFS info (device loop1): disk space caching is enabled
    [  161.066179] BTRFS info (device loop1): has skinny extents
    [  161.069997] BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
    [  161.073328] BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
    [  161.077084] BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
    
    Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    zhanglikernel authored and kdave committed Oct 14, 2021
  4. btrfs: index free space entries on size

    Currently we index free space on offset only, because usually we have a
    hint from the allocator that we want to honor for locality reasons.
    However if we fail to use this hint we have to go back to a brute force
    search through the free space entries to find a large enough extent.
    
    With sufficiently fragmented free space this becomes quite expensive, as
    we have to linearly search all of the free space entries to find if we
    have a part that's long enough.
    
    To fix this add a cached rb tree to index based on free space entry
    bytes.  This will allow us to quickly look up the largest chunk in the
    free space tree for this block group, and stop searching once we've
    found an entry that is too small to satisfy our allocation.  We simply
    choose to use this tree if we're searching from the beginning of the
    block group, as we know we do not care about locality at that point.
    
    I wrote an allocator test that creates a 10TiB ram backed null block
    device and then fallocates random files until the file system is full.
    I think go through and delete all of the odd files.  Then I spawn 8
    threads that fallocate 64mib files (1/2 our extent size cap) until the
    file system is full again.  I use bcc's funclatency to measure the
    latency of find_free_extent.  The baseline results are
    
         nsecs               : count     distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 0        |                                        |
           256 -> 511        : 10356    |****                                    |
           512 -> 1023       : 58242    |*************************               |
          1024 -> 2047       : 74418    |********************************        |
          2048 -> 4095       : 90393    |****************************************|
          4096 -> 8191       : 79119    |***********************************     |
          8192 -> 16383      : 35614    |***************                         |
         16384 -> 32767      : 13418    |*****                                   |
         32768 -> 65535      : 12811    |*****                                   |
         65536 -> 131071     : 17090    |*******                                 |
        131072 -> 262143     : 26465    |***********                             |
        262144 -> 524287     : 40179    |*****************                       |
        524288 -> 1048575    : 55469    |************************                |
       1048576 -> 2097151    : 48807    |*********************                   |
       2097152 -> 4194303    : 26744    |***********                             |
       4194304 -> 8388607    : 35351    |***************                         |
       8388608 -> 16777215   : 13918    |******                                  |
      16777216 -> 33554431   : 21       |                                        |
    
    avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
    
    And the patch results are
    
         nsecs               : count     distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 0        |                                        |
           256 -> 511        : 6883     |**                                      |
           512 -> 1023       : 54346    |*********************                   |
          1024 -> 2047       : 79170    |********************************        |
          2048 -> 4095       : 98890    |****************************************|
          4096 -> 8191       : 81911    |*********************************       |
          8192 -> 16383      : 27075    |**********                              |
         16384 -> 32767      : 14668    |*****                                   |
         32768 -> 65535      : 13251    |*****                                   |
         65536 -> 131071     : 15340    |******                                  |
        131072 -> 262143     : 26715    |**********                              |
        262144 -> 524287     : 43274    |*****************                       |
        524288 -> 1048575    : 53870    |*********************                   |
       1048576 -> 2097151    : 55368    |**********************                  |
       2097152 -> 4194303    : 41036    |****************                        |
       4194304 -> 8388607    : 24927    |**********                              |
       8388608 -> 16777215   : 33       |                                        |
      16777216 -> 33554431   : 9        |                                        |
    
    avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
    
    There's a little variation in the amount of calls done because of timing
    of the threads with metadata requirements, but the avg, total, and
    count's are relatively consistent between runs (usually within 2-5% of
    each other).  As you can see here we have around a 30% decrease in
    average latency with a 30% decrease in overall time spent in
    find_free_extent.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  5. btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls

    For device removal and replace we call btrfs_find_device_by_devspec,
    which if we give it a device path and nothing else will call
    btrfs_get_dev_args_from_path, which opens the block device and reads the
    super block and then looks up our device based on that.
    
    However at this point we're holding the sb write "lock", so reading the
    block device pulls in the dependency of ->open_mutex, which produces the
    following lockdep splat
    
    ======================================================
    WARNING: possible circular locking dependency detected
    5.14.0-rc2+ torvalds#405 Not tainted
    ------------------------------------------------------
    losetup/11576 is trying to acquire lock:
    ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
    
    but task is already holding lock:
    ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
    -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7d/0x750
           lo_open+0x28/0x60 [loop]
           blkdev_get_whole+0x25/0xf0
           blkdev_get_by_dev.part.0+0x168/0x3c0
           blkdev_open+0xd2/0xe0
           do_dentry_open+0x161/0x390
           path_openat+0x3cc/0xa20
           do_filp_open+0x96/0x120
           do_sys_openat2+0x7b/0x130
           __x64_sys_openat+0x46/0x70
           do_syscall_64+0x38/0x90
           entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    -> #3 (&disk->open_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7d/0x750
           blkdev_get_by_dev.part.0+0x56/0x3c0
           blkdev_get_by_path+0x98/0xa0
           btrfs_get_bdev_and_sb+0x1b/0xb0
           btrfs_find_device_by_devspec+0x12b/0x1c0
           btrfs_rm_device+0x127/0x610
           btrfs_ioctl+0x2a31/0x2e70
           __x64_sys_ioctl+0x80/0xb0
           do_syscall_64+0x38/0x90
           entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    -> #2 (sb_writers#12){.+.+}-{0:0}:
           lo_write_bvec+0xc2/0x240 [loop]
           loop_process_work+0x238/0xd00 [loop]
           process_one_work+0x26b/0x560
           worker_thread+0x55/0x3c0
           kthread+0x140/0x160
           ret_from_fork+0x1f/0x30
    
    -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
           process_one_work+0x245/0x560
           worker_thread+0x55/0x3c0
           kthread+0x140/0x160
           ret_from_fork+0x1f/0x30
    
    -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
           __lock_acquire+0x10ea/0x1d90
           lock_acquire+0xb5/0x2b0
           flush_workqueue+0x91/0x5e0
           drain_workqueue+0xa0/0x110
           destroy_workqueue+0x36/0x250
           __loop_clr_fd+0x9a/0x660 [loop]
           block_ioctl+0x3f/0x50
           __x64_sys_ioctl+0x80/0xb0
           do_syscall_64+0x38/0x90
           entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    other info that might help us debug this:
    
    Chain exists of:
      (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
    
     Possible unsafe locking scenario:
    
           CPU0                    CPU1
           ----                    ----
      lock(&lo->lo_mutex);
                                   lock(&disk->open_mutex);
                                   lock(&lo->lo_mutex);
      lock((wq_completion)loop0);
    
     *** DEADLOCK ***
    
    1 lock held by losetup/11576:
     #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
    
    stack backtrace:
    CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ torvalds#405
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
     dump_stack_lvl+0x57/0x72
     check_noncircular+0xcf/0xf0
     ? stack_trace_save+0x3b/0x50
     __lock_acquire+0x10ea/0x1d90
     lock_acquire+0xb5/0x2b0
     ? flush_workqueue+0x67/0x5e0
     ? lockdep_init_map_type+0x47/0x220
     flush_workqueue+0x91/0x5e0
     ? flush_workqueue+0x67/0x5e0
     ? verify_cpu+0xf0/0x100
     drain_workqueue+0xa0/0x110
     destroy_workqueue+0x36/0x250
     __loop_clr_fd+0x9a/0x660 [loop]
     ? blkdev_ioctl+0x8d/0x2a0
     block_ioctl+0x3f/0x50
     __x64_sys_ioctl+0x80/0xb0
     do_syscall_64+0x38/0x90
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f31b02404cb
    
    Instead what we want to do is populate our device lookup args before we
    grab any locks, and then pass these args into btrfs_rm_device().  From
    there we can find the device and do the appropriate removal.
    
    Suggested-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  6. btrfs: add a btrfs_get_dev_args_from_path helper

    We are going to want to populate our device lookup args outside of any
    locks and then do the actual device lookup later, so add a helper to do
    this work and make btrfs_find_device_by_devspec() use this helper for
    now.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  7. btrfs: handle device lookup with btrfs_dev_lookup_args

    We have a lot of device lookup functions that all do something slightly
    different.  Clean this up by adding a struct to hold the different
    lookup criteria, and then pass this around to btrfs_find_device() so it
    can do the proper matching based on the lookup criteria.
    
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  8. btrfs: do not call close_fs_devices in btrfs_rm_device

    There's a subtle case where if we're removing the seed device from a
    file system we need to free its private copy of the fs_devices.  However
    we do not need to call close_fs_devices(), because at this point there
    are no devices left to close as we've closed the last one.  The only
    thing that close_fs_devices() does is decrement ->opened, which should
    be 1.  We want to avoid calling close_fs_devices() here because it has a
    lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
    uuid_mutex in this path.
    
    So simply decrement the  ->opened counter like we should, and then clean
    up like normal.  Also add a comment explaining what we're doing here as
    I initially removed this code erroneously.
    
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  9. btrfs: add comments for device counts in struct btrfs_fs_devices

    A bug was was checking a wrong device count before we delete the struct
    btrfs_fs_devices in btrfs_rm_device(). To avoid future confusion and
    easy reference add a comment about the various device counts that we have
    in the struct btrfs_fs_devices.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Oct 14, 2021
  10. btrfs: use num_device to check for the last surviving seed device

    For both sprout and seed fsids,
     btrfs_fs_devices::num_devices provides device count including missing
     btrfs_fs_devices::open_devices provides device count excluding missing
    
    We create a dummy struct btrfs_device for the missing device, so
    num_devices != open_devices when there is a missing device.
    
    In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices
    before freeing the seed fs_devices. Instead we should check for
    %cur_devices->num_devices.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Oct 14, 2021
  11. btrfs: update device path inode time instead of bd_inode

    Christoph pointed out that I'm updating bdev->bd_inode for the device
    time when we remove block devices from a btrfs file system, however this
    isn't actually exposed to anything.  The inode we want to update is the
    one that's associated with the path to the device, usually on devtmpfs,
    so that blkid notices the difference.
    
    We still don't want to do the blkdev_open, so use kern_path() to get the
    path to the given device and do the update time on that inode.
    
    Fixes: 8f96a5b ("btrfs: update the bdev time directly when closing")
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  12. btrfs: remove btrfs_bio::logical member

    The member btrfs_bio::logical is only initialized by two call sites:
    
    - btrfs_repair_one_sector()
      No corresponding site to utilize it.
    
    - btrfs_submit_direct()
      The corresponding site to utilize it is btrfs_check_read_dio_bio().
    
    However for btrfs_check_read_dio_bio(), we can grab the file_offset from
    btrfs_dio_private::file_offset directly.
    
    Thus it turns out we don't really need that btrfs_bio::logical member at
    all.
    
    For btrfs_bio, the logical bytenr can be fetched from its
    bio->bi_iter.bi_sector directly.
    
    So let's just remove the member to save 8 bytes for structure btrfs_bio.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Oct 14, 2021
  13. btrfs: rename btrfs_dio_private::logical_offset to file_offset

    The naming of "logical_offset" can be confused with logical bytenr of
    the dio range.
    
    In fact it's file offset, and the naming "file_offset" is already widely
    used in all other sites.
    
    Just do the rename to avoid confusion.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Oct 14, 2021
  14. btrfs: use bvec_kmap_local in btrfs_csum_one_bio

    Using local kmaps slightly reduces the chances to stray writes, and
    the bvec interface cleans up the code a little bit.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Christoph Hellwig authored and kdave committed Oct 14, 2021
  15. btrfs: reduce btrfs_update_block_group alloc argument to bool

    btrfs_update_block_group() accounts for the number of bytes allocated or
    freed. Argument @alloc specifies whether the call is for alloc or free.
    Convert the argument @alloc type from int to bool.
    
    Reviewed-by: Su Yue <l@damenly.su>
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Oct 14, 2021
  16. btrfs: make btrfs_ref::real_root optional

    Now that real_root is only used in ref-verify core gate it behind
    CONFIG_BTRFS_FS_REF_VERIFY ifdef. This shrinks the size of pending
    delayed refs by 8 bytes per ref, of which we can have many at any one
    time depending on intensity of the workload. Also change the comment
    about the member as it no longer deals with qgroups.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    lorddoskias authored and kdave committed Oct 14, 2021
  17. btrfs: pull up qgroup checks from delayed-ref core to init time

    Instead of checking whether qgroup processing for a dealyed ref has to
    happen in the core of delayed ref, simply pull the check at init time of
    respective delayed ref structures. This eliminates the final use of
    real_root in delayed-ref core paving the way to making this member
    optional.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    lorddoskias authored and kdave committed Oct 14, 2021
  18. btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_da…

    …ta_ref
    
    In order to make 'real_root' used only in ref-verify it's required to
    have the necessary context to perform the same checks that this member
    is used for. So add 'mod_root' which will contain the root on behalf of
    which a delayed ref was created and a 'skip_group' parameter which
    will contain callsite-specific override of skip_qgroup.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    lorddoskias authored and kdave committed Oct 14, 2021
  19. btrfs: rely on owning_root field in btrfs_add_delayed_tree_ref to det…

    …ect CHUNK_ROOT
    
    The real_root field is going to be used only by ref-verify tool so limit
    its use outside of it. Blocks belonging to the chunk root will always
    have it as an owner so the check is equivalent.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    lorddoskias authored and kdave committed Oct 14, 2021
  20. btrfs: rename root fields in delayed refs structs

    Both data and metadata delayed ref structures have fields named
    root/ref_root respectively. Those are somewhat cryptic and don't really
    convey the real meaning. In fact those roots are really the original
    owners of the respective block (i.e in case of a snapshot a data delayed
    ref will contain the original root that owns the given block). Rename
    those fields accordingly and adjust comments.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    lorddoskias authored and kdave committed Oct 14, 2021
  21. btrfs: fix abort logic in btrfs_replace_file_extents

    Error injection testing uncovered a case where we'd end up with a
    corrupt file system with a missing extent in the middle of a file.  This
    occurs because the if statement to decide if we should abort is wrong.
    
    The only way we would abort in this case is if we got a ret !=
    -EOPNOTSUPP and we called from the file clone code.  However the
    prealloc code uses this path too.  Instead we need to abort if there is
    an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
    if we came from the clone file code.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  22. btrfs: do not infinite loop in data reclaim if we aborted

    Error injection stressing uncovered a busy loop in our data reclaim
    loop.  There are two cases here, one where we loop creating block groups
    until space_info->full is set, or in the main loop we will skip erroring
    out any tickets if space_info->full == 0.  Unfortunately if we aborted
    the transaction then we will never allocate chunks or reclaim any space
    and thus never get ->full, and you'll see stack traces like this:
    
      watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
      CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ torvalds#328
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: events_unbound btrfs_async_reclaim_data_space
      RIP: 0010:btrfs_join_transaction+0x12/0x20
      RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
      RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
      RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
      R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
      R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
      FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
      Call Trace:
       flush_space+0x4a8/0x660
       btrfs_async_reclaim_data_space+0x55/0x130
       process_one_work+0x1e9/0x380
       worker_thread+0x53/0x3e0
       ? process_one_work+0x380/0x380
       kthread+0x118/0x140
       ? __kthread_bind_mask+0x60/0x60
       ret_from_fork+0x1f/0x30
    
    Fix this by checking to see if we have a btrfs fs error in either of the
    reclaim loops, and if so fail the tickets and bail.  In addition to
    this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
    aborted, simply fail everything.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  23. btrfs: add a BTRFS_FS_ERROR helper

    We have a few flags that are inconsistently used to describe the fs in
    different states of failure.  As of 5963ffc ("btrfs: always abort
    the transaction if we abort a trans handle") we will always set
    BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
    and ERROR to see if things have gone wrong.  Add a helper to check
    BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
    use the helper.
    
    The TRANS_ABORTED bit check was added in af72273 ("Btrfs: clean up
    resources during umount after trans is aborted") but is not actually
    specific.
    
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  24. btrfs: change error handling for btrfs_delete_*_in_log

    Currently we will abort the transaction if we get a random error (like
    -EIO) while trying to remove the directory entries from the root log
    during rename.
    
    However since these are simply log tree related errors, we can mark the
    trans as needing a full commit.  Then if the error was truly
    catastrophic we'll hit it during the normal commit and abort as
    appropriate.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  25. btrfs: change handle_fs_error in recover_log_trees to aborts

    During inspection of the return path for replay I noticed that we don't
    actually abort the transaction if we get a failure during replay.  This
    isn't a problem necessarily, as we properly return the error and will
    fail to mount.  However we still leave this dangling transaction that
    could conceivably be committed without thinking there was an error.
    
    We were using btrfs_handle_fs_error() here, but that pre-dates the
    transaction abort code.  Simply replace the btrfs_handle_fs_error()
    calls with transaction aborts, so we still know where exactly things
    went wrong, and add a few in some other un-handled error cases.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Oct 14, 2021
  26. btrfs: check for error when looking up inode during dir entry replay

    At replay_one_name(), we are treating any error from btrfs_lookup_inode()
    as if the inode does not exists. Fix this by checking for an error and
    returning it to the caller.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Oct 14, 2021
  27. btrfs: unify lookup return value when dir entry is missing

    btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
    entries and both are used during log replay or when updating a log tree
    during an unlink.
    
    However when the dir item does not exists, btrfs_lookup_dir_item() returns
    NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
    the dir item exists but there is no matching entry for a given name or
    index, both return NULL. This makes the call sites during log replay to
    be more verbose than necessary and it makes it easy to miss this slight
    difference. Since we don't need to distinguish between those two cases,
    make btrfs_lookup_dir_index_item() always return NULL when there is no
    matching directory entry - either because there isn't any dir entry or
    because there is one but it does not match the given name and index.
    
    Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
    'index' since it is supposed to match an index number, and the name
    'objectid' is not very good because it can easily be confused with an
    inode number (like the inode number a dir entry points to).
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Oct 14, 2021
Older