Skip to content
Permalink
Goffredo-Baron…

Commits on Jul 21, 2020

  1. btrfs: allow more subvol= option

    When more than one subvol= options are passed, btrfs try to mount
    each subvolume until the first one succeed. Up to 5 subvol= options
    can be passed.
    
    Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
    kreijack authored and 0day robot committed Jul 21, 2020

Commits on Jul 20, 2020

  1. btrfs: Simplify setting/clearing fs_info to btrfs_fs_devices

    It makes no sense to have sysfs-related routines be responsible for
    properly initialising the fs_info pointer of struct btrfs_fs_device.
    Instead this can be streamlined by making it the responsibility of
    btrfs_init_devices_late to initialize it. That function already
    initializes fs_info of every individual device in btrfs_fs_devices.
    
    As far as clearing it is concerned it makes sense to move it to
    close_fs_devices. That function is only called when struct
    btrfs_fs_devices is no longer in use - either for holding seeds or main
    devices for a mounted filesystem.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 20, 2020
  2. btrfs: Make close_fs_devices return void

    The return value of this function conveys absolutely no information.
    All callers already check the state of  fs_devices->opened to decide
    how to proceed. So conver the function to returning void. While at it
    make btrfs_close_devices also return void.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 20, 2020
  3. btrfs: Factor out reada loop in __reada_start_machine

    This is in preparation for moving fs_devices to proper lists.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 20, 2020
  4. btrfs: fix lockdep splat from btrfs_dump_space_info

    When running with -o enospc_debug you can get the following splat if one
    of the dump_space_info's trip
    
    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc5+ torvalds#20 Tainted: G           OE
    ------------------------------------------------------
    dd/563090 is trying to acquire lock:
    ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
    
    but task is already holding lock:
    ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
    -> #3 (&cache->lock){+.+.}-{2:2}:
           _raw_spin_lock+0x25/0x30
           btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
           find_free_extent+0x7ef/0x13b0 [btrfs]
           btrfs_reserve_extent+0x9b/0x180 [btrfs]
           btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
           alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
           __btrfs_cow_block+0x122/0x530 [btrfs]
           btrfs_cow_block+0x106/0x210 [btrfs]
           commit_cowonly_roots+0x55/0x300 [btrfs]
           btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
           sync_filesystem+0x74/0x90
           generic_shutdown_super+0x22/0x100
           kill_anon_super+0x14/0x30
           btrfs_kill_super+0x12/0x20 [btrfs]
           deactivate_locked_super+0x36/0x70
           cleanup_mnt+0x104/0x160
           task_work_run+0x5f/0x90
           __prepare_exit_to_usermode+0x1bd/0x1c0
           do_syscall_64+0x5e/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    -> #2 (&space_info->lock){+.+.}-{2:2}:
           _raw_spin_lock+0x25/0x30
           btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
           btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
           btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
           clear_state_bit+0x81/0x1a0 [btrfs]
           __clear_extent_bit+0x25c/0x5d0 [btrfs]
           clear_extent_bit+0x15/0x20 [btrfs]
           btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
           truncate_cleanup_page+0x47/0xe0
           truncate_inode_pages_range+0x238/0x840
           truncate_pagecache+0x44/0x60
           btrfs_setattr+0x202/0x5e0 [btrfs]
           notify_change+0x33b/0x490
           do_truncate+0x76/0xd0
           path_openat+0x687/0xa10
           do_filp_open+0x91/0x100
           do_sys_openat2+0x215/0x2d0
           do_sys_open+0x44/0x80
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    -> #1 (&tree->lock#2){+.+.}-{2:2}:
           _raw_spin_lock+0x25/0x30
           find_first_extent_bit+0x32/0x150 [btrfs]
           write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
           __btrfs_write_out_cache+0x172/0x480 [btrfs]
           btrfs_write_out_cache+0x7a/0xf0 [btrfs]
           btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
           commit_cowonly_roots+0x245/0x300 [btrfs]
           btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
           close_ctree+0xf9/0x2f5 [btrfs]
           generic_shutdown_super+0x6c/0x100
           kill_anon_super+0x14/0x30
           btrfs_kill_super+0x12/0x20 [btrfs]
           deactivate_locked_super+0x36/0x70
           cleanup_mnt+0x104/0x160
           task_work_run+0x5f/0x90
           __prepare_exit_to_usermode+0x1bd/0x1c0
           do_syscall_64+0x5e/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
           __lock_acquire+0x1240/0x2460
           lock_acquire+0xab/0x360
           _raw_spin_lock+0x25/0x30
           btrfs_dump_free_space+0x2b/0xa0 [btrfs]
           btrfs_dump_space_info+0xf4/0x120 [btrfs]
           btrfs_reserve_extent+0x176/0x180 [btrfs]
           __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
           cache_save_setup+0x28d/0x3b0 [btrfs]
           btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
           btrfs_commit_transaction+0xcc/0xac0 [btrfs]
           btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
           btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
           btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
           btrfs_file_write_iter+0x3cf/0x610 [btrfs]
           new_sync_write+0x11e/0x1b0
           vfs_write+0x1c9/0x200
           ksys_write+0x68/0xe0
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    other info that might help us debug this:
    
    Chain exists of:
      &ctl->tree_lock --> &space_info->lock --> &cache->lock
    
     Possible unsafe locking scenario:
    
           CPU0                    CPU1
           ----                    ----
      lock(&cache->lock);
                                   lock(&space_info->lock);
                                   lock(&cache->lock);
      lock(&ctl->tree_lock);
    
     *** DEADLOCK ***
    
    6 locks held by dd/563090:
     #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
     #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
     #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
     #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
     #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
     #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
    
    stack backtrace:
    CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ torvalds#20
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
    Call Trace:
     dump_stack+0x96/0xd0
     check_noncircular+0x162/0x180
     __lock_acquire+0x1240/0x2460
     ? wake_up_klogd.part.0+0x30/0x40
     lock_acquire+0xab/0x360
     ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
     _raw_spin_lock+0x25/0x30
     ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
     btrfs_dump_free_space+0x2b/0xa0 [btrfs]
     btrfs_dump_space_info+0xf4/0x120 [btrfs]
     btrfs_reserve_extent+0x176/0x180 [btrfs]
     __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
     ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
     cache_save_setup+0x28d/0x3b0 [btrfs]
     btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
     btrfs_commit_transaction+0xcc/0xac0 [btrfs]
     ? start_transaction+0xe0/0x5b0 [btrfs]
     btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
     btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
     btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
     ? ktime_get_coarse_real_ts64+0xa8/0xd0
     ? trace_hardirqs_on+0x1c/0xe0
     btrfs_file_write_iter+0x3cf/0x610 [btrfs]
     new_sync_write+0x11e/0x1b0
     vfs_write+0x1c9/0x200
     ksys_write+0x68/0xe0
     do_syscall_64+0x52/0xb0
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    This is because we're holding the block_group->lock while trying to dump
    the free space cache.  However we don't need this lock, we just need it
    to read the values for the printk, so move the free space cache dumping
    outside of the block group lock.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  5. btrfs: move the chunk_mutex in btrfs_read_chunk_tree

    We are currently getting this lockdep splat
    
    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc5+ torvalds#20 Tainted: G            E
    ------------------------------------------------------
    mount/678048 is trying to acquire lock:
    ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
    
    but task is already holding lock:
    ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
    -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
           __mutex_lock+0x8b/0x8f0
           btrfs_init_new_device+0x2d2/0x1240 [btrfs]
           btrfs_ioctl+0x1de/0x2d20 [btrfs]
           ksys_ioctl+0x87/0xc0
           __x64_sys_ioctl+0x16/0x20
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
           __lock_acquire+0x1240/0x2460
           lock_acquire+0xab/0x360
           __mutex_lock+0x8b/0x8f0
           clone_fs_devices+0x4d/0x170 [btrfs]
           btrfs_read_chunk_tree+0x330/0x800 [btrfs]
           open_ctree+0xb7c/0x18ce [btrfs]
           btrfs_mount_root.cold+0x13/0xfa [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           fc_mount+0xe/0x40
           vfs_kern_mount.part.0+0x71/0x90
           btrfs_mount+0x13b/0x3e0 [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           do_mount+0x7de/0xb30
           __x64_sys_mount+0x8e/0xd0
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    other info that might help us debug this:
    
     Possible unsafe locking scenario:
    
           CPU0                    CPU1
           ----                    ----
      lock(&fs_info->chunk_mutex);
                                   lock(&fs_devs->device_list_mutex);
                                   lock(&fs_info->chunk_mutex);
      lock(&fs_devs->device_list_mutex);
    
     *** DEADLOCK ***
    
    3 locks held by mount/678048:
     #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
     #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
     #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
    
    stack backtrace:
    CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ torvalds#20
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
    Call Trace:
     dump_stack+0x96/0xd0
     check_noncircular+0x162/0x180
     __lock_acquire+0x1240/0x2460
     ? asm_sysvec_apic_timer_interrupt+0x12/0x20
     lock_acquire+0xab/0x360
     ? clone_fs_devices+0x4d/0x170 [btrfs]
     __mutex_lock+0x8b/0x8f0
     ? clone_fs_devices+0x4d/0x170 [btrfs]
     ? rcu_read_lock_sched_held+0x52/0x60
     ? cpumask_next+0x16/0x20
     ? module_assert_mutex_or_preempt+0x14/0x40
     ? __module_address+0x28/0xf0
     ? clone_fs_devices+0x4d/0x170 [btrfs]
     ? static_obj+0x4f/0x60
     ? lockdep_init_map_waits+0x43/0x200
     ? clone_fs_devices+0x4d/0x170 [btrfs]
     clone_fs_devices+0x4d/0x170 [btrfs]
     btrfs_read_chunk_tree+0x330/0x800 [btrfs]
     open_ctree+0xb7c/0x18ce [btrfs]
     ? super_setup_bdi_name+0x79/0xd0
     btrfs_mount_root.cold+0x13/0xfa [btrfs]
     ? vfs_parse_fs_string+0x84/0xb0
     ? rcu_read_lock_sched_held+0x52/0x60
     ? kfree+0x2b5/0x310
     legacy_get_tree+0x30/0x50
     vfs_get_tree+0x28/0xc0
     fc_mount+0xe/0x40
     vfs_kern_mount.part.0+0x71/0x90
     btrfs_mount+0x13b/0x3e0 [btrfs]
     ? cred_has_capability+0x7c/0x120
     ? rcu_read_lock_sched_held+0x52/0x60
     ? legacy_get_tree+0x30/0x50
     legacy_get_tree+0x30/0x50
     vfs_get_tree+0x28/0xc0
     do_mount+0x7de/0xb30
     ? memdup_user+0x4e/0x90
     __x64_sys_mount+0x8e/0xd0
     do_syscall_64+0x52/0xb0
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
    then read the device, which takes the device_list_mutex.  The
    device_list_mutex needs to be taken before the chunk_mutex, so this is a
    problem.  We only really need the chunk mutex around adding the chunk,
    so move the mutex around read_one_chunk.
    
    An argument could be made that we don't even need the chunk_mutex here
    as it's during mount, and we are protected by various other locks.
    However we already have special rules for ->device_list_mutex, and I'd
    rather not have another special case for ->chunk_mutex.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  6. btrfs: fix lockdep splat in open_fs_devices

    There's long existed a lockdep splat because we open our bdev's under
    the ->device_list_mutex at mount time, which acquires the bd_mutex.
    Usually this goes unnoticed, but if you do loopback devices at all
    suddenly the bd_mutex comes with a whole host of other dependencies,
    which results in the splat when you mount a btrfs file system.
    
    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
    ------------------------------------------------------
    systemd-journal/509 is trying to acquire lock:
    ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    
    but task is already holding lock:
    ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
     -> torvalds#6 (sb_pagefaults){.+.+}-{0:0}:
           __sb_start_write+0x13e/0x220
           btrfs_page_mkwrite+0x59/0x560 [btrfs]
           do_page_mkwrite+0x4f/0x130
           do_wp_page+0x3b0/0x4f0
           handle_mm_fault+0xf47/0x1850
           do_user_addr_fault+0x1fc/0x4b0
           exc_page_fault+0x88/0x300
           asm_exc_page_fault+0x1e/0x30
    
     -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
           __might_fault+0x60/0x80
           _copy_from_user+0x20/0xb0
           get_sg_io_hdr+0x9a/0xb0
           scsi_cmd_ioctl+0x1ea/0x2f0
           cdrom_ioctl+0x3c/0x12b4
           sr_block_ioctl+0xa4/0xd0
           block_ioctl+0x3f/0x50
           ksys_ioctl+0x82/0xc0
           __x64_sys_ioctl+0x16/0x20
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #4 (&cd->lock){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           sr_block_open+0xa2/0x180
           __blkdev_get+0xdd/0x550
           blkdev_get+0x38/0x150
           do_dentry_open+0x16b/0x3e0
           path_openat+0x3c9/0xa00
           do_filp_open+0x75/0x100
           do_sys_openat2+0x8a/0x140
           __x64_sys_openat+0x46/0x70
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           __blkdev_get+0x6a/0x550
           blkdev_get+0x85/0x150
           blkdev_get_by_path+0x2c/0x70
           btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
           open_fs_devices+0x88/0x240 [btrfs]
           btrfs_open_devices+0x92/0xa0 [btrfs]
           btrfs_mount_root+0x250/0x490 [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           vfs_kern_mount.part.0+0x71/0xb0
           btrfs_mount+0x119/0x380 [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           do_mount+0x8c6/0xca0
           __x64_sys_mount+0x8e/0xd0
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           btrfs_run_dev_stats+0x36/0x420 [btrfs]
           commit_cowonly_roots+0x91/0x2d0 [btrfs]
           btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
           btrfs_sync_file+0x38a/0x480 [btrfs]
           __x64_sys_fdatasync+0x47/0x80
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
           btrfs_sync_file+0x38a/0x480 [btrfs]
           __x64_sys_fdatasync+0x47/0x80
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
           __lock_acquire+0x1241/0x20c0
           lock_acquire+0xb0/0x400
           __mutex_lock+0x7b/0x820
           btrfs_record_root_in_trans+0x44/0x70 [btrfs]
           start_transaction+0xd2/0x500 [btrfs]
           btrfs_dirty_inode+0x44/0xd0 [btrfs]
           file_update_time+0xc6/0x120
           btrfs_page_mkwrite+0xda/0x560 [btrfs]
           do_page_mkwrite+0x4f/0x130
           do_wp_page+0x3b0/0x4f0
           handle_mm_fault+0xf47/0x1850
           do_user_addr_fault+0x1fc/0x4b0
           exc_page_fault+0x88/0x300
           asm_exc_page_fault+0x1e/0x30
    
    other info that might help us debug this:
    
    Chain exists of:
      &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
    
    Possible unsafe locking scenario:
    
         CPU0                    CPU1
         ----                    ----
     lock(sb_pagefaults);
                                 lock(&mm->mmap_lock#2);
                                 lock(sb_pagefaults);
     lock(&fs_info->reloc_mutex);
    
     *** DEADLOCK ***
    
    3 locks held by systemd-journal/509:
     #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
     #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
     #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
    
    stack backtrace:
    CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    Call Trace:
     dump_stack+0x92/0xc8
     check_noncircular+0x134/0x150
     __lock_acquire+0x1241/0x20c0
     lock_acquire+0xb0/0x400
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     ? lock_acquire+0xb0/0x400
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     __mutex_lock+0x7b/0x820
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     ? kvm_sched_clock_read+0x14/0x30
     ? sched_clock+0x5/0x10
     ? sched_clock_cpu+0xc/0xb0
     btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     start_transaction+0xd2/0x500 [btrfs]
     btrfs_dirty_inode+0x44/0xd0 [btrfs]
     file_update_time+0xc6/0x120
     btrfs_page_mkwrite+0xda/0x560 [btrfs]
     ? sched_clock+0x5/0x10
     do_page_mkwrite+0x4f/0x130
     do_wp_page+0x3b0/0x4f0
     handle_mm_fault+0xf47/0x1850
     do_user_addr_fault+0x1fc/0x4b0
     exc_page_fault+0x88/0x300
     ? asm_exc_page_fault+0x8/0x30
     asm_exc_page_fault+0x1e/0x30
    RIP: 0033:0x7fa3972fdbfe
    Code: Bad RIP value.
    
    Fix this by not holding the ->device_list_mutex at this point.  The
    device_list_mutex exists to protect us from modifying the device list
    while the file system is running.
    
    However it can also be modified by doing a scan on a device.  But this
    action is specifically protected by the uuid_mutex, which we are holding
    here.  We cannot race with opening at this point because we have the
    ->s_mount lock held during the mount.  Not having the
    ->device_list_mutex here is perfectly safe as we're not going to change
    the devices at this point.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  7. btrfs: add a comment explaining the data flush steps

    The data flushing steps are not obvious to people other than myself and
    Chris.  Write a giant comment explaining the reasoning behind each flush
    step for data as well as why it is in that particular order.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  8. btrfs: do async reclaim for data reservations

    Now that we have the data ticketing stuff in place, move normal data
    reservations to use an async reclaim helper to satisfy tickets.  Before
    we could have multiple tasks race in and both allocate chunks, resulting
    in more data chunks than we would necessarily need.  Serializing these
    allocations and making a single thread responsible for flushing will
    only allocate chunks as needed, as well as cut down on transaction
    commits and other flush related activities.
    
    Priority reservations will still work as they have before, simply
    trying to allocate a chunk until they can make their reservation.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  9. btrfs: flush delayed refs when trying to reserve data space

    We can end up with free'd extents in the delayed refs, and thus
    may_commit_transaction() may not think we have enough pinned space to
    commit the transaction and we'll ENOSPC early.  Handle this by running
    the delayed refs in order to make sure pinned is uptodate before we try
    to commit the transaction.
    
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  10. btrfs: run delayed iputs before committing the transaction for data

    Before we were waiting on iputs after we committed the transaction, but
    this doesn't really make much sense.  We want to reclaim any space we
    may have in order to be more likely to commit the transaction, due to
    pinned space being added by running the delayed iputs.  Fix this by
    making delayed iputs run before committing the transaction.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  11. btrfs: don't force commit if we are data

    We used to unconditionally commit the transaction at least 2 times and
    then on the 3rd try check against pinned space to make sure committing
    the transaction was worth the effort.  This is overkill, we know nobody
    is going to steal our reservation, and if we can't make our reservation
    with the pinned amount simply bail out.
    
    This also cleans up the passing of bytes_needed to
    may_commit_transaction, as that was the thing we added into place in
    order to accomplish this behavior.  We no longer need it so remove that
    mess.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  12. btrfs: drop the commit_cycles stuff for data reservations

    This was an old wart left over from how we previously did data
    reservations.  Before we could have people race in and take a
    reservation while we were flushing space, so we needed to make sure we
    looped a few times before giving up.  Now that we're using the ticketing
    infrastructure we don't have to worry about this and can drop the logic
    altogether.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  13. btrfs: use the same helper for data and metadata reservations

    Now that data reservations follow the same pattern as metadata
    reservations we can simply rename __reserve_metadata_bytes to
    __reserve_bytes and use that helper for data reservations.
    
    Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
    because we can never overcommit.  We also will never pass in FLUSH_ALL
    for data, so we'll simply be added to the priority list and go straight
    into handle_reserve_ticket.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  14. btrfs: serialize data reservations if we are flushing

    Nikolay reported a problem where generic/371 would fail sometimes with a
    slow drive.  The gist of the test is that we fallocate a file in
    parallel with a pwrite of a different file.  These two files combined
    are smaller than the file system, but sometimes the pwrite would ENOSPC.
    
    A fair bit of investigation uncovered the fact that the fallocate
    workload was racing in and grabbing the free space that the pwrite
    workload was trying to free up so it could make its own reservation.
    After a few loops of this eventually the pwrite workload would error out
    with an ENOSPC.
    
    We've had the same problem with metadata as well, and we serialized all
    metadata allocations to satisfy this problem.  This wasn't usually a
    problem with data because data reservations are more straightforward,
    but obviously could still happen.
    
    Fix this by not allowing reservations to occur if there are any pending
    tickets waiting to be satisfied on the space info.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  15. btrfs: use ticketing for data space reservations

    Now that we have all the infrastructure in place, use the ticketing
    infrastructure to make data allocations.  This still maintains the exact
    same flushing behavior, but now we're using tickets to get our
    reservations satisfied.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  16. btrfs: add btrfs_reserve_data_bytes and use it

    Create a new function btrfs_reserve_data_bytes() in order to handle data
    reservations.  This uses the new flush types and flush states to handle
    making data reservations.
    
    This patch specifically does not change any functionality, and is
    purposefully not cleaned up in order to make bisection easier for the
    future patches.  The new helper is identical to the old helper in how it
    handles data reservations.  We first try to force a chunk allocation,
    and then we run through the flush states all at once and in the same
    order that they were done with the old helper.
    
    Subsequent patches will clean this up and change the behavior of the
    flushing, and it is important to keep those changes separate so we can
    easily bisect down to the patch that caused the regression, rather than
    the patch that made us start using the new infrastructure.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  17. btrfs: add the data transaction commit logic into may_commit_transaction

    Data space flushing currently unconditionally commits the transaction
    twice in a row, and the last time it checks if there's enough pinned
    extents to satisfy it's reservation before deciding to commit the
    transaction for the 3rd and final time.
    
    Encode this logic into may_commit_transaction().  In the next patch we
    will pass in U64_MAX for bytes_needed the first two times, and the final
    time we will pass in the actual bytes we need so the normal logic will
    apply.
    
    This patch exists soley to make the logical changes I will make to the
    flushing state machine separate to make it easier to bisect any
    performance related regressions.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  18. btrfs: add flushing states for handling data reservations

    Currently the way we do data reservations is by seeing if we have enough
    space in our space_info.  If we do not and we're a normal inode we'll
    
    1) Attempt to force a chunk allocation until we can't anymore.
    2) If that fails we'll flush delalloc, then commit the transaction, then
       run the delayed iputs.
    
    If we are a free space inode we're only allowed to force a chunk
    allocation.  In order to use the normal flushing mechanism we need to
    encode this into a flush state array for normal inodes.  Since both will
    start with allocating chunks until the space info is full there is no
    need to add this as a flush state, this will be handled specially.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  19. btrfs: check tickets after waiting on ordered extents

    Right now if the space is free'd up after the ordered extents complete
    (which is likely since the reservations are held until they complete),
    we would do extra delalloc flushing before we'd notice that we didn't
    have any more tickets.  Fix this by moving the tickets check after our
    wait_ordered_extents check.
    
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  20. btrfs: use btrfs_start_delalloc_roots in shrink_delalloc

    The original iteration of flushing had us flushing delalloc and then
    checking to see if we could make our reservation, thus we were very
    careful about how many pages we would flush at once.
    
    But now that everything is async and we satisfy tickets as the space
    becomes available we don't have to keep track of any of this, simply try
    and flush the number of dirty inodes we may have in order to reclaim
    space to make our reservation.  This cleans up our delalloc flushing
    significantly.
    
    The async_pages stuff is dropped because btrfs_start_delalloc_roots()
    handles the case that we generate async extents for us, so we no longer
    require this extra logic.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  21. btrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc

    We are going to use the ticket infrastructure for data, so use the
    btrfs_space_info_free_bytes_may_use() helper in
    btrfs_free_reserved_data_space_noquota() so we get the
    try_granting_tickets call when we free our reservation.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  22. btrfs: call btrfs_try_granting_tickets when reserving space

    If we have compression on we could free up more space than we reserved,
    and thus be able to make a space reservation.  Add the call for this
    scenario.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  23. btrfs: call btrfs_try_granting_tickets when unpinning anything

    When unpinning we were only calling btrfs_try_granting_tickets() if
    global_rsv->space_info == space_info, which is problematic because we
    use ticketing for SYSTEM chunks, and want to use it for DATA as well.
    Fix this by moving this call outside of that if statement.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  24. btrfs: call btrfs_try_granting_tickets when freeing reserved bytes

    We were missing a call to btrfs_try_granting_tickets in
    btrfs_free_reserved_bytes, so add it to handle the case where we're able
    to satisfy an allocation because we've freed a pending reservation.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  25. btrfs: make ALLOC_CHUNK use the space info flags

    We have traditionally used flush_space() to flush metadata space, so
    we've been unconditionally using btrfs_metadata_alloc_profile() for our
    profile to allocate a chunk.  However if we're going to use this for
    data we need to use btrfs_get_alloc_profile() on the space_info we pass
    in.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  26. btrfs: make shrink_delalloc take space_info as an arg

    Currently shrink_delalloc just looks up the metadata space info, but
    this won't work if we're trying to reclaim space for data chunks.  We
    get the right space_info we want passed into flush_space, so simply pass
    that along to shrink_delalloc.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Tested-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
Older
You can’t perform that action at this time.