Anand-Jain/btr…
Commits on Dec 8, 2021
-
btrfs: harden identification of the stale device
Identifying and removing the stale device from the fs_uuids list is done by the function btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn depends on the function device_path_matched() to check if the device repeats in more than one btrfs_device structure. The matching of the device happens by its path, the device path. However, when dm mapper is in use, the dm device paths are nothing but a link to the actual block device, which leads to the device_path_matched() failing to match. Fix this by matching the dev_t as provided by lookup_bdev() instead of plain strcmp() the device paths. Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com>
Commits on Nov 16, 2021
-
Fixup merge-to-merge conflict in lzo_compress_pages
Signed-off-by: David Sterba <dsterba@suse.com>
-
Merge branch 'for-next-current-v5.15-20211116' into for-next-20211116
# Conflicts: # fs/btrfs/lzo.c
-
-
btrfs: allow device add if balance is paused
Currently paused balance precludes adding a device since they are both considered exclusive ops and we can have at most 1 running at a time. This is problematic in case a filesystem encounters an ENOSPC situation while balance is running, in this case the only thing the user can do is mount the fs with "skip_balance" which pauses balance and delete some data to free up space for balance. However, it should be possible to add a new device when balance is paused. Fix this by allowing device add to proceed when balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make device add compatible with paused balance in btrfs_exclop…
…_start_try_lock This is needed to enable device add to work in cases when a file system has been mounted with 'skip_balance' mount option. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: introduce BTRFS_EXCLOP_BALANCE_PAUSED exclusive state
Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: change root to fs_info for btrfs_reserve_metadata_bytes
We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: get rid of root->orphan_cleanup_state
Now that we don't care about the stage of the orphan_cleanup_state, simply replace it with a bit on ->state to make sure we don't call the orphan cleanup every time we wander into this root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: remove global rsv stealing logic for orphan cleanup
This is very old code before we were stealing from the global reserve during evict. We have proper ways to steal from the global reserve while we're evicting, so rip out this code as it's no longer necessary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing code
I forgot to convert this over when I introduced the global reserve stealing code to the space flushing code. Evict was simply trying to make its reservation and then if it failed it would steal from the global rsv, which is racey because it's outside of the normal ticketing code. Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT, and then make the priority flushing path do the steal for us. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: check ticket->steal in steal_from_global_block_rsv
We're going to use this helper in the priority flushing loop, move this check into the helper to simplify the logic. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: check for priority ticket granting before flushing
Since we're dropping locks before we enter the priority flushing loops we could have had our ticket granted before we got the space_info->lock. So add this check to avoid doing some extra flushing in the priority flushing cases. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: handle priority ticket failures in their respective helpers
Currently the error case for the priority tickets is handled where we deal with all of the tickets, priority and non-priority. This is ok in general, but it makes for some awkward locking. We take and drop the space_info->lock back to back because of these different types of tickets. Rework the code to handle priority ticket failures in their respective helpers. This allows us to be less wonky with our space_info->lock usage, and means that the main handler simply has to check ticket->error, as the ticket is guaranteed to be off any list and completely handled by the time it exits one of the handlers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: deprecate BTRFS_IOC_BALANCE ioctl
The v2 balance ioctl has been introduced more than 9 years ago. Users of the old v1 ioctl should have long been migrated to it. It's time we deprecate it and eventually remove it. The only known user is in btrfs-progs that tries v1 as a fallback in case v2 is not supported. This is not necessary anymore. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make 1-bit bit-fields of scrub_page unsigned int
The bitfields have_csum and io_error are currently signed which is not recommended as the representation is an implementation defined behaviour. Fix this by making the bit-fields unsigned ints. Fixes: 2c36395 ("btrfs: scrub: remove the anonymous structure from scrub_page") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: check-integrity: fix a warning on write caching disabled disk
When a disk has write caching disabled, we skip submission of a bio with flush and sync requests before writing the superblock, since it's not needed. However when the integrity checker is enabled, this results in reports that there are metadata blocks referred by a superblock that were not properly flushed. So don't skip the bio submission only when the integrity checker is enabled for the sake of simplicity, since this is a debug tool and not meant for use in non-debug builds. fstests/btrfs/220 trigger a check-integrity warning like the following when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0. btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)! ------------[ cut here ]------------ WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs] CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1 Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019 RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs] RSP: 0018:ffffb642afb47940 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00 RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003 FS: 00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0 Call Trace: btrfsic_process_written_block+0x2f7/0x850 [btrfs] __btrfsic_submit_bio.part.19+0x310/0x330 [btrfs] ? bio_associate_blkg_from_css+0xa4/0x2c0 btrfsic_submit_bio+0x18/0x30 [btrfs] write_dev_supers+0x81/0x2a0 [btrfs] ? find_get_pages_range_tag+0x219/0x280 ? pagevec_lookup_range_tag+0x24/0x30 ? __filemap_fdatawait_range+0x6d/0xf0 ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e ? find_first_extent_bit+0x9b/0x160 [btrfs] ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e write_all_supers+0x1b3/0xa70 [btrfs] ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e btrfs_commit_transaction+0x59d/0xac0 [btrfs] close_ctree+0x11d/0x339 [btrfs] generic_shutdown_super+0x71/0x110 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 exit_to_user_mode_prepare+0x1f0/0x200 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x46/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f009f711dfb RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50 RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580 R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50 R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff ---[ end trace 2c4b82abcef9eec4 ]--- S-65536(sdb2/65536/1) --> M-1064960(sdb2/1064960/1) Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: silence lockdep when reading chunk tree during mount
Often some test cases like btrfs/161 trigger lockdep splats that complain about possible unsafe lock scenario due to the fact that during mount, when reading the chunk tree we end up calling blkdev_get_by_path() while holding a read lock on a leaf of the chunk tree. That produces a lockdep splat like the following: [ 3653.683975] ====================================================== [ 3653.685148] WARNING: possible circular locking dependency detected [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted [ 3653.687239] ------------------------------------------------------ [ 3653.688400] mount/447465 is trying to acquire lock: [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.691054] but task is already holding lock: [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.693978] which lock already depends on the new lock. [ 3653.695510] the existing dependency chain (in reverse order) is: [ 3653.696915] -> #3 (btrfs-chunk-00){++++}-{3:3}: [ 3653.698053] down_read_nested+0x4b/0x140 [ 3653.698893] __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.699988] btrfs_read_lock_root_node+0x31/0x40 [btrfs] [ 3653.701205] btrfs_search_slot+0x537/0xc00 [btrfs] [ 3653.702234] btrfs_insert_empty_items+0x32/0x70 [btrfs] [ 3653.703332] btrfs_init_new_device+0x563/0x15b0 [btrfs] [ 3653.704439] btrfs_ioctl+0x2110/0x3530 [btrfs] [ 3653.705405] __x64_sys_ioctl+0x83/0xb0 [ 3653.706215] do_syscall_64+0x3b/0xc0 [ 3653.706990] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.708040] -> #2 (sb_internal#2){.+.+}-{0:0}: [ 3653.708994] lock_release+0x13d/0x4a0 [ 3653.709533] up_write+0x18/0x160 [ 3653.710017] btrfs_sync_file+0x3f3/0x5b0 [btrfs] [ 3653.710699] __loop_update_dio+0xbd/0x170 [loop] [ 3653.711360] lo_ioctl+0x3b1/0x8a0 [loop] [ 3653.711929] block_ioctl+0x48/0x50 [ 3653.712442] __x64_sys_ioctl+0x83/0xb0 [ 3653.712991] do_syscall_64+0x3b/0xc0 [ 3653.713519] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.714233] -> #1 (&lo->lo_mutex){+.+.}-{3:3}: [ 3653.715026] __mutex_lock+0x92/0x900 [ 3653.715648] lo_open+0x28/0x60 [loop] [ 3653.716275] blkdev_get_whole+0x28/0x90 [ 3653.716867] blkdev_get_by_dev.part.0+0x142/0x320 [ 3653.717537] blkdev_open+0x5e/0xa0 [ 3653.718043] do_dentry_open+0x163/0x390 [ 3653.718604] path_openat+0x3f0/0xa80 [ 3653.719128] do_filp_open+0xa9/0x150 [ 3653.719652] do_sys_openat2+0x97/0x160 [ 3653.720197] __x64_sys_openat+0x54/0x90 [ 3653.720766] do_syscall_64+0x3b/0xc0 [ 3653.721285] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.721986] -> #0 (&disk->open_mutex){+.+.}-{3:3}: [ 3653.722775] __lock_acquire+0x130e/0x2210 [ 3653.723348] lock_acquire+0xd7/0x310 [ 3653.723867] __mutex_lock+0x92/0x900 [ 3653.724394] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.725041] blkdev_get_by_path+0xb8/0xd0 [ 3653.725614] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.726332] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.726999] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.727739] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.728384] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.729130] legacy_get_tree+0x30/0x50 [ 3653.729676] vfs_get_tree+0x28/0xc0 [ 3653.730192] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.730800] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.731427] legacy_get_tree+0x30/0x50 [ 3653.731970] vfs_get_tree+0x28/0xc0 [ 3653.732486] path_mount+0x2d4/0xbe0 [ 3653.732997] __x64_sys_mount+0x103/0x140 [ 3653.733560] do_syscall_64+0x3b/0xc0 [ 3653.734080] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.734782] other info that might help us debug this: [ 3653.735784] Chain exists of: &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00 [ 3653.737123] Possible unsafe locking scenario: [ 3653.737865] CPU0 CPU1 [ 3653.738435] ---- ---- [ 3653.739007] lock(btrfs-chunk-00); [ 3653.739449] lock(sb_internal#2); [ 3653.740193] lock(btrfs-chunk-00); [ 3653.740955] lock(&disk->open_mutex); [ 3653.741431] *** DEADLOCK *** [ 3653.742176] 3 locks held by mount/447465: [ 3653.742739] #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0 [ 3653.744114] #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs] [ 3653.745563] #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs] [ 3653.747066] stack backtrace: [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1 [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 3653.750592] Call Trace: [ 3653.750967] dump_stack_lvl+0x57/0x72 [ 3653.751526] check_noncircular+0xf3/0x110 [ 3653.752136] ? stack_trace_save+0x4b/0x70 [ 3653.752748] __lock_acquire+0x130e/0x2210 [ 3653.753356] lock_acquire+0xd7/0x310 [ 3653.753898] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.754596] ? lock_is_held_type+0xe8/0x140 [ 3653.755125] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.755729] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.756338] __mutex_lock+0x92/0x900 [ 3653.756794] ? blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.757400] ? do_raw_spin_unlock+0x4b/0xa0 [ 3653.757930] ? _raw_spin_unlock+0x29/0x40 [ 3653.758437] ? bd_prepare_to_claim+0x129/0x150 [ 3653.758999] ? trace_module_get+0x2b/0xd0 [ 3653.759508] ? try_module_get.part.0+0x50/0x80 [ 3653.760072] blkdev_get_by_dev.part.0+0xe7/0x320 [ 3653.760661] ? devcgroup_check_permission+0xc1/0x1f0 [ 3653.761288] blkdev_get_by_path+0xb8/0xd0 [ 3653.761797] btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] [ 3653.762454] open_fs_devices+0xd7/0x2c0 [btrfs] [ 3653.763055] ? clone_fs_devices+0x8f/0x170 [btrfs] [ 3653.763689] btrfs_read_chunk_tree+0x3ad/0x870 [btrfs] [ 3653.764370] ? kvm_sched_clock_read+0x14/0x40 [ 3653.764922] open_ctree+0xb8e/0x17bf [btrfs] [ 3653.765493] ? super_setup_bdi_name+0x79/0xd0 [ 3653.766043] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 3653.766780] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.767488] ? kfree+0x1f2/0x3c0 [ 3653.767979] legacy_get_tree+0x30/0x50 [ 3653.768548] vfs_get_tree+0x28/0xc0 [ 3653.769076] vfs_kern_mount.part.0+0x71/0xb0 [ 3653.769718] btrfs_mount+0x11d/0x3a0 [btrfs] [ 3653.770381] ? rcu_read_lock_sched_held+0x3f/0x80 [ 3653.771086] ? kfree+0x1f2/0x3c0 [ 3653.771574] legacy_get_tree+0x30/0x50 [ 3653.772136] vfs_get_tree+0x28/0xc0 [ 3653.772673] path_mount+0x2d4/0xbe0 [ 3653.773201] __x64_sys_mount+0x103/0x140 [ 3653.773793] do_syscall_64+0x3b/0xc0 [ 3653.774333] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 3653.775094] RIP: 0033:0x7f648bc45aaa This happens because through btrfs_read_chunk_tree(), which is called only during mount, ends up acquiring the mutex open_mutex of a block device while holding a read lock on a leaf of the chunk tree while other paths need to acquire other locks before locking extent buffers of the chunk tree. Since at mount time when we call btrfs_read_chunk_tree() we know that we don't have other tasks running in parallel and modifying the chunk tree, we can simply skip locking of chunk tree extent buffers. So do that and move the assertion that checks the fs is not yet mounted to the top block of btrfs_read_chunk_tree(), with a comment before doing it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> -
btrfs: fix memory ordering between normal and ordered work functions
Ordered work functions aren't guaranteed to be handled by the same thread which executed the normal work functions. The only way execution between normal/ordered functions is synchronized is via the WORK_DONE_BIT, unfortunately the used bitops don't guarantee any ordering whatsoever. This manifested as seemingly inexplicable crashes on ARM64, where async_chunk::inode is seen as non-null in async_cow_submit which causes submit_compressed_extents to be called and crash occurs because async_chunk::inode suddenly became NULL. The call trace was similar to: pc : submit_compressed_extents+0x38/0x3d0 lr : async_cow_submit+0x50/0xd0 sp : ffff800015d4bc20 <registers omitted for brevity> Call trace: submit_compressed_extents+0x38/0x3d0 async_cow_submit+0x50/0xd0 run_ordered_work+0xc8/0x280 btrfs_work_helper+0x98/0x250 process_one_work+0x1f0/0x4ac worker_thread+0x188/0x504 kthread+0x110/0x114 ret_from_fork+0x10/0x18 Fix this by adding respective barrier calls which ensure that all accesses preceding setting of WORK_DONE_BIT are strictly ordered before setting the flag. At the same time add a read barrier after reading of WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads would be strictly ordered after reading the bit. This in turn ensures are all accesses before WORK_DONE_BIT are going to be strictly ordered before any access that can occur in ordered_func. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 08a9ff3 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue") CC: stable@vger.kernel.org # 4.4+ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> -
btrfs: fix a out-of-bound access in copy_compressed_data_to_page()
[BUG] The following script can cause btrfs to crash: $ mount -o compress-force=lzo $DEV /mnt $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1 $ sync The call trace looks like this: general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:__memcpy+0x12/0x20 Call Trace: lzo_compress_pages+0x236/0x540 [btrfs] btrfs_compress_pages+0xaa/0xf0 [btrfs] compress_file_range+0x431/0x8e0 [btrfs] async_cow_start+0x12/0x30 [btrfs] btrfs_work_helper+0xf6/0x3e0 [btrfs] process_one_work+0x294/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 ---[ end trace 63c3c0f131e61982 ]--- [CAUSE] In lzo_compress_pages(), parameter @out_pages is not only an output parameter (for the number of compressed pages), but also an input parameter, as the upper limit of compressed pages we can utilize. In commit d408880 ("btrfs: subpage: make lzo_compress_pages() compatible"), the refactoring doesn't take @out_pages as an input, thus completely ignoring the limit. And for compress-force case, we could hit incompressible data that compressed size would go beyond the page limit, and cause the above crash. [FIX] Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(), and check if we're beyond the limit before accessing the pages. Note: this also fixes crash on 32bit architectures that was suspected to be caused by merge of btrfs patches to 5.16-rc1. Reported in https://lore.kernel.org/all/20211104115001.GU20319@twin.jikos.cz/ . Reported-by: Omar Sandoval <osandov@fb.com> Fixes: d408880 ("btrfs: subpage: make lzo_compress_pages() compatible") Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: fix a out-of-boundary access for copy_compressed_data_to_page()
[BUG] The following script can cause btrfs to crash: mount -o compress-force=lzo $DEV /mnt dd if=/dev/urandom of=/mnt/foo bs=4k count=1 sync The calltrace looks like this: general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:__memcpy+0x12/0x20 Call Trace: lzo_compress_pages+0x236/0x540 [btrfs] btrfs_compress_pages+0xaa/0xf0 [btrfs] compress_file_range+0x431/0x8e0 [btrfs] async_cow_start+0x12/0x30 [btrfs] btrfs_work_helper+0xf6/0x3e0 [btrfs] process_one_work+0x294/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 ---[ end trace 63c3c0f131e61982 ]--- [CAUSE] In lzo_compress_pages(), parameter @out_pages is not only an output parameter (for the number of compressed pages), but also an input parameter, as the upper limit of compressed pages we can utilize. In commit d408880 ("btrfs: subpage: make lzo_compress_pages() compatible"), the refactor doesn't take @out_pages as an input, thus completely ignoring the limit. And for compress-force case, we could hit incompressible data that compressed size would go beyond the page limit, and cause above crash. [FIX] Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(), and check if we're beyond the limit before accessing the pages. Reported-by: Omar Sandoval <osandov@fb.com> Fixes: d408880 ("btrfs: subpage: make lzo_compress_pages() compatible") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commits on Nov 15, 2021
-
btrfs: cache reported zone during mount
When mounting a device, we are reporting the zones twice: once for checking the zone attributes in btrfs_get_dev_zone_info and once for loading block groups' zone info in btrfs_load_block_group_zone_info(). With a lot of block groups, that leads to a lot of REPORT ZONE commands and slows down the mount process. This patch introduces a zone info cache in struct btrfs_zoned_device_info. The cache is populated while in btrfs_get_dev_zone_info() and used for btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE commands. The zone cache is then released after loading the block groups, as it will not be much effective during the run time. Benchmark: Mount an HDD with 57,007 block groups Before patch: 171.368 seconds After patch: 64.064 seconds While it still takes a minute due to the slowness of loading all the block groups, the patch reduces the mount time by 1/3. Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/ Fixes: 5b31646 ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: remove unused parameter fs_devices from btrfs_init_workqueues
Since commit ba8a9d0 ("Btrfs: delete the entire async bio submission framework") removed submit workqueues, the parameter fs_devices is not used anymore. Remove it, no functional changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: deprecate BTRFS_IOC_BALANCE ioctl
The v2 balance ioctl has been introduced more than 9 years ago. Users of the old v1 ioctl should have long been migrated to it. It's time we deprecate it and eventually remove it. The only known user is in btrfs-progs that tries v1 as a fallback in case v2 is not supported. This is not necessary anymore. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: make 1-bit bit-fields of scrub_page unsigned int
The bitfields have_csum and io_error are currently signed which is not recommended as the representation is an implementation defined behaviour. Fix this by making the bit-fields unsigned ints. Fixes: 2c36395 ("btrfs: scrub: remove the anonymous structure from scrub_page") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: reduce the scope of the tree log mutex during transaction commit
In the transaction commit path we are acquiring the tree log mutex too early and we have a stale comment because: 1) It mentions a function named btrfs_commit_tree_roots(), which does not exists anymore, it was the old name of commit_cowonly_roots(), renamed a very long time ago by commit 5d4f98a ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)")); 2) It mentions that we need to acquire the tree log mutex at that point to ensure we have no running log writers. That is not correct anymore, for many years at least, since we are guaranteed that we do not have any log writers at that point simply because we have set the state of the transaction to TRANS_STATE_COMMIT_DOING and have waited for all writers to complete - meaning no one can log until we change the state of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the transaction or start a new one will block until we do that state transition; 3) The comment mentions a "trans mutex" which doesn't exists since 2011, commit a4abeea ("Btrfs: kill trans_mutex") removed it; 4) The current use of the tree log mutex is to ensure proper serialization of super block writes - if someone started a new transaction and uses it for logging, it will wait for the previous transaction to write its super block before writing the super block when attempting to sync the log. So acquire the tree log mutex only when it's absolutely needed, before setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the stale comment, add some assertions and new comments where appropriate. Also, this has no effect on concurrency or performance, since the new start of the critical section is still when the transaction is in the state TRANS_STATE_COMMIT_DOING. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: consolidate device_list_mutex in prepare_sprout to its parent
btrfs_prepare_sprout() splices seed devices into its own struct fs_devices, so that its parent function btrfs_init_new_device() can add the new sprout device to fs_info->fs_devices. Both btrfs_prepare_sprout() and btrfs_init_new_device() need device_list_mutex. But they are holding it separately, thus create a small race window. Close it and hold device_list_mutex across both functions btrfs_init_new_device() and btrfs_prepare_sprout(). Split btrfs_prepare_sprout() into btrfs_init_sprout() and btrfs_setup_sprout(). This split is essential because device_list_mutex must not be held for allocations in btrfs_init_sprout() but must be held for btrfs_setup_sprout(). So now a common device_list_mutex can be used between btrfs_init_new_device() and btrfs_setup_sprout(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: switch seeding_dev in init_new_device to bool
Declare int seeding_dev as a bool. Also, move its declaration a line below to adjust packing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: send: remove unused type parameter to iterate_inode_ref_t
Again, I don't think this was ever used since iterate_dir_item() is only used for xattrs. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: send: remove unused found_type parameter to lookup_dir_item_in…
…ode() As far as I can tell, this was never used. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
btrfs: rename btrfs_item_end_nr to btrfs_item_data_end
The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually the offset of the end of the data the item points to. In fact all of the helpers that we use btrfs_item_end_nr() use data in their name, like BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end() to make it clear what this helper is giving us. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>