Skip to content
Permalink
Qu-Wenruo/Btrf…

Commits on Mar 30, 2016

  1. btrfs: dedupe: Preparation for compress-dedupe co-work

    For dedupe to work with compression, new members recording compression
    algorithm and on-disk extent length are needed.
    
    Add them for later compress-dedupe co-work.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  2. btrfs: dedupe: Add support for adding hash for on-disk backend

    Now on-disk backend can add hash now.
    
    Since all needed on-disk backend functions are added, also allow on-disk
    backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
    only) to 2 (inmemory + ondisk).
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  3. btrfs: dedupe: Add support to delete hash for on-disk backend

    Now on-disk backend can delete hash now.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  4. btrfs: dedupe: Add support for on-disk hash search

    Now on-disk backend should be able to search hash now.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  5. btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

    Since we will introduce a new on-disk based dedupe method, introduce new
    interfaces to resume previous dedupe setup.
    
    And since we introduce a new tree for status, also add disable handler
    for it.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  6. btrfs: dedupe: Add basic tree structure for on-disk dedupe method

    Introduce a new tree, dedupe tree to record on-disk dedupe hash.
    As a persist hash storage instead of in-memeory only implement.
    
    Unlike Liu Bo's implement, in this version we won't do hack for
    bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
    search case, just like in-memory backend.
    
    Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  7. btrfs: dedupe: add per-file online dedupe control

    Introduce inode_need_dedupe() to implement per-file online dedupe control.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  8. btrfs: dedupe: add a property handler for online dedupe

    We use btrfs extended attribute "btrfs.dedupe" to record per-file online
    dedupe status, so add a dedupe property handler.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  9. btrfs: dedupe: add an inode nodedupe flag

    Introduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable
    online data dedupelication for specified files.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  10. btrfs: dedupe: Add ioctl for inband dedupelication

    Add ioctl interface for inband dedupelication, which includes:
    1) enable
    2) disable
    3) status
    
    And a pseudo RO compat flag, to imply that btrfs now supports inband
    dedup.
    However we don't add any ondisk format change, it's just a pseudo RO
    compat flag.
    
    All these ioctl interface are state-less, which means caller don't need
    to bother previous dedupe state before calling them, and only need to
    care the final desired state.
    
    For example, if user want to enable dedupe with specified block size and
    limit, just fill the ioctl structure and call enable ioctl.
    No need to check if dedupe is already running.
    
    These ioctls will handle things like re-configure or disable quite well.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  11. btrfs: dedupe: Inband in-memory only de-duplication implement

    Core implement for inband de-duplication.
    It reuse the async_cow_start() facility to do the calculate dedupe hash.
    And use dedupe hash to do inband de-duplication at extent level.
    
    The work flow is as below:
    1) Run delalloc range for an inode
    2) Calculate hash for the delalloc range at the unit of dedupe_bs
    3) For hash match(duplicated) case, just increase source extent ref
       and insert file extent.
       For hash mismatch case, go through the normal cow_file_range()
       fallback, and add hash into dedupe_tree.
       Compress for hash miss case is not supported yet.
    
    Current implement restore all dedupe hash in memory rb-tree, with LRU
    behavior to control the limit.
    
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  12. btrfs: ordered-extent: Add support for dedupe

    Add ordered-extent support for dedupe.
    
    Note, current ordered-extent support only supports non-compressed source
    extent.
    Support for compressed source extent will be added later.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  13. btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

    Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
    supported yet, so implement btrfs_dedupe_calc_hash() interface using
    SHA256.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  14. btrfs: dedupe: Introduce function to search for an existing hash

    Introduce static function inmem_search() to handle the job for in-memory
    hash tree.
    
    The trick is, we must ensure the delayed ref head is not being run at
    the time we search the for the hash.
    
    With inmem_search(), we can implement the btrfs_dedupe_search()
    interface.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  15. btrfs: delayed-ref: Add support for increasing data ref under spinlock

    For in-band dedupe, btrfs needs to increase data ref with delayed_ref
    locked, so add a new function btrfs_add_delayed_data_ref_lock() to
    increase extent ref with delayed_refs already locked.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Qu Wenruo authored and fengguang committed Mar 30, 2016
  16. btrfs: dedupe: Introduce function to remove hash from in-memory tree

    Introduce static function inmem_del() to remove hash from in-memory
    dedupe tree.
    And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  17. btrfs: dedupe: Introduce function to add hash into in-memory tree

    Introduce static function inmem_add() to add hash into in-memory tree.
    And now we can implement the btrfs_dedupe_add() interface.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  18. btrfs: dedupe: Introduce function to initialize dedupe info

    Add generic function to initialize dedupe info.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016
  19. btrfs: dedupe: Introduce dedupe framework and its header

    Introduce the header for btrfs online(write time) de-duplication
    framework and needed header.
    
    The new de-duplication framework is going to support 2 different dedupe
    methods and 1 dedupe hash.
    
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
    wangxiaoguang authored and fengguang committed Mar 30, 2016

Commits on Mar 14, 2016

  1. btrfs: Fix misspellings in comments.

    Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adambuchbinder authored and kdave committed Mar 14, 2016
  2. btrfs: Print Warning only if ENOSPC_DEBUG is enabled

    Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use
    btrfs_debug if it is enabled.
    
    Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
    [ preserve the WARN_ON ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    Ashish Samant authored and kdave committed Mar 14, 2016

Commits on Mar 11, 2016

  1. btrfs: scrub: silence an uninitialized variable warning

    It's basically harmless if "ref_level" isn't initialized since it's only
    used for an error message, but it causes a static checker warning.
    
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    error27 authored and kdave committed Mar 11, 2016
  2. btrfs: move btrfs_compression_type to compression.h

    So that its better organized.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Mar 11, 2016
  3. btrfs: rename btrfs_print_info to btrfs_print_mod_info

    So that it indicates what it does.
    
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Mar 11, 2016
  4. Btrfs: Show a warning message if one of objectid reaches its highest …

    …value
    
    It's better to show a warning message for the exceptional case
    that one of objectid (in most case, inode number) reaches its
    highest value. For example, if inode cache is off and this event
    happens, we can't create any file even if there are not so many files.
    This message ease detecting such problem.
    
    Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Satoru Takeuchi authored and kdave committed Mar 11, 2016
  5. Documentation: btrfs: remove usage specific information

    The document in the kernel sources is yet another palce where the
    documentation would need to be updated, while it is not the primary
    source. We actively maintain the wiki pages.
    
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Mar 11, 2016
  6. btrfs: use kbasename in btrfsic_mount

    This is more readable.
    
    Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
    Reviewed-by Andy Shevchenko <andy.shevchenko@gmail.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Villemoes authored and kdave committed Mar 11, 2016

Commits on Mar 1, 2016

  1. Btrfs: do not collect ordered extents when logging that inode exists

    When logging that an inode exists, for example as part of a directory
    fsync operation, we were collecting any ordered extents for the inode but
    we ended up doing nothing with them except tagging them as processed, by
    setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
    subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
    collecting and processing them. This created a time window where a second
    fsync against the inode, using the fast path, ended up not logging the
    checksums for the new extents but it logged the extents since they were
    part of the list of modified extents. This happened because the ordered
    extents were not collected and checksums were not yet added to the csum
    tree - the ordered extents have not gone through btrfs_finish_ordered_io()
    yet (which is where we add them to the csum tree by calling
    inode.c:add_pending_csums()).
    
    So fix this by not collecting an inode's ordered extents if we are logging
    it with the LOG_INODE_EXISTS mode.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  2. Btrfs: fix race when checking if we can skip fsync'ing an inode

    If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
    returns false, it's possible that we had an ordered extent in progress
    (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
    last_trans field was not greater than the id of the last committed
    transaction, but shortly after, before we checked if there were any
    ongoing ordered extents, the ordered extent had just completed and
    removed itself from the inode's ordered tree, in which case we end up not
    logging the inode, losing some data if a power failure or crash happens
    after the fsync handler returns and before the transaction is committed.
    
    Fix this by checking first if there are any ongoing ordered extents
    before comparing the inode's last_trans with the id of the last committed
    transaction - when it completes, an ordered extent always updates the
    inode's last_trans before it removes itself from the inode's ordered
    tree (at btrfs_finish_ordered_io()).
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  3. Btrfs: fix listxattrs not listing all xattrs packed in the same item

    In the listxattrs handler, we were not listing all the xattrs that are
    packed in the same btree item, which happens when multiple xattrs have
    a name that when crc32c hashed produce the same checksum value.
    
    Fix this by processing them all.
    
    The following test case for xfstests reproduces the issue:
    
      seq=`basename $0`
      seqres=$RESULT_DIR/$seq
      echo "QA output created by $seq"
      tmp=/tmp/$$
      status=1	# failure is the default!
      trap "_cleanup; exit \$status" 0 1 2 3 15
    
      _cleanup()
      {
          cd /
          rm -f $tmp.*
      }
    
      # get standard environment, filters and checks
      . ./common/rc
      . ./common/filter
      . ./common/attr
    
      # real QA test starts here
      _supported_fs generic
      _supported_os Linux
      _require_scratch
      _require_attrs
    
      rm -f $seqres.full
    
      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount
    
      # Create our test file with a few xattrs. The first 3 xattrs have a name
      # that when given as input to a crc32c function result in the same checksum.
      # This made btrfs list only one of the xattrs through listxattrs system call
      # (because it packs xattrs with the same name checksum into the same btree
      # item).
      touch $SCRATCH_MNT/testfile
      $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
      $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
      $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
      $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
      $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
    
      # Now call getfattr with --dump, which calls the listxattrs system call.
      # It should list all the xattrs we have set before.
      $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
    
      status=0
      exit
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  4. Btrfs: fix deadlock between direct IO reads and buffered writes

    While running a test with a mix of buffered IO and direct IO against
    the same files I hit a deadlock reported by the following trace:
    
    [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
    [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
    [11642.149771]     0 15282      2 0x00000000
    [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
    [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
    [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
    [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
    [11642.161403] Call Trace:
    [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
    [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
    [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
    [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
    [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
    [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
    [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
    [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
    [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
    [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
    [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
    [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
    [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
    [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
    [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
    [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
    [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
    [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
    [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
    [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
    [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
    [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
    [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
    [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
    [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
    [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
    [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
    [11642.226295] 2 locks held by kworker/u32:3/15282:
    [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
    [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
    [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
    [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
    [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
    [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
    [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
    [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
    [11642.243715] Call Trace:
    [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
    [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
    [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
    [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
    [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
    [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
    [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
    [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
    [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
    [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
    [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
    [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
    [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
    [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
    [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
    [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
    [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
    [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
    [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
    [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
    [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
    [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
    [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
    [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
    [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
    [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
    [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
    [11642.282604] 3 locks held by kworker/u32:8/15289:
    [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
    [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
    [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
    [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
    [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
    [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
    [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
    [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
    [11642.298433] Call Trace:
    [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
    [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
    [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
    [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
    [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
    [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
    [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
    [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
    [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
    [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.310403] 3 locks held by fdm-stress/26848:
    [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
    [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
    [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
    [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
    [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
    [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
    [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
    [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
    [11642.325096] Call Trace:
    [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
    [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
    [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
    [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
    [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
    [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
    [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
    [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
    [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
    [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
    [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
    [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
    [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
    [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
    [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
    [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
    [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
    [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
    [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
    [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
    [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
    [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.350222] 4 locks held by fdm-stress/26849:
    [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
    [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
    [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
    [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
    [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
    [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
    [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
    [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
    [11642.373430] Call Trace:
    [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
    [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
    [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
    [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
    [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
    [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
    [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
    [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
    [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
    [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
    [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
    [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
    [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
    [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
    [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
    [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
    [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
    [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
    [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
    [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
    [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
    [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
    [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
    [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
    [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
    [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
    [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
    [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
    [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
    [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
    [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
    [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
    [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
    [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.416572] 1 lock held by fdm-stress/26850:
    [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
    [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
    [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
    [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
    [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
    [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
    [11642.426591] Call Trace:
    [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
    [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
    [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
    [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
    [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
    [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
    [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
    [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
    [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
    [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
    [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.441533] 3 locks held by fdm-stress/26851:
    [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
    [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
    [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    
    This happened because under specific timings the path for direct IO reads
    can deadlock with concurrent buffered writes. The diagram below shows how
    this happens for an example file that has the following layout:
    
         [  extent A  ]  [  extent B  ]  [ ....
         0K              4K              8K
    
         CPU 1                                               CPU 2                             CPU 3
    
    DIO read against range
     [0K, 8K[ starts
    
    btrfs_direct_IO()
      --> calls btrfs_get_blocks_direct()
          which finds the extent map for the
          extent A and leaves the range
          [0K, 4K[ locked in the inode's
          io tree
    
                                                       buffered write against
                                                       range [4K, 8K[ starts
    
                                                       __btrfs_buffered_write()
                                                         --> dirties page at 4K
    
                                                                                         a user space
                                                                                         task calls sync
                                                                                         for e.g or
                                                                                         writepages() is
                                                                                         invoked by mm
    
                                                                                         writepages()
                                                                                           run_delalloc_range()
                                                                                             cow_file_range()
                                                                                               --> ordered extent X
                                                                                                   for the buffered
                                                                                                   write is created
                                                                                                   and
                                                                                                   writeback starts
    
      --> calls btrfs_get_blocks_direct()
          again, without submitting first
          a bio for reading extent A, and
          finds the extent map for extent B
    
      --> calls lock_extent_direct()
    
          --> locks range [4K, 8K[
          --> finds ordered extent X
              covering range [4K, 8K[
          --> unlocks range [4K, 8K[
    
                                                      buffered write against
                                                      range [0K, 8K[ starts
    
                                                      __btrfs_buffered_write()
                                                        prepare_pages()
                                                          --> locks pages with
                                                              offsets 0 and 4K
                                                        lock_and_cleanup_extent_if_need()
                                                          --> blocks attempting to
                                                              lock range [0K, 8K[ in
                                                              the inode's io tree,
                                                              because the range [0, 4K[
                                                              is already locked by the
                                                              direct IO task at CPU 1
    
          --> calls
              btrfs_start_ordered_extent(oe X)
    
              btrfs_start_ordered_extent(oe X)
    
                --> At this point writeback for ordered
                    extent X has not finished yet
    
                filemap_fdatawrite_range()
                  btrfs_writepages()
                    extent_writepages()
                      extent_write_cache_pages()
                        --> finds page with offset 0
                            with the writeback tag
                            (and not dirty)
                        --> tries to lock it
                             --> deadlock, task at CPU 2
                                 has the page locked and
                                 is blocked on the io range
                                 [0, 4K[ that was locked
                                 earlier by this task
    
    So fix this by falling back to a buffered read in the direct IO read path
    when an ordered extent for a buffered write is found.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  5. Btrfs: fix extent_same allowing destination offset beyond i_size

    When using the same file as the source and destination for a dedup
    (extent_same ioctl) operation we were allowing it to dedup to a
    destination offset beyond the file's size, which doesn't make sense and
    it's not allowed for the case where the source and destination files are
    not the same file. This made de deduplication operation successful only
    when the source range corresponded to a hole, a prealloc extent or an
    extent with all bytes having a value of 0x00. This was also leaving a
    file hole (between i_size and destination offset) without the
    corresponding file extent items, which can be reproduced with the
    following steps for example:
    
      $ mkfs.btrfs -f /dev/sdi
      $ mount /dev/sdi /mnt/sdi
    
      $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
      wrote 404990/404990 bytes at offset 304457
      395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
    
      $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
      Deduping 2 total files
      (28672, 24576): /mnt/sdi/foobar
      (929792, 24576): /mnt/sdi/foobar
      1 files asked to be deduped
      i: 0, status: 0, bytes_deduped: 24576
      24576 total bytes deduped in this operation
    
      $ umount /mnt/sdi
      $ btrfsck /dev/sdi
      Checking filesystem on /dev/sdi
      UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
      checking extents
      checking free space cache
      checking fs roots
      root 5 inode 257 errors 100, file extent discount
      Found file extent holes:
              start: 712704, len: 217088
      found 540673 bytes used err is 1
      total csum bytes: 400
      total tree bytes: 131072
      total fs tree bytes: 32768
      total extent tree bytes: 16384
      btree space waste bytes: 123675
      file data blocks allocated: 671744
        referenced 671744
      btrfs-progs v4.2.3
    
    So fix this by not allowing the destination to go beyond the file's size,
    just as we do for the same where the source and destination files are not
    the same.
    
    A test for xfstests follows.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  6. Btrfs: fix file loss on log replay after renaming a file and fsync

    We have two cases where we end up deleting a file at log replay time
    when we should not. For this to happen the file must have been renamed
    and a directory inode must have been fsynced/logged.
    
    Two examples that exercise these two cases are listed below.
    
      Case 1)
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
      $ mkdir -p /mnt/a/b
      $ mkdir /mnt/c
      $ touch /mnt/a/b/foo
      $ sync
      $ mv /mnt/a/b/foo /mnt/c/
      # Create file bar just to make sure the fsync on directory a/ does
      # something and it's not a no-op.
      $ touch /mnt/a/bar
      $ xfs_io -c "fsync" /mnt/a
      < power fail / crash >
    
      The next time the filesystem is mounted, the log replay procedure
      deletes file foo.
    
      Case 2)
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
      $ mkdir /mnt/a
      $ mkdir /mnt/b
      $ mkdir /mnt/c
      $ touch /mnt/a/foo
      $ ln /mnt/a/foo /mnt/b/foo_link
      $ touch /mnt/b/bar
      $ sync
      $ unlink /mnt/b/foo_link
      $ mv /mnt/b/bar /mnt/c/
      $ xfs_io -c "fsync" /mnt/a/foo
      < power fail / crash >
    
      The next time the filesystem is mounted, the log replay procedure
      deletes file bar.
    
    The reason why the files are deleted is because when we log inodes
    other then the fsync target inode, we ignore their last_unlink_trans
    value and leave the log without enough information to later replay the
    rename operations. So we need to look at the last_unlink_trans values
    and fallback to a transaction commit if they are greater than the
    id of the last committed transaction.
    
    So fix this by looking at the last_unlink_trans values and fallback to
    transaction commits when needed. Also, when logging other inodes (for
    case 1 we logged descendants of the fsync target inode while for case 2
    we logged ascendants) we need to care about concurrent tasks updating
    the last_unlink_trans of inodes we are logging (which was already an
    existing problem in check_parent_dirs_for_sync()). Since we can not
    acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
    deadlocks with other concurrent operations that acquire the i_mutex of
    2 inodes (other fsyncs or renames for example), we need to serialize on
    the log_mutex of the inode we are logging. A task setting a new value for
    an inode's last_unlink_trans must acquire the inode's log_mutex and it
    must do this update before doing the actual unlink operation (which is
    already the case except when deleting a snapshot). Conversely the task
    logging the inode must first log the inode and then check the inode's
    last_unlink_trans value while holding its log_mutex, as if its value is
    not greater then the id of the last committed transaction it means it
    logged a safe state of the inode's items, while if its value is not
    smaller then the id of the last committed transaction it means the inode
    state it has logged might not be safe (the concurrent task might have
    just updated last_unlink_trans but hasn't done yet the unlink operation)
    and therefore a transaction commit must be done.
    
    Test cases for xfstests follow in separate patches.
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  7. Btrfs: fix unreplayable log after snapshot delete + parent dir fsync

    If we delete a snapshot, fsync its parent directory and crash/power fail
    before the next transaction commit, on the next mount when we attempt to
    replay the log tree of the root containing the parent directory we will
    fail and prevent the filesystem from mounting, which is solvable by wiping
    out the log trees with the btrfs-zero-log tool but very inconvenient as
    we will lose any data and metadata fsynced before the parent directory
    was fsynced.
    
    For example:
    
      $ mkfs.btrfs -f /dev/sdc
      $ mount /dev/sdc /mnt
      $ mkdir /mnt/testdir
      $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
      $ btrfs subvolume delete /mnt/testdir/snap
      $ xfs_io -c "fsync" /mnt/testdir
      < crash / power failure and reboot >
      $ mount /dev/sdc /mnt
      mount: mount(2) failed: No such file or directory
    
    And in dmesg/syslog we get the following message and trace:
    
    [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
    [192066.363010] ------------[ cut here ]------------
    [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
    [192066.367250] BTRFS: Transaction aborted (error -2)
    [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
    [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
    [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
    [192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
    [192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
    [192066.385827] Call Trace:
    [192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
    [192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
    [192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
    [192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
    [192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
    [192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
    [192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
    [192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
    [192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
    [192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
    [192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
    [192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
    [192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
    [192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
    [192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
    [192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
    [192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
    [192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
    [192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
    [192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
    [192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
    [192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
    [192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
    [192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
    [192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
    [192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
    [192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
    [192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
    [192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
    [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
    [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
    [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
    [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
    [192066.444613] BTRFS: open_ctree failed
    
    This happens because when we are replaying the log and processing the
    directory entry pointing to the snapshot in the subvolume tree, we treat
    its btrfs_dir_item item as having a location with a key type matching
    BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
    BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
    object id refers to a root number and not to an inode in the root
    containing the parent directory.
    
    So fix this by triggering a transaction commit if an fsync against the
    parent directory is requested after deleting a snapshot. This is the
    simplest approach for a rare use case. Some alternative that avoids the
    transaction commit would require more code to explicitly delete the
    snapshot at log replay time (factoring out common code from ioctl.c:
    btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
    log tree of the snapshot's root from the log root of the root of tree
    roots, amongst other steps.
    
    A test case for xfstests that triggers the issue follows.
    
      seq=`basename $0`
      seqres=$RESULT_DIR/$seq
      echo "QA output created by $seq"
      tmp=/tmp/$$
      status=1	# failure is the default!
      trap "_cleanup; exit \$status" 0 1 2 3 15
    
      _cleanup()
      {
          _cleanup_flakey
          cd /
          rm -f $tmp.*
      }
    
      # get standard environment, filters and checks
      . ./common/rc
      . ./common/filter
      . ./common/dmflakey
    
      # real QA test starts here
      _need_to_be_root
      _supported_fs btrfs
      _supported_os Linux
      _require_scratch
      _require_dm_target flakey
      _require_metadata_journaling $SCRATCH_DEV
    
      rm -f $seqres.full
    
      _scratch_mkfs >>$seqres.full 2>&1
      _init_flakey
      _mount_flakey
    
      # Create a snapshot at the root of our filesystem (mount point path), delete it,
      # fsync the mount point path, crash and mount to replay the log. This should
      # succeed and after the filesystem is mounted the snapshot should not be visible
      # anymore.
      _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
      _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
      $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
      _flakey_drop_and_remount
      [ -e $SCRATCH_MNT/snap1 ] && \
          echo "Snapshot snap1 still exists after log replay"
    
      # Similar scenario as above, but this time the snapshot is created inside a
      # directory and not directly under the root (mount point path).
      mkdir $SCRATCH_MNT/testdir
      _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
      _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
      $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
      _flakey_drop_and_remount
      [ -e $SCRATCH_MNT/testdir/snap2 ] && \
          echo "Snapshot snap2 still exists after log replay"
    
      _unmount_flakey
    
      echo "Silence is golden"
      status=0
      exit
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Tested-by: Liu Bo <bo.li.liu@oracle.com>
    Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Mar 1, 2016
  8. Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/gi…

    …t/kdave/linux into for-linus-4.6
    
    Btrfs patchsets for 4.6
    masoncl committed Mar 1, 2016
Older
You can’t perform that action at this time.