Skip to content
Permalink
Liu-Bo/Btrfs-d…

Commits on Dec 4, 2015

  1. Btrfs: disable online scrub repair on ro cases

    This disables repair process on ro cases as it can cause system
    to be unresponsive on the ASSERT() in repair_io_failure().
    
    This can happen when scrub is running and a hardware error pops up,
    we should fallback to ro mounts gracefully instead of being unresponsive.
    
    Reported-by: Codebird <codebird@birds-are-nice.me>
    Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
    Liu Bo authored and fengguang committed Dec 4, 2015

Commits on Aug 21, 2015

  1. btrfs: fix compile when block cgroups are not enabled

    bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
    This adds an ifdef around them.  It's not perfect, but our
    use of bi_ioc is being removed in the 4.3 merge window.
    
    The bi_css usage really should go into bio_clone, but I want to make
    sure that doesn't introduce problems for other bio_clone use cases.
    
    Signed-off-by: Chris Mason <clm@fb.com>
    masoncl committed Aug 21, 2015

Commits on Aug 19, 2015

  1. Btrfs: fix file read corruption after extent cloning and fsync

    If we partially clone one extent of a file into a lower offset of the
    file, fsync the file, power fail and then mount the fs to trigger log
    replay, we can get multiple checksum items in the csum tree that overlap
    each other and result in checksum lookup failures later. Those failures
    can make file data read requests assume a checksum value of 0, but they
    will not return an error (-EIO for example) to userspace exactly because
    the expected checksum value 0 is a special value that makes the read bio
    endio callback return success and set all the bytes of the corresponding
    page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
    From a userspace perspective this is equivalent to file corruption
    because we are not returning what was written to the file.
    
    Details about how this can happen, and why, are included inline in the
    following reproducer test case for fstests and the comment added to
    tree-log.c.
    
      seq=`basename $0`
      seqres=$RESULT_DIR/$seq
      echo "QA output created by $seq"
      tmp=/tmp/$$
      status=1	# failure is the default!
      trap "_cleanup; exit \$status" 0 1 2 3 15
    
      _cleanup()
      {
          _cleanup_flakey
          rm -f $tmp.*
      }
    
      # get standard environment, filters and checks
      . ./common/rc
      . ./common/filter
      . ./common/dmflakey
    
      # real QA test starts here
      _need_to_be_root
      _supported_fs btrfs
      _supported_os Linux
      _require_scratch
      _require_dm_flakey
      _require_cloner
      _require_metadata_journaling $SCRATCH_DEV
    
      rm -f $seqres.full
    
      _scratch_mkfs >>$seqres.full 2>&1
      _init_flakey
      _mount_flakey
    
      # Create our test file with a single 100K extent starting at file
      # offset 800K. We fsync the file here to make the fsync log tree gets
      # a single csum item that covers the whole 100K extent, which causes
      # the second fsync, done after the cloning operation below, to not
      # leave in the log tree two csum items covering two sub-ranges
      # ([0, 20K[ and [20K, 100K[)) of our extent.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                      -c "fsync"                     \
                       $SCRATCH_MNT/foo | _filter_xfs_io
    
      # Now clone part of our extent into file offset 400K. This adds a file
      # extent item to our inode's metadata that points to the 100K extent
      # we created before, using a data offset of 20K and a data length of
      # 20K, so that it refers to the sub-range [20K, 40K[ of our original
      # extent.
      $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
          -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
    
      # Now fsync our file to make sure the extent cloning is durably
      # persisted. This fsync will not add a second csum item to the log
      # tree containing the checksums for the blocks in the sub-range
      # [20K, 40K[ of our extent, because there was already a csum item in
      # the log tree covering the whole extent, added by the first fsync
      # we did before.
      $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
    
      echo "File digest before power failure:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch
    
      # Silently drop all writes and ummount to simulate a crash/power
      # failure.
      _load_flakey_table $FLAKEY_DROP_WRITES
      _unmount_flakey
    
      # Allow writes again, mount to trigger log replay and validate file
      # contents.
      # The fsync log replay first processes the file extent item
      # corresponding to the file offset 400K (the one which refers to the
      # [20K, 40K[ sub-range of our 100K extent) and then processes the file
      # extent item for file offset 800K. It used to happen that when
      # processing the later, it erroneously left in the csum tree 2 csum
      # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
      # 1 for the whole range of our extent. This introduced a problem where
      # subsequent lookups for the checksums of blocks within the range
      # [40K, 100K[ of our extent would not find anything because lookups in
      # the csum tree ended up looking only at the smaller csum item, the
      # one covering the subrange [20K, 40K[. This made read requests assume
      # an expected checksum with a value of 0 for those blocks, which caused
      # checksum verification failure when the read operations finished.
      # However those checksum failure did not result in read requests
      # returning an error to user space (like -EIO for e.g.) because the
      # expected checksum value had the special value 0, and in that case
      # btrfs set all bytes of the corresponding pages with the value 0x01
      # and produce the following warning in dmesg/syslog:
      #
      #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
      #   1322675045 expected csum 0"
      #
      _load_flakey_table $FLAKEY_ALLOW_WRITES
      _mount_flakey
    
      echo "File digest after log replay:"
      # Must match the same digest he had after cloning the extent and
      # before the power failure happened.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
    
      _unmount_flakey
    
      status=0
      exit
    
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Aug 19, 2015
  2. Btrfs: check if previous transaction aborted to avoid fs corruption

    While we are committing a transaction, it's possible the previous one is
    still finishing its commit and therefore we wait for it to finish first.
    However we were not checking if that previous transaction ended up getting
    aborted after we waited for it to commit, so we ended up committing the
    current transaction which can lead to fs corruption because the new
    superblock can point to trees that have had one or more nodes/leafs that
    were never durably persisted.
    The following sequence diagram exemplifies how this is possible:
    
              CPU 0                                                        CPU 1
    
      transaction N starts
    
      (...)
    
      btrfs_commit_transaction(N)
    
        cur_trans->state = TRANS_STATE_COMMIT_START;
        (...)
        cur_trans->state = TRANS_STATE_COMMIT_DOING;
        (...)
    
        cur_trans->state = TRANS_STATE_UNBLOCKED;
        root->fs_info->running_transaction = NULL;
    
                                                                  btrfs_start_transaction()
                                                                     --> starts transaction N + 1
    
        btrfs_write_and_wait_transaction(trans, root);
          --> starts writing all new or COWed ebs created
              at transaction N
    
                                                                  creates some new ebs, COWs some
                                                                  existing ebs but doesn't COW or
                                                                  deletes eb X
    
                                                                  btrfs_commit_transaction(N + 1)
                                                                    (...)
                                                                    cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                    (...)
                                                                    wait_for_commit(root, prev_trans);
                                                                      --> prev_trans == transaction N
    
        btrfs_write_and_wait_transaction() continues
        writing ebs
           --> fails writing eb X, we abort transaction N
               and set bit BTRFS_FS_STATE_ERROR on
               fs_info->fs_state, so no new transactions
               can start after setting that bit
    
           cleanup_transaction()
             btrfs_cleanup_one_transaction()
               wakes up task at CPU 1
    
                                                                    continues, doesn't abort because
                                                                    cur_trans->aborted (transaction N + 1)
                                                                    is zero, and no checks for bit
                                                                    BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                    are made
    
                                                                    btrfs_write_and_wait_transaction(trans, root);
                                                                      --> succeeds, no errors during writeback
    
                                                                    write_ctree_super(trans, root, 0);
                                                                      --> succeeds
                                                                      --> we have now a superblock that points us
                                                                          to some root that uses eb X, which was
                                                                          never written to disk
    
    In this scenario future attempts to read eb X from disk results in an
    error message like "parent transid verify failed on X wanted Y found Z".
    
    So fix this by aborting the current transaction if after waiting for the
    previous transaction we verify that it was aborted.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: Josef Bacik <jbacik@fb.com>
    Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    fdmanana authored and masoncl committed Aug 19, 2015
  3. btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

    alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
    transaction but this allocation context is rather weak wrt. reclaim
    capabilities. The page allocator currently tries hard to not fail these
    allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
    still fail if the _current_ process is the OOM killer victim. Moreover
    there is an attempt to move away from the default no-fail behavior and
    allow these allocation to fail more eagerly. This would lead to:
    
    [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
    
    which is clearly undesirable and the nofail behavior should be explicit
    if the allocation failure cannot be tolerated.
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Michal Hocko authored and masoncl committed Aug 19, 2015
  4. btrfs: Prevent from early transaction abort

    Btrfs relies on GFP_NOFS allocation when committing the transaction but
    this allocation context is rather weak wrt. reclaim capabilities. The
    page allocator currently tries hard to not fail these allocations if
    they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
    currently but there is an attempt to move away from the default no-fail
    behavior and allow these allocation to fail more eagerly. And this would
    lead to a pre-mature transaction abort as follows:
    
    [   55.328093] Call Trace:
    [   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
    [   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
    [   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
    [   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
    [   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
    [   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
    [   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
    [   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
    [   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
    [   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
    [   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
    [   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
    [   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
    [   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
    [   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
    [   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
    [   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
    [   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
    [   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
    [   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
    [   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
    [   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
    [   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
    [   55.384280] ------------[ cut here ]------------
    [   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
    [...]
    [   55.384337] Call Trace:
    [   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
    [   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
    [   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
    [   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
    [   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
    [   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
    [   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
    [   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
    [   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
    [   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
    [   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
    [   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
    [   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
    [   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
    [   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
    [   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
    [   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
    [...]
    [   55.384608] ---[ end trace c29799da1d4dd621 ]---
    [   55.437323] BTRFS info (device hdb1): forced readonly
    [   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry
    
    Fix this by being explicit about the no-fail behavior of this allocation
    path and use __GFP_NOFAIL.
    
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Michal Hocko authored and masoncl committed Aug 19, 2015
  5. btrfs: Remove unused arguments in tree-log.c

    Following arguments are not used in tree-log.c:
     insert_one_name(): path, type
     wait_log_commit(): trans
     wait_for_writer(): trans
    
    This patch remove them.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 19, 2015
  6. btrfs: Remove useless condition in start_log_trans()

    Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
    for start_log_trans():
     fs/btrfs/tree-log.c:178 start_log_trans()
     warn: we tested 'root->log_root' before and it was 'false'
    
     fs/btrfs/tree-log.c
     147          if (root->log_root) {
     We test "root->log_root" here.
     ...
    
    Reason:
     Condition of:
     fs/btrfs/tree-log.c:178: if (!root->log_root) {
     is not necessary after commit: 7237f18
    
     It caused a smatch warning, and no functionally error.
    
    Fix:
     Deleting above condition will make smatch shut up,
     but a better way is to do cleanup for start_log_trans()
     to remove duplicated code and make code more readable.
    
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 19, 2015

Commits on Aug 9, 2015

  1. Btrfs: add support for blkio controllers

    This attaches accounting information to bios as we submit them so the
    new blkio controllers can throttle on btrfs filesystems.
    
    Not much is required, we're just associating bios with blkcgs during clone,
    calling wbc_init_bio()/wbc_account_io() during writepages submission,
    and attaching the bios to the current context during direct IO.
    
    Finally if we are splitting bios during btrfs_map_bio, this attaches
    accounting information to the split.
    
    The end result is able to throttle nicely on single disk filesystems.  A
    little more work is required for multi-device filesystems.
    
    Signed-off-by: Chris Mason <clm@fb.com>
    masoncl committed Aug 9, 2015
  2. Btrfs: remove unused mutex from struct 'btrfs_fs_info'

    The code using 'ordered_extent_flush_mutex' mutex has removed by below
    commit.
     - 8d875f9
       btrfs: disable strict file flushes for renames and truncates
    But the mutex still lives in struct 'btrfs_fs_info'.
    
    So, this patch removes the mutex from struct 'btrfs_fs_info' and its
    initialization code.
    
    Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Byongho Lee authored and masoncl committed Aug 9, 2015
  3. Btrfs: fix parity scrub of RAID 5/6 with missing device

    When testing the previous patch, Zhao Lei reported a similar bug when
    attempting to scrub a degraded RAID 5/6 filesystem with a missing
    device, leading to NULL pointer dereferences from the RAID 5/6 parity
    scrubbing code.
    
    The first cause was the same as in the previous patch: attempting to
    call bio_add_page() on a missing block device. To fix this,
    scrub_extent_for_parity() can just mark the sectors on the missing
    device as errors instead of attempting to read from it.
    
    Additionally, the code uses scrub_remap_extent() to map the extent of
    the corresponding data stripe, but the extent wasn't already mapped. If
    scrub_remap_extent() finds a missing block device, it doesn't initialize
    extent_dev, so we're left with a NULL struct btrfs_device. The solution
    is to use btrfs_map_block() directly.
    
    Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    osandov authored and masoncl committed Aug 9, 2015
  4. Btrfs: fix device replace of a missing RAID 5/6 device

    The original implementation of device replace on RAID 5/6 seems to have
    missed support for replacing a missing device. When this is attempted,
    we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
    crashes when we try to dereference it. This happens because
    btrfs_map_block() has no choice but to return us the missing device
    because RAID 5/6 don't have any alternate mirrors to read from, and a
    missing device has a NULL bdev.
    
    The idea implemented here is to handle the missing device case
    separately, which better only happen when we're replacing a missing RAID
    5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
    reconstruct the data from parity, check it with
    scrub_recheck_block_checksum(), and write it out with
    scrub_write_block_to_dev_replace().
    
    Reported-by: Philip <bugzilla@philip-seeger.de>
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    osandov authored and masoncl committed Aug 9, 2015
  5. Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation

    The current RAID 5/6 recovery code isn't quite prepared to handle
    missing devices. In particular, it expects a bio that we previously
    attempted to use in the read path, meaning that it has valid pages
    allocated. However, missing devices have a NULL blkdev, and we can't
    call bio_add_page() on a bio with a NULL blkdev. We could do manual
    manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
    a separate path that allows us to manually add pages to the rbio.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    osandov authored and masoncl committed Aug 9, 2015
  6. Btrfs: count devices correctly in readahead during RAID 5/6 replace

    Commit 5fbc7c5 ("Btrfs: fix unfinished readahead thread for raid5/6
    degraded mounting") fixed a problem where we would skip a missing device
    when we shouldn't have because there are no other mirrors to read from
    in RAID 5/6. After commit 2c8cdd6 ("Btrfs, replace: write dirty
    pages into the replace target device"), the fix doesn't work when we're
    doing a missing device replace on RAID 5/6 because the replace device is
    counted as a mirror so we're tricked into thinking we can safely skip
    the missing device. The fix is to count only the real stripes and decide
    based on that.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    osandov authored and masoncl committed Aug 9, 2015
  7. Btrfs: remove misleading handling of missing device scrub

    scrub_submit() claims that it can handle a bio with a NULL block device,
    but this is misleading, as calling bio_add_page() on a bio with a NULL
    ->bi_bdev would've already crashed. Delete this, as we're about to
    properly handle a missing block device.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    osandov authored and masoncl committed Aug 9, 2015
  8. btrfs: fix clone / extent-same deadlocks

    Clone and extent same lock their source and target inodes in opposite order.
    In addition to this, the range locking in clone doesn't take ordering into
    account. Fix this by having clone use the same locking helpers as
    btrfs-extent-same.
    
    In addition, I do a small cleanup of the locking helpers, removing a case
    (both inodes being the same) which was poorly accounted for and never
    actually used by the callers.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.de>
    Reviewed-by: David Sterba <dsterba@suse.cz>
    Signed-off-by: Chris Mason <clm@fb.com>
    Mark Fasheh authored and masoncl committed Aug 9, 2015
  9. Btrfs: fix defrag to merge tail file extent

    The file layout is
    
    [extent 1]...[extent n][4k extent][HOLE][extent x]
    
    extent 1~n and 4k extent can be merged during defrag, and the whole
    defrag bytes is larger than our defrag thresh(256k), 4k extent as a
    tail is left unmerged since we check if its next extent can be merged
    (the next one is a hole, so the check will fail), the layout thus can
    be
    
    [new extent][4k extent][HOLE][extent x]
     (1~n)
    
    To fix it, beside looking at the next one, this also looks at the
    previous one by checking @defrag_end, which is set to 0 when we
    decide to stop merging contiguous extents, otherwise, we can merge
    the previous one with our extent.
    
    Also, this makes btrfs behave consistent with how xfs and ext4 do.
    
    Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Liu Bo authored and masoncl committed Aug 9, 2015
  10. Btrfs: fix warning in backref walking

    When we do backref walking, we search firstly in queued delayed refs
    and then the on-disk backrefs, but we parse differently for shared
    references, for delayed refs we also add 'ref->root' while for on-disk
    backrefs we don't, this can prevent us from merging refs indexed
    by the same bytenr and cause find_parent_nodes() to throw a warning at
    'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
    with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
    that happens.
    
    For shared references, no matter if it's delayed or on-disk, ref->root is
    not at all used, instead it's ref->parent that really matters, so this has
    delayed refs handled as the same way as on-disk refs.
    
    Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    Liu Bo authored and masoncl committed Aug 9, 2015
  11. btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()

    When a task trying to double lock a extent buffer, there are no
    lockdep warning about it because this lock may be in "blocking_lock"
    state, and make us hard to debug.
    
    This patch add a WARN_ON() for above condition, it can not report
    all deadlock cases(as lock between tasks), but at least helps us
    some.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  12. btrfs: Remove root argument in extent_data_ref_count()

    Because it is never used.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  13. btrfs: Fix wrong comment of btrfs_alloc_tree_block()

    These wrong comment was copyed from another function(expired) from
    init, this patch fixed them.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  14. btrfs: abort transaction on btrfs_reloc_cow_block()

    When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current
    code just return a err-value to caller, but leave new_created extent
    buffer exist and locked.
    
    Then subsequent code (in relocate) try to lock above eb again,
    and caused deadlock without any dmesg.
    (eb lock use wait_event(), so no lockdep message)
    
    It is hard to do recover work in __btrfs_cow_block() at this error
    point, but we can abort transaction to avoid deadlock and operate on
    unstable state.a
    
    It also helps developer to find wrong place quickly.
    (better than a frozen fs without any dmesg before patch)
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  15. btrfs: Remove unnecessary variants in relocation.c

    These arguments are not used in functions, remove them for cleanup
    and make kernel stack happy.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  16. btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_ch…

    …unk()
    
    Remove chunk_objectid argument from btrfs_relocate_chunk() because
    it is not necessary, it can also cleanup some code in caller for
    prepare its value.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  17. btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()

    objectid's init-value is not used in any case, remove it.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  18. btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()

    We need error checking code for get_ref_objectid_v0() in
    relocate_block_group(), to avoid unpredictable result, especially
    for accessing uninitialized value(when function failed) after
    this line.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  19. btrfs: Fix data checksum error cause by replace with io-load.

    xfstests btrfs/070 sometimes failed.
    In my test machine, its fail rate is about 30%.
    In another vm(vmware), its fail rate is about 50%.
    
    Reason:
      btrfs/070 do replace and defrag with fsstress simultaneously,
      after above operation, checksum error is found by scrub.
    
      Actually, it have no relationship with defrag operation, only
      replace with fsstress can trigger this bug.
    
      New data writen to target device have possibility rewrited by
      old data from source device by replace code in debug, to avoid
      above problem, we can set target block group to readonly in
      replace period, so new data requested by other operation will
      not write to same place with replace code.
    
      Before patch(4.1-rc3):
        30% failed in 100 xfstests.
      After patch:
        0% failed in 300 xfstests.
    
    It also happened in btrfs/071 as it's another scrub with IO load tests.
    
    Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  20. btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chu…

    …nks()
    
    Use new intruduced scrub_pause_on/off() can make this code block
    clean and more readable.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  21. btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()

    It can reduce current duplicated code which is similar to
    scrub_blocked_if_needed() but can not call it because little
    different.
    It also used by my next patch which is in same case.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  22. btrfs: Use ref_cnt for set_block_group_ro()

    More than one code call set_block_group_ro() and restore rw in fail.
    
    Old code use bool bit to save blockgroup's ro state, it can not
    support parallel case(it is confirmd exist in my debug log).
    
    This patch use ref count to store ro state, and rename
    set_block_group_ro/set_block_group_rw
    to
    inc_block_group_ro/dec_block_group_ro.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  23. btrfs: Bypass unrelated items before accessing its contents in scrub

    When we access extent_root in scrub_stripe() and
    scrub_raid56_parity(), we need bypass unrelated tree item firstly
    before using its contents to do other condition.
    
    It is not a bug fix, only making code sequence in logic.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  24. btrfs: Load only necessary csums into list in scrub

    We need not load csum of whole strip in scrub because strip is trimed
    before use, it is to say, what we really need to calculate csum is
    data between [extent_logical, extent_len).
    
    This patch changed to use above segment for btrfs_lookup_csums_range()
    in scrub_stripe()
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  25. btrfs: Fix calculate typo caused by ambiguous meaning of logic_end

    For example, in scrub_raid56_parity(), following lines are used
    to judge is all data processed:
     place1: if (key.objectid > logic_end) ...
     place2: if (logic_start >= logic_end) ...
     ...
     (place2 is typo, is should be ">", it is copied from other
      place, where logic_end's meaning is different, long story...)
    
    We can fix above typo directly, but the root reason is ambiguous
    meaning of logic_end in scrub raid56 parity.
    
    In other place, XXX_end is pointed to data which is not included,
    and we need to process segment of [XXX_start, XXX_end).
    
    But for scrub raid56 parity, logic_end is pointed to lattest data
    need to process, and introduced many "+ 1" and "- 1" in code as
    below:
     length = sparity->logic_end - sparity->logic_start + 1
     logic_end - logic_start + 1
     stripe_logical + increment - 1
    
    This patch changed logic_end's meaning to make it in normal understanding
    in raid56 parity functions and data struct alone with above bugfix.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
  26. btrfs: Free checksum list on scrub_extent() fail

    When scrub_extent() failed, we need to free previois created
    checksum list.
    
    Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
    Signed-off-by: Chris Mason <clm@fb.com>
    zhaoleidd authored and masoncl committed Aug 9, 2015
Older
You can’t perform that action at this time.