Skip to content
Permalink
Jan-Kara/ext4-…
Switch branches/tags

Commits on Jun 16, 2021

  1. ext4: Improve scalability of ext4 orphan file handling

    Even though the length of the critical section when adding / removing
    orphaned inodes was significantly reduced by using orphan file, the
    contention of lock protecting orphan file still appears high in profiles
    for truncate / unlink intensive workloads with high number of threads.
    
    This patch makes handling of orphan file completely lockless. Also to
    reduce conflicts between CPUs different CPUs start searching for empty
    slot in orphan file in different blocks.
    
    Performance comparison of locked orphan file handling, lockless orphan
    file handling, and completely disabled orphan inode handling
    from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on
    SAS SSD disk, average of 5 runs:
    
    stress-orphan (microbenchmark truncating files byte-by-byte from N
    processes in parallel)
    
    Threads Time            Time            Time
            Orphan locked   Orphan lockless No orphan
      1       0.945600       0.939400        0.891200
      2       1.331800       1.246600        1.174400
      4       1.995000       1.780600        1.713200
      8       6.424200       4.900000        4.106000
     16      14.937600       8.516400        8.138000
     32      33.038200      24.565600       24.002200
     64      60.823600      39.844600       38.440200
    128     122.941400      70.950400       69.315000
    
    So we can see that with lockless orphan file handling, addition /
    deletion of orphaned inodes got almost completely out of picture even
    for a microbenchmark stressing it.
    
    For reaim creat_clo workload on ramdisk there are also noticeable gains
    (average of 5 runs):
    
    Clients         Vanilla (ops/s)        Patched (ops/s)
    creat_clo-1     14705.88 (   0.00%)    14354.07 *  -2.39%*
    creat_clo-3     27108.43 (   0.00%)    28301.89 (   4.40%)
    creat_clo-5     37406.48 (   0.00%)    45180.73 *  20.78%*
    creat_clo-7     41338.58 (   0.00%)    54687.50 *  32.29%*
    creat_clo-9     45226.13 (   0.00%)    62937.07 *  39.16%*
    creat_clo-11    44000.00 (   0.00%)    65088.76 *  47.93%*
    creat_clo-13    36516.85 (   0.00%)    68661.97 *  88.03%*
    creat_clo-15    30864.20 (   0.00%)    69551.78 * 125.35%*
    creat_clo-17    27478.45 (   0.00%)    67729.08 * 146.48%*
    creat_clo-19    25000.00 (   0.00%)    61621.62 * 146.49%*
    creat_clo-21    18772.35 (   0.00%)    63829.79 * 240.02%*
    creat_clo-23    16698.94 (   0.00%)    61938.96 * 270.92%*
    creat_clo-25    14973.05 (   0.00%)    56947.61 * 280.33%*
    creat_clo-27    16436.69 (   0.00%)    65008.03 * 295.51%*
    creat_clo-29    13949.01 (   0.00%)    69047.62 * 395.00%*
    creat_clo-31    14283.52 (   0.00%)    67982.45 * 375.95%*
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    jankara authored and intel-lab-lkp committed Jun 16, 2021
  2. ext4: Speedup ext4 orphan inode handling

    Ext4 orphan inode handling is a bottleneck for workloads which heavily
    truncate / unlink small files since it contends on the global
    s_orphan_mutex lock (and generally it's difficult to improve scalability
    of the ondisk linked list of orphaned inodes).
    
    This patch implements new way of handling orphan inodes. Instead of
    linking orphaned inode into a linked list, we store it's inode number in
    a new special file which we call "orphan file". Currently we still
    protect the orphan file with a spinlock for simplicity but even in this
    setting we can substantially reduce the length of the critical section
    and thus speedup some workloads.
    
    Note that the change is backwards compatible when the filesystem is
    clean - the existence of the orphan file is a compat feature, we set
    another ro-compat feature indicating orphan file needs scanning for
    orphaned inodes when mounting filesystem read-write. This ro-compat
    feature gets cleared on unmount / remount read-only.
    
    Some performance data from 80 CPU Xeon Server with 512 GB of RAM,
    filesystem located on SSD, average of 5 runs:
    
    stress-orphan (microbenchmark truncating files byte-by-byte from N
    processes in parallel)
    
    Threads Time            Time
            Vanilla         Patched
      1       1.057200        0.945600
      2       1.680400        1.331800
      4       2.547000        1.995000
      8       7.049400        6.424200
     16      14.827800       14.937600
     32      40.948200       33.038200
     64      87.787400       60.823600
    128     206.504000      122.941400
    
    So we can see significant wins all over the board.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    jankara authored and intel-lab-lkp committed Jun 16, 2021
  3. ext4: Move orphan inode handling into a separate file

    Move functions for handling orphan inodes into a new file
    fs/ext4/orphan.c to have them in one place and somewhat reduce size of
    other files. No code changes.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    jankara authored and intel-lab-lkp committed Jun 16, 2021
  4. ext4: Support for checksumming from journal triggers

    JBD2 layer support triggers which are called when journaling layer moves
    buffer to a certain state. We can use the frozen trigger, which gets
    called when buffer data is frozen and about to be written out to the
    journal, to compute block checksums for some buffer types (similarly as
    does ocfs2). This avoids unnecessary repeated recomputation of the
    checksum (at the cost of larger window where memory corruption won't be
    caught by checksumming) and is even necessary when there are
    unsynchronized updaters of the checksummed data.
    
    So add argument to ext4_journal_get_write_access() and
    ext4_journal_get_create_access() which describes buffer type so that
    triggers can be set accordingly. This patch is mostly only a change of
    prototype of the above mentioned functions and a few small helpers. Real
    checksumming will come later.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    jankara authored and intel-lab-lkp committed Jun 16, 2021

Commits on Jun 6, 2021

  1. ext4: update journal documentation

    Add a section about journal checkpointing, including information about
    the ioctl EXT4_IOC_CHECKPOINT which can be used to trigger a journal
    checkpoint from userspace.
    
    Also, update the journal allocation information to reflect that up to
    10240000 blocks are used for the journal and that the journal is not
    necessarily contiguous.
    
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    
    Changes in v5:
    - clarify behavior of DRY_RUN flag
    Link: https://lore.kernel.org/r/20210518151327.130198-3-leah.rumancik@gmail.com
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    lrumancik authored and tytso committed Jun 6, 2021
  2. ext4: add ioctl EXT4_IOC_CHECKPOINT

    ioctl EXT4_IOC_CHECKPOINT checkpoints and flushes the journal. This
    includes forcing all the transactions to the log, checkpointing the
    transactions, and flushing the log to disk. This ioctl takes u32 "flags"
    as an argument. Three flags are supported. EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
    can be used to verify input to the ioctl. It returns error if there is any
    invalid input, otherwise it returns success without performing
    any checkpointing. The other two flags, EXT4_IOC_CHECKPOINT_FLAG_DISCARD
    and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT, can be used to issue requests to
    discard or zeroout the journal logs blocks, respectively. At this
    point, EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT is primarily added to enable
    testing of this codepath on devices that don't support discard.
    EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
    cannot both be set.
    
    Systems that wish to achieve content deletion SLO can set up a daemon
    that calls this ioctl at a regular interval such that it matches with the
    SLO requirement. Thus, with this patch, the ext4_dir_entry2 wipeout
    patch[1], and the Ext4 "-o discard" mount option set, Ext4 can now
    guarantee that all file contents, file metatdata, and filenames will not
    be accessible through the filesystem and will have had discard or
    zeroout requests issued for corresponding device blocks.
    
    The __jbd2_journal_erase function could also be used to discard or
    zero-fill the journal during journal load after recovery. This would
    provide a potential solution to a journal replay bug reported earlier this
    year[2]. After a successful journal recovery, e2fsck can call this ioctl to
    discard the journal as well.
    
    [1] https://lore.kernel.org/linux-ext4/YIHknqxngB1sUdie@mit.edu/
    [2] https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/
    
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    
    Changes in v4:
    - update commit description
    - update error codes
    - update code formatting
    - add flag EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
    - add flag EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
    
    Changes in v5:
    - update error checking
    - make DRY_RUN include checks on input
    - added info about DRY_RUN in commit
    - added explicit conversion from ioctl flags to jbd2 flags
    Link: https://lore.kernel.org/r/20210518151327.130198-2-leah.rumancik@gmail.com
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    lrumancik authored and tytso committed Jun 6, 2021
  3. ext4: add discard/zeroout flags to journal flush

    Add a flags argument to jbd2_journal_flush to enable discarding or
    zero-filling the journal blocks while flushing the journal.
    
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    
    Changes in v4:
    - restructured code division between patches
    - changed jbd2_journal_flush flags arg from bool to unsigned long long
    
    Changes in v5:
    - changed jbd2_journal_flush flags to unsigned int
    - changed name of jbd2_journal_flush flags from JBD2_ERASE* to
    JBD2_JOURNAL_FLUSH*
    - cleaned up loop in jbd2_journal_erase which finds contiguous regions
    - updated flag checking
    Link: https://lore.kernel.org/r/20210518151327.130198-1-leah.rumancik@gmail.com
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    lrumancik authored and tytso committed Jun 6, 2021
  4. ext4: Only advertise encrypted_casefold when encryption and unicode a…

    …re enabled
    
    Encrypted casefolding is only supported when both encryption and
    casefolding are both enabled in the config.
    
    Fixes: 471fbbe ("ext4: handle casefolding with encryption")
    Cc: stable@vger.kernel.org # 5.13+
    Signed-off-by: Daniel Rosenberg <drosen@google.com>
    Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Daniel Rosenberg authored and tytso committed Jun 6, 2021
  5. ext4: fix no-key deletion for encrypt+casefold

    commit 471fbbe ("ext4: handle casefolding with encryption") is
    missing a few checks for the encryption key which are needed to
    support deleting enrypted casefolded files when the key is not
    present.
    
    This bug made it impossible to delete encrypted+casefolded directories
    without the encryption key, due to errors like:
    
        W         : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key
    
    Repro steps in kvm-xfstests test appliance:
          mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
          mount /vdc
          mkdir /vdc/dir
          chattr +F /vdc/dir
          keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
          xfs_io -c "set_encpolicy $keyid" /vdc/dir
          for i in `seq 1 100`; do
              mkdir /vdc/dir/$i
          done
          xfs_io -c "rm_enckey $keyid" /vdc
          rm -rf /vdc/dir # fails with the bug
    
    Fixes: 471fbbe ("ext4: handle casefolding with encryption")
    Signed-off-by: Daniel Rosenberg <drosen@google.com>
    Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Daniel Rosenberg authored and tytso committed Jun 6, 2021
  6. ext4: fix memory leak in ext4_fill_super

    Buffer head references must be released before calling kill_bdev();
    otherwise the buffer head (and its page referenced by b_data) will not
    be freed by kill_bdev, and subsequently that bh will be leaked.
    
    If blocksizes differ, sb_set_blocksize() will kill current buffers and
    page cache by using kill_bdev(). And then super block will be reread
    again but using correct blocksize this time. sb_set_blocksize() didn't
    fully free superblock page and buffer head, and being busy, they were
    not freed and instead leaked.
    
    This can easily be reproduced by calling an infinite loop of:
    
      systemctl start <ext4_on_lvm>.mount, and
      systemctl stop <ext4_on_lvm>.mount
    
    ... since systemd creates a cgroup for each slice which it mounts, and
    the bh leak get amplified by a dying memory cgroup that also never
    gets freed, and memory consumption is much more easily noticed.
    
    Fixes: ce40733 ("ext4: Check for return value from sb_set_blocksize")
    Fixes: ac27a0e ("ext4: initial copy of files from ext3")
    Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
    Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Cc: stable@kernel.org
    YustasSwamp authored and tytso committed Jun 6, 2021
  7. ext4: fix fast commit alignment issues

    Fast commit recovery data on disk may not be aligned. So, when the
    recovery code reads it, this patch makes sure that fast commit info
    found on-disk is first memcpy-ed into an aligned variable before
    accessing it. As a consequence of it, we also remove some macros that
    could resulted in unaligned accesses.
    
    Cc: stable@kernel.org
    Fixes: 8016e29 ("ext4: fast commit recovery path")
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Jun 6, 2021
  8. ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed

    We got follow bug_on when run fsstress with injecting IO fault:
    [130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
    [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
    ......
    [130747.334329] Call trace:
    [130747.334553]  ext4_es_cache_extent+0x150/0x168 [ext4]
    [130747.334975]  ext4_cache_extents+0x64/0xe8 [ext4]
    [130747.335368]  ext4_find_extent+0x300/0x330 [ext4]
    [130747.335759]  ext4_ext_map_blocks+0x74/0x1178 [ext4]
    [130747.336179]  ext4_map_blocks+0x2f4/0x5f0 [ext4]
    [130747.336567]  ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
    [130747.336995]  ext4_readpage+0x54/0x100 [ext4]
    [130747.337359]  generic_file_buffered_read+0x410/0xae8
    [130747.337767]  generic_file_read_iter+0x114/0x190
    [130747.338152]  ext4_file_read_iter+0x5c/0x140 [ext4]
    [130747.338556]  __vfs_read+0x11c/0x188
    [130747.338851]  vfs_read+0x94/0x150
    [130747.339110]  ksys_read+0x74/0xf0
    
    This patch's modification is according to Jan Kara's suggestion in:
    https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
    "I see. Now I understand your patch. Honestly, seeing how fragile is trying
    to fix extent tree after split has failed in the middle, I would probably
    go even further and make sure we fix the tree properly in case of ENOSPC
    and EDQUOT (those are easily user triggerable).  Anything else indicates a
    HW problem or fs corruption so I'd rather leave the extent tree as is and
    don't try to fix it (which also means we will not create overlapping
    extents)."
    
    Cc: stable@kernel.org
    Signed-off-by: Ye Bin <yebin10@huawei.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Ye Bin authored and tytso committed Jun 6, 2021

Commits on Jun 3, 2021

  1. ext4: fix accessing uninit percpu counter variable with fast_commit

    When running generic/527 with fast_commit configuration, the following
    issue is seen on Power.  With fast_commit, during ext4_fc_replay()
    (which can be called from ext4_fill_super()), if inode eviction
    happens then it can access an uninitialized percpu counter variable.
    
    This patch adds the check before accessing the counters in
    ext4_free_inode() path.
    
    [  321.165371] run fstests generic/527 at 2021-04-29 08:38:43
    [  323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
    [  323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
    [  323.619767] Faulting instruction address: 0xc000000000bae78c
    cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
        pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
        lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
        pid   = 5593, comm = mount
    	ext4_free_inode+0x780/0xb90
    	ext4_evict_inode+0xa8c/0xc60
    	evict+0xfc/0x1e0
    	ext4_fc_replay+0xc50/0x20f0
    	do_one_pass+0xfe0/0x1350
    	jbd2_journal_recover+0x184/0x2e0
    	jbd2_journal_load+0x1c0/0x4a0
    	ext4_fill_super+0x2458/0x4200
    	mount_bdev+0x1dc/0x290
    	ext4_mount+0x28/0x40
    	legacy_get_tree+0x4c/0xa0
    	vfs_get_tree+0x4c/0x120
    	path_mount+0xcf8/0xd70
    	do_mount+0x80/0xd0
    	sys_mount+0x3fc/0x490
    	system_call_exception+0x384/0x3d0
    	system_call_common+0xec/0x278
    
    Cc: stable@kernel.org
    Fixes: 8016e29 ("ext4: fast commit recovery path")
    Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
    Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    riteshharjani authored and tytso committed Jun 3, 2021

Commits on May 21, 2021

  1. ext4: fix memory leak in ext4_mb_init_backend on error path.

    Fix a memory leak discovered by syzbot when a file system is corrupted
    with an illegally large s_log_groups_per_flex.
    
    Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com
    Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    PhilPotter authored and tytso committed May 21, 2021

Commits on Apr 22, 2021

  1. ext4: wipe ext4_dir_entry2 upon file deletion

    Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len.
    In case sensitive data is stored in filenames, this ensures no potentially
    sensitive data is left in the directory entry upon deletion. Also, wipe
    these fields upon moving a directory entry during the conversion to an
    htree and when splitting htree nodes.
    
    The data wiped may still exist in the journal, but there are future
    commits planned to address this.
    
    Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
    Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    lrumancik authored and tytso committed Apr 22, 2021
  2. ext4: Fix occasional generic/418 failure

    Eric has noticed that after pagecache read rework, generic/418 is
    occasionally failing for ext4 when blocksize < pagesize. In fact, the
    pagecache rework just made hard to hit race in ext4 more likely. The
    problem is that since ext4 conversion of direct IO writes to iomap
    framework (commit 378f32b), we update inode size after direct IO
    write only after invalidating page cache. Thus if buffered read sneaks
    at unfortunate moment like:
    
    CPU1 - write at offset 1k                       CPU2 - read from offset 0
    iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
                                                    ext4_readpage();
    ext4_handle_inode_extension()
    
    the read will zero out tail of the page as it still sees smaller inode
    size and thus page cache becomes inconsistent with on-disk contents with
    all the consequences.
    
    Fix the problem by moving inode size update into end_io handler which
    gets called before the page cache is invalidated.
    
    Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com>
    Fixes: 378f32b ("ext4: introduce direct I/O write using iomap infrastructure")
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara <jack@suse.cz>
    Acked-by: Dave Chinner <dchinner@redhat.com>
    Link: https://lore.kernel.org/r/20210415155417.4734-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    jankara authored and tytso committed Apr 22, 2021

Commits on Apr 18, 2021

  1. fs: fix reporting supported extra file attributes for statx()

    statx(2) notes that any attribute that is not indicated as supported
    by stx_attributes_mask has no usable value.  Commits 801e523
    ("fs: move generic stat response attr handling to vfs_getattr_nosec")
    and 712b269 ("fs/stat: Define DAX statx attribute") sets
    STATX_ATTR_AUTOMOUNT and STATX_ATTR_DAX, respectively, without setting
    stx_attributes_mask, which can cause xfstests generic/532 to fail.
    
    Fix this in the same way as commit 1b9598c ("xfs: fix reporting
    supported extra file attributes for statx()")
    
    Fixes: 801e523 ("fs: move generic stat response attr handling to vfs_getattr_nosec")
    Fixes: 712b269 ("fs/stat: Define DAX statx attribute")
    Cc: stable@kernel.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    tytso committed Apr 18, 2021

Commits on Apr 13, 2021

  1. ext4: allow the dax flag to be set and cleared on inline directories

    This is needed to allow generic/607 to pass for file systems with the
    inline data_feature enabled, and it allows the use of file systems
    where the directories use inline_data, while the files are accessed
    via DAX.
    
    Cc: stable@kernel.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    tytso committed Apr 13, 2021

Commits on Apr 10, 2021

  1. ext4: fix debug format string warning

    Using no_printk() for jbd_debug() revealed two warnings:
    
    fs/jbd2/recovery.c: In function 'fc_do_one_pass':
    fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' argument [-Werror=format=]
      256 |                 jbd_debug(3, "Processing fast commit blk with seq %d");
          |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range':
    fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Werror=format=]
     1732 |                 jbd_debug(1, "Converting from %d to %d %lld",
    
    The first one was added incorrectly, and was also missing a few newlines
    in debug output, and the second one happened when the type of an
    argument changed.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Fixes: d556435 ("jbd2: avoid -Wempty-body warnings")
    Fixes: 6db0746 ("ext4: use BIT() macro for BH_** state bits")
    Fixes: 5b849b5 ("jbd2: fast commit recovery path")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Link: https://lore.kernel.org/r/20210409201211.1866633-1-arnd@kernel.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    arndb authored and tytso committed Apr 10, 2021
  2. ext4: fix trailing whitespace

    Made suggested modifications from checkpatch in reference to ERROR:
     trailing whitespace
    
    Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
    Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
    Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Jack Qiu authored and tytso committed Apr 10, 2021
  3. ext4: fix various seppling typos

    Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
    Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    unixbhaskar authored and tytso committed Apr 10, 2021
  4. ext4: fix error return code in ext4_fc_perform_commit()

    In case of if not ext4_fc_add_tlv branch, an error return code is missing.
    
    Cc: stable@kernel.org
    Fixes: aa75f4d ("ext4: main fast-commit commit path")
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Signed-off-by: Xu Yihang <xuyihang@huawei.com>
    Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Link: https://lore.kernel.org/r/20210408070033.123047-1-xuyihang@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Xu Yihang authored and tytso committed Apr 10, 2021
  5. ext4: annotate data race in jbd2_journal_dirty_metadata()

    Assertion checks in jbd2_journal_dirty_metadata() are known to be racy
    but we don't want to be grabbing locks just for them.  We thus recheck
    them under b_state_lock only if it looks like they would fail. Annotate
    the checks with data_race().
    
    Cc: stable@kernel.org
    Reported-by: Hao Sun <sunhao.th@gmail.com>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20210406161804.20150-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    jankara authored and tytso committed Apr 10, 2021
  6. ext4: annotate data race in start_this_handle()

    Access to journal->j_running_transaction is not protected by appropriate
    lock and thus is racy. We are well aware of that and the code handles
    the race properly. Just add a comment and data_race() annotation.
    
    Cc: stable@kernel.org
    Reported-by: syzbot+30774a6acf6a2cf6d535@syzkaller.appspotmail.com
    Signed-off-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20210406161804.20150-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    jankara authored and tytso committed Apr 10, 2021
  7. ext4: fix ext4_error_err save negative errno into superblock

    Fix As write_mmp_block() so that it returns -EIO instead of 1, so that
    the correct error gets saved into the superblock.
    
    Cc: stable@kernel.org
    Fixes: 54d3adb ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
    Reported-by: Liu Zhi Qiang <liuzhiqiang26@huawei.com>
    Signed-off-by: Ye Bin <yebin10@huawei.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Link: https://lore.kernel.org/r/20210406025331.148343-1-yebin10@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Ye Bin authored and tytso committed Apr 10, 2021
  8. ext4: fix error code in ext4_commit_super

    We should set the error code when ext4_commit_super check argument failed.
    Found in code review.
    Fixes: c4be0c1 ("filesystem freeze: add error handling of write_super_lockfs/unlockfs").
    
    Cc: stable@kernel.org
    Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Link: https://lore.kernel.org/r/20210402101631.561-1-changfengnan@vivo.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Fengnan Chang authored and tytso committed Apr 10, 2021
  9. ext4: always panic when errors=panic is specified

    Before commit 014c9ca ("ext4: make ext4_abort() use
    __ext4_error()"), the following series of commands would trigger a
    panic:
    
    1. mount /dev/sda -o ro,errors=panic test
    2. mount /dev/sda -o remount,abort test
    
    After commit 014c9ca, remounting a file system using the test
    mount option "abort" will no longer trigger a panic.  This commit will
    restore the behaviour immediately before commit 014c9ca.
    (However, note that the Linux kernel's behavior has not been
    consistent; some previous kernel versions, including 5.4 and 4.19
    similarly did not panic after using the mount option "abort".)
    
    This also makes a change to long-standing behaviour; namely, the
    following series commands will now cause a panic, when previously it
    did not:
    
    1. mount /dev/sda -o ro,errors=panic test
    2. echo test > /sys/fs/ext4/sda/trigger_fs_error
    
    However, this makes ext4's behaviour much more consistent, so this is
    a good thing.
    
    Cc: stable@kernel.org
    Fixes: 014c9ca ("ext4: make ext4_abort() use __ext4_error()")
    Signed-off-by: Ye Bin <yebin10@huawei.com>
    Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Ye Bin authored and tytso committed Apr 10, 2021

Commits on Apr 9, 2021

  1. ext4: delete redundant uptodate check for buffer

    The buffer uptodate state has been checked in function set_buffer_uptodate,
    there is no need use buffer_uptodate before calling set_buffer_uptodate and
    delete it.
    
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Signed-off-by: Yang Guo <guoyang2@huawei.com>
    Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
    Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
    Link: https://lore.kernel.org/r/1617260610-29770-1-git-send-email-zhangshaokun@hisilicon.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Yang Guo authored and tytso committed Apr 9, 2021
  2. ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()

    When CONFIG_QUOTA is enabled, if we failed to mount the filesystem due
    to some error happens behind ext4_orphan_cleanup(), it will end up
    triggering a after free issue of super_block. The problem is that
    ext4_orphan_cleanup() will set SB_ACTIVE flag if CONFIG_QUOTA is
    enabled, after we cleanup the truncated inodes, the last iput() will put
    them into the lru list, and these inodes' pages may probably dirty and
    will be write back by the writeback thread, so it could be raced by
    freeing super_block in the error path of mount_bdev().
    
    After check the setting of SB_ACTIVE flag in ext4_orphan_cleanup(), it
    was used to ensure updating the quota file properly, but evict inode and
    trash data immediately in the last iput does not affect the quotafile,
    so setting the SB_ACTIVE flag seems not required[1]. Fix this issue by
    just remove the SB_ACTIVE setting.
    
    [1] https://lore.kernel.org/linux-ext4/99cce8ca-e4a0-7301-840f-2ace67c551f3@huawei.com/T/#m04990cfbc4f44592421736b504afcc346b2a7c00
    
    Cc: stable@kernel.org
    Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
    Tested-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Link: https://lore.kernel.org/r/20210331033138.918975-1-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    zhangyi089 authored and tytso committed Apr 9, 2021
  3. ext4: make prefetch_block_bitmaps default

    Block bitmap prefetching is needed for these allocator optimization
    data structures to get populated and provide better group scanning
    order. So, turn it on bu default. prefetch_block_bitmaps mount option
    is now marked as removed and a new option no_prefetch_block_bitmaps is
    added to disable block bitmap prefetching.
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
  4. ext4: add proc files to monitor new structures

    This patch adds a new file "mb_structs_summary" which allows us to see
    the summary of the new allocator structures added in this
    series. Here's the sample output of file:
    
    optimize_scan: 1
    max_free_order_lists:
            list_order_0_groups: 0
            list_order_1_groups: 0
            list_order_2_groups: 0
            list_order_3_groups: 0
            list_order_4_groups: 0
            list_order_5_groups: 0
            list_order_6_groups: 0
            list_order_7_groups: 0
            list_order_8_groups: 0
            list_order_9_groups: 0
            list_order_10_groups: 0
            list_order_11_groups: 0
            list_order_12_groups: 0
            list_order_13_groups: 40
    fragment_size_tree:
            tree_min: 16384
            tree_max: 32768
            tree_nodes: 40
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
    Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
  5. ext4: improve cr 0 / cr 1 group scanning

    Instead of traversing through groups linearly, scan groups in specific
    orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
    largest free order >= the order of the request. So, with this patch,
    we maintain lists for each possible order and insert each group into a
    list based on the largest free order in its buddy bitmap. During cr 0
    allocation, we traverse these lists in the increasing order of largest
    free orders. This allows us to find a group with the best available cr
    0 match in constant time. If nothing can be found, we fallback to cr 1
    immediately.
    
    At CR1, the story is slightly different. We want to traverse in the
    order of increasing average fragment size. For CR1, we maintain a rb
    tree of groupinfos which is sorted by average fragment size. Instead
    of traversing linearly, at CR1, we traverse in the order of increasing
    average fragment size, starting at the most optimal group. This brings
    down cr 1 search complexity to log(num groups).
    
    For cr >= 2, we just perform the linear search as before. Also, in
    case of lock contention, we intermittently fallback to linear search
    even in CR 0 and CR 1 cases. This allows us to proceed during the
    allocation path even in case of high contention.
    
    There is an opportunity to do optimization at CR2 too. That's because
    at CR2 we only consider groups where bb_free counter (number of free
    blocks) is greater than the request extent size. That's left as future
    work.
    
    All the changes introduced in this patch are protected under a new
    mount option "mb_optimize_scan".
    
    With this patchset, following experiment was performed:
    
    Created a highly fragmented disk of size 65TB. The disk had no
    contiguous 2M regions. Following command was run consecutively for 3
    times:
    
    time dd if=/dev/urandom of=file bs=2M count=10
    
    Here are the results with and without cr 0/1 optimizations introduced
    in this patch:
    
    |---------+------------------------------+---------------------------|
    |         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
    |---------+------------------------------+---------------------------|
    | 1st run | 5m1.871s                     | 2m47.642s                 |
    | 2nd run | 2m28.390s                    | 0m0.611s                  |
    | 3rd run | 2m26.530s                    | 0m1.255s                  |
    |---------+------------------------------+---------------------------|
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
  6. ext4: add MB_NUM_ORDERS macro

    A few arrays in mballoc.c use the total number of valid orders as
    their size. Currently, this value is set as "sb->s_blocksize_bits +
    2". This makes code harder to read. So, instead add a new macro
    MB_NUM_ORDERS(sb) to make the code more readable.
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
    Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
  7. ext4: add mballoc stats proc file

    Add new stats for measuring the performance of mballoc. This patch is
    forked from Artem Blagodarenko's work that can be found here:
    
    https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch
    
    This patch reorganizes the stats by cr level. This is how the output
    looks like:
    
    mballoc:
    	reqs: 0
    	success: 0
    	groups_scanned: 0
    	cr0_stats:
    		hits: 0
    		groups_considered: 0
    		useless_loops: 0
    		bad_suggestions: 0
    	cr1_stats:
    		hits: 0
    		groups_considered: 0
    		useless_loops: 0
    		bad_suggestions: 0
    	cr2_stats:
    		hits: 0
    		groups_considered: 0
    		useless_loops: 0
    	cr3_stats:
    		hits: 0
    		groups_considered: 0
    		useless_loops: 0
    	extents_scanned: 0
    		goal_hits: 0
    		2^n_hits: 0
    		breaks: 0
    		lost: 0
    	buddies_generated: 0/40
    	buddies_time_used: 0
    	preallocated: 0
    	discarded: 0
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Reviewed-by: Andreas Dilger <adilger@dilger.ca>
    Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
    Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
  8. ext4: add ability to return parsed options from parse_options

    Before this patch, the function parse_options() was returning
    journal_devnum and journal_ioprio variables to the caller. This patch
    generalizes that interface to allow parse_options to return any parsed
    options to return back to the caller. In this patch series, it gets
    used to capture the value of "mb_optimize_scan=%u" mount option.
    
    Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
    Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
    Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    harshadjs authored and tytso committed Apr 9, 2021
Older