Skip to content
Permalink
Qu-Wenruo/Intr…

Commits on Jan 10, 2020

  1. btrfs: statfs: Use pre-calculated per-profile available space

    Although btrfs_calc_avail_data_space() is trying to do an estimation
    on how many data chunks it can allocate, the estimation is far from
    perfect:
    
    - Metadata over-commit is not considered at all
    - Chunk allocation doesn't take RAID5/6 into consideration
    
    This patch will change btrfs_calc_avail_data_space() to use
    pre-calculated per-profile available space.
    
    This provides the following benefits:
    - Accurate unallocated data space estimation, including RAID5/6
      It's as accurate as chunk allocator, and can handle RAID5/6.
    
    Although it still can't handle metadata over-commit that accurately, we
    still have fallback method for over-commit, by using factor based
    estimation.
    
    The good news is, over-commit can only happen when we have enough
    unallocated space, so even we may not report byte accurate result when
    the fs is empty, the result will get more and more accurate when
    unallocated space is reducing.
    
    So the metadata over-commit shouldn't cause too many problem.
    
    Since we're keeping the old lock-free design, statfs should not experience
    any extra delay.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    adam900710 authored and 0day robot committed Jan 10, 2020
  2. btrfs: space-info: Use per-profile available space in can_overcommit()

    For the following disk layout, can_overcommit() can cause false
    confidence in available space:
    
      devid 1 unallocated:	1T
      devid 2 unallocated:	10T
      metadata type:	RAID1
    
    As can_overcommit() simply uses unallocated space with factor to
    calculate the allocatable metadata chunk size.
    
    can_overcommit() believes we still have 5.5T for metadata chunks, while
    the truth is, we only have 1T available for metadata chunks.
    This can lead to ENOSPC at run_delalloc_range() and cause transaction
    abort.
    
    Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
    all, we need proper chunk-allocator level awareness to do such estimation.
    
    Thankfully, we have per-profile available space already calculated, just
    use that facility to avoid such false confidence.
    
    Reported-by: Marc Lehmann <schmorp@schmorp.de>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    adam900710 authored and 0day robot committed Jan 10, 2020
  3. btrfs: Introduce per-profile available space facility

    [PROBLEM]
    There are some locations in btrfs requiring accurate estimation on how
    many new bytes can be allocated on unallocated space.
    
    We have two types of estimation:
    - Factor based calculation
      Just use all unallocated space, divide by the profile factor
      One obvious user is can_overcommit().
    
    - Chunk allocator like calculation
      This will emulate the chunk allocator behavior, to get a proper
      estimation.
      The only user is btrfs_calc_avail_data_space(), utilized by
      btrfs_statfs().
      The problem is, that function is not generic purposed enough, can't
      handle things like RAID5/6.
    
    Current factor based calculation can't handle the following case:
      devid 1 unallocated:	1T
      devid 2 unallocated:	10T
      metadata type:	RAID1
    
    If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
    metadata.
    But in fact we can only get 1T free space, as we're limited by the
    smallest device for RAID1.
    
    [SOLUTION]
    This patch will introduce per-profile available space calculation,
    which can give an estimation based on chunk-allocator-like behavior.
    
    The difference between it and chunk allocator is mostly on rounding and
    [0, 1M) reserved space handling, which shouldn't cause practical impact.
    
    The newly introduced per-profile available space calculation will
    calculate available space for each type, using chunk-allocator like
    calculation.
    
    With that facility, for above device layout we get the full available
    space array:
      RAID10:	0  (not enough devices)
      RAID1:	1T
      RAID1C3:	0  (not enough devices)
      RAID1C4:	0  (not enough devices)
      DUP:		5.5T
      RAID0:	2T
      SINGLE:	11T
      RAID5:	1T
      RAID6:	0  (not enough devices)
    
    Or for a more complex example:
      devid 1 unallocated:	1T
      devid 2 unallocated:  1T
      devid 3 unallocated:	10T
    
    We will get an array of:
      RAID10:	0  (not enough devices)
      RAID1:	2T
      RAID1C3:	1T
      RAID1C4:	0  (not enough devices)
      DUP:		6T
      RAID0:	3T
      SINGLE:	12T
      RAID5:	2T
      RAID6:	0  (not enough devices)
    
    And for the each profile , we go chunk allocator level calculation:
    The pseudo code looks like:
    
      clear_virtual_used_space_of_all_rw_devices();
      do {
      	/*
      	 * The same as chunk allocator, despite used space,
      	 * we also take virtual used space into consideration.
      	 */
      	sort_device_with_virtual_free_space();
    
      	/*
      	 * Unlike chunk allocator, we don't need to bother hole/stripe
      	 * size, so we use the smallest device to make sure we can
      	 * allocated as many stripes as regular chunk allocator
      	 */
      	stripe_size = device_with_smallest_free->avail_space;
    	stripe_size = min(stripe_size, to_alloc / ndevs);
    
      	/*
      	 * Allocate a virtual chunk, allocated virtual chunk will
      	 * increase virtual used space, allow next iteration to
      	 * properly emulate chunk allocator behavior.
      	 */
      	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
      	if (ret == 0)
      		avail += allocated_size;
      } while (ret == 0)
    
    As we always select the device with least free space, the device with
    the most space will be the first to be utilized, just like chunk
    allocator.
    For above 1T + 10T device, we will allocate a 1T virtual chunk
    in the first iteration, then run out of device in next iteration.
    
    Thus only get 1T free space for RAID1 type, just like what chunk
    allocator would do.
    
    The patch will update such per-profile available space at the following
    timing:
    - Mount time
    - Chunk allocation
    - Chunk removal
    - Device grow
    - Device shrink
    
    Those timing are all protected by chunk_mutex, and what we do are only
    iterating in-memory only structures, no extra IO triggered, so the
    performance impact should be pretty small.
    
    For the extra error handling, the principle is to keep the old behavior.
    That's to say, if old error handler would just return an error, then we
    follow it.
    If the older error handler choose to abort transaction, then we follow
    it too.
    So new failure mode is introduced.
    
    Suggested-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    adam900710 authored and 0day robot committed Jan 10, 2020
  4. btrfs: Reset device size when btrfs_update_device() failed in btrfs_g…

    …row_device()
    
    When btrfs_update_device() failed due to ENOMEM, we didn't reset device
    size back to its original size, causing the in-memory device size larger
    than original.
    
    If somehow the memory pressure get solved, and the fs committed, since
    the device item is not updated, but super block total size get updated,
    it would cause mount failure due to size mismatch.
    
    So here revert device size and super size to its original size when
    btrfs_update_device() failed, just like what we did in shrink_device().
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    adam900710 authored and 0day robot committed Jan 10, 2020

Commits on Jan 5, 2020

  1. Linux 5.5-rc5

    torvalds committed Jan 5, 2020
  2. Merge tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/…

    …kernel/git/riscv/linux
    
    Pull RISC-V fixes from Paul Walmsley:
     "Several fixes for RISC-V:
    
       - Fix function graph trace support
    
       - Prefix the CSR IRQ_* macro names with "RV_", to avoid collisions
         with macros elsewhere in the Linux kernel tree named "IRQ_TIMER"
    
       - Use __pa_symbol() when computing the physical address of a kernel
         symbol, rather than __pa()
    
       - Mark the RISC-V port as supporting GCOV
    
      One DT addition:
    
       - Describe the L2 cache controller in the FU540 DT file
    
      One documentation update:
    
       - Add patch acceptance guideline documentation"
    
    * tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
      Documentation: riscv: add patch acceptance guidelines
      riscv: prefix IRQ_ macro names with an RV_ namespace
      clocksource: riscv: add notrace to riscv_sched_clock
      riscv: ftrace: correct the condition logic in function graph tracer
      riscv: dts: Add DT support for SiFive L2 cache controller
      riscv: gcov: enable gcov for RISC-V
      riscv: mm: use __pa_symbol for kernel symbols
    torvalds committed Jan 5, 2020
  3. Documentation: riscv: add patch acceptance guidelines

    Formalize, in kernel documentation, the patch acceptance policy for
    arch/riscv.  In summary, it states that as maintainers, we plan to
    only accept patches for new modules or extensions that have been
    frozen or ratified by the RISC-V Foundation.
    
    We've been following these guidelines for the past few months.  In the
    meantime, we've received quite a bit of feedback that it would be
    helpful to have these guidelines formally documented.
    
    Based on a suggestion from Matthew Wilcox, we also add a link to this
    file to Documentation/process/index.rst, to make this document easier
    to find.  The format of this document has also been changed to align
    to the format outlined in the maintainer entry profiles, in accordance
    with comments from Jon Corbet and Dan Williams.
    
    Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
    Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Krste Asanovic <krste@berkeley.edu>
    Cc: Andrew Waterman <waterman@eecs.berkeley.edu>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    paul-walmsley-sifive committed Jan 5, 2020
  4. riscv: prefix IRQ_ macro names with an RV_ namespace

    "IRQ_TIMER", used in the arch/riscv CSR header file, is a sufficiently
    generic macro name that it's used by several source files across the
    Linux code base.  Some of these other files ultimately include the
    arch/riscv CSR include file, causing collisions.  Fix by prefixing the
    RISC-V csr.h IRQ_ macro names with an RV_ prefix.
    
    Fixes: a4c3733 ("riscv: abstract out CSR names for supervisor vs machine mode")
    Reported-by: Olof Johansson <olof@lixom.net>
    Acked-by: Olof Johansson <olof@lixom.net>
    Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
    paul-walmsley-sifive committed Jan 5, 2020
  5. clocksource: riscv: add notrace to riscv_sched_clock

    When enabling ftrace graph tracer, it gets the tracing clock in
    ftrace_push_return_trace().  Eventually, it invokes riscv_sched_clock()
    to get the clock value.  If riscv_sched_clock() isn't marked with
    'notrace', it will call ftrace_push_return_trace() and cause infinite
    loop.
    
    The result of failure as follow:
    
    command: echo function_graph >current_tracer
    [   46.176787] Unable to handle kernel paging request at virtual address ffffffe04fb38c48
    [   46.177309] Oops [#1]
    [   46.177478] Modules linked in:
    [   46.177770] CPU: 0 PID: 256 Comm: $d Not tainted 5.5.0-rc1 torvalds#47
    [   46.177981] epc: ffffffe00035e59a ra : ffffffe00035e57e sp : ffffffe03a7569b0
    [   46.178216]  gp : ffffffe000d29b90 tp : ffffffe03a756180 t0 : ffffffe03a756968
    [   46.178430]  t1 : ffffffe00087f408 t2 : ffffffe03a7569a0 s0 : ffffffe03a7569f0
    [   46.178643]  s1 : ffffffe00087f408 a0 : 0000000ac054cda4 a1 : 000000000087f411
    [   46.178856]  a2 : 0000000ac054cda4 a3 : 0000000000373ca0 a4 : ffffffe04fb38c48
    [   46.179099]  a5 : 00000000153e22a8 a6 : 00000000005522ff a7 : 0000000000000005
    [   46.179338]  s2 : ffffffe03a756a90 s3 : ffffffe00032811c s4 : ffffffe03a756a58
    [   46.179570]  s5 : ffffffe000d29fe0 s6 : 0000000000000001 s7 : 0000000000000003
    [   46.179809]  s8 : 0000000000000003 s9 : 0000000000000002 s10: 0000000000000004
    [   46.180053]  s11: 0000000000000000 t3 : 0000003fc815749c t4 : 00000000000efc90
    [   46.180293]  t5 : ffffffe000d29658 t6 : 0000000000040000
    [   46.180482] status: 0000000000000100 badaddr: ffffffe04fb38c48 cause: 000000000000000f
    
    Signed-off-by: Zong Li <zong.li@sifive.com>
    Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
    [paul.walmsley@sifive.com: cleaned up patch description]
    Fixes: 92e0d14 ("clocksource/drivers/riscv_timer: Provide the sched_clock")
    Cc: stable@vger.kernel.org
    Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
    zongbox authored and paul-walmsley-sifive committed Jan 5, 2020
  6. Merge branch 'akpm' (patches from Andrew)

    Merge misc fixes from Andrew Morton:
     "17 fixes"
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
      hexagon: define ioremap_uc
      ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less
      ocfs2: call journal flush to mark journal as empty after journal recovery when mount
      mm/hugetlb: defer freeing of huge pages if in non-task context
      mm/gup: fix memory leak in __gup_benchmark_ioctl
      mm/oom: fix pgtables units mismatch in Killed process message
      fs/posix_acl.c: fix kernel-doc warnings
      hexagon: work around compiler crash
      hexagon: parenthesize registers in asm predicates
      fs/namespace.c: make to_mnt_ns() static
      fs/nsfs.c: include headers for missing declarations
      fs/direct-io.c: include fs/internal.h for missing prototype
      mm: move_pages: return valid node id in status if the page is already on the target node
      memcg: account security cred as well to kmemcg
      kcov: fix struct layout for kcov_remote_arg
      mm/zsmalloc.c: fix the migrated zspage statistics.
      mm/memory_hotplug: shrink zones when offlining memory
    torvalds committed Jan 5, 2020
  7. Merge tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/li…

    …nux/kernel/git/jj/linux-apparmor
    
    Pull apparmor fixes from John Johansen:
    
     - performance regression: only get a label reference if the fast path
       check fails
    
     - fix aa_xattrs_match() may sleep while holding a RCU lock
    
     - fix bind mounts aborting with -ENOMEM
    
    * tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
      apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
      apparmor: only get a label reference if the fast path check fails
      apparmor: fix bind mounts aborting with -ENOMEM
    torvalds committed Jan 5, 2020

Commits on Jan 4, 2020

  1. apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock

    aa_xattrs_match() is unfortunately calling vfs_getxattr_alloc() from a
    context protected by an rcu_read_lock. This can not be done as
    vfs_getxattr_alloc() may sleep regardles of the gfp_t value being
    passed to it.
    
    Fix this by breaking the rcu_read_lock on the policy search when the
    xattr match feature is requested and restarting the search if a policy
    changes occur.
    
    Fixes: 8e51f90 ("apparmor: Add support for attaching profiles via xattr, presence and value")
    Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
    Reported-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: John Johansen <john.johansen@canonical.com>
    John Johansen
    John Johansen committed Jan 4, 2020
  2. Merge tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/ke…

    …rnel/git/mips/linux
    
    Pull MIPS fixes from Paul Burton:
     "A collection of MIPS fixes:
    
       - Fill the struct cacheinfo shared_cpu_map field with sensible
         values, notably avoiding issues with perf which was unhappy in the
         absence of these values.
    
       - A boot fix for Loongson 2E & 2F machines which was fallout from
         some refactoring performed this cycle.
    
       - A Kconfig dependency fix for the Loongson CPU HWMon driver.
    
       - A couple of VDSO fixes, ensuring gettimeofday() behaves
         appropriately for kernel configurations that don't include support
         for a clocksource the VDSO can use & fixing the calling convention
         for the n32 & n64 VDSOs which would previously clobber the $gp/$28
         register.
    
       - A build fix for vmlinuz compressed images which were
         inappropriately building with -fsanitize-coverage despite not being
         part of the kernel proper, then failing to link due to the missing
         __sanitizer_cov_trace_pc() function.
    
       - A couple of eBPF JIT fixes, including disabling it for MIPS32 due
         to a large number of issues with the code generated there &
         reflecting ISA dependencies in Kconfig to enforce that systems
         which don't support the JIT must include the interpreter"
    
    * tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
      MIPS: Avoid VDSO ABI breakage due to global register variable
      MIPS: BPF: eBPF JIT: check for MIPS ISA compliance in Kconfig
      MIPS: BPF: Disable MIPS32 eBPF JIT
      MIPS: Prevent link failure with kcov instrumentation
      MIPS: Kconfig: Use correct form for 'depends on'
      mips: Fix gettimeofday() in the vdso library
      MIPS: Fix boot on Fuloong2 systems
      mips: cacheinfo: report shared CPU map
    torvalds committed Jan 4, 2020
  3. hexagon: define ioremap_uc

    Similar to commit 38e45d8 ("sparc64: implement ioremap_uc") define
    ioremap_uc for hexagon to avoid errors from
    -Wimplicit-function-definition.
    
    Link: http://lkml.kernel.org/r/20191209222956.239798-2-ndesaulniers@google.com
    Link: ClangBuiltLinux#797
    Fixes: e537654 ("lib: devres: add a helper function for ioremap_uc")
    Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
    Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
    Acked-by: Brian Cain <bcain@codeaurora.org>
    Cc: Lee Jones <lee.jones@linaro.org>
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Tuowen Zhao <ztuowen@gmail.com>
    Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Alexios Zavras <alexios.zavras@intel.com>
    Cc: Allison Randal <allison@lohutok.net>
    Cc: Will Deacon <will@kernel.org>
    Cc: Richard Fontana <rfontana@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    nickdesaulniers authored and torvalds committed Jan 4, 2020
  4. ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less

    Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
    file system will trigger the system crash, usually after ocfs2 file
    system is unmounted.
    
    This system crash is caused by a generic memory corruption, these crash
    backtraces are not always the same, for exapmle,
    
        ocfs2: Unmounting device (253,16) on (node 172167785)
        general protection fault: 0000 [#1] SMP PTI
        CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
        RIP: 0010:__kmalloc+0xa5/0x2a0
        Code: 00 00 4d 8b 07 65 4d 8b
        RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
        RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
        R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
        R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
        FS:  00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
        Call Trace:
         ext4_htree_store_dirent+0x35/0x100 [ext4]
         htree_dirblock_to_tree+0xea/0x290 [ext4]
         ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
         ext4_readdir+0x67c/0x9d0 [ext4]
         iterate_dir+0x8d/0x1a0
         __x64_sys_getdents+0xab/0x130
         do_syscall_64+0x60/0x1f0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f699d33a9fb
    
    This regression problem was introduced by commit e581595 ("ocfs: no
    need to check return value of debugfs_create functions").
    
    Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com
    Fixes: e581595 ("ocfs: no need to check return value of debugfs_create functions")
    Signed-off-by: Gang He <ghe@suse.com>
    Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: <stable@vger.kernel.org>	[5.3+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    ganghe authored and torvalds committed Jan 4, 2020
  5. ocfs2: call journal flush to mark journal as empty after journal reco…

    …very when mount
    
    If journal is dirty when mount, it will be replayed but jbd2 sb log tail
    cannot be updated to mark a new start because journal->j_flag has
    already been set with JBD2_ABORT first in journal_init_common.
    
    When a new transaction is committed, it will be recored in block 1
    first(journal->j_tail is set to 1 in journal_reset).  If emergency
    restart happens again before journal super block is updated
    unfortunately, the new recorded trans will not be replayed in the next
    mount.
    
    The following steps describe this procedure in detail.
    1. mount and touch some files
    2. these transactions are committed to journal area but not checkpointed
    3. emergency restart
    4. mount again and its journals are replayed
    5. journal super block's first s_start is 1, but its s_seq is not updated
    6. touch a new file and its trans is committed but not checkpointed
    7. emergency restart again
    8. mount and journal is dirty, but trans committed in 6 will not be
    replayed.
    
    This exception happens easily when this lun is used by only one node.
    If it is used by multi-nodes, other node will replay its journal and its
    journal super block will be updated after recovery like what this patch
    does.
    
    ocfs2_recover_node->ocfs2_replay_journal.
    
    The following jbd2 journal can be generated by touching a new file after
    journal is replayed, and seq 15 is the first valid commit, but first seq
    is 13 in journal super block.
    
    logdump:
      Block 0: Journal Superblock
      Seq: 0   Type: 4 (JBD2_SUPERBLOCK_V2)
      Blocksize: 4096   Total Blocks: 32768   First Block: 1
      First Commit ID: 13   Start Log Blknum: 1
      Error: 0
      Feature Compat: 0
      Feature Incompat: 2 block64
      Feature RO compat: 0
      Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
      FS Share Cnt: 1   Dynamic Superblk Blknum: 0
      Per Txn Block Limit    Journal: 0    Data: 0
    
      Block 1: Journal Commit Block
      Seq: 14   Type: 2 (JBD2_COMMIT_BLOCK)
    
      Block 2: Journal Descriptor
      Seq: 15   Type: 1 (JBD2_DESCRIPTOR_BLOCK)
      No. Blocknum        Flags
       0. 587             none
      UUID: 00000000000000000000000000000000
       1. 8257792         JBD2_FLAG_SAME_UUID
       2. 619             JBD2_FLAG_SAME_UUID
       3. 24772864        JBD2_FLAG_SAME_UUID
       4. 8257802         JBD2_FLAG_SAME_UUID
       5. 513             JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
      ...
      Block 7: Inode
      Inode: 8257802   Mode: 0640   Generation: 57157641 (0x3682809)
      FS Generation: 2839773110 (0xa9437fb6)
      CRC32: 00000000   ECC: 0000
      Type: Regular   Attr: 0x0   Flags: Valid
      Dynamic Features: (0x1) InlineData
      User: 0 (root)   Group: 0 (root)   Size: 7
      Links: 1   Clusters: 0
      ctime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
      atime: 0x5de5d870 0x113181a1 -- Tue Dec  3 11:37:20.288457121 2019
      mtime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
      dtime: 0x0 -- Thu Jan  1 08:00:00 1970
      ...
      Block 9: Journal Commit Block
      Seq: 15   Type: 2 (JBD2_COMMIT_BLOCK)
    
    The following is journal recovery log when recovering the upper jbd2
    journal when mount again.
    
    syslog:
      ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
      fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
      fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
      fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
      fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13
    
    Due to first commit seq 13 recorded in journal super is not consistent
    with the value recorded in block 1(seq is 14), journal recovery will be
    terminated before seq 15 even though it is an unbroken commit, inode
    8257802 is a new file and it will be lost.
    
    Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com
    Signed-off-by: Kai Li <li.kai4@h3c.com>
    Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Reviewed-by: Changwei Ge <gechangwei@live.cn>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Kai Li authored and torvalds committed Jan 4, 2020
  6. mm/hugetlb: defer freeing of huge pages if in non-task context

    The following lockdep splat was observed when a certain hugetlbfs test
    was run:
    
      ================================
      WARNING: inconsistent lock state
      4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
      --------------------------------
      inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
      {SOFTIRQ-ON-W} state was registered at:
        lock_acquire+0x14f/0x3b0
        _raw_spin_lock+0x30/0x70
        __nr_hugepages_store_common+0x11b/0xb30
        hugetlb_sysctl_handler_common+0x209/0x2d0
        proc_sys_call_handler+0x37f/0x450
        vfs_write+0x157/0x460
        ksys_write+0xb8/0x170
        do_syscall_64+0xa5/0x4d0
        entry_SYSCALL_64_after_hwframe+0x6a/0xdf
      irq event stamp: 691296
      hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
      hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
      softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
      softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
    
      other info that might help us debug this:
       Possible unsafe locking scenario:
    
             CPU0
             ----
        lock(hugetlb_lock);
        <Interrupt>
          lock(hugetlb_lock);
    
       *** DEADLOCK ***
          :
      Call Trace:
       <IRQ>
       __lock_acquire+0x146b/0x48c0
       lock_acquire+0x14f/0x3b0
       _raw_spin_lock+0x30/0x70
       free_huge_page+0x36f/0xaa0
       bio_check_pages_dirty+0x2fc/0x5c0
       clone_endio+0x17f/0x670 [dm_mod]
       blk_update_request+0x276/0xe50
       scsi_end_request+0x7b/0x6a0
       scsi_io_completion+0x1c6/0x1570
       blk_done_softirq+0x22e/0x350
       __do_softirq+0x23d/0xad8
       irq_exit+0x23e/0x2b0
       do_IRQ+0x11a/0x200
       common_interrupt+0xf/0xf
       </IRQ>
    
    Both the hugetbl_lock and the subpool lock can be acquired in
    free_huge_page().  One way to solve the problem is to make both locks
    irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
    held for a linear scan of ALL hugetlb pages during a cgroup reparentling
    operation.  So it is just too long to have irq disabled unless we can
    break hugetbl_lock down into finer-grained locks with shorter lock hold
    times.
    
    Another alternative is to defer the freeing to a workqueue job.  This
    patch implements the deferred freeing by adding a free_hpage_workfn()
    work function to do the actual freeing.  The free_huge_page() call in a
    non-task context saves the page to be freed in the hpage_freelist linked
    list in a lockless manner using the llist APIs.
    
    The generic workqueue is used to process the work, but a dedicated
    workqueue can be used instead if it is desirable to have the huge page
    freed ASAP.
    
    Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
    llist APIs which simplfy the code.
    
    Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
    Signed-off-by: Waiman Long <longman@redhat.com>
    Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
    Acked-by: Davidlohr Bueso <dbueso@suse.de>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Andi Kleen <ak@linux.intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Waiman Long authored and torvalds committed Jan 4, 2020
  7. mm/gup: fix memory leak in __gup_benchmark_ioctl

    In the implementation of __gup_benchmark_ioctl() the allocated pages
    should be released before returning in case of an invalid cmd.  Release
    pages via kvfree().
    
    [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
    Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
    Fixes: 714a3a1 ("mm/gup_benchmark.c: add additional pinning methods")
    Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Reviewed-by: Ira Weiny <ira.weiny@intel.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Cc: Keith Busch <keith.busch@intel.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Navidem authored and torvalds committed Jan 4, 2020
  8. mm/oom: fix pgtables units mismatch in Killed process message

    pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
    As everything else is printed in kB, I chose to fix the value rather than
    the string.
    
    Before:
    
    [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
    ...
    [   1878]  1000  1878   217253   151144  1269760        0             0 python
    ...
    Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0
    
    After:
    
    [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
    ...
    [   1436]  1000  1436   217253   151890  1294336        0             0 python
    ...
    Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0
    
    Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
    Fixes: 70cb6d2 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Edward Chron <echron@arista.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    idryomov authored and torvalds committed Jan 4, 2020
  9. fs/posix_acl.c: fix kernel-doc warnings

    Fix kernel-doc warnings in fs/posix_acl.c.
    Also fix one typo (setgit -> setgid).
    
      fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
      fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
      fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
    
    Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org
    Fixes: 0739310 ("posix_acl: Clear SGID bit when setting file permissions")
    
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Acked-by: Andreas Gruenbacher <agruenba@redhat.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andreas Gruenbacher <agruenba@redhat.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Randy Dunlap authored and torvalds committed Jan 4, 2020
  10. hexagon: work around compiler crash

    Clang cannot translate the string "r30" into a valid register yet.
    
    Link: ClangBuiltLinux#755
    Link: http://lkml.kernel.org/r/20191028155722.23419-1-ndesaulniers@google.com
    Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
    Suggested-by: Sid Manning <sidneym@quicinc.com>
    Reviewed-by: Brian Cain <bcain@codeaurora.org>
    Cc: Allison Randal <allison@lohutok.net>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Richard Fontana <rfontana@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    nickdesaulniers authored and torvalds committed Jan 4, 2020
  11. hexagon: parenthesize registers in asm predicates

    Hexagon requires that register predicates in assembly be parenthesized.
    
    Link: ClangBuiltLinux#754
    Link: http://lkml.kernel.org/r/20191209222956.239798-3-ndesaulniers@google.com
    Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
    Suggested-by: Sid Manning <sidneym@codeaurora.org>
    Acked-by: Brian Cain <bcain@codeaurora.org>
    Cc: Lee Jones <lee.jones@linaro.org>
    Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Cc: Tuowen Zhao <ztuowen@gmail.com>
    Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Alexios Zavras <alexios.zavras@intel.com>
    Cc: Allison Randal <allison@lohutok.net>
    Cc: Will Deacon <will@kernel.org>
    Cc: Richard Fontana <rfontana@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    nickdesaulniers authored and torvalds committed Jan 4, 2020
  12. fs/namespace.c: make to_mnt_ns() static

    Make to_mnt_ns() static to address the following 'sparse' warning:
    
        fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static?
    
    Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    ebiggers authored and torvalds committed Jan 4, 2020
  13. fs/nsfs.c: include headers for missing declarations

    Include linux/proc_fs.h and fs/internal.h to address the following
    'sparse' warnings:
    
        fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
        fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?
    
    Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    ebiggers authored and torvalds committed Jan 4, 2020
  14. fs/direct-io.c: include fs/internal.h for missing prototype

    Include fs/internal.h to address the following 'sparse' warning:
    
        fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?
    
    Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    ebiggers authored and torvalds committed Jan 4, 2020
  15. mm: move_pages: return valid node id in status if the page is already…

    … on the target node
    
    Felix Abecassis reports move_pages() would return random status if the
    pages are already on the target node by the below test program:
    
      int main(void)
      {
    	const long node_id = 1;
    	const long page_size = sysconf(_SC_PAGESIZE);
    	const int64_t num_pages = 8;
    
    	unsigned long nodemask =  1 << node_id;
    	long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
    	if (ret < 0)
    		return (EXIT_FAILURE);
    
    	void **pages = malloc(sizeof(void*) * num_pages);
    	for (int i = 0; i < num_pages; ++i) {
    		pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
    				MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
    				-1, 0);
    		if (pages[i] == MAP_FAILED)
    			return (EXIT_FAILURE);
    	}
    
    	ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
    	if (ret < 0)
    		return (EXIT_FAILURE);
    
    	int *nodes = malloc(sizeof(int) * num_pages);
    	int *status = malloc(sizeof(int) * num_pages);
    	for (int i = 0; i < num_pages; ++i) {
    		nodes[i] = node_id;
    		status[i] = 0xd0; /* simulate garbage values */
    	}
    
    	ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    	printf("move_pages: %ld\n", ret);
    	for (int i = 0; i < num_pages; ++i)
    		printf("status[%d] = %d\n", i, status[i]);
      }
    
    Then running the program would return nonsense status values:
    
      $ ./move_pages_bug
      move_pages: 0
      status[0] = 208
      status[1] = 208
      status[2] = 208
      status[3] = 208
      status[4] = 208
      status[5] = 208
      status[6] = 208
      status[7] = 208
    
    This is because the status is not set if the page is already on the
    target node, but move_pages() should return valid status as long as it
    succeeds.  The valid status may be errno or node id.
    
    We can't simply initialize status array to zero since the pages may be
    not on node 0.  Fix it by updating status with node id which the page is
    already on.
    
    Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
    Reported-by: Felix Abecassis <fabecassis@nvidia.com>
    Tested-by: Felix Abecassis <fabecassis@nvidia.com>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>
    Acked-by: Christoph Lameter <cl@linux.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: <stable@vger.kernel.org>	[4.17+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    yang-shi authored and torvalds committed Jan 4, 2020
  16. memcg: account security cred as well to kmemcg

    The cred_jar kmem_cache is already memcg accounted in the current kernel
    but cred->security is not.  Account cred->security to kmemcg.
    
    Recently we saw high root slab usage on our production and on further
    inspection, we found a buggy application leaking processes.  Though that
    buggy application was contained within its memcg but we observe much
    more system memory overhead, couple of GiBs, during that period.  This
    overhead can adversely impact the isolation on the system.
    
    One source of high overhead we found was cred->security objects, which
    have a lifetime of at least the life of the process which allocated
    them.
    
    Link: http://lkml.kernel.org/r/20191205223721.40034-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Chris Down <chris@chrisdown.name>
    Reviewed-by: Roman Gushchin <guro@fb.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    shakeelb authored and torvalds committed Jan 4, 2020
  17. kcov: fix struct layout for kcov_remote_arg

    Make the layout of kcov_remote_arg the same for 32-bit and 64-bit code.
    This makes it more convenient to write userspace apps that can be
    compiled into 32-bit or 64-bit binaries and still work with the same
    64-bit kernel.
    
    Also use proper __u32 types in uapi headers instead of unsigned ints.
    
    Link: http://lkml.kernel.org/r/9e91020876029cfefc9211ff747685eba9536426.1575638983.git.andreyknvl@google.com
    Fixes: eec028c ("kcov: remote coverage support")
    Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Alan Stern <stern@rowland.harvard.edu>
    Cc: Felipe Balbi <balbi@kernel.org>
    Cc: Chunfeng Yun <chunfeng.yun@mediatek.com>
    Cc: "Jacky . Cao @ sony . com" <Jacky.Cao@sony.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    xairy authored and torvalds committed Jan 4, 2020
  18. mm/zsmalloc.c: fix the migrated zspage statistics.

    When zspage is migrated to the other zone, the zone page state should be
    updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
    counts including proc/zoneinfo in practice.
    
    Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
    Fixes: 91537fe ("mm: add NR_ZSMALLOC to vmstat")
    Signed-off-by: Chanho Min <chanho.min@lge.com>
    Signed-off-by: Jinsuk Choi <jjinsuk.choi@lge.com>
    Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
    Acked-by: Minchan Kim <minchan@kernel.org>
    Cc: <stable@vger.kernel.org>        [4.9+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Chanho Min authored and torvalds committed Jan 4, 2020
  19. mm/memory_hotplug: shrink zones when offlining memory

    We currently try to shrink a single zone when removing memory.  We use
    the zone of the first page of the memory we are removing.  If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):
    
        BUG: unable to handle page fault for address: 000000000000353d
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] SMP PTI
        CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ torvalds#317
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
        Workqueue: kacpi_hotplug acpi_hotplug_work_fn
        RIP: 0010:clear_zone_contiguous+0x5/0x10
        Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
        RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
        RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
        RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
        R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
        FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         __remove_pages+0x4b/0x640
         arch_remove_memory+0x63/0x8d
         try_remove_memory+0xdb/0x130
         __remove_memory+0xa/0x11
         acpi_memory_device_remove+0x70/0x100
         acpi_bus_trim+0x55/0x90
         acpi_device_hotplug+0x227/0x3a0
         acpi_hotplug_work_fn+0x1a/0x30
         process_one_work+0x221/0x550
         worker_thread+0x50/0x3b0
         kthread+0x105/0x140
         ret_from_fork+0x3a/0x50
        Modules linked in:
        CR2: 000000000000353d
    
    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that.  We now
    properly shrink the zones, even if we have DIMMs whereby
    
     - Some memory blocks fall into no zone (never onlined)
    
     - Some memory blocks fall into multiple zones (offlined+re-onlined)
    
     - Multiple memory blocks that fall into different zones
    
    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().
    
    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e]
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: <stable@vger.kernel.org>	[5.0+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    davidhildenbrand authored and torvalds committed Jan 4, 2020
  20. Merge tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vk…

    …oul/slave-dma
    
    Pull dmaengine fixes from Vinod Koul:
     "A bunch of fixes for:
    
       - uninitialized dma_slave_caps access
    
       - virt-dma use after free in vchan_complete()
    
       - driver fixes for ioat, k3dma and jz4780"
    
    * tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
      ioat: ioat_alloc_ring() failure handling.
      dmaengine: virt-dma: Fix access after free in vchan_complete()
      dmaengine: k3dma: Avoid null pointer traversal
      dmaengine: dma-jz4780: Also break descriptor chains on JZ4725B
      dmaengine: Fix access to uninitialized dma_slave_caps
    torvalds committed Jan 4, 2020
  21. Merge tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel…

    …/git/mchehab/linux-media
    
    Pull media fixes from Mauro Carvalho Chehab:
    
     - some fixes at CEC core to comply with HDMI 2.0 specs and fix some
       border cases
    
     - a fix at the transmission logic of the pulse8-cec driver
    
     - one alignment fix on a data struct at ipu3 when built with 32 bits
    
    * tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
      media: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes
      media: pulse8-cec: fix lost cec_transmit_attempt_done() call
      media: cec: check 'transmit_in_progress', not 'transmitting'
      media: cec: avoid decrementing transmit_queue_sz if it is 0
      media: cec: CEC 2.0-only bcast messages were ignored
    torvalds committed Jan 4, 2020

Commits on Jan 3, 2020

  1. Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/ker…

    …nel/git/kdave/linux
    
    Pull btrfs fixes from David Sterba:
     "A few fixes for btrfs:
    
       - blkcg accounting problem with compression that could stall writes
    
       - setting up blkcg bio for compression crashes due to NULL bdev
         pointer
    
       - fix possible infinite loop in writeback for nocow files (here
         possible means almost impossible, 13 things that need to happen to
         trigger it)"
    
    * tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
      Btrfs: fix infinite loop during nocow writeback due to race
      btrfs: fix compressed write bio blkcg attribution
      btrfs: punt all bios created in btrfs_submit_compressed_write()
    torvalds committed Jan 3, 2020
  2. Merge tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block

    Pull block fixes from Jens Axboe:
     "Three fixes in here:
    
       - Fix for a missing split on default memory boundary mask (4G) (Ming)
    
       - Fix for multi-page read bio truncate (Ming)
    
       - Fix for null_blk zone close request handling (Damien)"
    
    * tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block:
      null_blk: Fix REQ_OP_ZONE_CLOSE handling
      block: fix splitting segments on boundary masks
      block: add bio_truncate to fix guard_bio_eod
    torvalds committed Jan 3, 2020
  3. Merge tag 'kbuild-fixes-v5.5-2' of git://git.kernel.org/pub/scm/linux…

    …/kernel/git/masahiroy/linux-kbuild
    
    Pull Kbuild fixes from Masahiro Yamada:
    
     - fix build error in usr/gen_initramfs_list.sh
    
     - fix libelf-dev dependency in deb-pkg build
    
    * tag 'kbuild-fixes-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
      kbuild/deb-pkg: annotate libelf-dev dependency as :native
      gen_initramfs_list.sh: fix 'bad variable name' error
    torvalds committed Jan 3, 2020
Older
You can’t perform that action at this time.