Skip to content
Permalink
Alistair-Franc…
Switch branches/tags

Commits on Oct 21, 2021

  1. uapi: futex: Add a futex syscall

    This commit adds two futex syscall wrappers that are exposed to
    userspace.
    
    Neither the kernel or glibc currently expose a futex wrapper, so
    userspace is left performing raw syscalls. This has mostly been becuase
    the overloading of one of the arguments makes it impossible to provide a
    single type safe function.
    
    Until recently the single syscall has worked fine. With the introduction
    of a 64-bit time_t futex call on 32-bit architectures, this has become
    more complex. The logic of handling the two possible futex syscalls is
    complex and often implemented incorrectly.
    
    This patch adds two futux syscall functions that correctly handle the
    time_t complexity for userspace.
    
    This idea is based on previous discussions: https://lkml.org/lkml/2021/9/21/143
    
    Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
    alistair23 authored and intel-lab-lkp committed Oct 21, 2021

Commits on Oct 20, 2021

  1. Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client

    Pull ceph fixes from Ilya Dryomov:
     "Two important filesystem fixes, marked for stable.
    
      The blocklisted superblocks issue was particularly annoying because
      for unexperienced users it essentially exacted a reboot to establish a
      new functional mount in that scenario"
    
    * tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
      ceph: fix handling of "meta" errors
      ceph: skip existing superblocks that are blocklisted or shut down when mounting
    torvalds committed Oct 20, 2021
  2. Merge tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/d…

    …ma-mapping
    
    Pull dma-mapping fixes from Christoph Hellwig:
    
     - fix more dma-debug fallout (Gerald Schaefer, Hamza Mahfooz)
    
     - fix a kerneldoc warning (Logan Gunthorpe)
    
    * tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mapping:
      dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNC
      dma-debug: fix sg checks in debug_dma_map_sg()
      dma-mapping: fix the kerneldoc for dma_map_sgtable()
    torvalds committed Oct 20, 2021
  3. Merge tag 'sound-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kern…

    …el/git/tiwai/sound
    
    Pull sound fixes from Takashi Iwai:
     "Again it became bigger than wished, unfortunately, as this contains
      quite a few ASoC fixes that came up a bit late. It also includes yet
      more HD- and USB-audio quirks: I decided to merge them now, as those
      are for stable, and we'll need them sooner or later.
    
      Although the volumes are a bit high, all changes are device-specific
      (and reasonably small) fixes, so it should be safe for the late rc"
    
    * tag 'sound-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
      ALSA: usb-audio: Fix microphone sound on Jieli webcam.
      ALSA: hda/realtek: Fixes HP Spectre x360 15-eb1xxx speakers
      ALSA: usb-audio: Provide quirk for Sennheiser GSP670 Headset
      ALSA: hda/realtek: Add quirk for Clevo PC50HS
      ALSA: usb-audio: add Schiit Hel device to quirk table
      ASoC: wm8960: Fix clock configuration on slave mode
      ASoC: cs42l42: Ensure 0dB full scale volume is used for headsets
      ASoC: soc-core: fix null-ptr-deref in snd_soc_del_component_unlocked()
      ASoC: codec: wcd938x: Add irq config support
      ASoC: DAPM: Fix missing kctl change notifications
      ASoC: Intel: bytcht_es8316: Utilize dev_err_probe() to avoid log saturation
      ASoC: Intel: bytcht_es8316: Switch to use gpiod_get_optional()
      ASoC: Intel: bytcht_es8316: Use temporary variable for struct device
      ASoC: Intel: bytcht_es8316: Get platform data via dev_get_platdata()
      ASoC: wcd938x: Fix jack detection issue
      ASoC: nau8824: Fix headphone vs headset, button-press detection no longer working
      ASoC: cs4341: Add SPI device ID table
      ASoC: pcm179x: Add missing entries SPI to device ID table
      ASoC: fsl_xcvr: Fix channel swap issue with ARC
      ASoC: pcm512x: Mend accesses to the I2S_1 and I2S_2 registers
    torvalds committed Oct 20, 2021
  4. Merge tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/k…

    …ernel/git/pcmoore/audit
    
    Pull audit fix from Paul Moore:
     "One small audit patch to add a pointer NULL check"
    
    * tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
      audit: fix possible null-pointer dereference in audit_filter_rules
    torvalds committed Oct 20, 2021
  5. Merge tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/ker…

    …nel/git/rostedt/linux-trace
    
    Pull tracing fix from Steven Rostedt:
     "Recursion fix for tracing.
    
      While cleaning up some of the tracing recursion protection logic, I
      discovered a scenario that the current design would miss, and would
      allow an infinite recursion. Removing an optimization trick that
      opened the hole fixes the issue and cleans up the code as well"
    
    * tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
      tracing: Have all levels of checks prevent recursion
    torvalds committed Oct 20, 2021
  6. Merge tag 'nios2_fixes_for_v5.15_part2' of git://git.kernel.org/pub/s…

    …cm/linux/kernel/git/dinguyen/linux
    
    Pull nios2 fix from Dinh Nguyen:
    
     - Renamed CTL_STATUS to CTL_FSTATUS to fix a redefined warning
    
    * tag 'nios2_fixes_for_v5.15_part2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
      NIOS2: irqflags: rename a redefined register name
    torvalds committed Oct 20, 2021
  7. Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

    Pull kvm fixes from Paolo Bonzini:
     "Tools:
       - kvm_stat: do not show halt_wait_ns since it is not a cumulative statistic
    
      x86:
       - clean ups and fixes for bus lock vmexit and lazy allocation of rmaps
       - two fixes for SEV-ES (one more coming as soon as I get reviews)
       - fix for static_key underflow
    
      ARM:
       - Properly refcount pages used as a concatenated stage-2 PGD
       - Fix missing unlock when detecting the use of MTE+VM_SHARED"
    
    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
      KVM: SEV-ES: reduce ghcb_sa_len to 32 bits
      KVM: VMX: Remove redundant handling of bus lock vmexit
      KVM: kvm_stat: do not show halt_wait_ns
      KVM: x86: WARN if APIC HW/SW disable static keys are non-zero on unload
      Revert "KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET"
      KVM: SEV-ES: Set guest_state_protected after VMSA update
      KVM: X86: fix lazy allocation of rmaps
      KVM: SEV-ES: fix length of string I/O
      KVM: arm64: Release mmap_lock when using VM_SHARED with MTE
      KVM: arm64: Report corrupted refcount at EL2
      KVM: arm64: Fix host stage-2 PGD refcount
      KVM: s390: Function documentation fixes
    torvalds committed Oct 20, 2021

Commits on Oct 19, 2021

  1. Merge branch 'akpm' (patches from Andrew)

    Merge misc fixes from Andrew Morton:
     "19 patches.
    
      Subsystems affected by this patch series: mm (userfaultfd, migration,
      memblock, mempolicy, slub, secretmem, and thp), ocfs2, binfmt, vfs,
      and misc"
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
      mailmap: add Andrej Shadura
      mm/thp: decrease nr_thps in file's mapping on THP split
      mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()
      vfs: check fd has read access in kernel_read_file_from_fd()
      elfcore: correct reference to CONFIG_UML
      mm, slub: fix incorrect memcg slab count for bulk free
      mm, slub: fix potential use-after-free in slab_debugfs_fops
      mm, slub: fix potential memoryleak in kmem_cache_open()
      mm, slub: fix mismatch between reconstructed freelist depth and cnt
      mm, slub: fix two bugs in slab_debug_trace_open()
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind()
      memblock: check memory total_size
      ocfs2: mount fails with buffer overflow in strlen
      ocfs2: fix data corruption after conversion from inline format
      mm/migrate: fix CPUHP state to update node demotion order
      mm/migrate: add CPU hotplug to demotion #ifdef
      mm/migrate: optimize hotplug-time demotion order updates
      userfaultfd: fix a race between writeprotect and exit_mmap()
      mm/userfaultfd: selftests: fix memory corruption with thp enabled
    torvalds committed Oct 19, 2021
  2. ceph: fix handling of "meta" errors

    Currently, we check the wb_err too early for directories, before all of
    the unsafe child requests have been waited on. In order to fix that we
    need to check the mapping->wb_err later nearer to the end of ceph_fsync.
    
    We also have an overly-complex method for tracking errors after
    blocklisting. The errors recorded in cleanup_session_requests go to a
    completely separate field in the inode, but we end up reporting them the
    same way we would for any other error (in fsync).
    
    There's no real benefit to tracking these errors in two different
    places, since the only reporting mechanism for them is in fsync, and
    we'd need to advance them both every time.
    
    Given that, we can just remove i_meta_err, and convert the places that
    used it to instead just use mapping->wb_err instead. That also fixes
    the original problem by ensuring that we do a check_and_advance of the
    wb_err at the end of the fsync op.
    
    Cc: stable@vger.kernel.org
    URL: https://tracker.ceph.com/issues/52864
    Reported-by: Patrick Donnelly <pdonnell@redhat.com>
    Signed-off-by: Jeff Layton <jlayton@kernel.org>
    Reviewed-by: Xiubo Li <xiubli@redhat.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    jtlayton authored and idryomov committed Oct 19, 2021
  3. ceph: skip existing superblocks that are blocklisted or shut down whe…

    …n mounting
    
    Currently when mounting, we may end up finding an existing superblock
    that corresponds to a blocklisted MDS client. This means that the new
    mount ends up being unusable.
    
    If we've found an existing superblock with a client that is already
    blocklisted, and the client is not configured to recover on its own,
    fail the match. Ditto if the superblock has been forcibly unmounted.
    
    While we're in here, also rename "other" to the more conventional "fsc".
    
    Cc: stable@vger.kernel.org
    URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499
    Signed-off-by: Jeff Layton <jlayton@kernel.org>
    Reviewed-by: Xiubo Li <xiubli@redhat.com>
    Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
    jtlayton authored and idryomov committed Oct 19, 2021
  4. mailmap: add Andrej Shadura

    Add a mapping for my old work email for BelDisplayTech to the personal
    email, and make sure the Collabora email has the correct spelling of the
    first name.
    
    Link: https://lkml.kernel.org/r/20210917091016.30232-1-andrew.shadura@collabora.co.uk
    Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    andrewshadura authored and torvalds committed Oct 19, 2021
  5. mm/thp: decrease nr_thps in file's mapping on THP split

    Decrease nr_thps counter in file's mapping to ensure that the page cache
    won't be dropped excessively on file write access if page has been
    already split.
    
    I've tried a test scenario running a big binary, kernel remaps it with
    THPs, then force a THP split with /sys/kernel/debug/split_huge_pages.
    During any further open of that binary with O_RDWR or O_WRITEONLY kernel
    drops page cache for it, because of non-zero thps counter.
    
    Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com
    Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Fixes: 09d91cd ("mm,thp: avoid writes to file with THP in pagecache")
    Fixes: 06d3eff ("mm/thp: fix node page state in split_huge_page_to_list()")
    Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: <sfoon.kim@samsung.com>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: William Kucharski <william.kucharski@oracle.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    mszyprow authored and torvalds committed Oct 19, 2021
  6. mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem()

    Check for a NULL page->mapping before dereferencing the mapping in
    page_is_secretmem(), as the page's mapping can be nullified while gup()
    is running, e.g.  by reclaim or truncation.
    
      BUG: kernel NULL pointer dereference, address: 0000000000000068
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G        W
      RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
      Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
      RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
      RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
      ...
      CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
      Call Trace:
       get_user_pages_fast_only+0x13/0x20
       hva_to_pfn+0xa9/0x3e0
       try_async_pf+0xa1/0x270
       direct_page_fault+0x113/0xad0
       kvm_mmu_page_fault+0x69/0x680
       vmx_handle_exit+0xe1/0x5d0
       kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
       kvm_vcpu_ioctl+0x267/0x670
       __x64_sys_ioctl+0x83/0xa0
       do_syscall_64+0x56/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com
    Fixes: 1507f51 ("mm: introduce memfd_secret system call to create "secret" memory areas")
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Reported-by: Darrick J. Wong <djwong@kernel.org>
    Reported-by: Stephen <stephenackerman16@gmail.com>
    Tested-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    sean-jc authored and torvalds committed Oct 19, 2021
  7. vfs: check fd has read access in kernel_read_file_from_fd()

    If we open a file without read access and then pass the fd to a syscall
    whose implementation calls kernel_read_file_from_fd(), we get a warning
    from __kernel_read():
    
            if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
    
    This currently affects both finit_module() and kexec_file_load(), but it
    could affect other syscalls in the future.
    
    Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
    Fixes: b844f0e ("vfs: define kernel_copy_file_from_fd()")
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Reported-by: Hao Sun <sunhao.th@gmail.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Mimi Zohar <zohar@linux.ibm.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Matthew Wilcox (Oracle) authored and torvalds committed Oct 19, 2021
  8. elfcore: correct reference to CONFIG_UML

    Commit 6e7b64b ("elfcore: fix building with clang") introduces
    special handling for two architectures, ia64 and User Mode Linux.
    However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig
    symbol for User-Mode Linux was used.
    
    Although the directory for User Mode Linux is ./arch/um; the Kconfig
    symbol for this architecture is called CONFIG_UML.
    
    Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs:
    
      UM
      Referencing files: include/linux/elfcore.h
      Similar symbols: UML, NUMA
    
    Correct the name of the config to the intended one.
    
    [akpm@linux-foundation.org: fix um/x86_64, per Catalin]
      Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com
      Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com
    
    Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com
    Fixes: 6e7b64b ("elfcore: fix building with clang")
    Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Nathan Chancellor <nathan@kernel.org>
    Cc: Nick Desaulniers <ndesaulniers@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Barret Rhoden <brho@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    bulwahn authored and torvalds committed Oct 19, 2021
  9. mm, slub: fix incorrect memcg slab count for bulk free

    kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects
    when doing bulk free.  So we shouldn't call memcg_slab_free_hook() again
    for bulk free to avoid incorrect memcg slab count.
    
    Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com
    Fixes: d1b2cf6 ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Bharata B Rao <bharata@linux.ibm.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MiaoheLin authored and torvalds committed Oct 19, 2021
  10. mm, slub: fix potential use-after-free in slab_debugfs_fops

    When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s
    because s will be freed soon.  And slab_debugfs_fops will use s later
    leading to a use-after-free.
    
    Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com
    Fixes: 64dd684 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Bharata B Rao <bharata@linux.ibm.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MiaoheLin authored and torvalds committed Oct 19, 2021
  11. mm, slub: fix potential memoryleak in kmem_cache_open()

    In error path, the random_seq of slub cache might be leaked.  Fix this
    by using __kmem_cache_release() to release all the relevant resources.
    
    Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com
    Fixes: 210e7a4 ("mm: SLUB freelist randomization")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Bharata B Rao <bharata@linux.ibm.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MiaoheLin authored and torvalds committed Oct 19, 2021
  12. mm, slub: fix mismatch between reconstructed freelist depth and cnt

    If object's reuse is delayed, it will be excluded from the reconstructed
    freelist.  But we forgot to adjust the cnt accordingly.  So there will
    be a mismatch between reconstructed freelist depth and cnt.  This will
    lead to free_debug_processing() complaining about freelist count or a
    incorrect slub inuse count.
    
    Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com
    Fixes: c389539 ("kasan, slub: fix handling of kasan_slab_free hook")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Bharata B Rao <bharata@linux.ibm.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MiaoheLin authored and torvalds committed Oct 19, 2021
  13. mm, slub: fix two bugs in slab_debug_trace_open()

    Patch series "Fixups for slub".
    
    This series contains various bug fixes for slub.  We fix memoryleak,
    use-afer-free, NULL pointer dereferencing and so on in slub.  More
    details can be found in the respective changelogs.
    
    This patch (of 5):
    
    It's possible that __seq_open_private() will return NULL.  So we should
    check it before using lest dereferencing NULL pointer.  And in error
    paths, we forgot to release private buffer via seq_release_private().
    Memory will leak in these paths.
    
    Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com
    Fixes: 64dd684 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
    Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Bharata B Rao <bharata@linux.ibm.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MiaoheLin authored and torvalds committed Oct 19, 2021
  14. mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL…

    … in mbind()
    
    syzbot reported access to unitialized memory in mbind() [1]
    
    Issue came with commit bda420b ("numa balancing: migrate on fault
    among multiple bound nodes")
    
    This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
    combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
    do_set_mempolicy()
    
    This patch moves the check in sanitize_mpol_flags() so that it is also
    used by mbind()
    
      [1]
      BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
       __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
       mpol_equal include/linux/mempolicy.h:105 [inline]
       vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
       mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
       do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
       kernel_mbind mm/mempolicy.c:1483 [inline]
       __do_sys_mbind mm/mempolicy.c:1490 [inline]
       __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
       __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
      Uninit was created at:
       slab_alloc_node mm/slub.c:3221 [inline]
       slab_alloc mm/slub.c:3230 [inline]
       kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
       mpol_new mm/mempolicy.c:293 [inline]
       do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
       kernel_mbind mm/mempolicy.c:1483 [inline]
       __do_sys_mbind mm/mempolicy.c:1490 [inline]
       __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
       __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      =====================================================
      Kernel panic - not syncing: panic_on_kmsan set ...
      CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
       dump_stack+0x25/0x28 lib/dump_stack.c:113
       panic+0x44f/0xdeb kernel/panic.c:232
       kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
       __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
       __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
       mpol_equal include/linux/mempolicy.h:105 [inline]
       vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
       mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
       do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
       kernel_mbind mm/mempolicy.c:1483 [inline]
       __do_sys_mbind mm/mempolicy.c:1490 [inline]
       __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
       __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
    Fixes: bda420b ("numa balancing: migrate on fault among multiple bound nodes")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    neebe000 authored and torvalds committed Oct 19, 2021
  15. memblock: check memory total_size

    mem=[X][G|M] is broken on ARM64 platform, there are cases that even
    type.cnt is 1, but total_size is not 0 because regions are merged into
    1.  So only check 'cnt' is not enough, total_size should be used,
    othersize bootargs 'mem=[X][G|B]' not work anymore.
    
    Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com
    Fixes: e888fa7 ("memblock: Check memory add/cap ordering")
    Signed-off-by: Peng Fan <peng.fan@nxp.com>
    Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    MrVan authored and torvalds committed Oct 19, 2021
  16. ocfs2: mount fails with buffer overflow in strlen

    Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
    ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
    trace below.  Problem seems to be that strings for cluster stack and
    cluster name are not guaranteed to be null terminated in the disk
    representation, while strlcpy assumes that the source string is always
    null terminated.  This causes a read outside of the source string
    triggering the buffer overflow detection.
    
      detected buffer overflow in strlen
      ------------[ cut here ]------------
      kernel BUG at lib/string.c:1149!
      invalid opcode: 0000 [#1] SMP PTI
      CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
        Debian 5.14.6-2
      RIP: 0010:fortify_panic+0xf/0x11
      ...
      Call Trace:
       ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
       ocfs2_fill_super+0x359/0x19b0 [ocfs2]
       mount_bdev+0x185/0x1b0
       legacy_get_tree+0x27/0x40
       vfs_get_tree+0x25/0xb0
       path_mount+0x454/0xa20
       __x64_sys_mount+0x103/0x140
       do_syscall_64+0x3b/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
    Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
    Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    vvidic authored and torvalds committed Oct 19, 2021
  17. ocfs2: fix data corruption after conversion from inline format

    Commit 6dbf7bb ("fs: Don't invalidate page buffers in
    block_write_full_page()") uncovered a latent bug in ocfs2 conversion
    from inline inode format to a normal inode format.
    
    The code in ocfs2_convert_inline_data_to_extents() attempts to zero out
    the whole cluster allocated for file data by grabbing, zeroing, and
    dirtying all pages covering this cluster.  However these pages are
    beyond i_size, thus writeback code generally ignores these dirty pages
    and no blocks were ever actually zeroed on the disk.
    
    This oversight was fixed by commit 693c241 ("ocfs2: No need to zero
    pages past i_size.") for standard ocfs2 write path, inline conversion
    path was apparently forgotten; the commit log also has a reasoning why
    the zeroing actually is not needed.
    
    After commit 6dbf7bb, things became worse as writeback code stopped
    invalidating buffers on pages beyond i_size and thus these pages end up
    with clean PageDirty bit but with buffers attached to these pages being
    still dirty.  So when a file is converted from inline format, then
    writeback triggers, and then the file is grown so that these pages
    become valid, the invalid dirtiness state is preserved,
    mark_buffer_dirty() does nothing on these pages (buffers are already
    dirty) but page is never written back because it is clean.  So data
    written to these pages is lost once pages are reclaimed.
    
    Simple reproducer for the problem is:
    
      xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
        -c "pwrite 4000 2000" ocfs2_file
    
    After unmounting and mounting the fs again, you can observe that end of
    'ocfs2_file' has lost its contents.
    
    Fix the problem by not doing the pointless zeroing during conversion
    from inline format similarly as in the standard write path.
    
    [akpm@linux-foundation.org: fix whitespace, per Joseph]
    
    Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
    Fixes: 6dbf7bb ("fs: Don't invalidate page buffers in block_write_full_page()")
    Signed-off-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Acked-by: Gang He <ghe@suse.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: "Markov, Andrey" <Markov.Andrey@Dell.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    jankara authored and torvalds committed Oct 19, 2021
  18. mm/migrate: fix CPUHP state to update node demotion order

    The node demotion order needs to be updated during CPU hotplug.  Because
    whether a NUMA node has CPU may influence the demotion order.  The
    update function should be called during CPU online/offline after the
    node_states[N_CPU] has been updated.  That is done in
    CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
    CPU offline.  But in commit 884a6e5 ("mm/migrate: update node
    demotion order on hotplug events"), the function to update node demotion
    order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
    doesn't satisfy the order requirement.
    
    For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
    and P2, P3 in S1), the demotion order is
    
     - S0 -> NUMA_NO_NODE
     - S1 -> NUMA_NO_NODE
    
    After P2 and P3 is offlined, because S1 has no CPU now, the demotion
    order should have been changed to
    
     - S0 -> S1
     - S1 -> NO_NODE
    
    but it isn't changed, because the order updating callback for CPU
    hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
    the demotion order is changed to the expected order as above.
    
    So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
    CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
    CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
    update function on them.
    
    Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
    Fixes: 884a6e5 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    yhuang-intel authored and torvalds committed Oct 19, 2021
  19. mm/migrate: add CPU hotplug to demotion #ifdef

    Once upon a time, the node demotion updates were driven solely by memory
    hotplug events.  But now, there are handlers for both CPU and memory
    hotplug.
    
    However, the #ifdef around the code checks only memory hotplug.  A
    system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
    hotplug events.
    
    Update the #ifdef around the common code.  Add memory and CPU-specific
    #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
    function warnings when their Kconfig option is off.
    
    [arnd@arndb.de: rework hotplug_memory_notifier() stub]
      Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
    
    Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
    Fixes: 884a6e5 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    hansendc authored and torvalds committed Oct 19, 2021
  20. mm/migrate: optimize hotplug-time demotion order updates

    Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.
    
    This contains two fixes for the "automatic demotion" code which was
    merged into 5.15:
    
     * Fix memory hotplug performance regression by watching
       suppressing any real action on irrelevant hotplug events.
    
     * Ensure CPU hotplug handler is registered when memory hotplug
       is disabled.
    
    This patch (of 2):
    
    == tl;dr ==
    
    Automatic demotion opted for a simple, lazy approach to handling hotplug
    events.  This noticeably slows down memory hotplug[1].  Optimize away
    updates to the demotion order when memory hotplug events should have no
    effect.
    
    This has no effect on CPU hotplug.  There is no known problem on the CPU
    side and any work there will be in a separate series.
    
    == Background ==
    
    Automatic demotion is a memory migration strategy to ensure that new
    allocations have room in faster memory tiers on tiered memory systems.
    The kernel maintains an array (node_demotion[]) to drive these
    migrations.
    
    The node_demotion[] path is calculated by starting at nodes with CPUs
    and then "walking" to nodes with memory.  Only hotplug events which
    online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
    actually affect the migration order.
    
    == Problem ==
    
    However, the current code is lazy.  It completely regenerates the
    migration order on *any* CPU or memory hotplug event.  The logic was
    that these events are extremely rare and that the overhead from
    indiscriminate order regeneration is minimal.
    
    Part of the update logic involves a synchronize_rcu(), which is a pretty
    big hammer.  Its overhead was large enough to be detected by some 0day
    tests that watch memory hotplug performance[1].
    
    == Solution ==
    
    Add a new helper (node_demotion_topo_changed()) which can differentiate
    between superfluous and impactful hotplug events.  Skip the expensive
    update operation for superfluous events.
    
    == Aside: Locking ==
    
    It took me a few moments to declare the locking to be safe enough for
    node_demotion_topo_changed() to work.  It all hinges on the memory
    hotplug lock:
    
    During memory hotplug events, 'mem_hotplug_lock' is held for write.
    This ensures that two memory hotplug events can not be called
    simultaneously.
    
    CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
    mutual exclusion between CPU hotplug events.  In addition, the demotion
    code acquire and hold the mem_hotplug_lock for read during its CPU
    hotplug handlers.  This provides mutual exclusion between the demotion
    memory hotplug callbacks and the CPU hotplug callbacks.
    
    This effectively allows treating the migration target generation code to
    act as if it is single-threaded.
    
    1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/
    
    Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
    Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
    Fixes: 884a6e5 ("mm/migrate: update node demotion order on hotplug events")
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Yang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    hansendc authored and torvalds committed Oct 19, 2021
  21. userfaultfd: fix a race between writeprotect and exit_mmap()

    A race is possible when a process exits, its VMAs are removed by
    exit_mmap() and at the same time userfaultfd_writeprotect() is called.
    
    The race was detected by KASAN on a development kernel, but it appears
    to be possible on vanilla kernels as well.
    
    Use mmget_not_zero() to prevent the race as done in other userfaultfd
    operations.
    
    Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
    Fixes: 63b2d41 ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
    Signed-off-by: Nadav Amit <namit@vmware.com>
    Tested-by: Li  Wang <liwang@redhat.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Nadav Amit authored and torvalds committed Oct 19, 2021
  22. mm/userfaultfd: selftests: fix memory corruption with thp enabled

    In RHEL's gating selftests we've encountered memory corruption in the
    uffd event test even with upstream kernel:
    
            # ./userfaultfd anon 128 4
            nr_pages: 32768, nr_pages_per_cpu: 32768
            bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729)
            bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877)
            bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699)
            bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196)
            testing uffd-wp with pagemap (pgsize=4096): done
            testing uffd-wp with pagemap (pgsize=2097152): done
            testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963)
            ERROR: faulting process failed (errno=0, line=1117)
    
    It can be easily reproduced when global thp enabled, which is the
    default for RHEL.
    
    It's also known as a side effect of commit 0db282b ("selftest: use
    mmap instead of posix_memalign to allocate memory", 2021-07-23), which
    is imho right itself on using mmap() to make sure the addresses will be
    untagged even on arm.
    
    The problem is, for each test we allocate buffers using two
    allocate_area() calls.  We assumed these two buffers won't affect each
    other, however they could, because mmap() could have found that the two
    buffers are near each other and having the same VMA flags, so they got
    merged into one VMA.
    
    It won't be a big problem if thp is not enabled, but when thp is
    agressively enabled it means when initializing the src buffer it could
    accidentally setup part of the dest buffer too when there's a shared THP
    that overlaps the two regions.  Then some of the dest buffer won't be
    able to be trapped by userfaultfd missing mode, then it'll cause memory
    corruption as described.
    
    To fix it, do release_pages() after initializing the src buffer.
    
    Since the previous two release_pages() calls are after
    uffd_test_ctx_clear() which will unmap all the buffers anyway (which is
    stronger than release pages; as unmap() also tear town pgtables), drop
    them as they shouldn't really be anything useful.
    
    We can mark the Fixes tag upon 0db282b as it's reported to only
    happen there, however the real "Fixes" IMHO should be 8ba6e86, as
    before that commit we'll always do explicit release_pages() before
    registration of uffd, and 8ba6e86 changed that logic by adding
    extra unmap/map and we didn't release the pages at the right place.
    Meanwhile I don't have a solid glue anyway on whether posix_memalign()
    could always avoid triggering this bug, hence it's safer to attach this
    fix to commit 8ba6e86.
    
    Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com
    Fixes: 8ba6e86 ("userfaultfd/selftests: reinitialize test context in each test")
    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reported-by: Li Wang <liwan@redhat.com>
    Tested-by: Li Wang <liwang@redhat.com>
    Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    xzpeter authored and torvalds committed Oct 19, 2021
  23. ALSA: usb-audio: Fix microphone sound on Jieli webcam.

    When a Jieli Technology USB Webcam is connected, the video part works
    well, but the mic sound is speeded up. On dmesg there are messages
    about different rates from the runtime rates, warnings about volume
    resolution and lastly, the log is filled, every 5 seconds, with
    retire_capture_urb error messages.
    
    The mic works only when ep packet size is set to wMaxPacketSize (normal
    sound and no more retire_capture_urb error messages). Skipping reading
    sample rate, fixes the messages about different rates and forcing a volume
    resolution, fixes warnings about volume range. I have arbitrarily choosed
    the value (16): I read in a comment that there should be no more than 255
    levels, so 4096 (max volume) / 16 = 0-255.
    
    Signed-off-by: Marco Giunta <giun7a@gmail.com>
    Link: https://lore.kernel.org/r/20211018162552.12082-1-giun7a@gmail.com
    Signed-off-by: Takashi Iwai <tiwai@suse.de>
    gocram authored and tiwai committed Oct 19, 2021

Commits on Oct 18, 2021

  1. audit: fix possible null-pointer dereference in audit_filter_rules

    Fix  possible null-pointer dereference in audit_filter_rules.
    
    audit_filter_rules() error: we previously assumed 'ctx' could be null
    
    Cc: stable@vger.kernel.org
    Fixes: bf36123 ("audit: add saddr_fam filter field")
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
    gscui authored and pcmoore committed Oct 18, 2021
  2. tracing: Have all levels of checks prevent recursion

    While writing an email explaining the "bit = 0" logic for a discussion on
    making ftrace_test_recursion_trylock() disable preemption, I discovered a
    path that makes the "not do the logic if bit is zero" unsafe.
    
    The recursion logic is done in hot paths like the function tracer. Thus,
    any code executed causes noticeable overhead. Thus, tricks are done to try
    to limit the amount of code executed. This included the recursion testing
    logic.
    
    Having recursion testing is important, as there are many paths that can
    end up in an infinite recursion cycle when tracing every function in the
    kernel. Thus protection is needed to prevent that from happening.
    
    Because it is OK to recurse due to different running context levels (e.g.
    an interrupt preempts a trace, and then a trace occurs in the interrupt
    handler), a set of bits are used to know which context one is in (normal,
    softirq, irq and NMI). If a recursion occurs in the same level, it is
    prevented*.
    
    Then there are infrastructure levels of recursion as well. When more than
    one callback is attached to the same function to trace, it calls a loop
    function to iterate over all the callbacks. Both the callbacks and the
    loop function have recursion protection. The callbacks use the
    "ftrace_test_recursion_trylock()" which has a "function" set of context
    bits to test, and the loop function calls the internal
    trace_test_and_set_recursion() directly, with an "internal" set of bits.
    
    If an architecture does not implement all the features supported by ftrace
    then the callbacks are never called directly, and the loop function is
    called instead, which will implement the features of ftrace.
    
    Since both the loop function and the callbacks do recursion protection, it
    was seemed unnecessary to do it in both locations. Thus, a trick was made
    to have the internal set of recursion bits at a more significant bit
    location than the function bits. Then, if any of the higher bits were set,
    the logic of the function bits could be skipped, as any new recursion
    would first have to go through the loop function.
    
    This is true for architectures that do not support all the ftrace
    features, because all functions being traced must first go through the
    loop function before going to the callbacks. But this is not true for
    architectures that support all the ftrace features. That's because the
    loop function could be called due to two callbacks attached to the same
    function, but then a recursion function inside the callback could be
    called that does not share any other callback, and it will be called
    directly.
    
    i.e.
    
     traced_function_1: [ more than one callback tracing it ]
       call loop_func
    
     loop_func:
       trace_recursion set internal bit
       call callback
    
     callback:
       trace_recursion [ skipped because internal bit is set, return 0 ]
       call traced_function_2
    
     traced_function_2: [ only traced by above callback ]
       call callback
    
     callback:
       trace_recursion [ skipped because internal bit is set, return 0 ]
       call traced_function_2
    
     [ wash, rinse, repeat, BOOM! out of shampoo! ]
    
    Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
    call for all functions.
    
    Since we want to encourage architectures to implement all ftrace features,
    having them slow down due to this extra logic may encourage the
    maintainers to update to the latest ftrace features. And because this
    logic is only safe for them, remove it completely.
    
     [*] There is on layer of recursion that is allowed, and that is to allow
         for the transition between interrupt context (normal -> softirq ->
         irq -> NMI), because a trace may occur before the context update is
         visible to the trace recursion logic.
    
    Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
    Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
    
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Petr Mladek <pmladek@suse.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Paul Walmsley <paul.walmsley@sifive.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Albert Ou <aou@eecs.berkeley.edu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Jiri Kosina <jikos@kernel.org>
    Cc: Miroslav Benes <mbenes@suse.cz>
    Cc: Joe Lawrence <joe.lawrence@redhat.com>
    Cc: Colin Ian King <colin.king@canonical.com>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Jisheng Zhang <jszhang@kernel.org>
    Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
    Cc: Guo Ren <guoren@kernel.org>
    Cc: stable@vger.kernel.org
    Fixes: edc15ca ("tracing: Avoid unnecessary multiple recursion checks")
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
    rostedt committed Oct 18, 2021
  3. KVM: SEV-ES: reduce ghcb_sa_len to 32 bits

    The size of the GHCB scratch area is limited to 16 KiB (GHCB_SCRATCH_AREA_LIMIT),
    so there is no need for it to be a u64.  This fixes a build error on 32-bit
    systems:
    
    i686-linux-gnu-ld: arch/x86/kvm/svm/sev.o: in function `sev_es_string_io:
    sev.c:(.text+0x110f): undefined reference to `__udivdi3'
    
    Cc: stable@vger.kernel.org
    Fixes: 019057b ("KVM: SEV-ES: fix length of string I/O")
    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Oct 18, 2021
  4. KVM: VMX: Remove redundant handling of bus lock vmexit

    Hardware may or may not set exit_reason.bus_lock_detected on BUS_LOCK
    VM-Exits. Dealing with KVM_RUN_X86_BUS_LOCK in handle_bus_lock_vmexit
    could be redundant when exit_reason.basic is EXIT_REASON_BUS_LOCK.
    
    We can remove redundant handling of bus lock vmexit. Unconditionally Set
    exit_reason.bus_lock_detected in handle_bus_lock_vmexit(), and deal with
    KVM_RUN_X86_BUS_LOCK only in vmx_handle_exit().
    
    Signed-off-by: Hao Xiang <hao.xiang@linux.alibaba.com>
    Message-Id: <1634299161-30101-1-git-send-email-hao.xiang@linux.alibaba.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Hao Xiang authored and bonzini committed Oct 18, 2021
Older