Skip to content
Permalink
Mina-Almasry/m…
Switch branches/tags

Commits on Nov 12, 2021

  1. mm, shmem, selftests: add tmpfs memcg= mount option tests

    Signed-off-by: Mina Almasry <almasrymina@google.com>
    
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: riel@surriel.com
    Cc: linux-mm@kvack.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Mina Almasry authored and intel-lab-lkp committed Nov 12, 2021
  2. mm, shmem: add tmpfs memcg= option documentation

    Signed-off-by: Mina Almasry <almasrymina@google.com>
    
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: riel@surriel.com
    Cc: linux-mm@kvack.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Mina Almasry authored and intel-lab-lkp committed Nov 12, 2021
  3. mm/oom: handle remote ooms

    On remote ooms (OOMs due to remote charging), the oom-killer will attempt
    to find a task to kill in the memcg under oom, if the oom-killer
    is unable to find one, the oom-killer should simply return ENOMEM to the
    allocating process.
    
    If we're in pagefault path and we're unable to return ENOMEM to the
    allocating process, we instead kill the allocating process.
    
    Signed-off-by: Mina Almasry <almasrymina@google.com>
    
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    CC: Roman Gushchin <guro@fb.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: riel@surriel.com
    Cc: linux-mm@kvack.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Mina Almasry authored and intel-lab-lkp committed Nov 12, 2021
  4. mm/shmem: support deterministic charging of tmpfs

    Add memcg= option to shmem mount.
    
    Users can specify this option at mount time and all data page charges
    will be charged to the memcg supplied. Processes are only allowed to
    direct tmpfs changes to a cgroup that they themselves can enter and
    allocate memory in.
    
    Signed-off-by: Mina Almasry <almasrymina@google.com>
    
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    CC: Roman Gushchin <guro@fb.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Muchun Song <songmuchun@bytedance.com>
    Cc: riel@surriel.com
    Cc: linux-mm@kvack.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Mina Almasry authored and intel-lab-lkp committed Nov 12, 2021

Commits on Oct 28, 2021

  1. pci: test for unexpectedly disabled bridges

    The all-ones value is not just a "device didn't exist" case, it's also
    potentially a quite valid value, so not restoring it would be wrong.
    
    What *would* be interesting is to hear where the bad values came from in
    the first place.  It sounds like the device state is saved after the PCI
    bus controller in front of the device has been crapped on, resulting in the
    PCI config cycles never reaching the device at all.
    
    Something along this patch (together with suspend/resume debugging output)
    migth help pinpoint it.  But it really sounds like something totally
    brokenly turned off the PCI bridge (some ACPI shutdown crud?  I wouldn't be
    entirely surprised)
    
    Cc: Greg KH <greg@kroah.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    torvalds authored and hnaz committed Oct 28, 2021
  2. kernel/fork.c: export kernel_thread() to modules

    mutex-subsystem-synchro-test-module.patch needs this
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  3. mutex subsystem, synchro-test module

    The attached patch adds a module for testing and benchmarking mutexes,
    semaphores and R/W semaphores.
    
    Using it is simple:
    
    	insmod synchro-test.ko <args>
    
    It will exit with error ENOANO after running the tests and printing the
    results to the kernel console log.
    
    The available arguments are:
    
     (*) mx=N
    
    	Start up to N mutex thrashing threads, where N is at most 20. All will
    	try and thrash the same mutex.
    
     (*) sm=N
    
    	Start up to N counting semaphore thrashing threads, where N is at most
    	20. All will try and thrash the same semaphore.
    
     (*) ism=M
    
    	Initialise the counting semaphore with M, where M is any positive
    	integer greater than zero. The default is 4.
    
     (*) rd=N
     (*) wr=O
     (*) dg=P
    
    	Start up to N reader thrashing threads, O writer thrashing threads and
    	P downgrader thrashing threads, where N, O and P are at most 20
    	apiece. All will try and thrash the same read/write semaphore.
    
     (*) elapse=N
    
    	Run the tests for N seconds. The default is 5.
    
     (*) load=N
    
    	Each thread delays for N uS whilst holding the lock. The dfault is 0.
    
     (*) interval=N
    
    	Each thread delays for N uS whilst not holding the lock. The default
    	is 0.
    
     (*) do_sched=1
    
    	Each thread will call schedule if required after each iteration.
    
     (*) v=1
    
    	Print more verbose information, including a thread iteration
    	distribution list.
    
    The module should be enabled by turning on CONFIG_DEBUG_SYNCHRO_TEST to "m".
    
    [randy.dunlap@oracle.com: fix build errors, add <sched.h> header file]
    [akpm@linux-foundation.org: remove smp_lock.h inclusion]
    [viro@ZenIV.linux.org.uk: kill daemonize() calls]
    [rdunlap@xenotime.net: fix printk format warrnings]
    [walken@google.com: add spinlock test]
    [walken@google.com: document default load and interval values]
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Adrian Bunk <bunk@stusta.de>
    Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
    Signed-off-by: Michel Lespinasse <walken@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    dhowells authored and hnaz committed Oct 28, 2021
  4. Releasing resources with children

    What does it mean to release a resource with children?  Should the children
    become children of the released resource's parent?  Should they be released
    too?  Should we fail the release?
    
    I bet we have no callers who expect this right now, but with
    insert_resource() we may get some.  At the point where someone hits this
    BUG we can figure out what semantics we want.
    
    Signed-off-by: Matthew Wilcox <willy@parisc-linux.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Oct 28, 2021
  5. Make sure nobody's leaking resources

    Currently, releasing a resource also releases all of its children.  That
    made sense when request_resource was the main method of dividing up the
    memory map.  With the increased use of insert_resource, it seems to me that
    we should instead reparent the newly orphaned resources.  Before we do
    that, let's make sure that nobody's actually relying on the current
    semantics.
    
    Signed-off-by: Matthew Wilcox <matthew@wil.cx>
    Cc: Greg KH <greg@kroah.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Oct 28, 2021
  6. kasan: add kasan mode messages when kasan init

    There are multiple kasan modes.  It makes sense that we add some messages
    to know which kasan mode is active when booting up.  see [1].
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=212195 [1]
    Link: https://lkml.kernel.org/r/20211020094850.4113-1-Kuan-Ying.Lee@mediatek.com
    Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Chinwen Chang <chinwen.chang@mediatek.com>
    Cc: Yee Lee <yee.lee@mediatek.com>
    Cc: Nicholas Tang <nicholas.tang@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Kuan-Ying Lee authored and hnaz committed Oct 28, 2021
  7. mm: unexport {,un}lock_page_memcg

    These are only used in built-in core mm code.
    
    Link: https://lkml.kernel.org/r/20210820095815.445392-3-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Christoph Hellwig authored and hnaz committed Oct 28, 2021
  8. mm: unexport folio_memcg_{,un}lock

    Patch series "unexport memcg locking helpers".
    
    Neither the old page-based nor the new folio-based memcg locking helpers
    are used in modular code at all, so drop the exports.
    
    
    This patch (of 2):
    
    folio_memcg_{,un}lock are only used in built-in core mm code.
    
    Link: https://lkml.kernel.org/r/20210820095815.445392-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20210820095815.445392-2-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Christoph Hellwig authored and hnaz committed Oct 28, 2021
  9. mm/migrate.c: remove MIGRATE_PFN_LOCKED

    MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
    source page was already locked during migrate_vma_collect().  If it wasn't
    then the a second attempt is made to lock the page.  However if the first
    attempt failed it's unlikely a second attempt will succeed, and the retry
    adds complexity.  So clean this up by removing the retry and
    MIGRATE_PFN_LOCKED flag.
    
    Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set,
    but nothing actually checks that.
    
    Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Alistair Popple authored and hnaz committed Oct 28, 2021
  10. mm: migrate: simplify the file-backed pages validation when migrating…

    … its mapping
    
    Patch series "Some cleanup for page migration", v3.
    
    This patchset does some cleanups and improvements for page migration.
    
    
    This patch (of 4):
    
    There is no need to validate the file-backed page's refcount before trying
    to freeze the page's expected refcount, instead we can rely on the
    folio_ref_freeze() to validate if the page has the expected refcount
    before migrating its mapping.
    
    Moreover we are always under the page lock when migrating the page
    mapping, which means nowhere else can remove it from the page cache, so we
    can remove the xas_load() validation under the i_pages lock.
    
    Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Baolin Wang authored and hnaz committed Oct 28, 2021
  11. mm: allow only SLUB on PREEMPT_RT

    Memory allocators may disable interrupts or preemption as part of the
    allocation and freeing process.  For PREEMPT_RT it is important that these
    sections remain deterministic and short and therefore don't depend on the
    size of the memory to allocate/ free or the inner state of the algorithm.
    
    Until v3.12-RT the SLAB allocator was an option but involved several
    changes to meet all the requirements.  The SLUB design fits better with
    PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT
    patchset.  Comparing the two allocator, SLUB outperformed SLAB in both
    throughput (time needed to allocate and free memory) and the maximal
    latency of the system measured with cyclictest during hackbench.
    
    SLOB was never evaluated since it was unlikely that it preforms better
    than SLAB.  During a quick test, the kernel crashed with SLOB enabled
    during boot.
    
    Disable SLAB and SLOB on PREEMPT_RT.
    
    [bigeasy@linutronix.de: commit description]
    Link: https://lkml.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Ingo Molnar authored and hnaz committed Oct 28, 2021
  12. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc() - fixup3
    
    Due to cd06ab2 ("drm/locking: add backtrace for locking contended
    locks without backoff") landing recently to -next adding a new stack depot
    user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate
    call to stack_depot_init() there as well.
    
    Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jani Nikula <jani.nikula@intel.com>
    Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Vijayanand Jitta <vjitta@codeaurora.org>
    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Cc: Maxime Ripard <mripard@kernel.org>
    Cc: Thomas Zimmermann <tzimmermann@suse.de>
    Cc: David Airlie <airlied@linux.ie>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Oliver Glitta <glittao@gmail.com>
    Cc: Imran Khan <imran.f.khan@oracle.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  13. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc() - fixup
    
    On FLATMEM, we call page_ext_init_flatmem_late() just before
    kmem_cache_init() which means stack_depot_init() (called by page owner
    init) will not recognize properly it should use kvmalloc() and not
    memblock_alloc().  memblock_alloc() will also not issue a warning and
    return a block memory that can be invalid and cause kernel page fault when
    saving stacks, as reported by the kernel test robot [1].
    
    Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so
    that slab_is_available() is true during stack_depot_init().  SPARSEMEM
    doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(),
    but a different page_ext_init() even later in the boot process.
    
    Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue.
    
    While at it, also actually resolve a checkpatch warning in stack_depot_init()
    from DRM CI, which was supposed to be in the original patch already.
    
    [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/
    
    Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  14. lib/stackdepot: fix spelling mistake and grammar in pr_err message

    There is a spelling mistake of the work allocation so fix this and
    re-phrase the message to make it easier to read.
    
    Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Colin Ian King authored and hnaz committed Oct 28, 2021
  15. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc()
    
    Currently, enabling CONFIG_STACKDEPOT means its stack_table will be
    allocated from memblock, even if stack depot ends up not actually used. 
    The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit.
    
    This is fine for use-cases such as KASAN which is also a config option and
    has overhead on its own.  But it's an issue for functionality that has to
    be actually enabled on boot (page_owner) or depends on hardware (GPU
    drivers) and thus the memory might be wasted.  This was raised as an issue
    [1] when attempting to add stackdepot support for SLUB's debug object
    tracking functionality.  It's common to build kernels with
    CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or
    create only specific kmem caches with debugging for testing purposes.
    
    It would thus be more efficient if stackdepot's table was allocated only
    when actually going to be used.  This patch thus makes the allocation (and
    whole stack_depot_init() call) optional:
    
    - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current
      well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN
      select this flag.
    - Other users have to call stack_depot_init() as part of their own init when
      it's determined that stack depot will actually be used. This may depend on
      both config and runtime conditions. Convert current users which are
      page_owner and several in the DRM subsystem. Same will be done for SLUB
      later.
    - Because the init might now be called after the boot-time memblock allocation
      has given all memory to the buddy allocator, change stack_depot_init() to
      allocate stack_table with kvmalloc() when memblock is no longer available.
      Also handle allocation failure by disabling stackdepot (could have
      theoretically happened even with memblock allocation previously), and don't
      unnecessarily align the memblock allocation to its own size anymore.
    
    [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/
    
    Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Dmitry Vyukov <dvyukov@google.com>
    Reviewed-by: Marco Elver <elver@google.com> # stackdepot
    Cc: Marco Elver <elver@google.com>
    Cc: Vijayanand Jitta <vjitta@codeaurora.org>
    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Cc: Maxime Ripard <mripard@kernel.org>
    Cc: Thomas Zimmermann <tzimmermann@suse.de>
    Cc: David Airlie <airlied@linux.ie>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Oliver Glitta <glittao@gmail.com>
    Cc: Imran Khan <imran.f.khan@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  16. mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault-vs-…

    …folios
    
    fix
    mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault.patch
    for folio tree PAGEFLAG_FALSE() change.
    
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  17. restore-acct_reclaim_writeback-for-folio

    Make Mel's "mm/vmscan: throttle reclaim and compaction when too may pages
    are isolated" work for folio changes.
    
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  18. linux-next: build failure after merge of the btrfs tree

    --Sig_/WBXj/Gqj3ng9sdQkmCFxe+v
    Content-Type: text/plain; charset=US-ASCII
    Content-Transfer-Encoding: quoted-printable
    
    Hi all,
    
    [I am not sure why this error only popped up after I merged Andrew's
    patch set ...]
    
    After merging the btrfs tree, today's linux-next build (x86_64
    allmodconfig) failed like this:
    
    In file included from include/linux/string.h:253,
                     from include/linux/bitmap.h:11,
                     from include/linux/cpumask.h:12,
                     from arch/x86/include/asm/cpumask.h:5,
                     from arch/x86/include/asm/msr.h:11,
                     from arch/x86/include/asm/processor.h:22,
                     from arch/x86/include/asm/cpufeature.h:5,
                     from arch/x86/include/asm/thread_info.h:53,
                     from include/linux/thread_info.h:60,
                     from arch/x86/include/asm/preempt.h:7,
                     from include/linux/preempt.h:78,
                     from include/linux/spinlock.h:55,
                     from include/linux/wait.h:9,
                     from include/linux/mempool.h:8,
                     from include/linux/bio.h:8,
                     from fs/btrfs/ioctl.c:7:
    In function 'memcpy',
        inlined from '_btrfs_ioctl_send' at fs/btrfs/ioctl.c:4846:3:
    include/linux/fortify-string.h:219:4: error: call to '__write_overflow' dec=
    lared with attribute error: detected write beyond size of object (1st param=
    eter)
      219 |    __write_overflow();
          |    ^~~~~~~~~~~~~~~~~~
    
    Caused by commit
    
      c8d9cdf ("btrfs: send: prepare for v2 protocol")
    
    This changes the "reserved" field of struct btrfs_ioctl_send_args from 4 u6=
    4's to 3, but the above memcpy is copying the "reserved" filed from a struc=
    t btrfs_ioctl_send_args_32 (4 u64s) into it.
    
    All I could really do at this point was mark BTRFS_FS as BROKEN
    (TEST_KMOD selects BTRFS_FS):
    
    From: Stephen Rothwell <sfr@canb.auug.org.au>
    Date: Wed, 27 Oct 2021 20:53:24 +1100
    Subject: [PATCH] make btrfs as BROKEN due to an inconsistent API change
    
    Link: https://lkml.kernel.org/r/20211027210924.22ef5881@canb.auug.org.au
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: David Sterba <dsterba@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    sfrothwell authored and hnaz committed Oct 28, 2021
  19. linux-next-rejects-fix

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  20. linux-next-rejects

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  21. linux-next

    GIT fb79395
    
    
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  22. revert-acct_reclaim_writeback-for-next

    take away this change from Mel's "mm/vmscan: throttle reclaim and
    compaction when too may pages are isolated" so that linux-next.patch
    applies more easily.
    
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  23. shm: extend forced shm destroy to support objects from several IPC nses

    Currently, exit_shm function not designed to work properly when
    task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
    
    This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
    leads to use-after-free (reproducer exists).
    
    That particular patch is attempt to fix the problem by extending exit_shm
    mechanism to handle shm's destroy from several IPC ns'es.
    
    To achieve that we do several things:
    
    1. add namespace (non-refcounted) pointer to the struct shmid_kernel
    
    2. during new shm object creation (newseg()/shmget syscall) we
       initialize this pointer by current task IPC ns
    
    3. exit_shm() fully reworked such that it traverses over all shp's in
       task->sysvshm.shm_clist and gets IPC namespace not from current task as
       it was before but from shp's object itself, then call shm_destroy(shp,
       ns).
    
    Note.  We need to be really careful here, because as it was said before
    (1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we using
    special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
    only if IPC ns not in the "state of destruction".
    
    Q/A
    
    Q: Why we can access shp->ns memory using non-refcounted pointer?
    A: Because shp object lifetime is always shorther than IPC namespace
       lifetime, so, if we get shp object from the task->sysvshm.shm_clist
       while holding task_lock(task) nobody can steal our namespace.
    
    Q: Does this patch change semantics of unshare/setns/clone syscalls?
    A: Not.  It's just fixes non-covered case when process may leave IPC
       namespace without getting task->sysvshm.shm_clist list cleaned up.
    
    Link: https://lkml.kernel.org/r/20211027224348.611025-3-alexander.mikhalitsyn@virtuozzo.com
    Fixes: ab602f7 ("shm: make exit_shm work proportional to task activity")
    Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Greg KH <gregkh@linuxfoundation.org>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mihalicyn authored and hnaz committed Oct 28, 2021
  24. ipc: WARN if trying to remove ipc object which is absent

    Patch series "shm: shm_rmid_forced feature fixes".
    
    Some time ago I met kernel crash after CRIU restore procedure,
    fortunately, it was CRIU restore, so, I had dump files and could do
    restore many times and crash reproduced easily.  After some investigation
    I've constructed the minimal reproducer.  It was found that it's
    use-after-free and it happens only if sysctl kernel.shm_rmid_forced = 1.
    
    The key of the problem is that the exit_shm() function not handles shp's
    object destroy when task->sysvshm.shm_clist contains items from different
    IPC namespaces.  In most cases this list will contain only items from one
    IPC namespace.
    
    Why this list may contain object from different namespaces?  Function
    exit_shm() designed to clean up this list always when process leaves IPC
    namespace.  But we made a mistake a long time ago and not add exit_shm()
    call into setns() syscall procedures.  1st second idea was just to add
    this call to setns() syscall but it's obviously changes semantics of
    setns() syscall and that's userspace-visible change.  So, I gave up this
    idea.
    
    First real attempt to address the issue was just to omit forced destroy if
    we meet shp object not from current task IPC namespace [1].  But that was
    not the best idea because task->sysvshm.shm_clist was protected by rwsem
    which belongs to current task IPC namespace.  It means that list
    corruption may occur.
    
    Second approach is just extend exit_shm() to properly handle shp's from
    different IPC namespaces [2].  This is really non-trivial thing, I've put
    a lot of effort into that but not believed that it's possible to make it
    fully safe, clean and clear.
    
    Thanks to the efforts of Manfred Spraul working an elegant solution was
    designed.  Thanks a lot, Manfred!
    
    Eric also suggested the way to address the issue in ("[RFC][PATCH] shm: In
    shm_exit destroy all created and never attached segments") Eric's idea was
    to maintain a list of shm_clists one per IPC namespace, use lock-less
    lists.  But there is some extra memory consumption-related concerns.
    
    Alternative solution which was suggested by me was implemented in ("shm:
    reset shm_clist on setns but omit forced shm destroy") Idea is pretty
    simple, we add exit_shm() syscall to setns() but DO NOT destroy shm
    segments even if sysctl kernel.shm_rmid_forced = 1, we just clean up the
    task->sysvshm.shm_clist list.  This chages semantics of setns() syscall a
    little bit but in comparision to "naive" solution when we just add
    exit_shm() without any special exclusions this looks like a safer option.
    
    [1] https://lkml.org/lkml/2021/7/6/1108
    [2] https://lkml.org/lkml/2021/7/14/736
    
    
    This patch (of 2):
    
    Let's produce a warning if we trying to remove non-existing IPC object
    from IPC namespace kht/idr structures.
    
    This allows to catch possible bugs when ipc_rmid() function was called
    with inconsistent struct ipc_ids*, struct kern_ipc_perm* arguments.
    
    Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
    Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
    Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Greg KH <gregkh@linuxfoundation.org>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mihalicyn authored and hnaz committed Oct 28, 2021
  25. ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL

    Compilation of ipc/ipc_sysctl.c is controlled by
    obj-$(CONFIG_SYSVIPC_SYSCTL)
    [see ipc/Makefile]
    
    And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL
    [see init/Kconfig]
    
    An SYSCTL is selected by PROC_SYSCTL.
    [see fs/proc/Kconfig]
    
    Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the
    fallback can be removed.
    
    Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Davidlohr Bueso <dbueso@suse.de>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    manfred-colorfu authored and hnaz committed Oct 28, 2021
  26. ipc-check-checkpoint_restore_ns_capable-to-modify-c-r-proc-files-fix

    ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable()
    
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Cc: Michal Clapinski <mclapinski@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  27. ipc: check checkpoint_restore_ns_capable() to modify C/R proc files

    This commit removes the requirement to be root to modify sem_next_id,
    msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable
    instead.
    
    Since those files are specific to the IPC namespace, there is no reason
    they should require root privileges.  This is similar to ns_last_pid,
    which also only checks checkpoint_restore_ns_capable.
    
    Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com
    Signed-off-by: Michal Clapinski <mclapinski@google.com>
    Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
    Reviewed-by: Manfred Spraul <manfred@colorfullife.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mclapinski authored and hnaz committed Oct 28, 2021
  28. selftests/kselftest/runner/run_one(): Allow running non-executable files

    When running a test program, 'run_one()' checks if the program has the
    execution permission and fails if it doesn't.  However, it's easy to
    mistakenly lose the permissions, as some common tools like 'diff' don't
    support the permission change well[1].  Compared to that, making mistakes
    in the test program's path would only rare, as those are explicitly listed
    in 'TEST_PROGS'.  Therefore, it might make more sense to resolve the
    situation on our own and run the program.
    
    For this reason, this commit makes the test program runner function still
    print the warning message but to try parsing the interpreter of the
    program and to explicitly run it with the interpreter, in this case.
    
    [1] https://lore.kernel.org/mm-commits/YRJisBs9AunccCD4@kroah.com/
    
    Link: https://lkml.kernel.org/r/20210810164534.25902-1-sj38.park@gmail.com
    Signed-off-by: SeongJae Park <sjpark@amazon.de>
    Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    sj-aws authored and hnaz committed Oct 28, 2021
  29. virtio-mem: disallow mapping virtio-mem memory via /dev/mem

    We don't want user space to be able to map virtio-mem device memory
    directly (e.g., via /dev/mem) in order to have guarantees that in a sane
    setup we'll never accidentially access unplugged memory within the
    device-managed region of a virtio-mem device, just as required by the
    virtio-spec.
    
    As soon as the virtio-mem driver is loaded, the device region is visible
    in /proc/iomem via the parent device region.  From that point on user
    space is aware of the device region and we want to disallow mapping
    anything inside that region (where we will dynamically (un)plug memory)
    until the driver has been unloaded cleanly and e.g., another driver might
    take over.
    
    By creating our parent IORESOURCE_SYSTEM_RAM resource with
    IORESOURCE_EXCLUSIVE, we will disallow any /dev/mem access to our device
    region until the driver was unloaded cleanly and removed the parent
    region.  This will work even though only some memory blocks are actually
    currently added to Linux and appear as busy in the resource tree.
    
    So access to the region from user space is only possible
    a) if we don't load the virtio-mem driver.
    b) after unloading the virtio-mem driver cleanly.
    
    Don't build virtio-mem if access to /dev/mem cannot be restricticted -- if
    we have CONFIG_DEVMEM=y but CONFIG_STRICT_DEVMEM is not set.
    
    Link: https://lkml.kernel.org/r/20210920142856.17758-4-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Acked-by: Michael S. Tsirkin <mst@redhat.com>
    Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hanjun Guo <guohanjun@huawei.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    davidhildenbrand authored and hnaz committed Oct 28, 2021
  30. kernel/resource: disallow access to exclusive system RAM regions

    virtio-mem dynamically exposes memory inside a device memory region as
    system RAM to Linux, coordinating with the hypervisor which parts are
    actually "plugged" and consequently usable/accessible.  On the one hand,
    the virtio-mem driver adds/removes whole memory blocks, creating/removing
    busy IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically
    (un)plugs memory inside added memory blocks, dynamically either exposing
    them to the buddy or hiding them from the buddy and marking them
    PG_offline.
    
    In contrast to physical devices, like a DIMM, the virtio-mem driver is
    required to actually make use of any of the device-provided memory,
    because it performs the handshake with the hypervisor.  virtio-mem memory
    cannot simply be access via /dev/mem without a driver.
    
    There is no safe way to:
    a) Access plugged memory blocks via /dev/mem, as they might contain
       unplugged holes or might get silently unplugged by the virtio-mem
       driver and consequently turned inaccessible.
    b) Access unplugged memory blocks via /dev/mem because the virtio-mem
       driver is required to make them actually accessible first.
    
    The virtio-spec states that unplugged memory blocks MUST NOT be written,
    and only selected unplugged memory blocks MAY be read.  We want to make
    sure, this is the case in sane environments -- where the virtio-mem driver
    was loaded.
    
    We want to make sure that in a sane environment, nobody "accidentially"
    accesses unplugged memory inside the device managed region.  For example,
    a user might spot a memory region in /proc/iomem and try accessing it via
    /dev/mem via gdb or dumping it via something else.  By the time the mmap()
    happens, the memory might already have been removed by the virtio-mem
    driver silently: the mmap() would succeeed and user space might
    accidentially access unplugged memory.
    
    So once the driver was loaded and detected the device along the
    device-managed region, we just want to disallow any access via /dev/mem to
    it.
    
    In an ideal world, we would mark the whole region as busy ("owned by a
    driver") and exclude it; however, that would be wrong, as we don't really
    have actual system RAM at these ranges added to Linux ("busy system RAM").
    Instead, we want to mark such ranges as "not actual busy system RAM but
    still soft-reserved and prepared by a driver for future use."
    
    Let's teach iomem_is_exclusive() to reject access to any range with
    "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
    if "iomem=relaxed" is set.  Introduce EXCLUSIVE_SYSTEM_RAM to make it
    easier for applicable drivers to depend on this setting in their Kconfig.
    
    For now, there are no applicable ranges and we'll modify virtio-mem next
    to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
    creates to contain all actual busy system RAM added via
    add_memory_driver_managed().
    
    Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Hanjun Guo <guohanjun@huawei.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    davidhildenbrand authored and hnaz committed Oct 28, 2021
  31. kernel/resource: clean up and optimize iomem_is_exclusive()

    Patch series "virtio-mem: disallow mapping virtio-mem memory via /dev/mem", v5.
    
    Let's add the basic infrastructure to exclude some physical memory regions
    marked as "IORESOURCE_SYSTEM_RAM" completely from /dev/mem access, even
    though they are not marked IORESOURCE_BUSY and even though "iomem=relaxed"
    is set.  Resource IORESOURCE_EXCLUSIVE for that purpose instead of adding
    new flags to express something similar to "soft-busy" or "not busy yet,
    but already prepared by a driver and not to be mapped by user space".
    
    Use it for virtio-mem, to disallow mapping any virtio-mem memory via
    /dev/mem to user space after the virtio-mem driver was loaded.
    
    
    This patch (of 3):
    
    We end up traversing subtrees of ranges we are not interested in; let's
    optimize this case, skipping such subtrees, cleaning up the function a
    bit.
    
    For example, in the following configuration (/proc/iomem):
    
    00000000-00000fff : Reserved
    00001000-00057fff : System RAM
    00058000-00058fff : Reserved
    00059000-0009cfff : System RAM
    0009d000-000fffff : Reserved
       000a0000-000bffff : PCI Bus 0000:00
       000c0000-000c3fff : PCI Bus 0000:00
       000c4000-000c7fff : PCI Bus 0000:00
       000c8000-000cbfff : PCI Bus 0000:00
       000cc000-000cffff : PCI Bus 0000:00
       000d0000-000d3fff : PCI Bus 0000:00
       000d4000-000d7fff : PCI Bus 0000:00
       000d8000-000dbfff : PCI Bus 0000:00
       000dc000-000dffff : PCI Bus 0000:00
       000e0000-000e3fff : PCI Bus 0000:00
       000e4000-000e7fff : PCI Bus 0000:00
       000e8000-000ebfff : PCI Bus 0000:00
       000ec000-000effff : PCI Bus 0000:00
       000f0000-000fffff : PCI Bus 0000:00
         000f0000-000fffff : System ROM
    00100000-3fffffff : System RAM
    40000000-403fffff : Reserved
       40000000-403fffff : pnp 00:00
    40400000-80a79fff : System RAM
    ...
    
    We don't have to look at any children of "0009d000-000fffff : Reserved" if
    we can just skip these 15 items directly because the parent range is not
    of interest.
    
    Link: https://lkml.kernel.org/r/20210920142856.17758-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210920142856.17758-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
    Cc: Hanjun Guo <guohanjun@huawei.com>
    Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    davidhildenbrand authored and hnaz committed Oct 28, 2021
Older