Skip to content
Permalink
Joao-Martins/m…
Switch branches/tags

Commits on Nov 12, 2021

  1. device-dax: compound devmap support

    Use the newly added compound devmap facility which maps the assigned dax
    ranges as compound pages at a page size of @align.
    
    dax devices are created with a fixed @align (huge page size) which is
    enforced through as well at mmap() of the device. Faults, consequently
    happen too at the specified @align specified at the creation, and those
    don't change throughout dax device lifetime. MCEs unmap a whole dax
    huge page, as well as splits occurring at the configured page size.
    
    Performance measured by gup_test improves considerably for
    unpin_user_pages() and altmap with NVDIMMs:
    
    $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
    [altmap]
    (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
    
     $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
    [altmap with -m 127004]
    (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
    
    .. as well as unpin_user_page_range_dirty_lock() being just as effective
    as THP/hugetlb[0] pages.
    
    [0] https://lore.kernel.org/linux-mm/20210212130843.13865-5-joao.m.martins@oracle.com/
    
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  2. device-dax: ensure dev_dax->pgmap is valid for dynamic devices

    Right now, only static dax regions have a valid @pgmap pointer in its
    struct dev_dax. Dynamic dax case however, do not.
    
    In preparation for device-dax compound devmap support, make sure that
    dev_dax pgmap field is set after it has been allocated and initialized.
    
    dynamic dax device have the @pgmap is allocated at probe() and it's
    managed by devm (contrast to static dax region which a pgmap is provided
    and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
    the pgmap when the dynamic dax device is released to avoid the same
    pgmap ranges to be re-requested across multiple region device reconfigs.
    
    Add a static_dev_dax() and use that helper in dev_dax_probe() to ensure
    the initialization differences between dynamic and static regions are
    more explicit. While at it, consolidate the ranges initialization when we
    allocate the @pgmap for the dynamic dax region case.
    
    Suggested-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  3. device-dax: use struct_size()

    Use the struct_size() helper for the size of a struct with variable array
    member at the end, rather than manually calculating it.
    
    Suggested-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  4. device-dax: use ALIGN() for determining pgoff

    Rather than calculating @pgoff manually, switch to ALIGN() instead.
    
    Suggested-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  5. mm/memremap: add ZONE_DEVICE support for compound pages

    Add a new @vmemmap_shift property for struct dev_pagemap which specifies that a
    devmap is composed of a set of compound pages of order @vmemmap_shift, instead of
    base pages. When a compound page devmap is requested, all but the first
    page are initialised as tail pages instead of order-0 pages.
    
    For certain ZONE_DEVICE users like device-dax which have a fixed page size,
    this creates an opportunity to optimize GUP and GUP-fast walkers, treating
    it the same way as THP or hugetlb pages.
    
    Additionally, commit 7118fc2 ("hugetlb: address ref count racing in
    prep_compound_gigantic_page") removed set_page_count() because the
    setting of page ref count to zero was redundant. devmap pages don't come
    from page allocator though and only head page refcount is used for
    compound pages, hence initialize tail page count to zero.
    
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  6. mm/page_alloc: refactor memmap_init_zone_device() page init

    Move struct page init to an helper function __init_zone_device_page().
    
    This is in preparation for sharing the storage for compound page
    metadata.
    
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  7. mm/page_alloc: split prep_compound_page into head and tail subparts

    Split the utility function prep_compound_page() into head and tail
    counterparts, and use them accordingly.
    
    This is in preparation for sharing the storage for compound page
    metadata.
    
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021
  8. memory-failure: fetch compound_head after pgmap_pfn_valid()

    memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
    dax_lock_page()).  For devmap with compound pages fetch the
    compound_head in case a tail page memory failure is being handled.
    
    Currently this is a nop, but in the advent of compound pages in
    dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
    
    Reported-by: Jane Chu <jane.chu@oracle.com>
    Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
    Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: Dan Williams <dan.j.williams@intel.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    jpemartins authored and intel-lab-lkp committed Nov 12, 2021

Commits on Oct 28, 2021

  1. pci: test for unexpectedly disabled bridges

    The all-ones value is not just a "device didn't exist" case, it's also
    potentially a quite valid value, so not restoring it would be wrong.
    
    What *would* be interesting is to hear where the bad values came from in
    the first place.  It sounds like the device state is saved after the PCI
    bus controller in front of the device has been crapped on, resulting in the
    PCI config cycles never reaching the device at all.
    
    Something along this patch (together with suspend/resume debugging output)
    migth help pinpoint it.  But it really sounds like something totally
    brokenly turned off the PCI bridge (some ACPI shutdown crud?  I wouldn't be
    entirely surprised)
    
    Cc: Greg KH <greg@kroah.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    torvalds authored and hnaz committed Oct 28, 2021
  2. kernel/fork.c: export kernel_thread() to modules

    mutex-subsystem-synchro-test-module.patch needs this
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  3. mutex subsystem, synchro-test module

    The attached patch adds a module for testing and benchmarking mutexes,
    semaphores and R/W semaphores.
    
    Using it is simple:
    
    	insmod synchro-test.ko <args>
    
    It will exit with error ENOANO after running the tests and printing the
    results to the kernel console log.
    
    The available arguments are:
    
     (*) mx=N
    
    	Start up to N mutex thrashing threads, where N is at most 20. All will
    	try and thrash the same mutex.
    
     (*) sm=N
    
    	Start up to N counting semaphore thrashing threads, where N is at most
    	20. All will try and thrash the same semaphore.
    
     (*) ism=M
    
    	Initialise the counting semaphore with M, where M is any positive
    	integer greater than zero. The default is 4.
    
     (*) rd=N
     (*) wr=O
     (*) dg=P
    
    	Start up to N reader thrashing threads, O writer thrashing threads and
    	P downgrader thrashing threads, where N, O and P are at most 20
    	apiece. All will try and thrash the same read/write semaphore.
    
     (*) elapse=N
    
    	Run the tests for N seconds. The default is 5.
    
     (*) load=N
    
    	Each thread delays for N uS whilst holding the lock. The dfault is 0.
    
     (*) interval=N
    
    	Each thread delays for N uS whilst not holding the lock. The default
    	is 0.
    
     (*) do_sched=1
    
    	Each thread will call schedule if required after each iteration.
    
     (*) v=1
    
    	Print more verbose information, including a thread iteration
    	distribution list.
    
    The module should be enabled by turning on CONFIG_DEBUG_SYNCHRO_TEST to "m".
    
    [randy.dunlap@oracle.com: fix build errors, add <sched.h> header file]
    [akpm@linux-foundation.org: remove smp_lock.h inclusion]
    [viro@ZenIV.linux.org.uk: kill daemonize() calls]
    [rdunlap@xenotime.net: fix printk format warrnings]
    [walken@google.com: add spinlock test]
    [walken@google.com: document default load and interval values]
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Adrian Bunk <bunk@stusta.de>
    Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
    Signed-off-by: Michel Lespinasse <walken@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    dhowells authored and hnaz committed Oct 28, 2021
  4. Releasing resources with children

    What does it mean to release a resource with children?  Should the children
    become children of the released resource's parent?  Should they be released
    too?  Should we fail the release?
    
    I bet we have no callers who expect this right now, but with
    insert_resource() we may get some.  At the point where someone hits this
    BUG we can figure out what semantics we want.
    
    Signed-off-by: Matthew Wilcox <willy@parisc-linux.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Oct 28, 2021
  5. Make sure nobody's leaking resources

    Currently, releasing a resource also releases all of its children.  That
    made sense when request_resource was the main method of dividing up the
    memory map.  With the increased use of insert_resource, it seems to me that
    we should instead reparent the newly orphaned resources.  Before we do
    that, let's make sure that nobody's actually relying on the current
    semantics.
    
    Signed-off-by: Matthew Wilcox <matthew@wil.cx>
    Cc: Greg KH <greg@kroah.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Oct 28, 2021
  6. kasan: add kasan mode messages when kasan init

    There are multiple kasan modes.  It makes sense that we add some messages
    to know which kasan mode is active when booting up.  see [1].
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=212195 [1]
    Link: https://lkml.kernel.org/r/20211020094850.4113-1-Kuan-Ying.Lee@mediatek.com
    Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
    Reviewed-by: Marco Elver <elver@google.com>
    Reviewed-by: David Hildenbrand <david@redhat.com>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Matthias Brugger <matthias.bgg@gmail.com>
    Cc: Chinwen Chang <chinwen.chang@mediatek.com>
    Cc: Yee Lee <yee.lee@mediatek.com>
    Cc: Nicholas Tang <nicholas.tang@mediatek.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Kuan-Ying Lee authored and hnaz committed Oct 28, 2021
  7. mm: unexport {,un}lock_page_memcg

    These are only used in built-in core mm code.
    
    Link: https://lkml.kernel.org/r/20210820095815.445392-3-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Christoph Hellwig authored and hnaz committed Oct 28, 2021
  8. mm: unexport folio_memcg_{,un}lock

    Patch series "unexport memcg locking helpers".
    
    Neither the old page-based nor the new folio-based memcg locking helpers
    are used in modular code at all, so drop the exports.
    
    
    This patch (of 2):
    
    folio_memcg_{,un}lock are only used in built-in core mm code.
    
    Link: https://lkml.kernel.org/r/20210820095815.445392-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20210820095815.445392-2-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Christoph Hellwig authored and hnaz committed Oct 28, 2021
  9. mm/migrate.c: remove MIGRATE_PFN_LOCKED

    MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
    source page was already locked during migrate_vma_collect().  If it wasn't
    then the a second attempt is made to lock the page.  However if the first
    attempt failed it's unlikely a second attempt will succeed, and the retry
    adds complexity.  So clean this up by removing the retry and
    MIGRATE_PFN_LOCKED flag.
    
    Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set,
    but nothing actually checks that.
    
    Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com
    Signed-off-by: Alistair Popple <apopple@nvidia.com>
    Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
    Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ben Skeggs <bskeggs@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Alistair Popple authored and hnaz committed Oct 28, 2021
  10. mm: migrate: simplify the file-backed pages validation when migrating…

    … its mapping
    
    Patch series "Some cleanup for page migration", v3.
    
    This patchset does some cleanups and improvements for page migration.
    
    
    This patch (of 4):
    
    There is no need to validate the file-backed page's refcount before trying
    to freeze the page's expected refcount, instead we can rely on the
    folio_ref_freeze() to validate if the page has the expected refcount
    before migrating its mapping.
    
    Moreover we are always under the page lock when migrating the page
    mapping, which means nowhere else can remove it from the page cache, so we
    can remove the xas_load() validation under the i_pages lock.
    
    Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com
    Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Baolin Wang authored and hnaz committed Oct 28, 2021
  11. mm: allow only SLUB on PREEMPT_RT

    Memory allocators may disable interrupts or preemption as part of the
    allocation and freeing process.  For PREEMPT_RT it is important that these
    sections remain deterministic and short and therefore don't depend on the
    size of the memory to allocate/ free or the inner state of the algorithm.
    
    Until v3.12-RT the SLAB allocator was an option but involved several
    changes to meet all the requirements.  The SLUB design fits better with
    PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT
    patchset.  Comparing the two allocator, SLUB outperformed SLAB in both
    throughput (time needed to allocate and free memory) and the maximal
    latency of the system measured with cyclictest during hackbench.
    
    SLOB was never evaluated since it was unlikely that it preforms better
    than SLAB.  During a quick test, the kernel crashed with SLOB enabled
    during boot.
    
    Disable SLAB and SLOB on PREEMPT_RT.
    
    [bigeasy@linutronix.de: commit description]
    Link: https://lkml.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Ingo Molnar authored and hnaz committed Oct 28, 2021
  12. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc() - fixup3
    
    Due to cd06ab2 ("drm/locking: add backtrace for locking contended
    locks without backoff") landing recently to -next adding a new stack depot
    user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate
    call to stack_depot_init() there as well.
    
    Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Jani Nikula <jani.nikula@intel.com>
    Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Vijayanand Jitta <vjitta@codeaurora.org>
    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Cc: Maxime Ripard <mripard@kernel.org>
    Cc: Thomas Zimmermann <tzimmermann@suse.de>
    Cc: David Airlie <airlied@linux.ie>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Oliver Glitta <glittao@gmail.com>
    Cc: Imran Khan <imran.f.khan@oracle.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  13. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc() - fixup
    
    On FLATMEM, we call page_ext_init_flatmem_late() just before
    kmem_cache_init() which means stack_depot_init() (called by page owner
    init) will not recognize properly it should use kvmalloc() and not
    memblock_alloc().  memblock_alloc() will also not issue a warning and
    return a block memory that can be invalid and cause kernel page fault when
    saving stacks, as reported by the kernel test robot [1].
    
    Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so
    that slab_is_available() is true during stack_depot_init().  SPARSEMEM
    doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(),
    but a different page_ext_init() even later in the boot process.
    
    Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue.
    
    While at it, also actually resolve a checkpatch warning in stack_depot_init()
    from DRM CI, which was supposed to be in the original patch already.
    
    [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/
    
    Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Reported-by: kernel test robot <oliver.sang@intel.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  14. lib/stackdepot: fix spelling mistake and grammar in pr_err message

    There is a spelling mistake of the work allocation so fix this and
    re-phrase the message to make it easier to read.
    
    Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Colin Ian King authored and hnaz committed Oct 28, 2021
  15. lib/stackdepot: allow optional init and stack_table allocation by kvm…

    …alloc()
    
    Currently, enabling CONFIG_STACKDEPOT means its stack_table will be
    allocated from memblock, even if stack depot ends up not actually used. 
    The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit.
    
    This is fine for use-cases such as KASAN which is also a config option and
    has overhead on its own.  But it's an issue for functionality that has to
    be actually enabled on boot (page_owner) or depends on hardware (GPU
    drivers) and thus the memory might be wasted.  This was raised as an issue
    [1] when attempting to add stackdepot support for SLUB's debug object
    tracking functionality.  It's common to build kernels with
    CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or
    create only specific kmem caches with debugging for testing purposes.
    
    It would thus be more efficient if stackdepot's table was allocated only
    when actually going to be used.  This patch thus makes the allocation (and
    whole stack_depot_init() call) optional:
    
    - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current
      well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN
      select this flag.
    - Other users have to call stack_depot_init() as part of their own init when
      it's determined that stack depot will actually be used. This may depend on
      both config and runtime conditions. Convert current users which are
      page_owner and several in the DRM subsystem. Same will be done for SLUB
      later.
    - Because the init might now be called after the boot-time memblock allocation
      has given all memory to the buddy allocator, change stack_depot_init() to
      allocate stack_table with kvmalloc() when memblock is no longer available.
      Also handle allocation failure by disabling stackdepot (could have
      theoretically happened even with memblock allocation previously), and don't
      unnecessarily align the memblock allocation to its own size anymore.
    
    [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/
    
    Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Dmitry Vyukov <dvyukov@google.com>
    Reviewed-by: Marco Elver <elver@google.com> # stackdepot
    Cc: Marco Elver <elver@google.com>
    Cc: Vijayanand Jitta <vjitta@codeaurora.org>
    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Cc: Maxime Ripard <mripard@kernel.org>
    Cc: Thomas Zimmermann <tzimmermann@suse.de>
    Cc: David Airlie <airlied@linux.ie>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Konovalov <andreyknvl@gmail.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Oliver Glitta <glittao@gmail.com>
    Cc: Imran Khan <imran.f.khan@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    tehcaster authored and hnaz committed Oct 28, 2021
  16. mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault-vs-…

    …folios
    
    fix
    mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault.patch
    for folio tree PAGEFLAG_FALSE() change.
    
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  17. restore-acct_reclaim_writeback-for-folio

    Make Mel's "mm/vmscan: throttle reclaim and compaction when too may pages
    are isolated" work for folio changes.
    
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  18. linux-next: build failure after merge of the btrfs tree

    --Sig_/WBXj/Gqj3ng9sdQkmCFxe+v
    Content-Type: text/plain; charset=US-ASCII
    Content-Transfer-Encoding: quoted-printable
    
    Hi all,
    
    [I am not sure why this error only popped up after I merged Andrew's
    patch set ...]
    
    After merging the btrfs tree, today's linux-next build (x86_64
    allmodconfig) failed like this:
    
    In file included from include/linux/string.h:253,
                     from include/linux/bitmap.h:11,
                     from include/linux/cpumask.h:12,
                     from arch/x86/include/asm/cpumask.h:5,
                     from arch/x86/include/asm/msr.h:11,
                     from arch/x86/include/asm/processor.h:22,
                     from arch/x86/include/asm/cpufeature.h:5,
                     from arch/x86/include/asm/thread_info.h:53,
                     from include/linux/thread_info.h:60,
                     from arch/x86/include/asm/preempt.h:7,
                     from include/linux/preempt.h:78,
                     from include/linux/spinlock.h:55,
                     from include/linux/wait.h:9,
                     from include/linux/mempool.h:8,
                     from include/linux/bio.h:8,
                     from fs/btrfs/ioctl.c:7:
    In function 'memcpy',
        inlined from '_btrfs_ioctl_send' at fs/btrfs/ioctl.c:4846:3:
    include/linux/fortify-string.h:219:4: error: call to '__write_overflow' dec=
    lared with attribute error: detected write beyond size of object (1st param=
    eter)
      219 |    __write_overflow();
          |    ^~~~~~~~~~~~~~~~~~
    
    Caused by commit
    
      c8d9cdf ("btrfs: send: prepare for v2 protocol")
    
    This changes the "reserved" field of struct btrfs_ioctl_send_args from 4 u6=
    4's to 3, but the above memcpy is copying the "reserved" filed from a struc=
    t btrfs_ioctl_send_args_32 (4 u64s) into it.
    
    All I could really do at this point was mark BTRFS_FS as BROKEN
    (TEST_KMOD selects BTRFS_FS):
    
    From: Stephen Rothwell <sfr@canb.auug.org.au>
    Date: Wed, 27 Oct 2021 20:53:24 +1100
    Subject: [PATCH] make btrfs as BROKEN due to an inconsistent API change
    
    Link: https://lkml.kernel.org/r/20211027210924.22ef5881@canb.auug.org.au
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: David Sterba <dsterba@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    sfrothwell authored and hnaz committed Oct 28, 2021
  19. linux-next-rejects-fix

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  20. linux-next-rejects

    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  21. linux-next

    GIT fb79395
    
    
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  22. revert-acct_reclaim_writeback-for-next

    take away this change from Mel's "mm/vmscan: throttle reclaim and
    compaction when too may pages are isolated" so that linux-next.patch
    applies more easily.
    
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  23. shm: extend forced shm destroy to support objects from several IPC nses

    Currently, exit_shm function not designed to work properly when
    task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
    
    This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
    leads to use-after-free (reproducer exists).
    
    That particular patch is attempt to fix the problem by extending exit_shm
    mechanism to handle shm's destroy from several IPC ns'es.
    
    To achieve that we do several things:
    
    1. add namespace (non-refcounted) pointer to the struct shmid_kernel
    
    2. during new shm object creation (newseg()/shmget syscall) we
       initialize this pointer by current task IPC ns
    
    3. exit_shm() fully reworked such that it traverses over all shp's in
       task->sysvshm.shm_clist and gets IPC namespace not from current task as
       it was before but from shp's object itself, then call shm_destroy(shp,
       ns).
    
    Note.  We need to be really careful here, because as it was said before
    (1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we using
    special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
    only if IPC ns not in the "state of destruction".
    
    Q/A
    
    Q: Why we can access shp->ns memory using non-refcounted pointer?
    A: Because shp object lifetime is always shorther than IPC namespace
       lifetime, so, if we get shp object from the task->sysvshm.shm_clist
       while holding task_lock(task) nobody can steal our namespace.
    
    Q: Does this patch change semantics of unshare/setns/clone syscalls?
    A: Not.  It's just fixes non-covered case when process may leave IPC
       namespace without getting task->sysvshm.shm_clist list cleaned up.
    
    Link: https://lkml.kernel.org/r/20211027224348.611025-3-alexander.mikhalitsyn@virtuozzo.com
    Fixes: ab602f7 ("shm: make exit_shm work proportional to task activity")
    Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Greg KH <gregkh@linuxfoundation.org>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mihalicyn authored and hnaz committed Oct 28, 2021
  24. ipc: WARN if trying to remove ipc object which is absent

    Patch series "shm: shm_rmid_forced feature fixes".
    
    Some time ago I met kernel crash after CRIU restore procedure,
    fortunately, it was CRIU restore, so, I had dump files and could do
    restore many times and crash reproduced easily.  After some investigation
    I've constructed the minimal reproducer.  It was found that it's
    use-after-free and it happens only if sysctl kernel.shm_rmid_forced = 1.
    
    The key of the problem is that the exit_shm() function not handles shp's
    object destroy when task->sysvshm.shm_clist contains items from different
    IPC namespaces.  In most cases this list will contain only items from one
    IPC namespace.
    
    Why this list may contain object from different namespaces?  Function
    exit_shm() designed to clean up this list always when process leaves IPC
    namespace.  But we made a mistake a long time ago and not add exit_shm()
    call into setns() syscall procedures.  1st second idea was just to add
    this call to setns() syscall but it's obviously changes semantics of
    setns() syscall and that's userspace-visible change.  So, I gave up this
    idea.
    
    First real attempt to address the issue was just to omit forced destroy if
    we meet shp object not from current task IPC namespace [1].  But that was
    not the best idea because task->sysvshm.shm_clist was protected by rwsem
    which belongs to current task IPC namespace.  It means that list
    corruption may occur.
    
    Second approach is just extend exit_shm() to properly handle shp's from
    different IPC namespaces [2].  This is really non-trivial thing, I've put
    a lot of effort into that but not believed that it's possible to make it
    fully safe, clean and clear.
    
    Thanks to the efforts of Manfred Spraul working an elegant solution was
    designed.  Thanks a lot, Manfred!
    
    Eric also suggested the way to address the issue in ("[RFC][PATCH] shm: In
    shm_exit destroy all created and never attached segments") Eric's idea was
    to maintain a list of shm_clists one per IPC namespace, use lock-less
    lists.  But there is some extra memory consumption-related concerns.
    
    Alternative solution which was suggested by me was implemented in ("shm:
    reset shm_clist on setns but omit forced shm destroy") Idea is pretty
    simple, we add exit_shm() syscall to setns() but DO NOT destroy shm
    segments even if sysctl kernel.shm_rmid_forced = 1, we just clean up the
    task->sysvshm.shm_clist list.  This chages semantics of setns() syscall a
    little bit but in comparision to "naive" solution when we just add
    exit_shm() without any special exclusions this looks like a safer option.
    
    [1] https://lkml.org/lkml/2021/7/6/1108
    [2] https://lkml.org/lkml/2021/7/14/736
    
    
    This patch (of 2):
    
    Let's produce a warning if we trying to remove non-existing IPC object
    from IPC namespace kht/idr structures.
    
    This allows to catch possible bugs when ipc_rmid() function was called
    with inconsistent struct ipc_ids*, struct kern_ipc_perm* arguments.
    
    Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
    Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
    Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Greg KH <gregkh@linuxfoundation.org>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
    Cc: Vasily Averin <vvs@virtuozzo.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mihalicyn authored and hnaz committed Oct 28, 2021
  25. ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL

    Compilation of ipc/ipc_sysctl.c is controlled by
    obj-$(CONFIG_SYSVIPC_SYSCTL)
    [see ipc/Makefile]
    
    And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL
    [see init/Kconfig]
    
    An SYSCTL is selected by PROC_SYSCTL.
    [see fs/proc/Kconfig]
    
    Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the
    fallback can be removed.
    
    Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
    Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Davidlohr Bueso <dbueso@suse.de>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    manfred-colorfu authored and hnaz committed Oct 28, 2021
  26. ipc-check-checkpoint_restore_ns_capable-to-modify-c-r-proc-files-fix

    ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable()
    
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Cc: Michal Clapinski <mclapinski@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Oct 28, 2021
  27. ipc: check checkpoint_restore_ns_capable() to modify C/R proc files

    This commit removes the requirement to be root to modify sem_next_id,
    msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable
    instead.
    
    Since those files are specific to the IPC namespace, there is no reason
    they should require root privileges.  This is similar to ns_last_pid,
    which also only checks checkpoint_restore_ns_capable.
    
    Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com
    Signed-off-by: Michal Clapinski <mclapinski@google.com>
    Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
    Reviewed-by: Manfred Spraul <manfred@colorfullife.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    mclapinski authored and hnaz committed Oct 28, 2021
Older