Muchun-Song/mm…
Commits on Jan 21, 2022
-
mm: remove range parameter from follow_invalidate_pte()
The only user (DAX) of range parameter of follow_invalidate_pte() is gone, it safe to remove the range paramter and make it static to simlify the code. Signed-off-by: Muchun Song <songmuchun@bytedance.com>
-
dax: fix missing writeprotect the pte entry
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pte entry within a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd entry dirty and writeable. 2) process B mmap with the @offset (e.g. 4K) and @Length (e.g. 4K) write to the same file, dirtying PMD radix tree entry (already done in 1)) and making the pte entry dirty and writeable. 3) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pte entry as clean and write protected since the vma of process B is not covered in dax_entry_mkclean(). 4) process B writes to the pte. These don't cause any page faults since the pte entry is dirty and writeable. The radix tree entry remains clean. 5) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 6) crash - dirty data that should have been fsync'd as part of 5) could still have been in the processor cache, and is lost. Reuse some infrastructure of page_mkclean_one() to let DAX can handle similar case to fix this issue. Fixes: 4b4bb46 ("dax: clear dirty entry tags on cache flush") Signed-off-by: Muchun Song <songmuchun@bytedance.com> -
mm: page_vma_mapped: support checking if a pfn is mapped into a vma
page_vma_mapped_walk() is supposed to check if a page is mapped into a vma. However, not all page frames (e.g. PFN_DEV) have a associated struct page with it. There is going to be some duplicate codes similar with this function if someone want to check if a pfn (without a struct page) is mapped into a vma. So add support for checking if a pfn is mapped into a vma. In the next patch, the dax will use this new feature. Signed-off-by: Muchun Song <songmuchun@bytedance.com>
-
dax: fix cache flush on PMD-mapped pages
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. Fixes: f729c8c ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: Muchun Song <songmuchun@bytedance.com>
-
mm: rmap: fix cache flush on THP pages
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. At least, no problems were found due to this. Maybe because the architectures that have virtual indexed caches is less. Fixes: f27176c ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()") Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Commits on Oct 28, 2021
-
pci: test for unexpectedly disabled bridges
The all-ones value is not just a "device didn't exist" case, it's also potentially a quite valid value, so not restoring it would be wrong. What *would* be interesting is to hear where the bad values came from in the first place. It sounds like the device state is saved after the PCI bus controller in front of the device has been crapped on, resulting in the PCI config cycles never reaching the device at all. Something along this patch (together with suspend/resume debugging output) migth help pinpoint it. But it really sounds like something totally brokenly turned off the PCI bridge (some ACPI shutdown crud? I wouldn't be entirely surprised) Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
kernel/fork.c: export kernel_thread() to modules
mutex-subsystem-synchro-test-module.patch needs this Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mutex subsystem, synchro-test module
The attached patch adds a module for testing and benchmarking mutexes, semaphores and R/W semaphores. Using it is simple: insmod synchro-test.ko <args> It will exit with error ENOANO after running the tests and printing the results to the kernel console log. The available arguments are: (*) mx=N Start up to N mutex thrashing threads, where N is at most 20. All will try and thrash the same mutex. (*) sm=N Start up to N counting semaphore thrashing threads, where N is at most 20. All will try and thrash the same semaphore. (*) ism=M Initialise the counting semaphore with M, where M is any positive integer greater than zero. The default is 4. (*) rd=N (*) wr=O (*) dg=P Start up to N reader thrashing threads, O writer thrashing threads and P downgrader thrashing threads, where N, O and P are at most 20 apiece. All will try and thrash the same read/write semaphore. (*) elapse=N Run the tests for N seconds. The default is 5. (*) load=N Each thread delays for N uS whilst holding the lock. The dfault is 0. (*) interval=N Each thread delays for N uS whilst not holding the lock. The default is 0. (*) do_sched=1 Each thread will call schedule if required after each iteration. (*) v=1 Print more verbose information, including a thread iteration distribution list. The module should be enabled by turning on CONFIG_DEBUG_SYNCHRO_TEST to "m". [randy.dunlap@oracle.com: fix build errors, add <sched.h> header file] [akpm@linux-foundation.org: remove smp_lock.h inclusion] [viro@ZenIV.linux.org.uk: kill daemonize() calls] [rdunlap@xenotime.net: fix printk format warrnings] [walken@google.com: add spinlock test] [walken@google.com: document default load and interval values] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Releasing resources with children
What does it mean to release a resource with children? Should the children become children of the released resource's parent? Should they be released too? Should we fail the release? I bet we have no callers who expect this right now, but with insert_resource() we may get some. At the point where someone hits this BUG we can figure out what semantics we want. Signed-off-by: Matthew Wilcox <willy@parisc-linux.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Make sure nobody's leaking resources
Currently, releasing a resource also releases all of its children. That made sense when request_resource was the main method of dividing up the memory map. With the increased use of insert_resource, it seems to me that we should instead reparent the newly orphaned resources. Before we do that, let's make sure that nobody's actually relying on the current semantics. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Cc: Greg KH <greg@kroah.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
kasan: add kasan mode messages when kasan init
There are multiple kasan modes. It makes sense that we add some messages to know which kasan mode is active when booting up. see [1]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=212195 [1] Link: https://lkml.kernel.org/r/20211020094850.4113-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Yee Lee <yee.lee@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mm: unexport {,un}lock_page_memcg
These are only used in built-in core mm code. Link: https://lkml.kernel.org/r/20210820095815.445392-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mm: unexport folio_memcg_{,un}lock
Patch series "unexport memcg locking helpers". Neither the old page-based nor the new folio-based memcg locking helpers are used in modular code at all, so drop the exports. This patch (of 2): folio_memcg_{,un}lock are only used in built-in core mm code. Link: https://lkml.kernel.org/r/20210820095815.445392-1-hch@lst.de Link: https://lkml.kernel.org/r/20210820095815.445392-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> -
mm/migrate.c: remove MIGRATE_PFN_LOCKED
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mm: migrate: simplify the file-backed pages validation when migrating…
… its mapping Patch series "Some cleanup for page migration", v3. This patchset does some cleanups and improvements for page migration. This patch (of 4): There is no need to validate the file-backed page's refcount before trying to freeze the page's expected refcount, instead we can rely on the folio_ref_freeze() to validate if the page has the expected refcount before migrating its mapping. Moreover we are always under the page lock when migrating the page mapping, which means nowhere else can remove it from the page cache, so we can remove the xas_load() validation under the i_pages lock. Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mm: allow only SLUB on PREEMPT_RT
Memory allocators may disable interrupts or preemption as part of the allocation and freeing process. For PREEMPT_RT it is important that these sections remain deterministic and short and therefore don't depend on the size of the memory to allocate/ free or the inner state of the algorithm. Until v3.12-RT the SLAB allocator was an option but involved several changes to meet all the requirements. The SLUB design fits better with PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT patchset. Comparing the two allocator, SLUB outperformed SLAB in both throughput (time needed to allocate and free memory) and the maximal latency of the system measured with cyclictest during hackbench. SLOB was never evaluated since it was unlikely that it preforms better than SLAB. During a quick test, the kernel crashed with SLOB enabled during boot. Disable SLAB and SLOB on PREEMPT_RT. [bigeasy@linutronix.de: commit description] Link: https://lkml.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
lib/stackdepot: allow optional init and stack_table allocation by kvm…
…alloc() - fixup3 Due to cd06ab2 ("drm/locking: add backtrace for locking contended locks without backoff") landing recently to -next adding a new stack depot user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
lib/stackdepot: allow optional init and stack_table allocation by kvm…
…alloc() - fixup On FLATMEM, we call page_ext_init_flatmem_late() just before kmem_cache_init() which means stack_depot_init() (called by page owner init) will not recognize properly it should use kvmalloc() and not memblock_alloc(). memblock_alloc() will also not issue a warning and return a block memory that can be invalid and cause kernel page fault when saving stacks, as reported by the kernel test robot [1]. Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so that slab_is_available() is true during stack_depot_init(). SPARSEMEM doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(), but a different page_ext_init() even later in the boot process. Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue. While at it, also actually resolve a checkpatch warning in stack_depot_init() from DRM CI, which was supposed to be in the original patch already. [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: kernel test robot <oliver.sang@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
lib/stackdepot: fix spelling mistake and grammar in pr_err message
There is a spelling mistake of the work allocation so fix this and re-phrase the message to make it easier to read. Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
lib/stackdepot: allow optional init and stack_table allocation by kvm…
…alloc() Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Marco Elver <elver@google.com> # stackdepot Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault-vs-…
…folios fix mm-filemap-check-if-thp-has-hwpoisoned-subpage-for-pmd-page-fault.patch for folio tree PAGEFLAG_FALSE() change. Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
restore-acct_reclaim_writeback-for-folio
Make Mel's "mm/vmscan: throttle reclaim and compaction when too may pages are isolated" work for folio changes. Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
linux-next: build failure after merge of the btrfs tree
--Sig_/WBXj/Gqj3ng9sdQkmCFxe+v Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Hi all, [I am not sure why this error only popped up after I merged Andrew's patch set ...] After merging the btrfs tree, today's linux-next build (x86_64 allmodconfig) failed like this: In file included from include/linux/string.h:253, from include/linux/bitmap.h:11, from include/linux/cpumask.h:12, from arch/x86/include/asm/cpumask.h:5, from arch/x86/include/asm/msr.h:11, from arch/x86/include/asm/processor.h:22, from arch/x86/include/asm/cpufeature.h:5, from arch/x86/include/asm/thread_info.h:53, from include/linux/thread_info.h:60, from arch/x86/include/asm/preempt.h:7, from include/linux/preempt.h:78, from include/linux/spinlock.h:55, from include/linux/wait.h:9, from include/linux/mempool.h:8, from include/linux/bio.h:8, from fs/btrfs/ioctl.c:7: In function 'memcpy', inlined from '_btrfs_ioctl_send' at fs/btrfs/ioctl.c:4846:3: include/linux/fortify-string.h:219:4: error: call to '__write_overflow' dec= lared with attribute error: detected write beyond size of object (1st param= eter) 219 | __write_overflow(); | ^~~~~~~~~~~~~~~~~~ Caused by commit c8d9cdf ("btrfs: send: prepare for v2 protocol") This changes the "reserved" field of struct btrfs_ioctl_send_args from 4 u6= 4's to 3, but the above memcpy is copying the "reserved" filed from a struc= t btrfs_ioctl_send_args_32 (4 u64s) into it. All I could really do at this point was mark BTRFS_FS as BROKEN (TEST_KMOD selects BTRFS_FS): From: Stephen Rothwell <sfr@canb.auug.org.au> Date: Wed, 27 Oct 2021 20:53:24 +1100 Subject: [PATCH] make btrfs as BROKEN due to an inconsistent API change Link: https://lkml.kernel.org/r/20211027210924.22ef5881@canb.auug.org.au Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> -
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
GIT fb79395 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
revert-acct_reclaim_writeback-for-next
take away this change from Mel's "mm/vmscan: throttle reclaim and compaction when too may pages are isolated" so that linux-next.patch applies more easily. Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
shm: extend forced shm destroy to support objects from several IPC nses
Currently, exit_shm function not designed to work properly when task->sysvshm.shm_clist holds shm objects from different IPC namespaces. This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it leads to use-after-free (reproducer exists). That particular patch is attempt to fix the problem by extending exit_shm mechanism to handle shm's destroy from several IPC ns'es. To achieve that we do several things: 1. add namespace (non-refcounted) pointer to the struct shmid_kernel 2. during new shm object creation (newseg()/shmget syscall) we initialize this pointer by current task IPC ns 3. exit_shm() fully reworked such that it traverses over all shp's in task->sysvshm.shm_clist and gets IPC namespace not from current task as it was before but from shp's object itself, then call shm_destroy(shp, ns). Note. We need to be really careful here, because as it was said before (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter only if IPC ns not in the "state of destruction". Q/A Q: Why we can access shp->ns memory using non-refcounted pointer? A: Because shp object lifetime is always shorther than IPC namespace lifetime, so, if we get shp object from the task->sysvshm.shm_clist while holding task_lock(task) nobody can steal our namespace. Q: Does this patch change semantics of unshare/setns/clone syscalls? A: Not. It's just fixes non-covered case when process may leave IPC namespace without getting task->sysvshm.shm_clist list cleaned up. Link: https://lkml.kernel.org/r/20211027224348.611025-3-alexander.mikhalitsyn@virtuozzo.com Fixes: ab602f7 ("shm: make exit_shm work proportional to task activity") Co-developed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
ipc: WARN if trying to remove ipc object which is absent
Patch series "shm: shm_rmid_forced feature fixes". Some time ago I met kernel crash after CRIU restore procedure, fortunately, it was CRIU restore, so, I had dump files and could do restore many times and crash reproduced easily. After some investigation I've constructed the minimal reproducer. It was found that it's use-after-free and it happens only if sysctl kernel.shm_rmid_forced = 1. The key of the problem is that the exit_shm() function not handles shp's object destroy when task->sysvshm.shm_clist contains items from different IPC namespaces. In most cases this list will contain only items from one IPC namespace. Why this list may contain object from different namespaces? Function exit_shm() designed to clean up this list always when process leaves IPC namespace. But we made a mistake a long time ago and not add exit_shm() call into setns() syscall procedures. 1st second idea was just to add this call to setns() syscall but it's obviously changes semantics of setns() syscall and that's userspace-visible change. So, I gave up this idea. First real attempt to address the issue was just to omit forced destroy if we meet shp object not from current task IPC namespace [1]. But that was not the best idea because task->sysvshm.shm_clist was protected by rwsem which belongs to current task IPC namespace. It means that list corruption may occur. Second approach is just extend exit_shm() to properly handle shp's from different IPC namespaces [2]. This is really non-trivial thing, I've put a lot of effort into that but not believed that it's possible to make it fully safe, clean and clear. Thanks to the efforts of Manfred Spraul working an elegant solution was designed. Thanks a lot, Manfred! Eric also suggested the way to address the issue in ("[RFC][PATCH] shm: In shm_exit destroy all created and never attached segments") Eric's idea was to maintain a list of shm_clists one per IPC namespace, use lock-less lists. But there is some extra memory consumption-related concerns. Alternative solution which was suggested by me was implemented in ("shm: reset shm_clist on setns but omit forced shm destroy") Idea is pretty simple, we add exit_shm() syscall to setns() but DO NOT destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just clean up the task->sysvshm.shm_clist list. This chages semantics of setns() syscall a little bit but in comparision to "naive" solution when we just add exit_shm() without any special exclusions this looks like a safer option. [1] https://lkml.org/lkml/2021/7/6/1108 [2] https://lkml.org/lkml/2021/7/14/736 This patch (of 2): Let's produce a warning if we trying to remove non-existing IPC object from IPC namespace kht/idr structures. This allows to catch possible bugs when ipc_rmid() function was called with inconsistent struct ipc_ids*, struct kern_ipc_perm* arguments. Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com Co-developed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> -
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
Compilation of ipc/ipc_sysctl.c is controlled by obj-$(CONFIG_SYSVIPC_SYSCTL) [see ipc/Makefile] And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL [see init/Kconfig] An SYSCTL is selected by PROC_SYSCTL. [see fs/proc/Kconfig] Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the fallback can be removed. Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
ipc-check-checkpoint_restore_ns_capable-to-modify-c-r-proc-files-fix
ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable() Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michal Clapinski <mclapinski@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
This commit removes the requirement to be root to modify sem_next_id, msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable instead. Since those files are specific to the IPC namespace, there is no reason they should require root privileges. This is similar to ns_last_pid, which also only checks checkpoint_restore_ns_capable. Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com Signed-off-by: Michal Clapinski <mclapinski@google.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
selftests/kselftest/runner/run_one(): Allow running non-executable files
When running a test program, 'run_one()' checks if the program has the execution permission and fails if it doesn't. However, it's easy to mistakenly lose the permissions, as some common tools like 'diff' don't support the permission change well[1]. Compared to that, making mistakes in the test program's path would only rare, as those are explicitly listed in 'TEST_PROGS'. Therefore, it might make more sense to resolve the situation on our own and run the program. For this reason, this commit makes the test program runner function still print the warning message but to try parsing the interpreter of the program and to explicitly run it with the interpreter, in this case. [1] https://lore.kernel.org/mm-commits/YRJisBs9AunccCD4@kroah.com/ Link: https://lkml.kernel.org/r/20210810164534.25902-1-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
We don't want user space to be able to map virtio-mem device memory directly (e.g., via /dev/mem) in order to have guarantees that in a sane setup we'll never accidentially access unplugged memory within the device-managed region of a virtio-mem device, just as required by the virtio-spec. As soon as the virtio-mem driver is loaded, the device region is visible in /proc/iomem via the parent device region. From that point on user space is aware of the device region and we want to disallow mapping anything inside that region (where we will dynamically (un)plug memory) until the driver has been unloaded cleanly and e.g., another driver might take over. By creating our parent IORESOURCE_SYSTEM_RAM resource with IORESOURCE_EXCLUSIVE, we will disallow any /dev/mem access to our device region until the driver was unloaded cleanly and removed the parent region. This will work even though only some memory blocks are actually currently added to Linux and appear as busy in the resource tree. So access to the region from user space is only possible a) if we don't load the virtio-mem driver. b) after unloading the virtio-mem driver cleanly. Don't build virtio-mem if access to /dev/mem cannot be restricticted -- if we have CONFIG_DEVMEM=y but CONFIG_STRICT_DEVMEM is not set. Link: https://lkml.kernel.org/r/20210920142856.17758-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
kernel/resource: disallow access to exclusive system RAM regions
virtio-mem dynamically exposes memory inside a device memory region as system RAM to Linux, coordinating with the hypervisor which parts are actually "plugged" and consequently usable/accessible. On the one hand, the virtio-mem driver adds/removes whole memory blocks, creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs memory inside added memory blocks, dynamically either exposing them to the buddy or hiding them from the buddy and marking them PG_offline. In contrast to physical devices, like a DIMM, the virtio-mem driver is required to actually make use of any of the device-provided memory, because it performs the handshake with the hypervisor. virtio-mem memory cannot simply be access via /dev/mem without a driver. There is no safe way to: a) Access plugged memory blocks via /dev/mem, as they might contain unplugged holes or might get silently unplugged by the virtio-mem driver and consequently turned inaccessible. b) Access unplugged memory blocks via /dev/mem because the virtio-mem driver is required to make them actually accessible first. The virtio-spec states that unplugged memory blocks MUST NOT be written, and only selected unplugged memory blocks MAY be read. We want to make sure, this is the case in sane environments -- where the virtio-mem driver was loaded. We want to make sure that in a sane environment, nobody "accidentially" accesses unplugged memory inside the device managed region. For example, a user might spot a memory region in /proc/iomem and try accessing it via /dev/mem via gdb or dumping it via something else. By the time the mmap() happens, the memory might already have been removed by the virtio-mem driver silently: the mmap() would succeeed and user space might accidentially access unplugged memory. So once the driver was loaded and detected the device along the device-managed region, we just want to disallow any access via /dev/mem to it. In an ideal world, we would mark the whole region as busy ("owned by a driver") and exclude it; however, that would be wrong, as we don't really have actual system RAM at these ranges added to Linux ("busy system RAM"). Instead, we want to mark such ranges as "not actual busy system RAM but still soft-reserved and prepared by a driver for future use." Let's teach iomem_is_exclusive() to reject access to any range with "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even if "iomem=relaxed" is set. Introduce EXCLUSIVE_SYSTEM_RAM to make it easier for applicable drivers to depend on this setting in their Kconfig. For now, there are no applicable ranges and we'll modify virtio-mem next to properly set IORESOURCE_EXCLUSIVE on the parent resource container it creates to contain all actual busy system RAM added via add_memory_driver_managed(). Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>