Skip to content
Permalink
Rafael-Aquini/…
Switch branches/tags

Commits on Mar 19, 2021

  1. mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SL…

    …AB_MERGE_DEFAULT) builds
    
    This is a minor addition to the allocator setup options to provide
    a simple way to on demand enable back cache merging for builds
    that by default run with CONFIG_SLAB_MERGE_DEFAULT not set.
    
    Signed-off-by: Rafael Aquini <aquini@redhat.com>
    aquini authored and intel-lab-lkp committed Mar 19, 2021

Commits on Aug 1, 2020

  1. pci: test for unexpectedly disabled bridges

    The all-ones value is not just a "device didn't exist" case, it's also
    potentially a quite valid value, so not restoring it would be wrong.
    
    What *would* be interesting is to hear where the bad values came from in
    the first place.  It sounds like the device state is saved after the PCI
    bus controller in front of the device has been crapped on, resulting in the
    PCI config cycles never reaching the device at all.
    
    Something along this patch (together with suspend/resume debugging output)
    migth help pinpoint it.  But it really sounds like something totally
    brokenly turned off the PCI bridge (some ACPI shutdown crud?  I wouldn't be
    entirely surprised)
    
    Cc: Greg KH <greg@kroah.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    torvalds authored and hnaz committed Aug 1, 2020
  2. kernel/fork.c: export kernel_thread() to modules

    mutex-subsystem-synchro-test-module.patch needs this
    
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Aug 1, 2020
  3. mutex subsystem, synchro-test module

    The attached patch adds a module for testing and benchmarking mutexes,
    semaphores and R/W semaphores.
    
    Using it is simple:
    
    	insmod synchro-test.ko <args>
    
    It will exit with error ENOANO after running the tests and printing the
    results to the kernel console log.
    
    The available arguments are:
    
     (*) mx=N
    
    	Start up to N mutex thrashing threads, where N is at most 20. All will
    	try and thrash the same mutex.
    
     (*) sm=N
    
    	Start up to N counting semaphore thrashing threads, where N is at most
    	20. All will try and thrash the same semaphore.
    
     (*) ism=M
    
    	Initialise the counting semaphore with M, where M is any positive
    	integer greater than zero. The default is 4.
    
     (*) rd=N
     (*) wr=O
     (*) dg=P
    
    	Start up to N reader thrashing threads, O writer thrashing threads and
    	P downgrader thrashing threads, where N, O and P are at most 20
    	apiece. All will try and thrash the same read/write semaphore.
    
     (*) elapse=N
    
    	Run the tests for N seconds. The default is 5.
    
     (*) load=N
    
    	Each thread delays for N uS whilst holding the lock. The dfault is 0.
    
     (*) interval=N
    
    	Each thread delays for N uS whilst not holding the lock. The default
    	is 0.
    
     (*) do_sched=1
    
    	Each thread will call schedule if required after each iteration.
    
     (*) v=1
    
    	Print more verbose information, including a thread iteration
    	distribution list.
    
    The module should be enabled by turning on CONFIG_DEBUG_SYNCHRO_TEST to "m".
    
    [randy.dunlap@oracle.com: fix build errors, add <sched.h> header file]
    [akpm@linux-foundation.org: remove smp_lock.h inclusion]
    [viro@ZenIV.linux.org.uk: kill daemonize() calls]
    [rdunlap@xenotime.net: fix printk format warrnings]
    [walken@google.com: add spinlock test]
    [walken@google.com: document default load and interval values]
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Adrian Bunk <bunk@stusta.de>
    Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
    Signed-off-by: Michel Lespinasse <walken@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    dhowells authored and hnaz committed Aug 1, 2020
  4. Releasing resources with children

    What does it mean to release a resource with children?  Should the children
    become children of the released resource's parent?  Should they be released
    too?  Should we fail the release?
    
    I bet we have no callers who expect this right now, but with
    insert_resource() we may get some.  At the point where someone hits this
    BUG we can figure out what semantics we want.
    
    Signed-off-by: Matthew Wilcox <willy@parisc-linux.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Aug 1, 2020
  5. Make sure nobody's leaking resources

    Currently, releasing a resource also releases all of its children.  That
    made sense when request_resource was the main method of dividing up the
    memory map.  With the increased use of insert_resource, it seems to me that
    we should instead reparent the newly orphaned resources.  Before we do
    that, let's make sure that nobody's actually relying on the current
    semantics.
    
    Signed-off-by: Matthew Wilcox <matthew@wil.cx>
    Cc: Greg KH <greg@kroah.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Matthew Wilcox authored and hnaz committed Aug 1, 2020
  6. virtio: pci: constify ioreadX() iomem argument (as in generic impleme…

    …ntation)
    
    The ioreadX() helpers have inconsistent interface.  On some architectures
    void *__iomem address argument is a pointer to const, on some not.
    
    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.
    
    Link: http://lkml.kernel.org/r/20200709072837.5869-5-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
    Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Allen Hubbe <allenbh@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Jon Mason <jdmason@kudzu.us>
    Cc: Kalle Valo <kvalo@codeaurora.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    krzk authored and hnaz committed Aug 1, 2020
  7. ntb: intel: constify ioreadX() iomem argument (as in generic implemen…

    …tation)
    
    The ioreadX() helpers have inconsistent interface.  On some architectures
    void *__iomem address argument is a pointer to const, on some not.
    
    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.
    
    Link: http://lkml.kernel.org/r/20200709072837.5869-4-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
    Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Acked-by: Dave Jiang <dave.jiang@intel.com>
    Cc: Allen Hubbe <allenbh@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Jon Mason <jdmason@kudzu.us>
    Cc: Kalle Valo <kvalo@codeaurora.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    krzk authored and hnaz committed Aug 1, 2020
  8. rtl818x: constify ioreadX() iomem argument (as in generic implementat…

    …ion)
    
    The ioreadX() helpers have inconsistent interface.  On some architectures
    void *__iomem address argument is a pointer to const, on some not.
    
    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.
    
    Link: http://lkml.kernel.org/r/20200709072837.5869-3-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
    Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Acked-by: Kalle Valo <kvalo@codeaurora.org>
    Cc: Allen Hubbe <allenbh@gmail.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Jon Mason <jdmason@kudzu.us>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    krzk authored and hnaz committed Aug 1, 2020
  9. sh: clk: fix assignment from incompatible pointer type for ioreadX()

    The ioreadX() helpers accept now pointer to const memory so declaration
    of read function needs updating.
    
    This fixes build errors like:
    
        drivers/sh/clk/cpg.c: In function `sh_clk_mstp_enable':
        drivers/sh/clk/cpg.c:49:9: error: assignment from incompatible pointer type [-Werror=incompatible-pointer-types]
            read = ioread8;
    
    Link: http://lkml.kernel.org/r/20200723082017.24053-1-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Rich Felker <dalias@libc.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    krzk authored and hnaz committed Aug 1, 2020
  10. iomap: constify ioreadX() iomem argument (as in generic implementation)

    Patch series "iomap: Constify ioreadX() iomem argument", v3.
    
    The ioread8/16/32() and others have inconsistent interface among the
    architectures: some taking address as const, some not.
    
    It seems there is nothing really stopping all of them to take pointer to
    const.
    
    
    This patch (of 4):
    
    The ioreadX() and ioreadX_rep() helpers have inconsistent interface.  On
    some architectures void *__iomem address argument is a pointer to const,
    on some not.
    
    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.
    
    Link: http://lkml.kernel.org/r/20200709072837.5869-1-krzk@kernel.org
    Link: http://lkml.kernel.org/r/20200709072837.5869-2-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
    Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Kalle Valo <kvalo@codeaurora.org>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Jon Mason <jdmason@kudzu.us>
    Cc: Allen Hubbe <allenbh@gmail.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    krzk authored and hnaz committed Aug 1, 2020
  11. sh: use generic strncpy()

    Current SH will get below warning at strncpy()
    
    In file included from ${LINUX}/arch/sh/include/asm/string.h:3,
                     from ${LINUX}/include/linux/string.h:20,
                     from ${LINUX}/include/linux/bitmap.h:9,
                     from ${LINUX}/include/linux/nodemask.h:95,
                     from ${LINUX}/include/linux/mmzone.h:17,
                     from ${LINUX}/include/linux/gfp.h:6,
                     from ${LINUX}/innclude/linux/slab.h:15,
                     from ${LINUX}/linux/drivers/mmc/host/vub300.c:38:
    ${LINUX}/drivers/mmc/host/vub300.c: In function 'new_system_port_status':
    ${LINUX}/arch/sh/include/asm/string_32.h:51:42: warning: array subscript\
      80 is above array bounds of 'char[26]' [-Warray-bounds]
       : "0" (__dest), "1" (__src), "r" (__src+__n)
                                         ~~~~~^~~~
    
    In general, strncpy() should behave like below.
    
    	char dest[10];
    	char *src = "12345";
    
    	strncpy(dest, src, 10);
    	// dest = {'1', '2', '3', '4', '5',
    	           '\0','\0','\0','\0','\0'}
    
    But, current SH strnpy() has 2 issues.
    1st is it will access to out-of-memory (= src + 10).
    2nd is it needs big fixup for it, and maintenance __asm__
    code is difficult.
    
    To solve these issues, this patch simply uses generic strncpy()
    instead of architecture specific one.
    
    Link: https://marc.info/?l=linux-renesas-soc&m=157664657013309
    Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
    Cc: Alan Modra <amodra@gmail.com>
    Cc: Bin Meng <bin.meng@windriver.com>
    Cc: Chen Zhou <chenzhou10@huawei.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Krzysztof Kozlowski <krzk@kernel.org>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Romain Naour <romain.naour@gmail.com>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    morimoto authored and hnaz committed Aug 1, 2020
  12. sh: clkfwk: remove r8/r16/r32

    SH will get below warning
    
    ${LINUX}/drivers/sh/clk/cpg.c: In function 'r8':
    ${LINUX}/drivers/sh/clk/cpg.c:41:17: warning: passing argument 1 of 'ioread8'
     discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
      return ioread8(addr);
                     ^~~~
    In file included from ${LINUX}/arch/sh/include/asm/io.h:21,
                     from ${LINUX}/include/linux/io.h:13,
                     from ${LINUX}/drivers/sh/clk/cpg.c:14:
    ${LINUX}/include/asm-generic/iomap.h:29:29: note: expected 'void *' but
    argument is of type 'const void *'
     extern unsigned int ioread8(void __iomem *);
                                 ^~~~~~~~~~~~~~
    
    We don't need "const" for r8/r16/r32.  And we don't need r8/r16/r32
    themselvs.  This patch cleanup these.
    
    X-MARC-Message: https://marc.info/?l=linux-renesas-soc&m=157852973916903
    Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
    Cc: Alan Modra <amodra@gmail.com>
    Cc: Bin Meng <bin.meng@windriver.com>
    Cc: Chen Zhou <chenzhou10@huawei.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Krzysztof Kozlowski <krzk@kernel.org>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Romain Naour <romain.naour@gmail.com>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    morimoto authored and hnaz committed Aug 1, 2020
  13. include/asm-generic/vmlinux.lds.h: align ro_after_init

    Since the patch [1], building the kernel using a toolchain built with
    binutils 2.33.1 prevents booting a sh4 system under Qemu.  Apply the patch
    provided by Alan Modra [2] that fix alignment of rodata.
    
    [1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=ebd2263ba9a9124d93bbc0ece63d7e0fae89b40e
    [2] https://www.sourceware.org/ml/binutils/2019-12/msg00112.html
    
    Link: https://marc.info/?l=linux-sh&m=158429470221261
    Signed-off-by: Romain Naour <romain.naour@gmail.com>
    Cc: Alan Modra <amodra@gmail.com>
    Cc: Bin Meng <bin.meng@windriver.com>
    Cc: Chen Zhou <chenzhou10@huawei.com>
    Cc: Geert Uytterhoeven <geert+renesas@glider.be>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Krzysztof Kozlowski <krzk@kernel.org>
    Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    RomainNaour authored and hnaz committed Aug 1, 2020
  14. mm: annotate a data race in page_zonenum()

     BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
    
     write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
      page_cpupid_xchg_last+0x51/0x80
      page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
      wp_page_reuse+0x3e/0xc0
      wp_page_reuse at mm/memory.c:2453
      do_wp_page+0x472/0x7b0
      do_wp_page at mm/memory.c:2798
      __handle_mm_fault+0xcb0/0xd00
      handle_pte_fault at mm/memory.c:4049
      (inlined by) __handle_mm_fault at mm/memory.c:4163
      handle_mm_fault+0xfc/0x2f0
      handle_mm_fault at mm/memory.c:4200
      do_page_fault+0x263/0x6f9
      do_user_addr_fault at arch/x86/mm/fault.c:1465
      (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
      page_fault+0x34/0x40
    
     read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
      put_page+0x15a/0x1f0
      page_zonenum at include/linux/mm.h:923
      (inlined by) is_zone_device_page at include/linux/mm.h:929
      (inlined by) page_is_devmap_managed at include/linux/mm.h:948
      (inlined by) put_page at include/linux/mm.h:1023
      wp_page_copy+0x571/0x930
      wp_page_copy at mm/memory.c:2615
      do_wp_page+0x107/0x7b0
      __handle_mm_fault+0xcb0/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ torvalds#6
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    A page never changes its zone number. The zone number happens to be
    stored in the same word as other bits which are modified, but the zone
    number bits will never be modified by any other write, so it can accept
    a reload of the zone bits after an intervening write and it don't need
    to use READ_ONCE(). Thus, annotate this data race using
    ASSERT_EXCLUSIVE_BITS() to also assert that there are no concurrent
    writes to it.
    
    Link: http://lkml.kernel.org/r/1581619089-14472-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Suggested-by: Marco Elver <elver@google.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  15. mm/swap.c: annotate data races for lru_rotate_pvecs

    Read to lru_add_pvec->nr could be interrupted and then write to the same
    variable.  The write has local interrupt disabled, but the plain reads
    result in data races.  However, it is unlikely the compilers could do much
    damage here given that lru_add_pvec->nr is a "unsigned char" and there is
    an existing compiler barrier.  Thus, annotate the reads using the
    data_race() macro.  The data races were reported by KCSAN,
    
     BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
    
     write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
      rotate_reclaimable_page+0x2df/0x490
      pagevec_add at include/linux/pagevec.h:81
      (inlined by) rotate_reclaimable_page at mm/swap.c:259
      end_page_writeback+0x1b5/0x2b0
      end_swap_bio_write+0x1d0/0x280
      bio_endio+0x297/0x560
      dec_pending+0x218/0x430 [dm_mod]
      clone_endio+0xe4/0x2c0 [dm_mod]
      bio_endio+0x297/0x560
      blk_update_request+0x201/0x920
      scsi_end_request+0x6b/0x4a0
      scsi_io_completion+0xb7/0x7e0
      scsi_finish_command+0x1ed/0x2a0
      scsi_softirq_done+0x1c9/0x1d0
      blk_done_softirq+0x181/0x1d0
      __do_softirq+0xd9/0x57c
      irq_exit+0xa2/0xc0
      do_IRQ+0x8b/0x190
      ret_from_intr+0x0/0x42
      delay_tsc+0x46/0x80
      __const_udelay+0x3c/0x40
      __udelay+0x10/0x20
      kcsan_setup_watchpoint+0x202/0x3a0
      __tsan_read1+0xc2/0x100
      lru_add_drain_cpu+0xb8/0x3f0
      lru_add_drain+0x25/0x40
      shrink_active_list+0xe1/0xc80
      shrink_lruvec+0x766/0xb70
      shrink_node+0x2d6/0xca0
      do_try_to_free_pages+0x1f7/0x9a0
      try_to_free_pages+0x252/0x5b0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x16e/0x6f0
      __handle_mm_fault+0xcd5/0xd40
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
      lru_add_drain_cpu+0xb8/0x3f0
      lru_add_drain_cpu at mm/swap.c:602
      lru_add_drain+0x25/0x40
      shrink_active_list+0xe1/0xc80
      shrink_lruvec+0x766/0xb70
      shrink_node+0x2d6/0xca0
      do_try_to_free_pages+0x1f7/0x9a0
      try_to_free_pages+0x252/0x5b0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x16e/0x6f0
      __handle_mm_fault+0xcd5/0xd40
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     2 locks held by oom02/37761:
      #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
      #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
     irq event stamp: 1949217
     trace_hardirqs_on_thunk+0x1a/0x1c
     __do_softirq+0x2e7/0x57c
     __do_softirq+0x34c/0x57c
     irq_exit+0xa2/0xc0
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ torvalds#6
     Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
    
    Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Acked-by: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  16. mm/rmap: annotate a data race at tlb_flush_batched

    mm->tlb_flush_batched could be accessed concurrently as noticed by
    KCSAN,
    
     BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
    
     write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
      try_to_unmap_one+0x59a/0x1ab0
      set_tlb_ubc_flush_pending at mm/rmap.c:635
      (inlined by) try_to_unmap_one at mm/rmap.c:1538
      rmap_walk_anon+0x296/0x650
      rmap_walk+0xdf/0x100
      try_to_unmap+0x18a/0x2f0
      shrink_page_list+0xef6/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      balance_pgdat+0x652/0xd90
      kswapd+0x396/0x8d0
      kthread+0x1e0/0x200
      ret_from_fork+0x27/0x50
    
     read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
      flush_tlb_batched_pending+0x29/0x90
      flush_tlb_batched_pending at mm/rmap.c:682
      change_p4d_range+0x5dd/0x1030
      change_pte_range at mm/mprotect.c:44
      (inlined by) change_pmd_range at mm/mprotect.c:212
      (inlined by) change_pud_range at mm/mprotect.c:240
      (inlined by) change_p4d_range at mm/mprotect.c:260
      change_protection+0x222/0x310
      change_prot_numa+0x3e/0x60
      task_numa_work+0x219/0x350
      task_work_run+0xed/0x140
      prepare_exit_to_usermode+0x2cc/0x2e0
      ret_from_intr+0x32/0x42
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 4 PID: 6364 Comm: mtest01 Tainted: G        W    L 5.5.0-next-20200210+ #5
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    flush_tlb_batched_pending() is under PTL but the write is not, but
    mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
    shattered.  Thus, mark it as an intentional data race by using the data
    race macro.
    
    Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  17. mm/mempool: fix a data race in mempool_free()

    mempool_t pool.curr_nr could be accessed concurrently as noticed by
    KCSAN,
    
     BUG: KCSAN: data-race in mempool_free / remove_element
    
     write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
      remove_element+0x4a/0x1c0
      remove_element at mm/mempool.c:132
      mempool_alloc+0x102/0x210
      (inlined by) mempool_alloc at mm/mempool.c:399
      bio_alloc_bioset+0x106/0x2c0
      get_swap_bio+0x49/0x230
      __swap_writepage+0x680/0xc30
      swap_writepage+0x9c/0xf0
      pageout+0x33e/0xae0
      shrink_page_list+0x1f57/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      <snip>
    
     read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
      mempool_free+0x3e/0x150
      mempool_free at mm/mempool.c:492
      bio_free+0x192/0x280
      bio_put+0x91/0xd0
      end_swap_bio_write+0x1d8/0x280
      bio_endio+0x2c2/0x5b0
      dec_pending+0x22b/0x440 [dm_mod]
      clone_endio+0xe4/0x2c0 [dm_mod]
      bio_endio+0x2c2/0x5b0
      blk_update_request+0x217/0x940
      scsi_end_request+0x6b/0x4d0
      scsi_io_completion+0xb7/0x7e0
      scsi_finish_command+0x223/0x310
      scsi_softirq_done+0x1d5/0x210
      blk_mq_complete_request+0x224/0x250
      scsi_mq_done+0xc2/0x250
      pqi_raid_io_complete+0x5a/0x70 [smartpqi]
      pqi_irq_handler+0x150/0x1410 [smartpqi]
      __handle_irq_event_percpu+0x90/0x540
      handle_irq_event_percpu+0x49/0xd0
      handle_irq_event+0x85/0xca
      handle_edge_irq+0x13f/0x3e0
      do_IRQ+0x86/0x190
      <snip>
    
    Since the write is under pool->lock but the read is done as lockless.
    Even though the commit 5b99054 ("mempool: fix and document
    synchronization and memory barrier usage") introduced the smp_wmb() and
    smp_rmb() pair to improve the situation, it is adequate to protect it
    from data races which could lead to a logic bug, so fix it by adding
    READ_ONCE() for the read.
    
    Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  18. mm/list_lru: fix a data race in list_lru_count_one

    struct list_lru_one l.nr_items could be accessed concurrently as noticed
    by KCSAN,
    
     BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move
    
     write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
      list_lru_isolate_move+0xf9/0x130
      list_lru_isolate_move at mm/list_lru.c:180
      inode_lru_isolate+0x12b/0x2a0
      __list_lru_walk_one+0x122/0x3d0
      list_lru_walk_one+0x75/0xa0
      prune_icache_sb+0x8b/0xc0
      super_cache_scan+0x1b8/0x250
      do_shrink_slab+0x256/0x6d0
      shrink_slab+0x41b/0x4a0
      shrink_node+0x35c/0xd80
      balance_pgdat+0x652/0xd90
      kswapd+0x396/0x8d0
      kthread+0x1e0/0x200
      ret_from_fork+0x27/0x50
    
     read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
      list_lru_count_one+0x116/0x2f0
      list_lru_count_one at mm/list_lru.c:193
      super_cache_count+0xe8/0x170
      do_shrink_slab+0x95/0x6d0
      shrink_slab+0x41b/0x4a0
      shrink_node+0x35c/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x170/0x700
      __handle_mm_fault+0xc9f/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    A shattered l.nr_items could affect the shrinker behaviour due to a data
    race. Fix it by adding READ_ONCE() for the read. Since the writes are
    aligned and up to word-size, assume those are safe from data races to
    avoid readability issues of writing WRITE_ONCE(var, var + val).
    
    Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  19. mm/memcontrol: fix a data race in scan count

    struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
    accessed concurrently as noticed by KCSAN,
    
     BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
    
     write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
      mem_cgroup_update_lru_size+0x11c/0x1d0
      mem_cgroup_update_lru_size at mm/memcontrol.c:1266
      isolate_lru_pages+0x6a9/0xf30
      shrink_active_list+0x123/0xcc0
      shrink_lruvec+0x8fd/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x170/0x700
      __handle_mm_fault+0xc9f/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
      lruvec_lru_size+0xbb/0x270
      mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
      (inlined by) lruvec_lru_size at mm/vmscan.c:326
      shrink_lruvec+0x1d0/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_current+0xa6/0x120
      alloc_slab_page+0x3b1/0x540
      allocate_slab+0x70/0x660
      new_slab+0x46/0x70
      ___slab_alloc+0x4ad/0x7d0
      __slab_alloc+0x43/0x70
      kmem_cache_alloc+0x2c3/0x420
      getname_flags+0x4c/0x230
      getname+0x22/0x30
      do_sys_openat2+0x205/0x3b0
      do_sys_open+0x9a/0xf0
      __x64_sys_openat+0x62/0x80
      do_syscall_64+0x91/0xb47
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ torvalds#6
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    The write is under lru_lock, but the read is done as lockless.  The scan
    count is used to determine how aggressively the anon and file LRU lists
    should be scanned.  Load tearing could generate an inefficient heuristic,
    so fix it by adding READ_ONCE() for the read.
    
    Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  20. mm/page_counter: fix various data races at memsw

    Commit 3e32cb2 ("mm: memcontrol: lockless page counters") could had
    memcg->memsw->watermark and memcg->memsw->failcnt been accessed
    concurrently as reported by KCSAN,
    
     BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
    
     read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
      page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
      try_charge+0x131/0xd50 mm/memcontrol.c:2405
      __memcg_kmem_charge_memcg+0x58/0x140
      __memcg_kmem_charge+0xcc/0x280
      __alloc_pages_nodemask+0x1e1/0x450
      alloc_pages_current+0xa6/0x120
      pte_alloc_one+0x17/0xd0
      __pte_alloc+0x3a/0x1f0
      copy_p4d_range+0xc36/0x1990
      copy_page_range+0x21d/0x360
      dup_mmap+0x5f5/0x7a0
      dup_mm+0xa2/0x240
      copy_process+0x1b3f/0x3460
      _do_fork+0xaa/0xa20
      __x64_sys_clone+0x13b/0x170
      do_syscall_64+0x91/0xb47
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
    
     write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
      page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
      try_charge+0x131/0xd50 mm/memcontrol.c:2405
      mem_cgroup_try_charge+0x159/0x460
      mem_cgroup_try_charge_delay+0x3d/0xa0
      wp_page_copy+0x14d/0x930
      do_wp_page+0x107/0x7b0
      __handle_mm_fault+0xce6/0xd40
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
    
     write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
      page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
      try_charge+0x185/0xbf0 mm/memcontrol.c:2405
      __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
      __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
      __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
    
     read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
      page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
      try_charge+0x185/0xbf0 mm/memcontrol.c:2405
      __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
      __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
      __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
    
    Since watermark could be compared or set to garbage due to a data race
    which would change the code logic, fix it by adding a pair of READ_ONCE()
    and WRITE_ONCE() in those places.
    
    The "failcnt" counter is tolerant of some degree of inaccuracy and is only
    used to report stats, a data race will not be harmful, thus mark it as an
    intentional data race using the data_race() macro.
    
    Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
    Fixes: 3e32cb2 ("mm: memcontrol: lockless page counters")
    Signed-off-by: Qian Cai <cai@lca.pw>
    Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Marco Elver <elver@google.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  21. mm-swapfile-fix-and-annotate-various-data-races-v2

    add a missing annotation for si->flags in memory.c
    
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  22. mm/swapfile: fix and annotate various data races

    swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,
    
    === si.highest_bit ===
    
     write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
      swap_range_alloc+0x81/0x130
      swap_range_alloc at mm/swapfile.c:681
      scan_swap_map_slots+0x371/0xb90
      get_swap_pages+0x39d/0x5c0
      get_swap_page+0xf2/0x524
      add_to_swap+0xe4/0x1c0
      shrink_page_list+0x1795/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
    
     read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
      scan_swap_map_slots+0x4a6/0xb90
      scan_swap_map_slots at mm/swapfile.c:892
      get_swap_pages+0x39d/0x5c0
      get_swap_page+0xf2/0x524
      add_to_swap+0xe4/0x1c0
      shrink_page_list+0x1795/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    === si.swap_map[offset] ===
    
     write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
      __swap_entry_free_locked+0x8c/0x100
      __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
      __swap_entry_free.constprop.20+0x69/0xb0
      free_swap_and_cache+0x53/0xa0
      unmap_page_range+0x7f8/0x1d70
      unmap_single_vma+0xcd/0x170
      unmap_vmas+0x18b/0x220
      exit_mmap+0xee/0x220
      mmput+0x10e/0x270
      do_exit+0x59b/0xf40
      do_group_exit+0x8b/0x180
    
     read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
      _swap_info_get+0x81/0xa0
      _swap_info_get at mm/swapfile.c:1140
      free_swap_and_cache+0x40/0xa0
      unmap_page_range+0x7f8/0x1d70
      unmap_single_vma+0xcd/0x170
      unmap_vmas+0x18b/0x220
      exit_mmap+0xee/0x220
      mmput+0x10e/0x270
      do_exit+0x59b/0xf40
      do_group_exit+0x8b/0x180
    
    === si.flags ===
    
     write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
      scan_swap_map_slots+0x6fe/0xb50
      scan_swap_map_slots at mm/swapfile.c:887
      get_swap_pages+0x39d/0x5c0
      get_swap_page+0x377/0x524
      add_to_swap+0xe4/0x1c0
      shrink_page_list+0x1795/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
    
     read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
      _swap_info_get+0x41/0xa0
      __swap_info_get at mm/swapfile.c:1114
      put_swap_page+0x84/0x490
      __remove_mapping+0x384/0x5f0
      shrink_page_list+0xff1/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
    
    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.
    
    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.
    
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  23. mm/filemap.c: fix a data race in filemap_fault()

    struct file_ra_state ra.mmap_miss could be accessed concurrently during
    page faults as noticed by KCSAN,
    
     BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
    
     write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
      filemap_fault+0x920/0xfc0
      do_sync_mmap_readahead at mm/filemap.c:2384
      (inlined by) filemap_fault at mm/filemap.c:2486
      __xfs_filemap_fault+0x112/0x3e0 [xfs]
      xfs_filemap_fault+0x74/0x90 [xfs]
      __do_fault+0x9e/0x220
      do_fault+0x4a0/0x920
      __handle_mm_fault+0xc69/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
      filemap_map_pages+0xc2e/0xd80
      filemap_map_pages at mm/filemap.c:2625
      do_fault+0x3da/0x920
      __handle_mm_fault+0xc69/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    ra.mmap_miss is used to contribute the readahead decisions, a data race
    could be undesirable.  Both the read and write is only under non-exclusive
    mmap_sem, two concurrent writers could even underflow the counter.  Fix
    the underflow by writing to a local variable before committing a final
    store to ra.mmap_miss given a small inaccuracy of the counter should be
    acceptable.
    
    Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
    Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
    Signed-off-by: Qian Cai <cai@lca.pw>
    Tested-by: Qian Cai <cai@lca.pw>
    Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    kiryl authored and hnaz committed Aug 1, 2020
  24. mm/swap_state: mark various intentional data races

    swap_cache_info.* could be accessed concurrently as noticed by
    KCSAN,
    
     BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache
    
     write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
      lookup_swap_cache+0x12e/0x460
      lookup_swap_cache at mm/swap_state.c:322
      do_swap_page+0x112/0xeb0
      __handle_mm_fault+0xc7a/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
      lookup_swap_cache+0x117/0x460
      lookup_swap_cache at mm/swap_state.c:322
      shmem_swapin_page+0xc7/0x9e0
      shmem_getpage_gfp+0x2ca/0x16c0
      shmem_fault+0xef/0x3c0
      __do_fault+0x9e/0x220
      do_fault+0x4a0/0x920
      __handle_mm_fault+0xc69/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G        W  O L 5.5.0-next-20200204+ torvalds#6
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
     write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
       __delete_from_swap_cache+0x681/0x8b0
       __delete_from_swap_cache at mm/swap_state.c:178
    
     read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
       __delete_from_swap_cache+0x66e/0x8b0
       __delete_from_swap_cache at mm/swap_state.c:178
    
    Both the read and write are done as lockless. Since swap_cache_info.*
    are only used to print out counter information, even if any of them
    missed a few incremental due to data races, it will be harmless, so just
    mark it as an intentional data race using the data_race() macro.
    
    While at it, fix a checkpatch.pl warning,
    
    WARNING: Single statement macros should not use a do {} while (0) loop
    
    Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  25. mm-page_io-mark-various-intentional-data-races-v2

    add a missing annotation
    
    Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  26. mm/page_io: mark various intentional data races

    struct swap_info_struct si.flags could be accessed concurrently as noticed
    by KCSAN,
    
     BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage
    
     write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
      scan_swap_map_slots+0x6fe/0xb50
      scan_swap_map_slots at mm/swapfile.c:887
      get_swap_pages+0x39d/0x5c0
      get_swap_page+0x377/0x524
      add_to_swap+0xe4/0x1c0
      shrink_page_list+0x1740/0x2820
      shrink_inactive_list+0x316/0x8b0
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x170/0x700
      __handle_mm_fault+0xc9f/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
      swap_readpage+0x204/0x6a0
      swap_readpage at mm/page_io.c:380
      read_swap_cache_async+0xa2/0xb0
      swapin_readahead+0x6a0/0x890
      do_swap_page+0x465/0xeb0
      __handle_mm_fault+0xc7a/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     Reported by Kernel Concurrency Sanitizer on:
     CPU: 7 PID: 5422 Comm: gmain Tainted: G        W  O L 5.5.0-next-20200204+ torvalds#6
     Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    
    Other reads,
    
     read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
      __swap_writepage+0x140/0xc20
      __swap_writepage at mm/page_io.c:289
    
     read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
      swap_set_page_dirty+0x44/0x1f4
      swap_set_page_dirty at mm/page_io.c:442
    
    The write is under &si->lock, but the reads are done as lockless.  Since
    the reads only check for a specific bit in the flag, it is harmless even
    if load tearing happens.  Thus, just mark them as intentional data races
    using the data_race() macro.
    
    Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  27. mm/frontswap: mark various intentional data races

    There are a few information counters that are intentionally not protected
    against increment races, so just annotate them using the data_race()
    macro.
    
     BUG: KCSAN: data-race in __frontswap_store / __frontswap_store
    
     write to 0xffffffff8b7174d8 of 8 bytes by task 6396 on cpu 103:
      __frontswap_store+0x2d0/0x344
      inc_frontswap_failed_stores at mm/frontswap.c:70
      (inlined by) __frontswap_store at mm/frontswap.c:280
      swap_writepage+0x83/0xf0
      pageout+0x33e/0xae0
      shrink_page_list+0x1f57/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x170/0x700
      __handle_mm_fault+0xc9f/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffffffff8b7174d8 of 8 bytes by task 6405 on cpu 47:
      __frontswap_store+0x2b9/0x344
      inc_frontswap_failed_stores at mm/frontswap.c:70
      (inlined by) __frontswap_store at mm/frontswap.c:280
      swap_writepage+0x83/0xf0
      pageout+0x33e/0xae0
      shrink_page_list+0x1f57/0x2870
      shrink_inactive_list+0x316/0x880
      shrink_lruvec+0x8dc/0x1380
      shrink_node+0x317/0xd80
      do_try_to_free_pages+0x1f7/0xa10
      try_to_free_pages+0x26c/0x5e0
      __alloc_pages_slowpath+0x458/0x1290
      __alloc_pages_nodemask+0x3bb/0x450
      alloc_pages_vma+0x8a/0x2c0
      do_anonymous_page+0x170/0x700
      __handle_mm_fault+0xc9f/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
    Link: http://lkml.kernel.org/r/1581114499-5042-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Cc: Marco Elver <elver@google.com>
    Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  28. mm/kmemleak: silence KCSAN splats in checksum

    Even if KCSAN is disabled for kmemleak, update_checksum() could still call
    crc32() (which is outside of kmemleak.c) to dereference object->pointer. 
    Thus, the value of object->pointer could be accessed concurrently as
    noticed by KCSAN,
    
     BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock
    
     write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
      do_raw_spin_lock+0x114/0x200
      debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
      (inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
      _raw_spin_lock+0x40/0x50
      __handle_mm_fault+0xa9e/0xd00
      handle_mm_fault+0xfc/0x2f0
      do_page_fault+0x263/0x6f9
      page_fault+0x34/0x40
    
     read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
      crc32_le_base+0x67/0x350
      crc32_le_base+0x67/0x350:
      crc32_body at lib/crc32.c:106
      (inlined by) crc32_le_generic at lib/crc32.c:179
      (inlined by) crc32_le at lib/crc32.c:197
      kmemleak_scan+0x528/0xd90
      update_checksum at mm/kmemleak.c:1172
      (inlined by) kmemleak_scan at mm/kmemleak.c:1497
      kmemleak_scan_thread+0xcc/0xfa
      kthread+0x1e0/0x200
      ret_from_fork+0x27/0x50
    
    If a shattered value was returned due to a data race, it will be corrected
    in the next scan.  Thus, let KCSAN ignore all reads in the region to
    silence KCSAN in case the write side is non-atomic.
    
    Link: http://lkml.kernel.org/r/20200317182754.2180-1-cai@lca.pw
    Signed-off-by: Qian Cai <cai@lca.pw>
    Suggested-by: Marco Elver <elver@google.com>
    Acked-by: Marco Elver <elver@google.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Qian Cai authored and hnaz committed Aug 1, 2020
  29. s390: fix build error for sys_call_table_emu

    Build error on s390:
    	arch/s390/kernel/entry.o: in function `sys_call_table_emu':
    	>> (.rodata+0x1288): undefined reference to `__s390_'
    
    In commit ("All arch: remove system call sys_sysctl")
     148  common	fdatasync		sys_fdatasync			sys_fdatasync
    -149  common	_sysctl			sys_sysctl			compat_sys_sysctl
    +149  common	_sysctl			sys_ni_syscall
     150  common	mlock			sys_mlock			sys_mlock
    
    After the patch is integrated, there is a format error in the generated
    arch/s390/include/generated/asm/syscall_table.h:
    	SYSCALL(sys_fdatasync, sys_fdatasync)
    	SYSCALL(sys_ni_syscall,) /* cause build error */
    	SYSCALL(sys_mlock,sys_mlock)
    
    According to the guidance of Heiko Carstens, use "-" to fill the empty
    system call Similarly, modify
    tools/perf/arch/s390/entry/syscalls/syscall.tbl.
    
    Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com
    Fixes: ("All arch: remove system call sys_sysctl")
    Fixes: https://lore.kernel.org/linuxppc-dev/20200616030734.87257-1-nixiaoming@huawei.com/
    Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
    Reported-by: kernel test robot <lkp@intel.com>
    Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    nixiaoming authored and hnaz committed Aug 1, 2020
  30. all arch: remove system call sys_sysctl

    Since commit 61a47c1 ("sysctl: Remove the sysctl system call"),
    sys_sysctl is actually unavailable: any input can only return an error.
    
    We have been warning about people using the sysctl system call for years
    and believe there are no more users.  Even if there are users of this
    interface if they have not complained or fixed their code by now they
    probably are not going to, so there is no point in warning them any
    longer.
    
    So completely remove sys_sysctl on all architectures.
    
    Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
    Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
    Acked-by: Will Deacon <will@kernel.org>		[arm/arm64]
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Aleksa Sarai <cyphar@cyphar.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Bin Meng <bin.meng@windriver.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: chenzefeng <chenzefeng2@huawei.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Christian Brauner <christian@brauner.io>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: David Howells <dhowells@redhat.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Dominik Brodowski <linux@dominikbrodowski.net>
    Cc: Fenghua Yu <fenghua.yu@intel.com>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Iurii Zaikin <yzaikin@google.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Kars de Jong <jongk@linux-m68k.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Krzysztof Kozlowski <krzk@kernel.org>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Marco Elver <elver@google.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Martin K. Petersen <martin.petersen@oracle.com>
    Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Michal Simek <monstr@monstr.eu>
    Cc: Miklos Szeredi <mszeredi@redhat.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
    Cc: Nick Piggin <npiggin@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Olof Johansson <olof@lixom.net>
    Cc: Paul Burton <paulburton@kernel.org>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
    Cc: Richard Henderson <rth@twiddle.net>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: Sami Tolvanen <samitolvanen@google.com>
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Cc: Sven Schnelle <svens@stackframe.org>
    Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    nixiaoming authored and hnaz committed Aug 1, 2020
  31. mm/madvise: check fatal signal pending of target process

    Bail out to prevent unnecessary CPU overhead if target process has pending
    fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
    
    Link: http://lkml.kernel.org/r/20200302193630.68771-4-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-5-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <christian@brauner.io>
    Cc: Daniel Colascione <dancol@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Dias <joaodias@google.com>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oleksandr Natalenko <oleksandr@redhat.com>
    Cc: Sandeep Patil <sspatil@google.com>
    Cc: SeongJae Park <sj38.park@gmail.com>
    Cc: SeongJae Park <sjpark@amazon.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Sonny Rao <sonnyrao@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: <linux-man@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    minchank authored and hnaz committed Aug 1, 2020
  32. mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinti…

    …ng-api-fix
    
    fix arm64 whoops
    
    Cc: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    akpm00 authored and hnaz committed Aug 1, 2020
  33. mm/madvise: introduce process_madvise() syscall: an external memory h…

    …inting API
    
    There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.
    
    The information required to make the reclaim decision is not known to
    the app.  Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to
    initiate reclaim on its own without any app involvement.
    
    To solve the issue, this patch introduces a new syscall process_madvise(2).
    It uses pidfd of an external process to give the hint. It also supports
    vector address range because Android app has thousands of vmas due to
    zygote so it's totally waste of CPU and power if we should call the
    syscall one by one for each vma.(With testing 2000-vma syscall vs
    1-vector syscall, it showed 15% performance improvement.  I think it
    would be bigger in real practice because the testing ran very cache
    friendly environment).
    
    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this
    could benefit users like TCP receive zerocopy and malloc implementations.
    In future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment.  With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.
    
    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the
    same UID) gives it the right to ptrace the process could use it
    successfully. The flag argument is reserved for future use if we need to
    extend the API.
    
    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky.  Because we are not sure all hints
    make sense from external process and implementation for the hint may
    rely on the caller being in the current context so it could be
    error-prone.  Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this
    patch.
    
    If someone want to add other hints, we could hear hear the usecase and
    review it for each hint.  It's safer for maintenance rather than
    introducing a buggy syscall but hard to fix it later.
    
    So finally, the API is as follows,
    
          ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                    unsigned long vlen, int advice, unsigned int flags);
    
        DESCRIPTION
          The process_madvise() system call is used to give advice or directions
          to the kernel about the address ranges from external process as well as
          local process. It provides the advice to address ranges of process
          described by iovec and vlen. The goal of such advice is to improve system
          or application performance.
    
          The pidfd selects the process referred to by the PID file descriptor
          specified in pidfd. (See pidofd_open(2) for further information)
    
          The pointer iovec points to an array of iovec structures, defined in
          <sys/uio.h> as:
    
            struct iovec {
                void *iov_base;         /* starting address */
                size_t iov_len;         /* number of bytes to be advised */
            };
    
          The iovec describes address ranges beginning at address(iov_base)
          and with size length of bytes(iov_len).
    
          The vlen represents the number of elements in iovec.
    
          The advice is indicated in the advice argument, which is one of the
          following at this moment if the target process specified by pidfd is
          external.
    
            MADV_COLD
            MADV_PAGEOUT
    
          Permission to provide a hint to external process is governed by a
          ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
    
          The process_madvise supports every advice madvise(2) has if target
          process is in same thread group with calling process so user could
          use process_madvise(2) to extend existing madvise(2) to support
          vector address ranges.
    
        RETURN VALUE
          On success, process_madvise() returns the number of bytes advised.
          This return value may be less than the total number of requested
          bytes, if an error occurred. The caller should check return value
          to determine whether a partial advice occurred.
    
    FAQ:
    
    Q.1 - Why does any external entity have better knowledge?
    
    Quote from Sandeep
    
    "For Android, every application (including the special SystemServer)
    are forked from Zygote.  The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.
    
    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.
    
    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.
    
    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.
    
    Besides, we can never rely on applications to clean things up
    themselves.  We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.
    
    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.
    
    - ssp
    
    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?
    
    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called.  If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect.  It's the
    responsibility of the process calling process_madvise to close this
    race condition.  For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called.  Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process.  Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm.  The suggested API itself does not provide synchronization.  It
    also apply other APIs like move_pages, process_vm_write.
    
    The race isn't really a problem though.  Why is it so wrong to require
    that callers do their own synchronization in some manner?  Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something.  Think about mmap.  It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before.  That's where we need synchronization by using other API or
    design from userside.  It shouldn't be part of API itself.  If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.
    
    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.
    
    Q.3 - Why doesn't ptrace work?
    
    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA.  Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill.  It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.
    
    [1] https://developer.android.com/topic/performance/memory"
    
    [2] process_getinfo for getting the cookie which is updated whenever
        vma of process address layout are changed - Daniel Colascione -
        https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
    
    [3] anonymous fd which is used for the object(i.e., address range)
        validation - Michal Hocko -
        https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
    
    [minchan@kernel.org: fix process_madvise build break for arm64]
      Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
      Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Christian Brauner <christian@brauner.io>
    Cc: Daniel Colascione <dancol@google.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Dias <joaodias@google.com>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oleksandr Natalenko <oleksandr@redhat.com>
    Cc: Sandeep Patil <sspatil@google.com>
    Cc: SeongJae Park <sj38.park@gmail.com>
    Cc: SeongJae Park <sjpark@amazon.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Sonny Rao <sonnyrao@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: <linux-man@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    minchank authored and hnaz committed Aug 1, 2020
  34. pid: move pidfd_get_pid() to pid.c

    process_madvise syscall needs pidfd_get_pid function to translate pidfd to
    pid so this patch move the function to kernel/pid.c.
    
    Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Reviewed-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
    Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: David Rientjes <rientjes@google.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jann Horn <jannh@google.com>
    Cc: Brian Geffon <bgeffon@google.com>
    Cc: Daniel Colascione <dancol@google.com>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Dias <joaodias@google.com>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Oleksandr Natalenko <oleksandr@redhat.com>
    Cc: Sandeep Patil <sspatil@google.com>
    Cc: SeongJae Park <sj38.park@gmail.com>
    Cc: SeongJae Park <sjpark@amazon.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Sonny Rao <sonnyrao@google.com>
    Cc: Tim Murray <timmurray@google.com>
    Cc: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: <linux-man@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    minchank authored and hnaz committed Aug 1, 2020
Older