NeilBrown/Repa…
Commits on Dec 16, 2021
-
NFS: swap-out must always use STABLE writes.
The commit handling code is not safe against memory-pressure deadlocks when writing to swap. In particular, nfs_commitdata_alloc() blocks indefinitely waiting for memory, and this can consume all available workqueue threads. swap-out most likely uses STABLE writes anyway as COND_STABLE indicates that a stable write should be used if the write fits in a single request, and it normally does. However if we ever swap with a small wsize, or gather unusually large numbers of pages for a single write, this might change. For safety, make it explicit in the code that direct writes used for swap must always use FLUSH_COND_STABLE. Signed-off-by: NeilBrown <neilb@suse.de>
-
NFSv4: keep state manager thread active if swap is enabled
If we are swapping over NFSv4, we may not be able to allocate memory to start the state-manager thread at the time when we need it. So keep it always running when swap is enabled, and just signal it to start. This requires updating and testing the cl_swapper count on the root rpc_clnt after following all ->cl_parent links. Signed-off-by: NeilBrown <neilb@suse.de>
-
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Signed-off-by: NeilBrown <neilb@suse.de> -
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
NFS_RPC_SWAPFLAGS is only used for READ requests. It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to requests. This is not needed for swap READ - though it is for writes where it is set via a different mechanism. RPC_TASK_ROOTCREDS causes the 'machine' credential to be used. This is not needed as the root credential is saved when the swap file is opened, and this is used for all IO. So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of RPC_TASK_ROOTCREDS, that isn't needed either. Remove both. Signed-off-by: NeilBrown <neilb@suse.de>
-
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
Currently, tasks marked as "swapper" tasks get put to the front of non-priority rpc_queues, and are sorted earlier than non-swapper tasks on the transport's ->xmit_queue. This is pointless as currently *all* tasks for a mount that has swap enabled on *any* file are marked as "swapper" tasks. So the net result is that the non-priority rpc_queues are reverse-ordered (LIFO). This scheduling boost is not necessary to avoid deadlocks, and hurts fairness, so remove it. If there were a need to expedite some requests, the tk_priority mechanism is a more appropriate tool. Signed-off-by: NeilBrown <neilb@suse.de>
-
SUNRPC/xprt: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de>
-
SUNRPC/auth: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. lookup_cred() can block on a mempool or kmalloc - and this can cause deadlocks. So add a new RPCAUTH_LOOKUP flag for async lookups and don't block on memory. If the -ENOMEM gets back to call_refreshresult(), wait a short while and try again. HZ>>4 is chosen as it is used elsewhere for -ENOMEM retries. Signed-off-by: NeilBrown <neilb@suse.de>
-
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by: NeilBrown <neilb@suse.de>
-
NFS: swap IO handling is slightly different for O_DIRECT IO
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted. We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw() 2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw(). Signed-off-by: NeilBrown <neilb@suse.de>
-
NFS: rename nfs_direct_IO and use as ->swap_rw
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a while. We now need a ->swap_rw function which behaves slightly differently, returning zero for success rather than a byte count. So modify nfs_direct_IO accordingly, rename it, and use it as the ->swap_rw function. Note: it still won't work - that will be fixed in later patches. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: Add AS_CAN_DIO mapping flag
Currently various places test if direct IO is possible on a file by checking for the existence of the direct_IO address space operation. This is a poor choice, as the direct_IO operation may not be used - it is only used if the generic_file_*_iter functions are called for direct IO and some filesystems - particularly NFS - don't do this. Instead, introduce a new mapping flag: AS_CAN_DIO and change the various places to check this (avoiding a pointer dereference). unlock_new_inode() will set this flag if ->direct_IO is present, so filesystems do not need to be changed. NFS *is* changed, to set the flag explicitly and discard the direct_IO entry in the address_space_operations for files. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: submit multipage write for SWP_FS_OPS swap-space
swap_writepage() is given one page at a time, but may be called repeatedly in succession. For block-device swapspace, the blk_plug functionality allows the multiple pages to be combined together at lower layers. That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS are single page reads. With this patch we pass a pointer-to-pointer via the wbc. swap_writepage can store state between calls - much like the pointer passed explicitly to swap_readpage. After calling swap_writepage() some number of times, the state will be passed to swap_write_unplug() which can submit the combined request. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: submit multipage reads for SWP_FS_OPS swap-space
swap_readpage() is given one page at a time, but maybe called repeatedly in succession. For block-device swapspace, the blk_plug functionality allows the multiple pages to be combined together at lower layers. That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS are single page reads. With this patch we pass in a pointer-to-pointer when swap_readpage can store state between calls - much like the effect of blk_plug. After calling swap_readpage() some number of times, the state will be passed to swap_read_unplug() which can submit the combined request. Some caller currently call blk_finish_plug() *before* the final call to swap_readpage(), so the last page cannot be included. This patch moves blk_finish_plug() to after the last call, and calls swap_read_unplug() there too. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
If swap-out is using filesystem operations (SWP_FS_OPS), then it is not safe to enter the FS for reclaim. So only down-grade the requirement for swap pages to __GFP_IO after checking that SWP_FS_OPS are not being used. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: perform async writes to SWP_FS_OPS swap-space
Writes to SWP_FS_OPS swapspace is currently synchronous. To make it async we need to allocate the kiocb struct which may block, but won't block as long as waiting for the write to complete would block. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
To submit an async read with ->swap_rw() we need to allocate a structure to hold the kiocb and other details. swap_readpage() cannot handle transient failure, so create a mempool to provide the structures. Signed-off-by: NeilBrown <neilb@suse.de>
-
MM: create new mm/swap.h header file.
Many functions declared in include/linux/swap.h are only used within mm/ Create a new "mm/swap.h" and move some of these declarations there. Remove the redundant 'extern' from the function declarations. Signed-off-by: NeilBrown <neilb@suse.de>
-
Structural cleanup for filesystem-based swap
Linux primarily uses IO to block devices for swap, but can send the IO requests to a filesystem. This has only ever worked for NFS, and that hasn't worked for a while due to a lack of testing. This seems like a good time for some tidy-up before restoring swap-over-NFS functionality. This patch: - updates the documentation (both copies!) for swap_activate which is woefully out-of-date - introduces a new address_space operation "swap_rw" for swap IO. The code currently used ->readpage for reads and ->direct_IO for writes. The former imposes a limit of one-page-at-a-time, the later means that direct writes and swap writes are encouraged to use the same path. While similar, swap can often be simpler as it can assume that no allocation is needed, and coherence with the page cache is irrelevant. - move the responsibility for setting SWP_FS_OPS to ->swap_activate() and also requires it to always call add_swap_extent(). This makes it much easier to find filesystems that require SWP_FS_OPS. - drops the call to the filesystem for ->set_page_dirty(). These pages do not belong to the filesystem, and it has no interest in the dirty status. writeout is switched to ->swap_rw, but read-in is not as that requires too much change for this patch. Both cifs and nfs set SWP_FS_OPS but neither provide a swap_rw, so both will now fail to activate swap. cifs never really tried to provide swap support as ->direct_IO always returns an error. NFS will be fixed up with following patches. Signed-off-by: NeilBrown <neilb@suse.de>
-
cifs: ignore resource_id while getting fscache super cookie
We have a cyclic dependency between fscache super cookie and root inode cookie. The super cookie relies on tcon->resource_id, which gets populated from the root inode number. However, fetching the root inode initializes inode cookie as a child of super cookie, which is yet to be populated. resource_id is only used as auxdata to check the validity of super cookie. We can completely avoid setting resource_id to remove the circular dependency. Since vol creation time and vol serial numbers are used for auxdata, we should be fine. Additionally, there will be auxiliary data check for each inode cookie as well. Fixes: 5bf91ef ("cifs: wait for tcon resource_id before getting fscache super") CC: David Howells <dhowells@redhat.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
-
cifs: sanitize multiple delimiters in prepath
mount.cifs can pass a device with multiple delimiters in it. This will cause rename(2) to fail with ENOENT. BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2031200 Fixes: 24e0a1e ("cifs: switch to new mount api") Cc: stable@vger.kernel.org # 5.11+ Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Commits on Dec 12, 2021
-
-
Merge tag 'usb-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel…
…/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some small USB fixes for 5.16-rc5. They include: - gadget driver fixes for reported issues - xhci fixes for reported problems. - config endpoint parsing fixes for where we got bitfields wrong Most of these have been in linux-next, the remaining few were not, but got lots of local testing in my systems and in some cloud testing infrastructures" * tag 'usb-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: core: config: using bit mask instead of individual bits usb: core: config: fix validation of wMaxPacketValue entries USB: gadget: zero allocate endpoint 0 buffers USB: gadget: detect too-big endpoint 0 requests xhci: avoid race between disable slot command and host runtime suspend xhci: Remove CONFIG_USB_DEFAULT_PERSIST to prevent xHCI from runtime suspending Revert "usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default"
-
Merge tag 'char-misc-5.16-rc5' of git://git.kernel.org/pub/scm/linux/…
…kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are a bunch of small char/misc and other driver subsystem fixes. Included in here are: - iio driver fixes for reported problems - phy driver fixes for a number of reported problems - mhi resume bugfix for broken hardware - nvmem driver fix - rtsx driver fix for irq issues - fastrpc packet parsing fix All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (33 commits) bus: mhi: core: Add support for forced PM resume iio: trigger: stm32-timer: fix MODULE_ALIAS misc: rtsx: Avoid mangling IRQ during runtime PM nvmem: eeprom: at25: fix FRAM byte_len misc: fastrpc: fix improper packet size calculation MAINTAINERS: add maintainer for Qualcomm FastRPC driver bus: mhi: pci_generic: Fix device recovery failed issue iio: adc: stm32: fix null pointer on defer_probe error phy: HiSilicon: Fix copy and paste bug in error handling dt-bindings: phy: zynqmp-psgtr: fix USB phy name phy: ti: omap-usb2: Fix the kernel-doc style phy: qualcomm: ipq806x-usb: Fix kernel-doc style iio: at91-sama5d2: Fix incorrect sign extension iio: adc: axp20x_adc: fix charging current reporting on AXP22x iio: gyro: adxrs290: fix data signedness phy: ti: tusb1210: Fix the kernel-doc warn phy: qualcomm: usb-hsic: Fix the kernel-doc warn phy: qualcomm: qmp: Add missing struct documentation phy: mvebu-cp110-utmi: Fix kernel-doc warns iio: ad7768-1: Call iio_trigger_notify_done() on error ...
-
Merge tag 'timers-urgent-2021-12-12' of git://git.kernel.org/pub/scm/…
…linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for clock chip drivers: - A regression fix for the Designware APB timer. A recent change to the error checking code transformed the error condition wrongly so it turned into a fail if good condition. - Fix a clang build fail of the ARM architected timer driver" * tag 'timers-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/arm_arch_timer: Force inlining of erratum_set_next_event_generic() clocksource/drivers/dw_apb_timer_of: Fix probe failure -
Merge tag 'irq-urgent-2021-12-12' of git://git.kernel.org/pub/scm/lin…
…ux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of interrupt chip driver fixes: - Fix the multi vector MSI allocation on Armada 370XP - Do interrupt acknowledgement correctly in the aspeed-scu driver - Make the IPR register offset correct in the NVIC driver - Make redistribution table flushing correct by issueing a SYNC command to ensure that the invalidation command has been executed - Plug a device tree node reference leak in the bcm7210-l2 driver - Trivial fixes in the MIPS GIC and the Apple AIC drivers" * tag 'irq-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/irq-bcm7120-l2: Add put_device() after of_find_device_by_node() irqchip/irq-gic-v3-its.c: Force synchronisation when issuing INVALL irqchip/apple-aic: Mark aic_init_smp() as __init irqchip: nvic: Fix offset for Interrupt Priority Offsets irqchip/mips-gic: Use bitfield helpers irqchip/aspeed-scu: Replace update_bits with write_bits. irqchip/armada-370-xp: Fix support for Multi-MSI interrupts irqchip/armada-370-xp: Fix return value of armada_370_xp_msi_alloc() -
Merge tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/l…
…inux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single fix for the x86 scheduler topology: Using cluster topology on hybrid CPUs, e.g. Alder Lake, biases the scheduler towards the ATOM cluster as that has more total capacity. Use selection based on CPU priority instead" * tag 'sched-urgent-2021-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched,x86: Don't use cluster topology for x86 hybrid CPUs
-
Merge tag 'csky-for-linus-5.16-rc5' of git://github.com/c-sky/csky-linux
Pull csky from Guo Ren: "Only one fix for csky: fix fpu config macro" * tag 'csky-for-linus-5.16-rc5' of git://github.com/c-sky/csky-linux: csky: fix typo of fpu config macro
-
usb: core: config: using bit mask instead of individual bits
Using standard USB_EP_MAXP_MULT_MASK instead of individual bits for extracting multiple-transactions bits from wMaxPacketSize value. Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com> Link: https://lore.kernel.org/r/20211210085219.16796-2-pavel.hofman@ivitera.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
usb: core: config: fix validation of wMaxPacketValue entries
The checks performed by commit aed9d65 ("USB: validate wMaxPacketValue entries in endpoint descriptors") require that initial value of the maxp variable contains both maximum packet size bits (10..0) and multiple-transactions bits (12..11). However, the existing code assings only the maximum packet size bits. This patch assigns all bits of wMaxPacketSize to the variable. Fixes: aed9d65 ("USB: validate wMaxPacketValue entries in endpoint descriptors") Cc: stable <stable@vger.kernel.org> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com> Link: https://lore.kernel.org/r/20211210085219.16796-1-pavel.hofman@ivitera.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
USB: gadget: zero allocate endpoint 0 buffers
Under some conditions, USB gadget devices can show allocated buffer contents to a host. Fix this up by zero-allocating them so that any extra data will all just be zeros. Reported-by: Szymon Heidrich <szymon.heidrich@gmail.com> Tested-by: Szymon Heidrich <szymon.heidrich@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
USB: gadget: detect too-big endpoint 0 requests
Sometimes USB hosts can ask for buffers that are too large from endpoint 0, which should not be allowed. If this happens for OUT requests, stall the endpoint, but for IN requests, trim the request size to the endpoint buffer size. Co-developed-by: Szymon Heidrich <szymon.heidrich@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/g…
…it/jejb/scsi Pull SCSI fixes from James Bottomley: "Four fixes, all in drivers. Three are small and obvious, the qedi one is a bit larger but also pretty obvious" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: qla2xxx: Format log strings only if needed scsi: scsi_debug: Fix buffer size of REPORT ZONES command scsi: qedi: Fix cmd_cleanup_cmpl counter mismatch issue scsi: pm80xx: Do not call scsi_remove_host() in pm8001_alloc()
-
Merge tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/x…
…fs-linux Pull xfs fix from Darrick Wong: "This fixes a race between a readonly remount process and other processes that hold a file IOLOCK on files that previously experienced copy on write, that could result in severe filesystem corruption if the filesystem is then remounted rw. I think this is fairly rare (since the only reliable reproducer I have that fits the second criteria is the experimental xfs_scrub program), but the race is clear, so we still need to fix this. Summary: - Fix a data corruption vector that can result from the ro remount process failing to clear all speculative preallocations from files and the rw remount process not noticing the incomplete cleanup" * tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove all COW fork extents when remounting readonly -
Merge branch 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/k…
…ernel/git/dennis/percpu Pull percpu fixes from Dennis Zhou: "This contains a fix for SMP && !MMU archs for percpu which has been tested by arm and sh. It seems in the past they have gotten away with it due to mapping of vm functions to km functions, but this fell apart a few releases ago and was just reported recently. The other is just a minor dependency clean up. I think queued up right now by Andrew is a fix in percpu that papers of what seems to be a bug in hotplug for a special situation with memoryless nodes. Michal Hocko is digging into it further" * 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu_ref: Replace kernel.h with the necessary inclusions percpu: km: ensure it is used with NOMMU (either UP or SMP)
Commits on Dec 11, 2021
-
Merge tag 'perf-tools-fixes-for-v5.16-2021-12-11' of git://git.kernel…
….org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Prevent out-of-bounds access to per sample registers. - Fix NULL vs IS_ERR_OR_NULL() checking on the python binding. - Intel PT fixes, half of those are one-liners: - Fix some PGE (packet generation enable/control flow packets) usage. - Fix sync state when a PSB (synchronization) packet is found. - Fix intel_pt_fup_event() assumptions about setting state type. - Fix state setting when receiving overflow (OVF) packet. - Fix next 'err' value, walking trace. - Fix missing 'instruction' events with 'q' option. - Fix error timestamp setting on the decoder error path. * tag 'perf-tools-fixes-for-v5.16-2021-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf python: Fix NULL vs IS_ERR_OR_NULL() checking perf intel-pt: Fix error timestamp setting on the decoder error path perf intel-pt: Fix missing 'instruction' events with 'q' option perf intel-pt: Fix next 'err' value, walking trace perf intel-pt: Fix state setting when receiving overflow (OVF) packet perf intel-pt: Fix intel_pt_fup_event() assumptions about setting state type perf intel-pt: Fix sync state when a PSB (synchronization) packet is found perf intel-pt: Fix some PGE (packet generation enable/control flow packets) usage perf tools: Prevent out-of-bounds access to registers