Wander-Lairson…
Commits on Dec 10, 2021
-
blktrace: switch trace spinlock to a raw spinlock
TRACE_EVENT disables preemption before calling the callback. Because of that blktrace triggers the following bug under PREEMPT_RT: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 119, name: kworker/u2:2 5 locks held by kworker/u2:2/119: #0: ffff8c2e4a88f538 ((wq_completion)xfs-cil/dm-0){+.+.}-{0:0}, at: process_one_work+0x200/0x450 #1: ffffab3840ac7e68 ((work_completion)(&cil->xc_push_work)){+.+.}-{0:0}, at: process_one_work+0x200/0x450 #2: ffff8c2e4a887128 (&cil->xc_ctx_lock){++++}-{3:3}, at: xlog_cil_push_work+0xb7/0x670 [xfs] #3: ffffffffa6a63780 (rcu_read_lock){....}-{1:2}, at: blk_add_trace_bio+0x0/0x1f0 #4: ffffffffa6610620 (running_trace_lock){+.+.}-{2:2}, at: __blk_add_trace+0x3ef/0x480 Preemption disabled at: [<ffffffffa4d35c05>] migrate_enable+0x45/0x140 CPU: 0 PID: 119 Comm: kworker/u2:2 Kdump: loaded Not tainted 5.14.0-25.rt21.25.light.el9.x86_64+debug #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs] Call Trace: ? migrate_enable+0x45/0x140 dump_stack_lvl+0x57/0x7d ___might_sleep.cold+0xe3/0xf7 rt_spin_lock+0x3a/0xd0 ? __blk_add_trace+0x3ef/0x480 __blk_add_trace+0x3ef/0x480 blk_add_trace_bio+0x18d/0x1f0 trace_block_bio_queue+0xb5/0x150 submit_bio_checks+0x1f0/0x520 ? sched_clock_cpu+0xb/0x100 submit_bio_noacct+0x30/0x1d0 ? bio_associate_blkg+0x66/0x190 xlog_cil_push_work+0x1b6/0x670 [xfs] ? register_lock_class+0x43/0x4f0 ? xfs_swap_extents+0x5f0/0x5f0 [xfs] process_one_work+0x275/0x450 ? process_one_work+0x200/0x450 worker_thread+0x55/0x3c0 ? process_one_work+0x450/0x450 kthread+0x188/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 To avoid this bug, we switch the trace lock to a raw spinlock. Signed-off-by: Wander Lairson Costa <wander@redhat.com> -
block: Avoid sleeping function called from invalid context bug
This was caught during QA test: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:942 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 243401, name: sed INFO: lockdep is turned off. Preemption disabled at: [<ffffffff89b26268>] blk_cgroup_bio_start+0x28/0xd0 CPU: 2 PID: 243401 Comm: sed Kdump: loaded Not tainted 4.18.0-353.rt7.138.el8.x86_64+debug #1 Hardware name: HP ProLiant DL380 Gen9, BIOS P89 05/06/2015 Call Trace: dump_stack+0x5c/0x80 ___might_sleep.cold.89+0xf5/0x109 rt_spin_lock+0x3e/0xd0 ? __blk_add_trace+0x428/0x4b0 __blk_add_trace+0x428/0x4b0 blk_add_trace_bio+0x16e/0x1c0 generic_make_request_checks+0x7e8/0x8c0 generic_make_request+0x3c/0x420 ? membarrier_private_expedited+0xd0/0x2b0 ? lock_release+0x1ca/0x450 ? submit_bio+0x3c/0x160 ? _raw_spin_unlock_irqrestore+0x3c/0x80 submit_bio+0x3c/0x160 ? rt_mutex_futex_unlock+0x66/0xa0 iomap_submit_ioend.isra.36+0x4a/0x70 xfs_vm_writepages+0x65/0x90 [xfs] do_writepages+0x41/0xe0 ? rt_mutex_futex_unlock+0x66/0xa0 __filemap_fdatawrite_range+0xce/0x110 xfs_release+0x11c/0x160 [xfs] __fput+0xd5/0x270 task_work_run+0xa1/0xd0 exit_to_usermode_loop+0x14d/0x160 do_syscall_64+0x23b/0x240 entry_SYSCALL_64_after_hwframe+0x6a/0xdf We replace the get/put_cpu() call by get/put_cpu_light to avoid this bug. Signed-off-by: Wander Lairson Costa <wander@redhat.com>
-
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers: pktdvd: stop using bdi congestion framework.
-
pktdvd: stop using bdi congestion framework.
The bdi congestion framework isn't widely used and should be deprecated. pktdvd makes use of it to track congestion, but this can be done entirely internally to pktdvd, so it doesn't need to use the framework. So introduce a "congested" flag. When waiting for bio_queue_size to drop, set this flag and a var_waitqueue() to wait for it. When bio_queue_size does drop and this flag is set, clear the flag and call wake_up_var(). We don't use a wait_var_event macro for the waiting as we need to set the flag and drop the spinlock before calling schedule() and while that is possible with __wait_var_event(), result is not easy to read. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/163910843527.9928.857338663717630212@noble.neil.brown.name Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Dec 8, 2021
-
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring: io_uring: batch completion in prior_task_list
-
io_uring: batch completion in prior_task_list
In previous patches, we have already gathered some tw with io_req_task_complete() as callback in prior_task_list, let's complete them in batch while we cannot grab uring lock. In this way, we batch the req_complete_post path. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.com Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Dec 7, 2021
-
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring: io_uring: split io_req_complete_post() and add a helper io_uring: add helper for task work execution code io_uring: add a priority tw list for irq completion work io-wq: add helper to merge two wq_lists
-
io_uring: split io_req_complete_post() and add a helper
Split io_req_complete_post(), this is a prep for the next patch. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-5-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: add helper for task work execution code
Add a helper for task work execution code. We will use it later. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: add a priority tw list for irq completion work
Now we have a lot of task_work users, some are just to complete a req and generate a cqe. Let's put the work to a new tw list which has a higher priority, so that it can be handled quickly and thus to reduce avg req latency and users can issue next round of sqes earlier. An explanatory case: origin timeline: submit_sqe-->irq-->add completion task_work -->run heavy work0~n-->run completion task_work now timeline: submit_sqe-->irq-->add completion task_work -->run completion task_work-->run heavy work0~n Limitation: this optimization is only for those that submission and reaping process are in different threads. Otherwise anyhow we have to submit new sqes after returning to userspace, then the order of TWs doesn't matter. Tested this patch(and the following ones) by manually replace __io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works. Then test with fio: ioengine=io_uring sqpoll=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=600 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 Tried various iodepth. The peak IOPS for this patch is 710K, while the old one is 665K. For avg latency, difference shows when iodepth grow: depth and avg latency(usec): depth new old 1 7.05 7.10 2 8.47 8.60 4 10.42 10.42 8 13.78 13.22 16 27.41 24.33 32 49.40 53.08 64 102.53 103.36 128 196.98 205.61 256 372.99 414.88 512 747.23 791.30 1024 1472.59 1538.72 2048 3153.49 3329.01 4096 6387.86 6682.54 8192 12150.25 12774.14 16384 23085.58 26044.71 Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20211207093951.247840-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk> -
io-wq: add helper to merge two wq_lists
add a helper to merge two wq_lists, it will be useful in the next patches. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Dec 6, 2021
-
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags blk-mq: Delete busy_iter_fn blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
-
blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags
Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees using megaraid SAS RAID card since moving to shared tags [0]. Previously, when shared tags was shared sbitmap, this function was less than optimum since we would iter through all tags for all hctx's, yet only ever match upto tagset depth number of rqs. Since the change to shared tags, things are even less efficient if we have parallel callers of blk_mq_queue_tag_busy_iter(). This is because in bt_iter() -> blk_mq_find_and_get_req() there would be more contention on accessing each request ref and tags->lock since they are now shared among all HW queues. Optimise by having separate calls to bt_for_each() for when we're using shared tags. In this case no longer pass a hctx, as it is no longer relevant, and teach bt_iter() about this. Ming suggested something along the lines of this change, apart from a different implementation. [0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/ Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Fixes: e155b0c ("blk-mq: Use shared tags for shared sbitmap support") Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete busy_iter_fn to reduce duplication. It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is less specific. However busy_tag_iter_fn is used in many different parts of the tree, unlike busy_iter_fn which is just use in block/, so just take the straightforward path now, so that we could rename later treewide. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is blk_mq_rq_inflight(). Function blk_mq_rq_inflight() uses the hctx to find the associated request queue to match against the request. However this same check is already done in caller bt_iter(), so drop this check. With that change there are no more users of busy_iter_fn blk_mq_hw_ctx argument, so drop the argument. Reviewed-by Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
mm: convert to using atomic-ref
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
block: convert to using atomic-ref
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: convert to using atomic-ref
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
atomic-ref: add basic infrastructure for atomic refs based on atomic_t
Make the atomic_t reference counting from commit f958d7b generic and available for other users. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops() blk-mq: don't run might_sleep() if the operation needn't blocking
-
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list will be changed when running 'dispatch_ops', so add one local variable for holding request queue. Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 4cafe86 ("blk-mq: run dispatch lock once in case of issuing from list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue won't sleep, so don't run might_sleep() for it. Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Dec 5, 2021
-
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring: io_uring: reuse io_req_task_complete for timeouts io_uring: tweak iopoll CQE_SKIP event counting io_uring: simplify selected buf handling io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
-
io_uring: reuse io_req_task_complete for timeouts
With kbuf unification io_req_task_complete() is now a generic function, use it for timeout's tw completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7142fa3cbaf3a4140d59bcba45cbe168cf40fac2.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: tweak iopoll CQE_SKIP event counting
When iopolling the userspace specifies the minimum number of "events" it expects. Previously, we had one CQE per request, so the definition of an "event" was unequivocal, but that's not more the case anymore with REQ_F_CQE_SKIP. Currently it counts the number of completed requests, replace it with the number of posted CQEs. This allows users of the "one CQE per link" scheme to wait for all N links in a single syscall, which is not possible without the patch and requires extra context switches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5a965c4d2249827392037bbd0186f87fea49c55.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: simplify selected buf handling
As selected buffers are now stored in a separate field in a request, get rid of rw/recv specific helpers and simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bd4a866d8d91b044f748c40efff9e4eacd07536e.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
Move them up to avoid explicit declaration. We will use them in later patches. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3631243d6fc4a79bbba0cd62597fc8cd5be95924.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Dec 3, 2021
-
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: blk-mq: run dispatch lock once in case of issuing from list blk-mq: pass request queue to blk_mq_run_dispatch_ops blk-mq: move srcu from blk_mq_hw_ctx to request_queue blk-mq: remove hctx_lock and hctx_unlock block: switch to atomic_t for request references block: move direct_IO into our own read_iter handler mm: move filemap_range_needs_writeback() into header
-
blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing single request directly, and enough to do it one time when issuing from whole list. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
blk-mq: pass request queue to blk_mq_run_dispatch_ops
We have switched to allocate srcu into request queue, so it is fine to pass request queue to blk_mq_run_dispatch_ops(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch critical area. However, this srcu instance stays at the end of hctx, and it often takes standalone cacheline, often cold. Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on the indirect percpu variable which is allocated from heap instead of being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It doesn't matter if srcu structure stays in hctx or request queue. So switch to per-request-queue srcu for protecting dispatch, and this way simplifies quiesce a lot, not mention quiesce is always done on the request queue wide. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
blk-mq: remove hctx_lock and hctx_unlock
Remove hctx_lock and hctx_unlock, and add one helper of blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops with rcu/srcu read held. Compared with hctx_lock()/hctx_unlock(): 1) remove 2 branch to 1, so we just need to check (hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops 2) srcu_idx needn't to be touched in case of non-blocking 3) might_sleep_if() can be moved to the blocking branch Also put the added blk_mq_run_dispatch_ops() in private header, so that the following patch can use it out of blk-mq.c. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more expensive than the io_uring method of using atomic_t and just checking for potential over/underflow. This borrows that same implementation, which in turn is based on the mm implementation from Linus. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just set it up ourselves and call our own handler. This avoids an indirect call for O_DIRECT. Fall back to filemap_read() if we fail. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
mm: move filemap_range_needs_writeback() into header
No functional changes in this patch, just in preparation for efficiently calling this light function from the block O_DIRECT handling. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>