Christoph-Hell…
Commits on Feb 16, 2022
-
block: skip the fsync_bdev call in del_gendisk for surprise removals
For surprise removals that have already marked the disk dead, there is no point in calling fsync_bdev as all I/O will fail anyway, so skip it. Signed-off-by: Christoph Hellwig <hch@lst.de>
-
block: fix surprise removal for drivers calling blk_set_queue_dying
Various block drivers call blk_set_queue_dying to mark a disk as dead due to surprise removal events, but since commit 8e141f9 that doesn't work given that the GD_DEAD flag needs to be set to stop I/O. Replace the driver calls to blk_set_queue_dying with a new (and properly documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the only remaining caller. Fixes: 8e141f9 ("block: drain file system I/O on del_gendisk") Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Commits on Feb 15, 2022
-
Merge branch 'for-5.18/drivers' into for-next
* for-5.18/drivers: loop: allow user to set the queue depth loop: remove extra variable in lo_req_flush loop: remove extra variable in lo_fallocate() loop: use sysfs_emit() in the sysfs xxx show()
-
loop: allow user to set the queue depth
Instead of hardcoding queue depth allow user to set the hw queue depth using module parameter. Set default value to 128 to retain the existing behavior. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
loop: remove extra variable in lo_req_flush
The local variable file is used to pass it to the vfs_fsync(). We can get away with using lo->lo_backing_file instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
loop: remove extra variable in lo_fallocate()
The local variable q is used to pass it to the blk_queue_discard(). We can get away with using lo->lo_queue instead of storing in a local variable which is not used anywhere else. No functional change in this patch. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
loop: use sysfs_emit() in the sysfs xxx show()
sprintf does not know the PAGE_SIZE maximum of the temporary buffer used for outputting sysfs content and it's possible to overrun the PAGE_SIZE buffer length. Use a generic sysfs_emit function that knows the size of the temporary buffer and ensures that no overrun is done for offset attribute in loop_attr_[offset|sizelimit|autoclear|partscan|dio]_show() callbacks. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/io_uring' into for-next
* for-5.18/io_uring: io-uring: Make tracepoints consistent. io-uring: add __fill_cqe function
-
io-uring: Make tracepoints consistent.
This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields: - context (ring) - request - user_data - opcode. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io-uring: add __fill_cqe function
This introduces the __fill_cqe function. This is necessary to correctly issue the io_uring_complete tracepoint. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: blk-cgroup: set blkg iostat after percpu stat aggregation
-
blk-cgroup: set blkg iostat after percpu stat aggregation
Don't need to do blkg_iostat_set for top blkg iostat on each CPU, so move it after percpu stat aggregation. Fixes: ef45fe4 ("blk-cgroup: show global disk stats in root cgroup io.stat") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220213085902.88884-1-zhouchengming@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: blk-lib: don't check bdev_get_queue() NULL check
-
blk-lib: don't check bdev_get_queue() NULL check
Based on the comment present in the bdev_get_queue() bdev->bd_queue can never be NULL. Remove the NULL check for the local variable q that is set from bdev_get_queue() for discard, write_same, and write_zeroes. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220215115247.11717-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/drivers' into for-next
* for-5.18/drivers: null_blk: fix return value from null_add_dev()
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: block: remove biodoc.rst
-
null_blk: fix return value from null_add_dev()
The function nullb_device_power_store() returns -ENOMEM when null_add_dev() fails. null_add_dev() can fail with return value other than -ENOMEM such as -EINVAL when Zoned Block Device option is used, see : nullb_device_power_store() null_add_dev() null_init_zoned_dev() return -EINVAL; When trying to load the module having -ENOMEM value returned on the command line creates confusion when pleanty of memory is free on the machine. Instead of hardcoding -ENOMEM return the value of null_add_dev() function. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
This document is completely out of date and extremely misleading. In general the existing kerneldoc comment serve as a much better documentation of the still existing functionality, while the history blurbs are pretty much irrelevant today. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220215081047.3693582-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Feb 11, 2022
-
Merge branch 'for-5.18/drivers' into for-next
* for-5.18/drivers: loop: clean up grammar in warning message
-
loop: clean up grammar in warning message
The phrase "has still" should be "still has" to clean up the grammar. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: docs: block: biodoc.rst: Drop the obsolete and incorrect content
-
docs: block: biodoc.rst: Drop the obsolete and incorrect content
Since commit 7eaceac ("block: remove per-queue plugging"), kernel has removed blk_run_address_space(), blk_unplug() and sync_buffer(), and moved to on-stack plugging. The document has been obsolete for years. Given that there is no obvious counterparts in the new mechinism to replace old APIs, this patch drops the content directly. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Link: https://lore.kernel.org/r/20220207074931.20067-1-song.bao.hua@hisilicon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: block: partition include/linux/blk-cgroup.h block: move initialization of q->blkg_list into blkcg_init_queue block: remove THROTL_IOPS_MAX
-
block: partition include/linux/blk-cgroup.h
Partition include/linux/blk-cgroup.h into two parts: one is public part, the other is block layer private part. Suggested by Christoph Hellwig. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
block: move initialization of q->blkg_list into blkcg_init_queue
q->blkg_list is only used by blkcg code, so move it into blkcg_init_queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220211101149.2368042-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
No one uses THROTL_IOPS_MAX any more, so remove it. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220211101149.2368042-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/block' into for-next
* for-5.18/block: block: introduce block_rq_error tracepoint
-
block: introduce block_rq_error tracepoint
Currently, rasdaemon uses the existing tracepoint block_rq_complete and filters out non-error cases in order to capture block disk errors. But there are a few problems with this approach: 1. Even kernel trace filter could do the filtering work, there is still some overhead after we enable this tracepoint. 2. The filter is merely based on errno, which does not align with kernel logic to check the errors for print_req_error(). 3. block_rq_complete only provides dev major and minor to identify the block device, it is not convenient to use in user-space. So introduce a new tracepoint block_rq_error just for the error case. With this patch, rasdaemon could switch to block_rq_error. Since the new tracepoint has the similar implementation with block_rq_complete, so move the existing code from TRACE_EVENT block_rq_complete() into new event class block_rq_completion(). Then add event for block_rq_complete and block_rq_err respectively from the newly created event class per the suggestion from Chaitanya Kulkarni. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/io_uring' into for-next
* for-5.18/io_uring: io-wq: use IO_WQ_ACCT_NR rather than hardcoded number io-wq: reduce acct->lock crossing functions lock/unlock io-wq: decouple work_list protection from the big wqe->lock
-
io-wq: use IO_WQ_ACCT_NR rather than hardcoded number
It's better to use the defined enum stuff not the hardcoded number to define array. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io-wq: reduce acct->lock crossing functions lock/unlock
reduce acct->lock lock and unlock in different functions to make the code clearer. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io-wq: decouple work_list protection from the big wqe->lock
wqe->lock is abused, it now protects acct->work_list, hash stuff, nr_workers, wqe->free_list and so on. Lets first get the work_list out of the wqe-lock mess by introduce a specific lock for work list. This is the first step to solve the huge contension between work insertion and work consumption. good thing: - split locking for bound and unbound work list - reduce contension between work_list visit and (worker's)free_list. For the hash stuff, since there won't be a work with same file in both bound and unbound work list, thus they won't visit same hash entry. it works well to use the new lock to protect hash stuff. Results: set max_unbound_worker = 4, test with echo-server: nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16 (-n connection, -l workload) before this patch: Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074 Overhead Command Shared Object Symbol 28.59% iou-wrk-10021 [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 8.89% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 6.20% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock 2.45% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 2.36% iou-wrk-10021 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.29% iou-wrk-10021 [kernel.vmlinux] [k] io_worker_handle_work 1.29% io_uring_echo_s [kernel.vmlinux] [k] io_wqe_enqueue 1.06% iou-wrk-10021 [kernel.vmlinux] [k] io_wqe_worker 1.06% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.03% iou-wrk-10021 [kernel.vmlinux] [k] __schedule 0.99% iou-wrk-10021 [kernel.vmlinux] [k] tcp_sendmsg_locked with this patch: Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943 Overhead Command Shared Object Symbol 16.86% iou-wrk-10893 [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 9.10% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock 4.53% io_uring_echo_s [kernel.vmlinux] [k] native_queued_spin_lock_slowpat 2.87% iou-wrk-10893 [kernel.vmlinux] [k] io_worker_handle_work 2.57% iou-wrk-10893 [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.56% io_uring_echo_s [kernel.vmlinux] [k] io_prep_async_work 1.82% io_uring_echo_s [kernel.vmlinux] [k] _raw_spin_lock 1.33% iou-wrk-10893 [kernel.vmlinux] [k] io_wqe_worker 1.26% io_uring_echo_s [kernel.vmlinux] [k] try_to_wake_up spin_lock failure from 25.59% + 8.89% = 34.48% to 16.86% + 4.53% = 21.39% TPS is similar, while cpu usage is from almost 400% to 350% Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Feb 8, 2022
-
Merge branch 'for-5.18/drivers' into for-next
* for-5.18/drivers: block/rnbd: Remove a useless mutex
-
block/rnbd: Remove a useless mutex
According to lib/idr.c, The IDA handles its own locking. It is safe to call any of the IDA functions without synchronisation in your code. so the 'ida_lock' mutex can just be removed. It is here only to protect some ida_simple_get()/ida_simple_remove() calls. While at it, switch to ida_alloc_XXX()/ida_free() instead to ida_simple_get()/ida_simple_remove(). The latter is deprecated and more verbose. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/7f9eccd8b1fce1bac45ac9b01a78cf72f54c0a61.1644266862.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-5.18/io_uring' into for-next
* for-5.18/io_uring: io_uring: Fix use of uninitialized ret in io_eventfd_register()