Skip to content
Permalink
Christoph-Hell…
Switch branches/tags

Commits on Feb 16, 2022

  1. block: skip the fsync_bdev call in del_gendisk for surprise removals

    For surprise removals that have already marked the disk dead, there is
    no point in calling fsync_bdev as all I/O will fail anyway, so skip it.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Christoph Hellwig authored and intel-lab-lkp committed Feb 16, 2022
  2. block: fix surprise removal for drivers calling blk_set_queue_dying

    Various block drivers call blk_set_queue_dying to mark a disk as dead due
    to surprise removal events, but since commit 8e141f9 that doesn't
    work given that the GD_DEAD flag needs to be set to stop I/O.
    
    Replace the driver calls to blk_set_queue_dying with a new (and properly
    documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
    only remaining caller.
    
    Fixes: 8e141f9 ("block: drain file system I/O on del_gendisk")
    Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Christoph Hellwig authored and intel-lab-lkp committed Feb 16, 2022

Commits on Feb 15, 2022

  1. Merge branch 'for-5.18/drivers' into for-next

    * for-5.18/drivers:
      loop: allow user to set the queue depth
      loop: remove extra variable in lo_req_flush
      loop: remove extra variable in lo_fallocate()
      loop: use sysfs_emit() in the sysfs xxx show()
    axboe committed Feb 15, 2022
  2. loop: allow user to set the queue depth

    Instead of hardcoding queue depth allow user to set the hw queue depth
    using module parameter. Set default value to 128 to retain the existing
    behavior.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  3. loop: remove extra variable in lo_req_flush

    The local variable file is used to pass it to the vfs_fsync(). We can
    get away with using lo->lo_backing_file instead of storing in a local
    variable which is not used anywhere else.
    
    No functional change in this patch.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  4. loop: remove extra variable in lo_fallocate()

    The local variable q is used to pass it to the blk_queue_discard(). We
    can get away with using lo->lo_queue instead of storing in a local
    variable which is not used anywhere else.
    
    No functional change in this patch.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  5. loop: use sysfs_emit() in the sysfs xxx show()

    sprintf does not know the PAGE_SIZE maximum of the temporary buffer
    used for outputting sysfs content and it's possible to overrun the
    PAGE_SIZE buffer length.
    
    Use a generic sysfs_emit function that knows the size of the
    temporary buffer and ensures that no overrun is done for offset
    attribute in
    loop_attr_[offset|sizelimit|autoclear|partscan|dio]_show() callbacks.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  6. Merge branch 'for-5.18/io_uring' into for-next

    * for-5.18/io_uring:
      io-uring: Make tracepoints consistent.
      io-uring: add __fill_cqe function
    axboe committed Feb 15, 2022
  7. io-uring: Make tracepoints consistent.

    This makes the io-uring tracepoints consistent. Where it makes sense
    the tracepoints start with the following four fields:
    - context (ring)
    - request
    - user_data
    - opcode.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    shrfb authored and axboe committed Feb 15, 2022
  8. io-uring: add __fill_cqe function

    This introduces the __fill_cqe function. This is necessary
    to correctly issue the io_uring_complete tracepoint.
    
    Signed-off-by: Stefan Roesch <shr@fb.com>
    Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    shrfb authored and axboe committed Feb 15, 2022
  9. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      blk-cgroup: set blkg iostat after percpu stat aggregation
    axboe committed Feb 15, 2022
  10. blk-cgroup: set blkg iostat after percpu stat aggregation

    Don't need to do blkg_iostat_set for top blkg iostat on each CPU,
    so move it after percpu stat aggregation.
    
    Fixes: ef45fe4 ("blk-cgroup: show global disk stats in root cgroup io.stat")
    Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20220213085902.88884-1-zhouchengming@bytedance.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    chengmingzhou authored and axboe committed Feb 15, 2022
  11. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      blk-lib: don't check bdev_get_queue() NULL check
    axboe committed Feb 15, 2022
  12. blk-lib: don't check bdev_get_queue() NULL check

    Based on the comment present in the bdev_get_queue()
    bdev->bd_queue can never be NULL. Remove the NULL check for the local
    variable q that is set from bdev_get_queue() for discard, write_same,
    and write_zeroes.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/20220215115247.11717-2-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  13. Merge branch 'for-5.18/drivers' into for-next

    * for-5.18/drivers:
      null_blk: fix return value from null_add_dev()
    axboe committed Feb 15, 2022
  14. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      block: remove biodoc.rst
    axboe committed Feb 15, 2022
  15. null_blk: fix return value from null_add_dev()

    The function nullb_device_power_store() returns -ENOMEM when
    null_add_dev() fails. null_add_dev() can fail with return value
    other than -ENOMEM such as -EINVAL when Zoned Block Device option
    is used, see :
    
    nullb_device_power_store()
     null_add_dev()
      null_init_zoned_dev()
    	return -EINVAL;
    
    When trying to load the module having -ENOMEM value returned on the
    command line creates confusion when pleanty of memory is free on the
    machine.
    
    Instead of hardcoding -ENOMEM return the value of null_add_dev()
    function.
    
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Chaitanya Kulkarni authored and axboe committed Feb 15, 2022
  16. block: remove biodoc.rst

    This document is completely out of date and extremely misleading. In
    general the existing kerneldoc comment serve as a much better
    documentation of the still existing functionality, while the history
    blurbs are pretty much irrelevant today.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20220215081047.3693582-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Feb 15, 2022

Commits on Feb 11, 2022

  1. Merge branch 'for-5.18/drivers' into for-next

    * for-5.18/drivers:
      loop: clean up grammar in warning message
    axboe committed Feb 11, 2022
  2. loop: clean up grammar in warning message

    The phrase "has still" should be "still has" to clean up the grammar.
    
    Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
    Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    ColinIanKing authored and axboe committed Feb 11, 2022
  3. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      docs: block: biodoc.rst: Drop the obsolete and incorrect content
    axboe committed Feb 11, 2022
  4. docs: block: biodoc.rst: Drop the obsolete and incorrect content

    Since commit 7eaceac ("block: remove per-queue plugging"), kernel
    has removed blk_run_address_space(), blk_unplug() and sync_buffer(),
    and moved to on-stack plugging. The document has been obsolete for
    years.
    Given that there is no obvious counterparts in the new mechinism to
    replace old APIs, this patch drops the content directly.
    
    Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
    Link: https://lore.kernel.org/r/20220207074931.20067-1-song.bao.hua@hisilicon.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    21cnbao authored and axboe committed Feb 11, 2022
  5. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      block: partition include/linux/blk-cgroup.h
      block: move initialization of q->blkg_list into blkcg_init_queue
      block: remove THROTL_IOPS_MAX
    axboe committed Feb 11, 2022
  6. block: partition include/linux/blk-cgroup.h

    Partition include/linux/blk-cgroup.h into two parts: one is public part,
    the other is block layer private part.
    
    Suggested by Christoph Hellwig.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Feb 11, 2022
  7. block: move initialization of q->blkg_list into blkcg_init_queue

    q->blkg_list is only used by blkcg code, so move it into
    blkcg_init_queue.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/20220211101149.2368042-3-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Feb 11, 2022
  8. block: remove THROTL_IOPS_MAX

    No one uses THROTL_IOPS_MAX any more, so remove it.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220211101149.2368042-2-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Feb 11, 2022
  9. Merge branch 'for-5.18/block' into for-next

    * for-5.18/block:
      block: introduce block_rq_error tracepoint
    axboe committed Feb 11, 2022
  10. block: introduce block_rq_error tracepoint

    Currently, rasdaemon uses the existing tracepoint block_rq_complete
    and filters out non-error cases in order to capture block disk errors.
    
    But there are a few problems with this approach:
    
    1. Even kernel trace filter could do the filtering work, there is
       still some overhead after we enable this tracepoint.
    
    2. The filter is merely based on errno, which does not align with kernel
       logic to check the errors for print_req_error().
    
    3. block_rq_complete only provides dev major and minor to identify
       the block device, it is not convenient to use in user-space.
    
    So introduce a new tracepoint block_rq_error just for the error case.
    With this patch, rasdaemon could switch to block_rq_error.
    
    Since the new tracepoint has the similar implementation with
    block_rq_complete, so move the existing code from TRACE_EVENT
    block_rq_complete() into new event class block_rq_completion(). Then add
    event for block_rq_complete and block_rq_err respectively from the newly
    created event class per the suggestion from Chaitanya Kulkarni.
    
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@infradead.org>
    Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
    Signed-off-by: Yang Shi <shy828301@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    yang-shi authored and axboe committed Feb 11, 2022
  11. Merge branch 'for-5.18/io_uring' into for-next

    * for-5.18/io_uring:
      io-wq: use IO_WQ_ACCT_NR rather than hardcoded number
      io-wq: reduce acct->lock crossing functions lock/unlock
      io-wq: decouple work_list protection from the big wqe->lock
    axboe committed Feb 11, 2022
  12. io-wq: use IO_WQ_ACCT_NR rather than hardcoded number

    It's better to use the defined enum stuff not the hardcoded number to
    define array.
    
    Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Hao Xu authored and axboe committed Feb 11, 2022
  13. io-wq: reduce acct->lock crossing functions lock/unlock

    reduce acct->lock lock and unlock in different functions to make the
    code clearer.
    
    Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Hao Xu authored and axboe committed Feb 11, 2022
  14. io-wq: decouple work_list protection from the big wqe->lock

    wqe->lock is abused, it now protects acct->work_list, hash stuff,
    nr_workers, wqe->free_list and so on. Lets first get the work_list out
    of the wqe-lock mess by introduce a specific lock for work list. This
    is the first step to solve the huge contension between work insertion
    and work consumption.
    good thing:
      - split locking for bound and unbound work list
      - reduce contension between work_list visit and (worker's)free_list.
    
    For the hash stuff, since there won't be a work with same file in both
    bound and unbound work list, thus they won't visit same hash entry. it
    works well to use the new lock to protect hash stuff.
    
    Results:
    set max_unbound_worker = 4, test with echo-server:
    nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
    (-n connection, -l workload)
    before this patch:
    Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
    Overhead  Command          Shared Object         Symbol
      28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
       8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
       6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
       2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
       2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
       2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
       1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
       1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
       1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
       1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
       0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked
    
    with this patch:
    Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
    Overhead  Command          Shared Object         Symbol
      16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
       9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
       4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
       2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
       2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
       2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
       1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
       1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
       1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up
    
    spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
    TPS is similar, while cpu usage is from almost 400% to 350%
    
    Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Hao Xu authored and axboe committed Feb 11, 2022

Commits on Feb 8, 2022

  1. Merge branch 'for-5.18/drivers' into for-next

    * for-5.18/drivers:
      block/rnbd: Remove a useless mutex
    axboe committed Feb 8, 2022
  2. block/rnbd: Remove a useless mutex

    According to lib/idr.c,
       The IDA handles its own locking.  It is safe to call any of the IDA
       functions without synchronisation in your code.
    
    so the 'ida_lock' mutex can just be removed.
    It is here only to protect some ida_simple_get()/ida_simple_remove() calls.
    
    While at it, switch to ida_alloc_XXX()/ida_free() instead to
    ida_simple_get()/ida_simple_remove().
    The latter is deprecated and more verbose.
    
    Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Acked-by: Jack Wang <jinpu.wang@ionos.com>
    Link: https://lore.kernel.org/r/7f9eccd8b1fce1bac45ac9b01a78cf72f54c0a61.1644266862.git.christophe.jaillet@wanadoo.fr
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    tititiou36 authored and axboe committed Feb 8, 2022
  3. Merge branch 'for-5.18/io_uring' into for-next

    * for-5.18/io_uring:
      io_uring: Fix use of uninitialized ret in io_eventfd_register()
    axboe committed Feb 8, 2022
Older