Skip to content
Permalink
Jens-Axboe/Las…
Switch branches/tags

Commits on Oct 19, 2021

  1. block: re-flow blk_mq_rq_ctx_init()

    Now that we have flags passed in, we can do a final re-arrange of the
    flow of blk_mq_rq_ctx_init() so we're always writing request in the
    order in which it is laid out.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe authored and intel-lab-lkp committed Oct 19, 2021
  2. block: prefetch request to be initialized

    Now we have the tags available in __blk_mq_alloc_requests_batch(), we
    can start fetching the first request cacheline before calling into the
    request initialization.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe authored and intel-lab-lkp committed Oct 19, 2021
  3. block: pass in blk_mq_tags to blk_mq_rq_ctx_init()

    Instead of getting this from data for every invocation of request
    initialization, pass it in as an argument instead.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe authored and intel-lab-lkp committed Oct 19, 2021
  4. block: add rq_flags to struct blk_mq_alloc_data

    There's a hole here we can use, and it's faster to set this earlier
    rather than need to check q->elevator multiple times.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe authored and intel-lab-lkp committed Oct 19, 2021
  5. Merge branch 'for-5.16/bdev-size' into for-next

    * for-5.16/bdev-size:
      partitions/ibm: use bdev_nr_sectors instead of open coding it
      partitions/efi: use bdev_nr_bytes instead of open coding it
      block/ioctl: use bdev_nr_sectors and bdev_nr_bytes
    axboe committed Oct 19, 2021
  6. partitions/ibm: use bdev_nr_sectors instead of open coding it

    Use the proper helper to read the block device size and switch various
    places to pass the size in terms of sectors which is more practical.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019062024.2171074-4-hch@lst.de
    [axboe: fix comment typo]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
  7. partitions/efi: use bdev_nr_bytes instead of open coding it

    Use the proper helper to read the block device size.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019062024.2171074-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
  8. block/ioctl: use bdev_nr_sectors and bdev_nr_bytes

    Use the proper helper to read the block device size.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019062024.2171074-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
  9. Merge branch 'for-5.16/block' into for-next

    * for-5.16/block:
      blk-wbt: prevent NULL pointer dereference in wb_timer_fn
    axboe committed Oct 19, 2021
  10. blk-wbt: prevent NULL pointer dereference in wb_timer_fn

    The timer callback used to evaluate if the latency is exceeded can be
    executed after the corresponding disk has been released, causing the
    following NULL pointer dereference:
    
    [ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
    [ 119.987617] #PF: supervisor read access in kernel mode
    [ 119.987971] #PF: error_code(0x0000) - not-present page
    [ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
    [ 119.988697] Oops: 0000 [#1] SMP NOPTI
    [ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
    [ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    [ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
    [ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
    [ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
    [ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
    [ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
    [ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
    [ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
    [ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
    [ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
    [ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
    [ 119.995906] Call Trace:
    [ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
    [ 119.996505] blk_stat_timer_fn+0x138/0x140
    [ 119.996830] call_timer_fn+0x2b/0x100
    [ 119.997136] __run_timers.part.0+0x1d1/0x240
    [ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
    [ 119.997826] ? ktime_get+0x3e/0xa0
    [ 119.998110] ? native_apic_msr_write+0x2c/0x30
    [ 119.998456] ? lapic_next_event+0x20/0x30
    [ 119.998779] ? clockevents_program_event+0x94/0xf0
    [ 119.999150] run_timer_softirq+0x2a/0x50
    [ 119.999465] __do_softirq+0xcb/0x26f
    [ 119.999764] irq_exit_rcu+0x8c/0xb0
    [ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
    [ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
    [ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20
    
    In this case simply return from the timer callback (no action
    required) to prevent the NULL pointer dereference.
    
    BugLink: https://bugs.launchpad.net/bugs/1947557
    Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
    Fixes: 34dbad5 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
    Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Andrea Righi authored and axboe committed Oct 19, 2021
  11. Merge branch 'for-5.16/drivers' into for-next

    * for-5.16/drivers:
      block: ataflop: fix breakage introduced at blk-mq refactoring
    axboe committed Oct 19, 2021
  12. block: ataflop: fix breakage introduced at blk-mq refactoring

    Refactoring of the Atari floppy driver when converting to blk-mq
    has broken the state machine in not-so-subtle ways:
    
    finish_fdc() must be called when operations on the floppy device
    have completed. This is crucial in order to relase the ST-DMA
    lock, which protects against concurrent access to the ST-DMA
    controller by other drivers (some DMA related, most just related
    to device register access - broken beyond compare, I know).
    
    When rewriting the driver's old do_request() function, the fact
    that finish_fdc() was called only when all queued requests had
    completed appears to have been overlooked. Instead, the new
    request function calls finish_fdc() immediately after the last
    request has been queued. finish_fdc() executes a dummy seek after
    most requests, and this overwrites the state machine's interrupt
    hander that was set up to wait for completion of the read/write
    request just prior. To make matters worse, finish_fdc() is called
    before device interrupts are re-enabled, making certain that the
    read/write interupt is missed.
    
    Shifting the finish_fdc() call into the read/write request
    completion handler ensures the driver waits for the request to
    actually complete. With a queue depth of 2, we won't see long
    request sequences, so calling finish_fdc() unconditionally just
    adds a little overhead for the dummy seeks, and keeps the code
    simple.
    
    While we're at it, kill ataflop_commit_rqs() which does nothing
    but run finish_fdc() unconditionally, again likely wiping out an
    in-flight request.
    
    Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
    Fixes: 6ec3938 ("ataflop: convert to blk-mq")
    CC: linux-block@vger.kernel.org
    CC: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Link: https://lore.kernel.org/r/20211019061321.26425-1-schmitzmic@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Michael Schmitz authored and axboe committed Oct 19, 2021
  13. Merge branch 'for-5.16/block' into for-next

    * for-5.16/block:
      block: align blkdev_dio inlined bio to a cacheline
      block: move blk_mq_tag_to_rq() inline
      block: get rid of plug list sorting
      block: return whether or not to unplug through boolean
      block: don't call blk_status_to_errno in blk_update_request
      block: move bdev_read_only() into the header
    axboe committed Oct 19, 2021
  14. block: align blkdev_dio inlined bio to a cacheline

    We get all sorts of unreliable and funky results since the bio is
    designed to align on a cacheline, which it does not when inlined like
    this.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  15. block: move blk_mq_tag_to_rq() inline

    This is in the fast path of driver issue or completion, and it's a single
    array index operation. Move it inline to avoid a function call for it.
    
    This does mean making struct blk_mq_tags block layer public, but there's
    not really much in there.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  16. block: get rid of plug list sorting

    Even if we have multiple queues in the plug list, chances that they
    are very interspersed is minimal. Don't bother spending CPU cycles
    sorting the list.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  17. block: return whether or not to unplug through boolean

    Instead of returning the same queue request through a request pointer,
    use a boolean to accomplish the same.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  18. block: don't call blk_status_to_errno in blk_update_request

    We only need to call it to resolve the blk_status_t -> errno mapping for
    tracing, so move the conversion into the tracepoints that are not called
    at all when tracing isn't enabled.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
  19. block: move bdev_read_only() into the header

    This is called for every write in the fast path, move it inline next
    to get_disk_ro() which is called internally.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  20. Merge branch 'for-5.16/bdev-size' into for-next

    * for-5.16/bdev-size: (31 commits)
      block: cache inode size in bdev
      udf: use sb_bdev_nr_blocks
      reiserfs: use sb_bdev_nr_blocks
      ntfs: use sb_bdev_nr_blocks
      jfs: use sb_bdev_nr_blocks
      ext4: use sb_bdev_nr_blocks
      block: add a sb_bdev_nr_blocks helper
      block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate
      squashfs: use bdev_nr_bytes instead of open coding it
      reiserfs: use bdev_nr_bytes instead of open coding it
      pstore/blk: use bdev_nr_bytes instead of open coding it
      ntfs3: use bdev_nr_bytes instead of open coding it
      nilfs2: use bdev_nr_bytes instead of open coding it
      nfs/blocklayout: use bdev_nr_bytes instead of open coding it
      jfs: use bdev_nr_bytes instead of open coding it
      hfsplus: use bdev_nr_sectors instead of open coding it
      hfs: use bdev_nr_sectors instead of open coding it
      fat: use bdev_nr_sectors instead of open coding it
      cramfs: use bdev_nr_bytes instead of open coding it
      btrfs: use bdev_nr_bytes instead of open coding it
      ...
    axboe committed Oct 19, 2021
  21. Merge branch 'for-5.16/io_uring' into for-next

    * for-5.16/io_uring: (82 commits)
      io_uring: inform block layer of how many requests we are submitting
      io_uring: simplify io_file_supports_nowait()
      io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags
      io_uring: arm poll for non-nowait files
      fs/io_uring: Prioritise checking faster conditions first in io_write
      io_uring: clean io_prep_rw()
      io_uring: optimise fixed rw rsrc node setting
      io_uring: return iovec from __io_import_iovec
      io_uring: optimise io_import_iovec fixed path
      io_uring: kill io_wq_current_is_worker() in iopoll
      io_uring: optimise req->ctx reloads
      io_uring: rearrange io_read()/write()
      io_uring: clean up io_import_iovec
      io_uring: optimise io_import_iovec nonblock passing
      io_uring: optimise read/write iov state storing
      io_uring: encapsulate rw state
      io_uring: optimise rw comletion handlers
      io_uring: prioritise read success path over fails
      io_uring: consistent typing for issue_flags
      io_uring: optimise rsrc referencing
      ...
    axboe committed Oct 19, 2021
  22. Merge branch 'for-5.16/block' into for-next

    * for-5.16/block:
      block: fix too broad elevator check in blk_mq_free_request()
    axboe committed Oct 19, 2021
  23. io_uring: inform block layer of how many requests we are submitting

    The block layer can use this knowledge to make smarter decisions on
    how to handle the request, if it knows that N more may be coming. Switch
    to using blk_start_plug_nr_ios() to pass in that information.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
  24. io_uring: simplify io_file_supports_nowait()

    Make sure that REQ_F_SUPPORT_NOWAIT is always set io_prep_rw(), and so
    we can stop caring about setting it down the line simplifying
    io_file_supports_nowait().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/60c8f1f5e2cb45e00f4897b2cec10c5b3669da91.1634425438.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  25. io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags

    Merge REQ_F_NOWAIT_READ and REQ_F_NOWAIT_WRITE into one flag, i.e.
    REQ_F_SUPPORT_NOWAIT. First it gets rid of dependence on CONFIG_64BIT
    but also simplifies the code.
    
    One thing to consider is when we don't have ->{read,write}_iter and go
    through loop_rw_iter(). Just fail it with -EAGAIN if we expect nowait
    behaviour but not sure whether it supports it.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/f832a20e5186c2e79c6519280c238f559a1d2bbc.1634425438.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  26. io_uring: arm poll for non-nowait files

    Don't check if we can do nowait before arming apoll, there are several
    reasons for that. First, we don't care much about files that don't
    support nowait. Second, it may be useful -- we don't want to be taking
    away extra workers from io-wq when it can go in some async. Even if it
    will go through io-wq eventually, it make difference in the numbers of
    workers actually used. And the last one, it's needed to clean nowait in
    future commits.
    
    [kernel test robot: fix unused-var]
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/9d06f3cb2c8b686d970269a87986f154edb83043.1634425438.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  27. fs/io_uring: Prioritise checking faster conditions first in io_write

    This commit reorders the conditions in a branch in io_write. The
    reorder to check 'ret2 == -EAGAIN' first as checking
    '(req->ctx->flags & IORING_SETUP_IOPOLL)' will likely be more
    expensive due to 2x memory derefences.
    
    Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
    Link: https://lore.kernel.org/r/20211017013229.4124279-1-goldstein.w.n@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    goldsteinn authored and axboe committed Oct 19, 2021
  28. io_uring: clean io_prep_rw()

    We already store req->file in a variable in io_prep_rw(), just use it
    instead of a couple of left references to kicob->ki_filp.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/2f5889fc7ab670daefd5ccaedd99416d8355f0ad.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  29. io_uring: optimise fixed rw rsrc node setting

    Move fixed rw io_req_set_rsrc_node() from rw prep into
    io_import_fixed(), if we're using fixed buffers it will always be called
    during submission as we save the state in advance,
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/68c06f66d5aa9661f1e4b88d08c52d23528297ec.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  30. io_uring: return iovec from __io_import_iovec

    We pass iovec** into __io_import_iovec(), which should keep it,
    initialise and modify accordingly. It's expensive, return it directly
    from __io_import_iovec encoding errors with ERR_PTR if needed.
    
    io_import_iovec keeps the old interface, but it's inline and so
    everything is optimised nicely.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/6230e9769982f03a8f86fa58df24666088c44d3e.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  31. io_uring: optimise io_import_iovec fixed path

    Delay loading req->rw.{addr,len} in io_import_iovec until it's really
    needed, so removing extra loads for the fixed path, which doesn't use
    them.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/3cc48dd0c4f1a37c4ce9aab5784281a2d83ad8be.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  32. io_uring: kill io_wq_current_is_worker() in iopoll

    Don't decide about locking based on io_wq_current_is_worker(), it's not
    consistent with all other code and is expensive, use issue_flags.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/7546d5a58efa4360173541c6fe02ee6b8c7b4ea7.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  33. io_uring: optimise req->ctx reloads

    Don't load req->ctx in advance, it takes an extra register and the field
    stays valid even after opcode handlers. It also optimises out req->ctx
    load in io_iopoll_req_issued() once it's inlined.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/1e45ff671c44be0eb904f2e448a211734893fa0b.1634314022.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  34. io_uring: rearrange io_read()/write()

    Combine force_nonblock branches (which is already optimised by
    compiler), flip branches so the most hot/common path is the first, e.g.
    as with non on-stack iov setup, and add extra likely/unlikely
    attributions for errror paths.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/2c2536c5896d70994de76e387ea09a0402173a3f.1634144845.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
  35. io_uring: clean up io_import_iovec

    Make io_import_iovec taking struct io_rw_state instead of an iter
    pointer. First it takes care of initialising iovec pointer, which can be
    forgotten. Even more, we can not init it if not needed, e.g. in case of
    IORING_OP_READ_FIXED or IORING_OP_READ. Also hide saving iter_state
    inside of it by splitting out an inline function of it to avoid extra
    ifs.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/b1bbc213a95e5272d4da5867bb977d9acb6f2109.1634144845.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 19, 2021
Older