Skip to content

Commits

Permalink
Michael-Wei/dm…
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Commits on Aug 12, 2021

  1. dm crypt: log aead integrity violations to audit subsystem

    Since dm-crypt target can be stacked on dm-integrity targets to
    provide authenticated encryption, integrity violations are recognized
    here during aead computation. We use the dm-audit submodule to
    signal those events to user space, too.
    
    The construction and destruction of crypt device mappings are also
    logged as audit events.
    
    Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
    quitschbo authored and intel-lab-lkp committed Aug 12, 2021
    Copy the full SHA
    2f59ff1 View commit details
    Browse the repository at this point in the history
  2. dm integrity: log audit events for dm-integrity target

    dm-integrity signals integrity violations by returning I/O errors
    to user space. To identify integrity violations by a controlling
    instance, the kernel audit subsystem can be used to emit audit
    events to user space. We use the new dm-audit submodule allowing
    to emit audit events on relevant I/O errors.
    
    The construction and destruction of integrity device mappings are
    also relevant for auditing a system. Thus, those events are also
    logged as audit events.
    
    Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
    quitschbo authored and intel-lab-lkp committed Aug 12, 2021
    Copy the full SHA
    6fbfa05 View commit details
    Browse the repository at this point in the history
  3. dm: introduce audit event module for device mapper

    To be able to send auditing events to user space, we introduce
    a generic dm-audit module. It provides helper functions to emit
    audit events through the kernel audit subsystem. We claim the
    AUDIT_DM type=1336 out of the audit event messages range in the
    corresponding userspace api in 'include/uapi/linux/audit.h' for
    those events.
    
    Following commits to device mapper targets actually will make
    use of this to emit those events in relevant cases.
    
    Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
    quitschbo authored and intel-lab-lkp committed Aug 12, 2021
    Copy the full SHA
    a409dd3 View commit details
    Browse the repository at this point in the history

Commits on Aug 10, 2021

  1. dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()

    On many core systems using dm-crypt, heavy spinlock contention in
    percpu_counter_compare() can be observed when the dmcrypt page allocation
    limit for a given device is reached or close to be reached. This is due
    to percpu_counter_compare() taking a spinlock to compute an exact
    result on potentially many CPUs at the same time.
    
    Switch to non-exact comparison of allocated and allowed pages by using
    the value returned by percpu_counter_read_positive().
    
    This may over/under estimate the actual number of allocated pages by at
    most (batch-1) * num_online_cpus() (assuming my understanding of the
    percpu_counter logic is proper).
    
    Currently, batch is bounded by 32. The system on which this issue was
    first observed has 256 CPUs and 512G of RAM. With a 4k page size, this
    change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt
    allocations, this seems an acceptable error. Certainly preferred over
    running into the spinlock contention.
    
    This behavior was separately/artificially reproduced on an EC2 c5.24xlarge
    instance system with 96 CPUs and 192GB RAM as follows, but can be
    provokes on systems with less available CPUs.
    
     * Disable swap
     * Tune vm settings to promote regular writeback
         $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
         $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
         $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
    
     * Create 8 dmcrypt devices based on files on a tmpfs
     * Create and mount an ext4 filesystem on each crypt devices
     * Run stress-ng --hdd 8 within one of above filesystems
    
    Total %system usage shown via sysstat goes to ~35%, write througput on the
    underlying loop device is ~2GB/s. perf profiling an individual kworker
    kcryptd thread shows the following in the profile, indicating it hits
    heavy spinlock contention in percpu_counter_compare():
    
        99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
                |
                ---ret_from_fork
                   kthread
                   worker_thread
                   |
                    --99.92%--process_one_work
                              |
                              |--80.52%--kcryptd_crypt
                              |          |
                              |          |--62.58%--mempool_alloc
                              |          |          |
                              |          |           --62.24%--crypt_page_alloc
                              |          |                     |
                              |          |                      --61.51%--__percpu_counter_compare
                              |          |                                |
                              |          |                                 --61.34%--__percpu_counter_sum
                              |          |                                           |
                              |          |                                           |--58.68%--_raw_spin_lock_irqsave
                              |          |                                           |          |
                              |          |                                           |           --58.30%--native_queued_spin_lock_slowpath
                              |          |                                           |
                              |          |                                            --0.69%--cpumask_next
                              |          |                                                      |
                              |          |                                                       --0.51%--_find_next_bit
                              |          |
                              |          |--10.61%--crypt_convert
                              |          |          |
                              |          |          |--6.05%--xts_crypt
                              ...
    
    After apply this change, %system usage is lowered to ~7% and
    write throughput on the loopback interface increases to 2.7GB/s.
    The profile shows mempool_alloc() as ~8% rather than ~62% in the
    profile and not hitting the percpu_counter() spinlock anymore.
    
        |--8.15%--mempool_alloc
        |          |
        |          |--3.93%--crypt_page_alloc
        |          |          |
        |          |           --3.75%--__alloc_pages
        |          |                     |
        |          |                      --3.62%--get_page_from_freelist
        |          |                                |
        |          |                                 --3.22%--rmqueue_bulk
        |          |                                           |
        |          |                                            --2.59%--_raw_spin_lock
        |                                                      |
        |          |                                                       --2.57%--native_queued_spin_lock_slowpath
        |          |
        |           --3.05%--_raw_spin_lock_irqsave
        |                     |
        |                      --2.49%--native_queued_spin_lock_slowpath
    
    Suggested-by: DJ Gregor <dj@corelight.com>
    Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
    Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
    Fixes: 5059353 ("dm crypt: limit the number of allocated pages")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    awelzel authored and snitm committed Aug 10, 2021
    Copy the full SHA
    5a2a338 View commit details
    Browse the repository at this point in the history
  2. dm: add documentation for IMA measurement support

    To interpret various DM target measurement data in IMA logs,
    a separate documentation page is needed under
    Documentation/admin-guide/device-mapper.
    
    Add documentation to help system administrators and attestation
    client/server component owners to interpret the measurement
    data generated by various DM targets, on various device/table state
    changes.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    00d4399 View commit details
    Browse the repository at this point in the history
  3. dm: update target status functions to support IMA measurement

    For device mapper targets to take advantage of IMA's measurement
    capabilities, the status functions for the individual targets need to be
    updated to handle the status_type_t case for value STATUSTYPE_IMA.
    
    Update status functions for the following target types, to log their
    respective attributes to be measured using IMA.
     01. cache
     02. crypt
     03. integrity
     04. linear
     05. mirror
     06. multipath
     07. raid
     08. snapshot
     09. striped
     10. verity
    
    For rest of the targets, handle the STATUSTYPE_IMA case by setting the
    measurement buffer to NULL.
    
    For IMA to measure the data on a given system, the IMA policy on the
    system needs to be updated to have the following line, and the system
    needs to be restarted for the measurements to take effect.
    
    /etc/ima/ima-policy
     measure func=CRITICAL_DATA label=device-mapper template=ima-buf
    
    The measurements will be reflected in the IMA logs, which are located at:
    
    /sys/kernel/security/integrity/ima/ascii_runtime_measurements
    /sys/kernel/security/integrity/ima/binary_runtime_measurements
    
    These IMA logs can later be consumed by various attestation clients
    running on the system, and send them to external services for attesting
    the system.
    
    The DM target data measured by IMA subsystem can alternatively
    be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
    DM_TABLE_STATUS_CMD.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    8ec4566 View commit details
    Browse the repository at this point in the history
  4. dm ima: measure data on device rename

    A given block device is identified by it's name and UUID.  However, both
    these parameters can be renamed.  For an external attestation service to
    correctly attest a given device, it needs to keep track of these rename
    events.
    
    Update the device data with the new values for IMA measurements.  Measure
    both old and new device name/UUID parameters in the same IMA measurement
    event, so that the old and the new values can be connected later.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    7d1d1df View commit details
    Browse the repository at this point in the history
  5. dm ima: measure data on table clear

    For a given block device, an inactive table slot contains the parameters
    to configure the device with.  The inactive table can be cleared
    multiple times, accidentally or maliciously, which may impact the
    functionality of the device, and compromise the system.  Therefore it is
    important to measure and log the event when a table is cleared.
    
    Measure device parameters, and table hashes when the inactive table slot
    is cleared.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    99169b9 View commit details
    Browse the repository at this point in the history
  6. dm ima: measure data on device remove

    Presence of an active block-device, configured with expected parameters,
    is important for an external attestation service to determine if a system
    meets the attestation requirements.  Therefore it is important for DM to
    measure the device remove events.
    
    Measure device parameters and table hashes when the device is removed,
    using either remove or remove_all.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    84010e5 View commit details
    Browse the repository at this point in the history
  7. dm ima: measure data on device resume

    A given block device can load a table multiple times, with different
    input parameters, before eventually resuming it.  Further, a device may
    be suspended and then resumed.  The device may never resume after a
    table-load.  Because of the above valid scenarios for a given device,
    it is important to measure and log the device resume event using IMA.
    
    Also, if the table is large, measuring it in clear-text each time the
    device changes state, will unnecessarily increase the size of IMA log.
    Since the table clear-text is already measured during table-load event,
    measuring the hash during resume should be sufficient to validate the
    table contents.
    
    Measure the device parameters, and hash of the active table, when the
    device is resumed.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    8eb6fab View commit details
    Browse the repository at this point in the history
  8. dm ima: measure data on table load

    DM configures a block device with various target specific attributes
    passed to it as a table.  DM loads the table, and calls each target’s
    respective constructors with the attributes as input parameters.
    Some of these attributes are critical to ensure the device meets
    certain security bar.  Thus, IMA should measure these attributes, to
    ensure they are not tampered with, during the lifetime of the device.
    So that the external services can have high confidence in the
    configuration of the block-devices on a given system.
    
    Some devices may have large tables.  And a given device may change its
    state (table-load, suspend, resume, rename, remove, table-clear etc.)
    many times.  Measuring these attributes each time when the device
    changes its state will significantly increase the size of the IMA logs.
    Further, once configured, these attributes are not expected to change
    unless a new table is loaded, or a device is removed and recreated.
    Therefore the clear-text of the attributes should only be measured
    during table load, and the hash of the active/inactive table should be
    measured for the remaining device state changes.
    
    Export IMA function ima_measure_critical_data() to allow measurement
    of DM device parameters, as well as target specific attributes, during
    table load.  Compute the hash of the inactive table and store it for
    measurements during future state change.  If a load is called multiple
    times, update the inactive table hash with the hash of the latest
    populated table.  So that the correct inactive table hash is measured
    when the device transitions to different states like resume, remove,
    rename, etc.
    
    Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
    Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Tushar Sugandhi authored and snitm committed Aug 10, 2021
    Copy the full SHA
    91ccbba View commit details
    Browse the repository at this point in the history
  9. dm writecache: add event counters

    Add 10 counters for various events (hit, miss, etc) and export them in
    the status line (accessed from userspace with "dmsetup status"). Also
    add a message "clear_stats" that resets these counters.
    
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Mikulas Patocka authored and snitm committed Aug 10, 2021
    Copy the full SHA
    e3a35d0 View commit details
    Browse the repository at this point in the history
  10. dm writecache: report invalid return from writecache_map helpers

    If some "writecache_map_*" function returns invalid state, it is a bug.
    So, we should report it and not fail silently.
    
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    Mikulas Patocka authored and snitm committed Aug 10, 2021
    Copy the full SHA
    df699cc View commit details
    Browse the repository at this point in the history
  11. dm writecache: further writecache_map() cleanup

    Factor out writecache_map_flush() and writecache_map_discard() from
    writecache_map(). Also eliminate the various goto labels in
    writecache_map().
    
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    snitm committed Aug 10, 2021
    Copy the full SHA
    15cb6f3 View commit details
    Browse the repository at this point in the history
  12. dm writecache: factor out writecache_map_remap_origin()

    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    snitm committed Aug 10, 2021
    Copy the full SHA
    4d020b3 View commit details
    Browse the repository at this point in the history
  13. dm writecache: split up writecache_map() to improve code readability

    writecache_map() has grown too large and can be confusing to read given
    all the goto statements.
    
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
    snitm committed Aug 10, 2021
    Copy the full SHA
    cdd4d78 View commit details
    Browse the repository at this point in the history
  14. writeback: make the laptop_mode prototypes available unconditionally

    Fix the !CONFIG_BLOCK build after the recent cleanup.
    
    Fixes: 5ed964f ("mm: hide laptop_mode_wb_timer entirely behind the BDI API")
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 10, 2021
    Copy the full SHA
    99d26de View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2021

  1. block: return ELEVATOR_DISCARD_MERGE if possible

    When merging one bio to request, if they are discard IO and the queue
    supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
    because both block core and related drivers(nvme, virtio-blk) doesn't
    handle mixed discard io merge(traditional IO merge together with
    discard merge) well.
    
    Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
    so both blk-mq and drivers just need to handle multi-range discard.
    
    Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Fixes: 2705dfb ("block: fix discard request merge")
    Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Aug 9, 2021
    Copy the full SHA
    866663b View commit details
    Browse the repository at this point in the history
  2. block: remove the bd_bdi in struct block_device

    Just retrieve the bdi from the disk.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    a11d7fc View commit details
    Browse the repository at this point in the history
  3. block: move the bdi from the request_queue to the gendisk

    The backing device information only makes sense for file system I/O,
    and thus belongs into the gendisk and not the lower level request_queue
    structure.  Move it there.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    edb0872 View commit details
    Browse the repository at this point in the history
  4. block: add a queue_has_disk helper

    Add a helper to check if a gendisk is associated with a request_queue.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    1008162 View commit details
    Browse the repository at this point in the history
  5. block: pass a gendisk to blk_queue_update_readahead

    .. and rename the function to disk_update_readahead.  This is in
    preparation for moving the BDI from the request_queue to the gendisk.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    471aa70 View commit details
    Browse the repository at this point in the history
  6. mm: hide laptop_mode_wb_timer entirely behind the BDI API

    Don't leak the detaіls of the timer into the block layer, instead
    initialize the timer in bdi_alloc and delete it in bdi_unregister.
    Note that this means the timer is initialized (but not armed) for
    non-block queues as well now.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    5ed964f View commit details
    Browse the repository at this point in the history
  7. block: remove support for delayed queue registrations

    Now that device mapper has been changed to register the disk once
    it is fully ready all this code is unused.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    d1254a8 View commit details
    Browse the repository at this point in the history
  8. dm: delay registering the gendisk

    device mapper is currently the only outlier that tries to call
    register_disk after add_disk, leading to fairly inconsistent state
    of these block layer data structures.  Instead change device-mapper
    to just register the gendisk later now that the holder mechanism
    can cope with that.
    
    Note that this introduces a user visible change: the dm kobject is
    now only visible after the initial table has been loaded.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    89f871a View commit details
    Browse the repository at this point in the history
  9. dm: move setting md->type into dm_setup_md_queue

    Move setting md->type from both callers into dm_setup_md_queue.
    This ensures that md->type is only set to a valid value after the queue
    has been fully setup, something we'll rely on future changes.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    ba30585 View commit details
    Browse the repository at this point in the history
  10. dm: cleanup cleanup_mapped_device

    md->queue is now always set when md->disk is set, so simplify the
    conditionals a bit.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    74a2b6e View commit details
    Browse the repository at this point in the history
  11. block: support delayed holder registration

    device mapper needs to register holders before it is ready to do I/O.
    Currently it does so by registering the disk early, which can leave
    the disk and queue in a weird half state where the queue is registered
    with the disk, except for sysfs and the elevator.  And this state has
    been a bit promlematic before, and will get more so when sorting out
    the responsibilities between the queue and the disk.
    
    Support registering holders on an initialized but not registered disk
    instead by delaying the sysfs registration until the disk is registered.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    d626338 View commit details
    Browse the repository at this point in the history
  12. block: look up holders by bdev

    Invert they way the holder relations are tracked.  This very
    slightly reduces the memory overhead for partitioned devices.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    0dbcfe2 View commit details
    Browse the repository at this point in the history
  13. block: remove the extra kobject reference in bd_link_disk_holder

    Since commit 0d02129 ("block: merge struct block_device and struct
    hd_struct") there is no way for the bdev to go away as long as there is
    a holder, so remove the extra references.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    fbd9a39 View commit details
    Browse the repository at this point in the history
  14. block: make the block holder code optional

    Move the block holder code into a separate file as it is not in any way
    related to the other block_dev.c code, and add a new selectable config
    option for it so that we don't have to build it without any remapped
    drivers selected.
    
    The Kconfig symbol contains a _DEPRECATED suffix to match the comments
    added in commit 49731ba
    ("block: restore multiple bd_link_disk_holder() support").
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Aug 9, 2021
    Copy the full SHA
    c66fd01 View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2021

  1. loop: Select I/O scheduler 'none' from inside add_disk()

    We noticed that the user interface of Android devices becomes very slow
    under memory pressure. This is because Android uses the zram driver on top
    of the loop driver for swapping, because under memory pressure the swap
    code alternates reads and writes quickly, because mq-deadline is the
    default scheduler for loop devices and because mq-deadline delays writes by
    five seconds for such a workload with default settings. Fix this by making
    the kernel select I/O scheduler 'none' from inside add_disk() for loop
    devices. This default can be overridden at any time from user space,
    e.g. via a udev rule. This approach has an advantage compared to changing
    the I/O scheduler from userspace from 'mq-deadline' into 'none', namely
    that synchronize_rcu() does not get called.
    
    This patch changes the default I/O scheduler for loop devices from
    'mq-deadline' into 'none'.
    
    Additionally, this patch reduces the Android boot time on my test setup
    with 0.5 seconds compared to configuring the loop I/O scheduler from user
    space.
    
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ming Lei <ming.lei@redhat.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Martijn Coenen <maco@android.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20210805174200.3250718-3-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Aug 5, 2021
    Copy the full SHA
    2112f5c View commit details
    Browse the repository at this point in the history
  2. blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flag

    elevator_get_default() uses the following algorithm to select an I/O
    scheduler from inside add_disk():
    - In case of a single hardware queue or if sharing hardware queues across
      multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
    - Otherwise, use 'none'.
    
    This is a good choice for most but not for all block drivers. Make it
    possible to override the selection of mq-deadline with a new flag,
    namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.
    
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ming Lei <ming.lei@redhat.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Martijn Coenen <maco@android.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Aug 5, 2021
    Copy the full SHA
    90b7198 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2021

  1. block: remove blk-mq-sysfs dead code

    In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
    define any attribute since the "mq" sysfs directory contains only
    sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
    blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
    unused and unnecessary. Remove all this unused code.
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Aug 2, 2021
    Copy the full SHA
    2bc1f6e View commit details
    Browse the repository at this point in the history
  2. loop: raise media_change event

    Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach.
    
    	# udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME &
    
    	# losetup -f zero
    	[    7.454235] loop0: detected capacity change from 0 to 16384
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop0
    	DEVNAME=/dev/loop0
    	DEVNAME=/dev/loop0
    
    	# losetup -f zero
    	[   10.205245] loop1: detected capacity change from 0 to 16384
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop1
    	DEVNAME=/dev/loop1
    	DEVNAME=/dev/loop1
    
    	# losetup -f zero2
    	[   13.532368] loop2: detected capacity change from 0 to 40960
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop2
    	DEVNAME=/dev/loop2
    
    	# losetup -D
    	DEVNAME=/dev/loop1
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop1
    	DEVNAME=/dev/loop2
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop2
    	DEVNAME=/dev/loop0
    	DISK_MEDIA_CHANGE=1
    	DEVNAME=/dev/loop0
    
    Signed-off-by: Matteo Croce <mcroce@microsoft.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Luca Boccassi <bluca@debian.org>
    Link: https://lore.kernel.org/r/20210712230530.29323-7-mcroce@linux.microsoft.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    teknoraver authored and axboe committed Aug 2, 2021
    Copy the full SHA
    9f65c48 View commit details
    Browse the repository at this point in the history
Older