Skip to content
Permalink
Maciej-Fijalko…
Switch branches/tags

Commits on Aug 14, 2021

  1. ice: make use of ice_for_each_* macros

    Go through the code base and use ice_for_each_* macros.  While at it,
    introduce ice_for_each_xdp_txq() macro that can be used for looping over
    xdp_rings array.
    
    Commit is not introducing any new functionality.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  2. ice: introduce XDP_TX fallback path

    Under rare circumstances there might be a situation where a requirement
    of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
    resources have to be shared between CPUs. This yields a need for placing
    accesses to xdp_ring inside a critical section protected by spinlock.
    These accesses happen to be in the hot path, so let's introduce the
    static branch that will be triggered from the control plane when driver
    could not provide Tx queue dedicated for XDP on each CPU.
    
    Currently, the design that has been picked is to allow any number of XDP
    Tx queues that is at least half of a count of CPUs that platform has.
    For lower number driver will bail out with a response to user that there
    were not enough Tx resources that would allow configuring XDP. The
    sharing of rings is signalled via static branch enablement which in turn
    indicates that lock for xdp_ring accesses needs to be taken in hot path.
    
    Approach based on static branch has no impact on performance of a
    non-fallback path. One thing that is needed to be mentioned is a fact
    that the static branch will act as a global driver switch, meaning that
    if one PF got out of Tx resources, then other PFs that ice driver is
    servicing will suffer. However, given the fact that HW that ice driver
    is handling has 1024 Tx queues per each PF, this is currently an
    unlikely scenario.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  3. ice: optimize XDP_TX workloads

    Optimize Tx descriptor cleaning for XDP. Current approach doesn't
    really scale and chokes when multiple flows are handled.
    
    Introduce two ring fields, @next_dd and @next_rs that will keep track of
    descriptor that should be looked at when the need for cleaning arise and
    the descriptor that should have the RS bit set, respectively.
    
    Note that at this point the threshold is a constant (32), but it is
    something that we could make configurable.
    
    First thing is to get away from setting RS bit on each descriptor. Let's
    do this only once NTU is higher than the currently @next_rs value. In
    such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
    advance the @next_rs by a 32.
    
    Second thing is to clean the Tx ring only when there are less than 32
    free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
    This bit is written back by HW to let the driver know that xmit was
    successful. It will happen only for those descriptors that had RS bit
    set. Clean only 32 descriptors and advance the DD bit.
    
    Actual cleaning routine is moved from ice_napi_poll() down to the
    ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
    SKBs in there that would rely on interrupts for the cleaning. Nice side
    effect is that for rare case of Tx fallback path (that next patch is
    going to introduce) we don't have to trigger the SW irq to clean the
    ring.
    
    With those two concepts, ring is kept at being almost full, but it is
    guaranteed that driver will be able to produce Tx descriptors.
    
    This approach seems to work out well even though the Tx descriptors are
    produced in one-by-one manner. Test was conducted with the ice HW
    bombarded with packets from HW generator, configured to generate 30
    flows.
    
    Xdp2 sample yields the following results:
    <snip>
    proto 17:   79973066 pkt/s
    proto 17:   80018911 pkt/s
    proto 17:   80004654 pkt/s
    proto 17:   79992395 pkt/s
    proto 17:   79975162 pkt/s
    proto 17:   79955054 pkt/s
    proto 17:   79869168 pkt/s
    proto 17:   79823947 pkt/s
    proto 17:   79636971 pkt/s
    </snip>
    
    As that sample reports the Rx'ed frames, let's look at sar output.
    It says that what we Rx'ed we do actually Tx, no noticeable drops.
    Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
    Average:       ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00      0.00      0.00     38.32
    
    with tx_busy staying calm.
    
    When compared to a state before:
    Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
    Average:       ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00      0.00      0.00     43.64
    
    it can be observed that the amount of txpck/s is almost doubled, meaning
    that the performance is improved by around 90%. All of this due to the
    drops in the driver, previously the tx_busy stat was bumped at a 7mpps
    rate.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  4. ice: propagate xdp_ring onto rx_ring

    With rings being split, it is now convenient to introduce a pointer to
    XDP ring within the Rx ring. For XDP_TX workloads this means that
    xdp_rings array access will be skipped, which was executed per each
    processed frame.
    
    Also, read the XDP prog once per NAPI and if prog is present, set up the
    local xdp_ring pointer. Reading prog a single time was discussed in [1]
    with some concern raised by Toke around dispatcher handling and having
    the need for going through the RCU grace period in the ndo_bpf driver
    callback, but ice currently is torning down NAPI instances regardless of
    the prog presence on VSI.
    
    Although the pointer to XDP ring introduced to Rx ring makes things a
    lot slimmer/simpler, I still feel that single prog read per NAPI
    lifetime is beneficial.
    
    Further patch that will introduce the fallback path will also get a
    profit from that as xdp_ring pointer will be set during the XDP rings
    setup.
    
    [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  5. ice: do not create xdp_frame on XDP_TX

    xdp_frame is not needed for XDP_TX data path in ice driver case.
    For this data path cleaning of sent descriptor will not happen anywhere
    outside of the driver, which means that carrying the information about
    the underlying memory model via xdp_frame will not be used. Therefore,
    this conversion can be simply dropped, which would relieve CPU a bit.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  6. ice: unify xdp_rings accesses

    There has been a long lasting issue of improper xdp_rings indexing for
    XDP_TX and XDP_REDIRECT actions. Given that currently rx_ring->q_index
    is mixed with smp_processor_id(), there could be a situation where Tx
    descriptors are produced onto XDP Tx ring, but tail is never bumped -
    for example pin a particular queue id to non-matching IRQ line.
    
    Address this problem by ignoring the user ring count setting and always
    initialize the xdp_rings array to be of num_possible_cpus() size. Then,
    always use the smp_processor_id() as an index to xdp_rings array. This
    provides serialization as at given time only a single softirq can run on
    a particular CPU.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  7. ice: split ice_ring onto Tx/Rx separate structs

    While it was convenient to have a generic ring structure that served
    both Tx and Rx sides, next commits are going to introduce several
    Tx-specific fields, so in order to avoid hurting the Rx side, let's
    pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
    
    Rx ring could be handled by the old ice_ring which would reduce the code
    churn within this patch, but this would make things asymmetric.
    
    Make the union out of the ring container within ice_q_vector so that it
    is possible to iterate over newly introduced ice_tx_ring.
    
    Remove the @SiZe as it's only accessed from control path and it can be
    calculated pretty easily.
    
    Change definitions of ice_update_ring_stats and
    ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
    used for both Rx and Tx rings.
    
    Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
    Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
    difference now.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  8. ice: move ice_container_type onto ice_ring_container

    Currently ice_container_type is scoped only for ice_ethtool.c. Next
    commit that will split the ice_ring struct onto Rx/Tx specific ring
    structs is going to also modify the type of linked list of rings that is
    within ice_ring_container. Therefore, the functions that are taking the
    ice_ring_container as an input argument will need to be aware of a ring
    type that will be looked up.
    
    Embed ice_container_type within ice_ring_container and initialize it
    properly when allocating the q_vectors.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021
  9. ice: remove ring_active from ice_ring

    This field is dead and driver is not making any use of it. Simply remove
    it.
    
    Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    mfijalko authored and intel-lab-lkp committed Aug 14, 2021

Commits on Aug 6, 2021

  1. netfilter: nfnetlink_hook: translate inet ingress to netdev

    The NFPROTO_INET pseudofamily is not exposed through this new netlink
    interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6
    for NFPROTO_INET prerouting/input/forward/output/postrouting hooks.
    The NFNLA_CHAIN_FAMILY attribute provides the family chain, which
    specifies if this hook applies to inet traffic only (either IPv4 or
    IPv6).
    
    Translate the inet/ingress hook to netdev/ingress to fully hide the
    NFPROTO_INET implementation details.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  2. netfilter: conntrack: remove offload_pickup sysctl again

    These two sysctls were added because the hardcoded defaults (2 minutes,
    tcp, 30 seconds, udp) turned out to be too low for some setups.
    
    They appeared in 5.14-rc1 so it should be fine to remove it again.
    
    Marcelo convinced me that there should be no difference between a flow
    that was offloaded vs. a flow that was not wrt. timeout handling.
    Thus the default is changed to those for TCP established and UDP stream,
    5 days and 120 seconds, respectively.
    
    Marcelo also suggested to account for the timeout value used for the
    offloading, this avoids increase beyond the value in the conntrack-sysctl
    and will also instantly expire the conntrack entry with altered sysctls.
    
    Example:
       nf_conntrack_udp_timeout_stream=60
       nf_flowtable_udp_timeout=60
    
    This will remove offloaded udp flows after one minute, rather than two.
    
    An earlier version of this patch also cleared the ASSURED bit to
    allow nf_conntrack to evict the entry via early_drop (i.e., table full).
    However, it looks like we can safely assume that connection timed out
    via HW is still in established state, so this isn't needed.
    
    Quoting Oz:
     [..] the hardware sends all packets with a set FIN flags to sw.
     [..] Connections that are aged in hardware are expected to be in the
     established state.
    
    In case it turns out that back-to-sw-path transition can occur for
    'dodgy' connections too (e.g., one side disappeared while software-path
    would have been in RETRANS timeout), we can adjust this later.
    
    Cc: Oz Shlomo <ozsh@nvidia.com>
    Cc: Paul Blakey <paulb@nvidia.com>
    Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Florian Westphal authored and ummakynes committed Aug 6, 2021
  3. netfilter: nfnetlink_hook: Use same family as request message

    Use the same family as the request message, for consistency. The
    netlink payload provides sufficient information to describe the hook
    object, including the family.
    
    This makes it easier to userspace to correlate the hooks are that
    visited by the packets for a certain family.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  4. netfilter: nfnetlink_hook: use the sequence number of the request mes…

    …sage
    
    The sequence number allows to correlate the netlink reply message (as
    part of the dump) with the original request message.
    
    The cb->seq field is internally used to detect an interference (update)
    of the hook list during the netlink dump, do not use it as sequence
    number in the netlink dump header.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  5. netfilter: nfnetlink_hook: missing chain family

    The family is relevant for pseudo-families like NFPROTO_INET
    otherwise the user needs to rely on the hook function name to
    differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names.
    
    Add nfnl_hook_chain_desc_attributes instead of using the existing
    NFTA_CHAIN_* attributes, since these do not provide a family number.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  6. netfilter: nfnetlink_hook: strip off module name from hookfn

    NFNLA_HOOK_FUNCTION_NAME should include the hook function name only,
    the module name is already provided by NFNLA_HOOK_MODULE_NAME.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  7. netfilter: conntrack: collect all entries in one cycle

    Michal Kubecek reports that conntrack gc is responsible for frequent
    wakeups (every 125ms) on idle systems.
    
    On busy systems, timed out entries are evicted during lookup.
    The gc worker is only needed to remove entries after system becomes idle
    after a busy period.
    
    To resolve this, always scan the entire table.
    If the scan is taking too long, reschedule so other work_structs can run
    and resume from next bucket.
    
    After a completed scan, wait for 2 minutes before the next cycle.
    Heuristics for faster re-schedule are removed.
    
    GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow
    tuning this as-needed or even turn the gc worker off.
    
    Reported-by: Michal Kubecek <mkubecek@suse.cz>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Florian Westphal authored and ummakynes committed Aug 6, 2021

Commits on Aug 4, 2021

  1. netfilter: nf_conntrack_bridge: Fix memory leak when error

    It should be added kfree_skb_list() when err is not equal to zero
    in nf_br_ip_fragment().
    
    v2: keep this aligned with IPv6.
    v3: modify iter.frag_list to iter.frag.
    
    Fixes: 3c171f4 ("netfilter: bridge: add connection tracking system")
    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Yajun Deng authored and ummakynes committed Aug 4, 2021
  2. netfilter: ipset: Limit the maximal range of consecutive elements to …

    …add/delete
    
    The range size of consecutive elements were not limited. Thus one could
    define a huge range which may result soft lockup errors due to the long
    execution time. Now the range size is limited to 2^20 entries.
    
    Reported-by: Brad Spengler <spender@grsecurity.net>
    Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Jozsef Kadlecsik authored and ummakynes committed Aug 4, 2021

Commits on Jul 30, 2021

  1. Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel…

    …/git/netdev/net
    
    Pull networking fixes from Jakub Kicinski:
     "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
      (mac80211) and netfilter trees.
    
      Current release - regressions:
    
       - mac80211: fix starting aggregation sessions on mesh interfaces
    
      Current release - new code bugs:
    
       - sctp: send pmtu probe only if packet loss in Search Complete state
    
       - bnxt_en: add missing periodic PHC overflow check
    
       - devlink: fix phys_port_name of virtual port and merge error
    
       - hns3: change the method of obtaining default ptp cycle
    
       - can: mcba_usb_start(): add missing urb->transfer_dma initialization
    
      Previous releases - regressions:
    
       - set true network header for ECN decapsulation
    
       - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
         LRO
    
       - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
         PHY
    
       - sctp: fix return value check in __sctp_rcv_asconf_lookup
    
      Previous releases - always broken:
    
       - bpf:
           - more spectre corner case fixes, introduce a BPF nospec
             instruction for mitigating Spectre v4
           - fix OOB read when printing XDP link fdinfo
           - sockmap: fix cleanup related races
    
       - mac80211: fix enabling 4-address mode on a sta vif after assoc
    
       - can:
           - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
           - j1939: j1939_session_deactivate(): clarify lifetime of session
             object, avoid UAF
           - fix number of identical memory leaks in USB drivers
    
       - tipc:
           - do not blindly write skb_shinfo frags when doing decryption
           - fix sleeping in tipc accept routine"
    
    * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
      gve: Update MAINTAINERS list
      can: esd_usb2: fix memory leak
      can: ems_usb: fix memory leak
      can: usb_8dev: fix memory leak
      can: mcba_usb_start(): add missing urb->transfer_dma initialization
      can: hi311x: fix a signedness bug in hi3110_cmd()
      MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
      bpf: Fix leakage due to insufficient speculative store bypass mitigation
      bpf: Introduce BPF nospec instruction for mitigating Spectre v4
      sis900: Fix missing pci_disable_device() in probe and remove
      net: let flow have same hash in two directions
      nfc: nfcsim: fix use after free during module unload
      tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
      sctp: fix return value check in __sctp_rcv_asconf_lookup
      nfc: s3fwrn5: fix undefined parameter values in dev_err()
      net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
      net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
      net/mlx5: Unload device upon firmware fatal error
      net/mlx5e: Fix page allocation failure for ptp-RQ over SF
      net/mlx5e: Fix page allocation failure for trap-RQ over SF
      ...
    torvalds committed Jul 30, 2021
  2. Merge tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kerne…

    …l/git/rafael/linux-pm
    
    Pull ACPI fixes from Rafael Wysocki:
     "These revert a recent IRQ resources handling modification that turned
      out to be problematic, fix suspend-to-idle handling on AMD platforms
      to take upcoming systems into account properly and fix the retrieval
      of the DPTF attributes of the PCH FIVR.
    
      Specifics:
    
       - Revert recent change of the ACPI IRQ resources handling that
         attempted to improve the ACPI IRQ override selection logic, but
         introduced serious regressions on some systems (Hui Wang).
    
       - Fix up quirks for AMD platforms in the suspend-to-idle support code
         so as to take upcoming systems using uPEP HID AMDI007 into account
         as appropriate (Mario Limonciello).
    
       - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
         agrees on the return data type with the ACPI control method
         evaluated for this purpose (Srinivas Pandruvada)"
    
    * tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
      ACPI: DPTF: Fix reading of attributes
      Revert "ACPI: resources: Add checks for ACPI IRQ override"
      ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
    torvalds committed Jul 30, 2021
  3. pipe: make pipe writes always wake up readers

    Since commit 1b6b26a ("pipe: fix and clarify pipe write wakeup
    logic") we have sanitized the pipe write logic, and would only try to
    wake up readers if they needed it.
    
    In particular, if the pipe already had data in it before the write,
    there was no point in trying to wake up a reader, since any existing
    readers must have been aware of the pre-existing data already.  Doing
    extraneous wakeups will only cause potential thundering herd problems.
    
    However, it turns out that some Android libraries have misused the EPOLL
    interface, and expected "edge triggered" be to "any new write will
    trigger it".  Even if there was no edge in sight.
    
    Quoting Sandeep Patil:
     "The commit 1b6b26a ('pipe: fix and clarify pipe write wakeup
      logic') changed pipe write logic to wakeup readers only if the pipe
      was empty at the time of write. However, there are libraries that
      relied upon the older behavior for notification scheme similar to
      what's described in [1]
    
      One such library 'realm-core'[2] is used by numerous Android
      applications. The library uses a similar notification mechanism as GNU
      Make but it never drains the pipe until it is full. When Android moved
      to v5.10 kernel, all applications using this library stopped working.
    
      The library has since been fixed[3] but it will be a while before all
      applications incorporate the updated library"
    
    Our regression rule for the kernel is that if applications break from
    new behavior, it's a regression, even if it was because the application
    did something patently wrong.  Also note the original report [4] by
    Michal Kerrisk about a test for this epoll behavior - but at that point
    we didn't know of any actual broken use case.
    
    So add the extraneous wakeup, to approximate the old behavior.
    
    [ I say "approximate", because the exact old behavior was to do a wakeup
      not for each write(), but for each pipe buffer chunk that was filled
      in. The behavior introduced by this change is not that - this is just
      "every write will cause a wakeup, whether necessary or not", which
      seems to be sufficient for the broken library use. ]
    
    It's worth noting that this adds the extraneous wakeup only for the
    write side, while the read side still considers the "edge" to be purely
    about reading enough from the pipe to allow further writes.
    
    See commit f467a6a ("pipe: fix and clarify pipe read wakeup logic")
    for the pipe read case, which remains that "only wake up if the pipe was
    full, and we read something from it".
    
    Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
    Link: https://github.com/realm/realm-core [2]
    Link: realm/realm-core#4666 [3]
    Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
    Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
    Reported-by: Sandeep Patil <sspatil@android.com>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    torvalds committed Jul 30, 2021
  4. Merge branches 'acpi-resources' and 'acpi-dptf'

    * acpi-resources:
      Revert "ACPI: resources: Add checks for ACPI IRQ override"
    
    * acpi-dptf:
      ACPI: DPTF: Fix reading of attributes
    rafaeljw committed Jul 30, 2021
  5. Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block

    Pull block fixes from Jens Axboe:
    
     - gendisk freeing fix (Christoph)
    
     - blk-iocost wake ordering fix (Tejun)
    
     - tag allocation error handling fix (John)
    
     - loop locking fix. While this isn't the prettiest fix in the world,
       nobody has any good alternatives for 5.14. Something to likely
       revisit for 5.15. (Tetsuo)
    
    * tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
      block: delay freeing the gendisk
      blk-iocost: fix operation ordering in iocg_wake_fn()
      blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
      loop: reintroduce global lock for safe loop_validate_file() traversal
    torvalds committed Jul 30, 2021
  6. Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block

    Pull io_uring fixes from Jens Axboe:
    
     - A fix for block backed reissue (me)
    
     - Reissue context hardening (me)
    
     - Async link locking fix (Pavel)
    
    * tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
      io_uring: fix poll requests leaking second poll entries
      io_uring: don't block level reissue off completion path
      io_uring: always reissue from task_work context
      io_uring: fix race in unified task_work running
      io_uring: fix io_prep_async_link locking
    torvalds committed Jul 30, 2021
  7. Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block

    Pull libata fixlets from Jens Axboe:
    
     - A fix for PIO highmem (Christoph)
    
     - Kill HAVE_IDE as it's now unused (Lukas)
    
    * tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
      arch: Kconfig: clean up obsolete use of HAVE_IDE
      libata: fix ata_pio_sector for CONFIG_HIGHMEM
    torvalds committed Jul 30, 2021
  8. Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/ke…

    …rnel/git/kdave/linux
    
    Pull btrfs fixes from David Sterba:
    
     - fix -Warray-bounds warning, to help external patchset to make it
       default treewide
    
     - fix writeable device accounting (syzbot report)
    
     - fix fsync and log replay after a rename and inode eviction
    
     - fix potentially lost error code when submitting multiple bios for
       compressed range
    
    * tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
      btrfs: calculate number of eb pages properly in csum_tree_block
      btrfs: fix rw device counting in __btrfs_free_extra_devids
      btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
      btrfs: mark compressed range uptodate only if all bio succeed
    torvalds committed Jul 30, 2021
  9. Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel…

    …/git/hid/hid
    
    Pull HID fixes from Jiri Kosina:
    
     - resume timing fix for intel-ish driver (Ye Xiang)
    
     - fix for using incorrect MMIO register in amd_sfh driver (Dylan
       MacKenzie)
    
     - Cintiq 24HDT / 27QHDT regression fix and touch processing fix for
       Wacom driver (Jason Gerecke)
    
     - device removal bugfix for ft260 driver (Michael Zaidman)
    
     - other small assorted fixes
    
    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
      HID: ft260: fix device removal due to USB disconnect
      HID: wacom: Skip processing of touches with negative slot values
      HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
      HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible"
      HID: apple: Add support for Keychron K1 wireless keyboard
      HID: fix typo in Kconfig
      HID: ft260: fix format type warning in ft260_word_show()
      HID: amd_sfh: Use correct MMIO register for DMA address
      HID: asus: Remove check for same LED brightness on set
      HID: intel-ish-hid: use async resume function
    torvalds committed Jul 30, 2021
  10. Merge branch 'akpm' (patches from Andrew)

    Merge misc fixes from Andrew Morton:
     "7 patches.
    
      Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
      migration, and memcg)"
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
      mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
      slub: fix unreclaimable slab stat for bulk free
      mm/migrate: fix NR_ISOLATED corruption on 64-bit
      mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
      ocfs2: issue zeroout to EOF blocks
      ocfs2: fix zero out valid data
      lib/test_string.c: move string selftest in the Runtime Testing menu
    torvalds committed Jul 30, 2021
  11. Merge tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org…

    …/pub/scm/linux/kernel/git/mkl/linux-can
    
    Marc Kleine-Budde says:
    
    ====================
    pull-request: can 2021-07-30
    
    The first patch is by me and adds Yasushi SHOJI as a reviewer for the
    Microchip CAN BUS Analyzer Tool driver.
    
    Dan Carpenter's patch fixes a signedness bug in the hi311x driver.
    
    Pavel Skripkin provides 4 patches, the first targets the mcba_usb
    driver by adding the missing urb->transfer_dma initialization, which
    was broken in a previous commit. The last 3 patches fix a memory leak
    in the usb_8dev, ems_usb and esd_usb2 driver.
    
    * tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
      can: esd_usb2: fix memory leak
      can: ems_usb: fix memory leak
      can: usb_8dev: fix memory leak
      can: mcba_usb_start(): add missing urb->transfer_dma initialization
      can: hi311x: fix a signedness bug in hi3110_cmd()
      MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
    ====================
    
    Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Jakub Kicinski committed Jul 30, 2021
  12. mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()

    When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
    the following dump occurs.
    
      BUG: kernel NULL pointer dereference, address: 0000000000000020
      [...]
      Oops: 0000 [#1] SMP
      [...]
      Workqueue: events kfree_rcu_work
      RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
      RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
      RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
      [...]
      Call Trace:
        kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
        kfree_bulk include/linux/slab.h:413 [inline]
        kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
        process_one_work+0x207/0x530 kernel/workqueue.c:2276
        worker_thread+0x320/0x610 kernel/workqueue.c:2422
        kthread+0x13d/0x160 kernel/kthread.c:313
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
    
    When kmalloc_node() a large memory, page is allocated, not slab, so when
    freeing memory via kfree_rcu(), this large memory should not be used by
    memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
    slab.
    
    Using page_objcgs_check() instead of page_objcgs() in
    memcg_slab_free_hook() to fix this bug.
    
    Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
    Fixes: 270c6a7 ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
    Signed-off-by: Wang Hai <wanghai38@huawei.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Roman Gushchin <guro@fb.com>
    Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Wang Hai authored and torvalds committed Jul 30, 2021
  13. slub: fix unreclaimable slab stat for bulk free

    SLUB uses page allocator for higher order allocations and update
    unreclaimable slab stat for such allocations.  At the moment, the bulk
    free for SLUB does not share code with normal free code path for these
    type of allocations and have missed the stat update.  So, fix the stat
    update by common code.  The user visible impact of the bug is the
    potential of inconsistent unreclaimable slab stat visible through
    meminfo and vmstat.
    
    Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
    Fixes: 6a486c0 ("mm, sl[ou]b: improve memory accounting")
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Roman Gushchin <guro@fb.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    shakeelb authored and torvalds committed Jul 30, 2021
  14. mm/migrate: fix NR_ISOLATED corruption on 64-bit

    Similar to commit 2da9f63 ("mm/vmscan: fix NR_ISOLATED_FILE
    corruption on 64-bit") avoid using unsigned int for nr_pages.  With
    unsigned int type the large unsigned int converts to a large positive
    signed long.
    
    Symptoms include CMA allocations hanging forever due to
    alloc_contig_range->...->isolate_migratepages_block waiting forever in
    "while (unlikely(too_many_isolated(pgdat)))".
    
    Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
    Fixes: c5fc5c3 ("mm: migrate: account THP NUMA migration counters correctly")
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Reported-by: Michael Ellerman <mpe@ellerman.id.au>
    Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
    Reviewed-by: Yang Shi <shy828301@gmail.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    kvaneesh authored and torvalds committed Jul 30, 2021
  15. mm: memcontrol: fix blocking rstat function called from atomic cgroup…

    …1 thresholding code
    
    Dan Carpenter reports:
    
        The patch 2d146aa: "mm: memcontrol: switch to rstat" from Apr
        29, 2021, leads to the following static checker warning:
    
    	    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
    	    warn: sleeping in atomic context
    
        mm/memcontrol.c
          3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
          3573  {
          3574          unsigned long val;
          3575
          3576          if (mem_cgroup_is_root(memcg)) {
          3577                  cgroup_rstat_flush(memcg->css.cgroup);
    			    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
        This is from static analysis and potentially a false positive.  The
        problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
        which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
        can sleep.
    
          3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
          3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
          3580                  if (swap)
          3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
          3582          } else {
          3583                  if (!swap)
          3584                          val = page_counter_read(&memcg->memory);
          3585                  else
          3586                          val = page_counter_read(&memcg->memsw);
          3587          }
          3588          return val;
          3589  }
    
    __mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
    thresholding code is invoked during stat changes, and those contexts
    have irqs disabled as well.  If the lock breaking occurs inside the
    flush function, it will result in a sleep from an atomic context.
    
    Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.
    
    Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
    Fixes: 2d146aa ("mm: memcontrol: switch to rstat")
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Acked-by: Chris Down <chris@chrisdown.name>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Reviewed-by: Shakeel Butt <shakeelb@google.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    hnaz authored and torvalds committed Jul 30, 2021
  16. ocfs2: issue zeroout to EOF blocks

    For punch holes in EOF blocks, fallocate used buffer write to zero the
    EOF blocks in last cluster.  But since ->writepage will ignore EOF
    pages, those zeros will not be flushed.
    
    This "looks" ok as commit 6bba447 ("ocfs2: fix data corruption by
    fallocate") will zero the EOF blocks when extend the file size, but it
    isn't.  The problem happened on those EOF pages, before writeback, those
    pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
    set, when writeback run by write_cache_pages(), DIRTY flag on the page
    was cleared, but DIRTY flag on the buffer_head not.
    
    When next write happened to those EOF pages, since buffer_head already
    had DIRTY flag set, it would not mark page DIRTY again.  That made
    writeback ignore them forever.  That will cause data corruption.  Even
    directio write can't work because it will fail when trying to drop pages
    caches before direct io, as it found the buffer_head for those pages
    still had DIRTY flag set, then it will fall back to buffer io mode.
    
    To make a summary of the issue, as writeback ingores EOF pages, once any
    EOF page is generated, any write to it will only go to the page cache,
    it will never be flushed to disk even file size extends and that page is
    not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
    write.
    
    The following code snippet from qemu-img could trigger the corruption.
    
      656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
      ...
      660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
      660   fallocate(11, 0, 2275868672, 327680) = 0
      658   pwrite64(11, "
    
    Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
    Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    biger410 authored and torvalds committed Jul 30, 2021
  17. ocfs2: fix zero out valid data

    If append-dio feature is enabled, direct-io write and fallocate could
    run in parallel to extend file size, fallocate used "orig_isize" to
    record i_size before taking "ip_alloc_sem", when
    ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
    extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
    out.
    
    Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
    Fixes: 6bba447 ("ocfs2: fix data corruption by fallocate")
    Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
    Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    biger410 authored and torvalds committed Jul 30, 2021
Older