Skip to content
Permalink
Branch: linux-vyos-4.1…
Commits on Aug 17, 2018
  1. sch_cake: Make gso-splitting configurable from userspace

    Dave Taht authored and Lochnair committed Jul 27, 2018
    This patch restores cake's deployed behavior at line rate to always
    split gso, and makes gso splitting configurable from userspace.
    
    running cake unlimited (unshaped) at 1gigE, local traffic:
    
    no-split-gso bql limit: 131966
    split-gso bql limit:   ~42392-45420
    
    On this 4 stream test splitting gso apart results in halving the
    observed interpacket latency at no loss in throughput.
    
    Summary of tcp_nup test run 'gso-split' (at 2018-07-26 16:03:51.824728):
    
     Ping (ms) ICMP :         0.83         0.81 ms              341
     TCP upload avg :       235.43       235.39 Mbits/s         301
     TCP upload sum :       941.71       941.56 Mbits/s         301
     TCP upload::1  :       235.45       235.43 Mbits/s         271
     TCP upload::2  :       235.45       235.41 Mbits/s         289
     TCP upload::3  :       235.40       235.40 Mbits/s         288
     TCP upload::4  :       235.41       235.40 Mbits/s         291
    
    verses
    
    Summary of tcp_nup test run 'no-split-gso' (at 2018-07-26 16:37:23.563960):
    
                               avg       median          # data pts
     Ping (ms) ICMP :         1.67         1.73 ms              348
     TCP upload avg :       234.56       235.37 Mbits/s         301
     TCP upload sum :       938.24       941.49 Mbits/s         301
     TCP upload::1  :       234.55       235.38 Mbits/s         285
     TCP upload::2  :       234.57       235.37 Mbits/s         286
     TCP upload::3  :       234.58       235.37 Mbits/s         274
     TCP upload::4  :       234.54       235.42 Mbits/s         288
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
  2. sch_cake: Fix tin order when set through skb->priority

    tohojo authored and Lochnair committed Jul 16, 2018
    In diffserv mode, CAKE stores tins in a different order internally than
    the logical order exposed to userspace. The order remapping was missing
    in the handling of 'tc filter' priority mappings through skb->priority,
    resulting in bulk and best effort mappings being reversed relative to
    how they are displayed.
    
    Fix this by adding the missing mapping when reading skb->priority.
    
    Fixes: 83f8fd6 ("sch_cake: Add DiffServ handling")
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  3. sch_cake: Conditionally split GSO segments

    tohojo authored and Lochnair committed Jul 6, 2018
    At lower bandwidths, the transmission time of a single GSO segment can add
    an unacceptable amount of latency due to HOL blocking. Furthermore, with a
    software shaper, any tuning mechanism employed by the kernel to control the
    maximum size of GSO segments is thrown off by the artificial limit on
    bandwidth. For this reason, we split GSO segments into their individual
    packets iff the shaper is active and configured to a bandwidth <= 1 Gbps.
    
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  4. sch_cake: Add overhead compensation support to the rate shaper

    tohojo authored and Lochnair committed Jul 6, 2018
    This commit adds configurable overhead compensation support to the rate
    shaper. With this feature, userspace can configure the actual bottleneck
    link overhead and encapsulation mode used, which will be used by the shaper
    to calculate the precise duration of each packet on the wire.
    
    This feature is needed because CAKE is often deployed one or two hops
    upstream of the actual bottleneck (which can be, e.g., inside a DSL or
    cable modem). In this case, the link layer characteristics and overhead
    reported by the kernel does not match the actual bottleneck. Being able to
    set the actual values in use makes it possible to configure the shaper rate
    much closer to the actual bottleneck rate (our experience shows it is
    possible to get with 0.1% of the actual physical bottleneck rate), thus
    keeping latency low without sacrificing bandwidth.
    
    The overhead compensation has three tunables: A fixed per-packet overhead
    size (which, if set, will be accounted from the IP packet header), a
    minimum packet size (MPU) and a framing mode supporting either ATM or PTM
    framing. We include a set of common keywords in TC to help users configure
    the right parameters. If no overhead value is set, the value reported by
    the kernel is used.
    
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  5. sch_cake: Add DiffServ handling

    tohojo authored and Lochnair committed Jul 6, 2018
    This adds support for DiffServ-based priority queueing to CAKE. If the
    shaper is in use, each priority tier gets its own virtual clock, which
    limits that tier's rate to a fraction of the overall shaped rate, to
    discourage trying to game the priority mechanism.
    
    CAKE defaults to a simple, three-tier mode that interprets most code points
    as "best effort", but places CS1 traffic into a low-priority "bulk" tier
    which is assigned 1/16 of the total rate, and a few code points indicating
    latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
    into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
    The other supported DiffServ modes are a 4-tier mode matching the 802.11e
    precedence rules, as well as two 8-tier modes, one of which implements
    strict precedence of the eight priority levels.
    
    This commit also adds an optional DiffServ 'wash' mode, which will zero out
    the DSCP fields of any packet passing through CAKE. While this can
    technically be done with other mechanisms in the kernel, having the feature
    available in CAKE significantly decreases configuration complexity; and the
    implementation cost is low on top of the other DiffServ-handling code.
    
    Filters and applications can set the skb->priority field to override the
    DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
    CAKE's qdisc handle, the minor number will be interpreted as a priority
    tier if it is less than or equal to the number of configured priority
    tiers.
    
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  6. sch_cake: Add NAT awareness to packet classifier

    tohojo authored and Lochnair committed Jul 6, 2018
    When CAKE is deployed on a gateway that also performs NAT (which is a
    common deployment mode), the host fairness mechanism cannot distinguish
    internal hosts from each other, and so fails to work correctly.
    
    To fix this, we add an optional NAT awareness mode, which will query the
    kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
    and use that in the flow and host hashing.
    
    When the shaper is enabled and the host is already performing NAT, the cost
    of this lookup is negligible. However, in unlimited mode with no NAT being
    performed, there is a significant CPU cost at higher bandwidths. For this
    reason, the feature is turned off by default.
    
    Cc: netfilter-devel@vger.kernel.org
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  7. sch_cake: Add optional ACK filter

    tohojo authored and Lochnair committed Jul 6, 2018
    The ACK filter is an optional feature of CAKE which is designed to improve
    performance on links with very asymmetrical rate limits. On such links
    (which are unfortunately quite prevalent, especially for DSL and cable
    subscribers), the downstream throughput can be limited by the number of
    ACKs capable of being transmitted in the *upstream* direction.
    
    Filtering ACKs can, in general, have adverse effects on TCP performance
    because it interferes with ACK clocking (especially in slow start), and it
    reduces the flow's resiliency to ACKs being dropped further along the path.
    To alleviate these drawbacks, the ACK filter in CAKE tries its best to
    always keep enough ACKs queued to ensure forward progress in the TCP flow
    being filtered. It does this by only filtering redundant ACKs. In its
    default 'conservative' mode, the filter will always keep at least two
    redundant ACKs in the queue, while in 'aggressive' mode, it will filter
    down to a single ACK.
    
    The ACK filter works by inspecting the per-flow queue on every packet
    enqueue. Starting at the head of the queue, the filter looks for another
    eligible packet to drop (so the ACK being dropped is always closer to the
    head of the queue than the packet being enqueued). An ACK is eligible only
    if it ACKs *fewer* bytes than the new packet being enqueued, including any
    SACK options. This prevents duplicate ACKs from being filtered, to avoid
    interfering with retransmission logic. In addition, we check TCP header
    options and only drop those that are known to not interfere with sender
    state. In particular, packets with unknown option codes are never dropped.
    
    In aggressive mode, an eligible packet is always dropped, while in
    conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
    (with no data segments) are considered eligible for dropping, but when an
    ACK with data segments is enqueued, this can cause another pure ACK to
    become eligible for dropping.
    
    The approach described above ensures that this ACK filter avoids most of
    the drawbacks of a naive filtering mechanism that only keeps flow state but
    does not inspect the queue. This is the rationale for including the ACK
    filter in CAKE itself rather than as separate module (as the TC filter, for
    instance).
    
    Our performance evaluation has shown that on a 30/1 Mbps link with a
    bidirectional traffic test (RRUL), turning on the ACK filter on the
    upstream link improves downstream throughput by ~20% (both modes) and
    upstream throughput by ~12% in conservative mode and ~40% in aggressive
    mode, at the cost of ~5ms of inter-flow latency due to the increased
    congestion.
    
    In *really* pathological cases, the effect can be a lot more; for instance,
    the ACK filter increases the achievable downstream throughput on a link
    with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
    Mbps to ~25 Mbps).
    
    Finally, even though we consider the ACK filter to be safer than most, we
    do not recommend turning it on everywhere: on more symmetrical link
    bandwidths the effect is negligible at best.
    
    Cc: Yuchung Cheng <ycheng@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  8. sch_cake: Add ingress mode

    tohojo authored and Lochnair committed Jul 6, 2018
    The ingress mode is meant to be enabled when CAKE runs downlink of the
    actual bottleneck (such as on an IFB device). The mode changes the shaper
    to also account dropped packets to the shaped rate, as these have already
    traversed the bottleneck.
    
    Enabling ingress mode will also tune the AQM to always keep at least two
    packets queued *for each flow*. This is done by scaling the minimum queue
    occupancy level that will disable the AQM by the number of active bulk
    flows. The rationale for this is that retransmits are more expensive in
    ingress mode, since dropped packets have to traverse the bottleneck again
    when they are retransmitted; thus, being more lenient and keeping a minimum
    number of packets queued will improve throughput in cases where the number
    of active flows are so large that they saturate the bottleneck even at
    their minimum window size.
    
    This commit also adds a separate switch to enable ingress mode rate
    autoscaling. If enabled, the autoscaling code will observe the actual
    traffic rate and adjust the shaper rate to match it. This can help avoid
    latency increases in the case where the actual bottleneck rate decreases
    below the shaped rate. The scaling filters out spikes by an EWMA filter.
    
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
  9. sched: Add Common Applications Kept Enhanced (cake) qdisc

    tohojo authored and Lochnair committed Jul 6, 2018
    sch_cake targets the home router use case and is intended to squeeze the
    most bandwidth and latency out of even the slowest ISP links and routers,
    while presenting an API simple enough that even an ISP can configure it.
    
    Example of use on a cable ISP uplink:
    
    tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
    
    To shape a cable download link (ifb and tc-mirred setup elided)
    
    tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
    
    CAKE is filled with:
    
    * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
      derived Flow Queuing system, which autoconfigures based on the bandwidth.
    * A novel "triple-isolate" mode (the default) which balances per-host
      and per-flow FQ even through NAT.
    * An deficit based shaper, that can also be used in an unlimited mode.
    * 8 way set associative hashing to reduce flow collisions to a minimum.
    * A reasonable interpretation of various diffserv latency/loss tradeoffs.
    * Support for zeroing diffserv markings for entering and exiting traffic.
    * Support for interacting well with Docsis 3.0 shaper framing.
    * Extensive support for DSL framing types.
    * Support for ack filtering.
    * Extensive statistics for measuring, loss, ecn markings, latency
      variation.
    
    A paper describing the design of CAKE is available at
    https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
    International Symposium on Local and Metropolitan Area Networks (LANMAN).
    
    This patch adds the base shaper and packet scheduler, while subsequent
    commits add the optional (configurable) features. The full userspace API
    and most data structures are included in this commit, but options not
    understood in the base version will be ignored.
    
    Various versions baking have been available as an out of tree build for
    kernel versions going back to 3.10, as the embedded router world has been
    running a few years behind mainline Linux. A stable version has been
    generally available on lede-17.01 and later.
    
    sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
    in the sqm-scripts, with sane defaults and vastly simpler configuration.
    
    CAKE's principal author is Jonathan Morton, with contributions from
    Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
    Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
    and Loganaden Velvindron.
    
    Testing from Pete Heist, Georgios Amanakis, and the many other members of
    the cake@lists.bufferbloat.net mailing list.
    
    tc -s qdisc show dev eth2
     qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
     Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
      backlog 0b 0p requeues 12
      memory used: 1053008b of 15140Kb
      capacity estimate: 970Mbit
      min/max network layer size:           28 /    1500
      min/max overhead-adjusted size:       84 /    1538
      average network hdr offset:           14
                        Bulk  Best Effort        Voice
       thresh      62500Kbit        1Gbit      250Mbit
       target          5.0ms        5.0ms        5.0ms
       interval      100.0ms      100.0ms      100.0ms
       pk_delay          5us          5us          6us
       av_delay          3us          2us          2us
       sp_delay          2us          1us          1us
       backlog            0b           0b           0b
       pkts          3164050     25030267      9530280
       bytes      3227519915  35396974782  12879808898
       way_inds            0            8            0
       way_miss           21          366           25
       way_cols            0            0            0
       drops               5            0            1
       marks               0            0            0
       ack_drop            0            0            0
       sp_flows            1            3            0
       bk_flows            0            1            1
       un_flows            0            0            0
       max_len         68130        68130        68130
    
    Tested-by: Pete Heist <peteheist@gmail.com>
    Tested-by: Georgios Amanakis <gamanakis@gmail.com>
    Signed-off-by: Dave Taht <dave.taht@gmail.com>
    Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Commits on Aug 13, 2018
  1. Merge tag 'v4.14.62' into linux-vyos-4.14.y

    c-po committed Aug 13, 2018
    This is the 4.14.62 stable release
    
    * tag 'v4.14.62': (3380 commits)
      Linux 4.14.62
      jfs: Fix inconsistency between memory allocation and ea_buf->max_size
      xfs: don't call xfs_da_shrink_inode with NULL bp
      xfs: validate cached inodes are free when allocated
      xfs: catch inode allocation state mismatch corruption
      intel_idle: Graceful probe failure when MWAIT is disabled
      nvmet-fc: fix target sgl list on large transfers
      nvme-pci: Fix queue double allocations
      nvme-pci: allocate device queues storage space at probe
      Btrfs: fix file data corruption after cloning a range and fsync
      i2c: imx: Fix reinit_completion() use
      ring_buffer: tracing: Inherit the tracing setting to next ring buffer
      ACPI / PCI: Bail early in acpi_pci_add_bus() if there is no ACPI handle
      ext4: fix false negatives *and* false positives in ext4_check_descriptors()
      netlink: Don't shift on 64 for ngroups
      nohz: Fix missing tick reprogram when interrupting an inline softirq
      nohz: Fix local_timer_softirq_pending()
      genirq: Make force irq threading setup more robust
      scsi: qla2xxx: Return error when TMF returns
      scsi: qla2xxx: Fix ISP recovery on unload
      ...
Commits on Aug 9, 2018
  1. Linux 4.14.62

    gregkh committed Aug 9, 2018
  2. jfs: Fix inconsistency between memory allocation and ea_buf->max_size

    shankarapailoor authored and gregkh committed Jun 5, 2018
    commit 92d3413 upstream.
    
    The code is assuming the buffer is max_size length, but we weren't
    allocating enough space for it.
    
    Signed-off-by: Shankara Pailoor <shankarapailoor@gmail.com>
    Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
    Cc: Guenter Roeck <linux@roeck-us.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. xfs: don't call xfs_da_shrink_inode with NULL bp

    sandeen authored and gregkh committed Jun 8, 2018
    commit bb3d48d upstream.
    
    xfs_attr3_leaf_create may have errored out before instantiating a buffer,
    for example if the blkno is out of range.  In that case there is no work
    to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
    if we try.
    
    This also seems to fix a flaw where the original error from
    xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
    removes a pointless assignment to bp which isn't used after this.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
    Reported-by: Xu, Wen <wen.xu@gatech.edu>
    Tested-by: Xu, Wen <wen.xu@gatech.edu>
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Cc: Eduardo Valentin <eduval@amazon.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. xfs: validate cached inodes are free when allocated

    Dave Chinner authored and gregkh committed Apr 18, 2018
    commit afca6c5 upstream.
    
    A recent fuzzed filesystem image cached random dcache corruption
    when the reproducer was run. This often showed up as panics in
    lookup_slow() on a null inode->i_ops pointer when doing pathwalks.
    
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    ....
    Call Trace:
     lookup_slow+0x44/0x60
     walk_component+0x3dd/0x9f0
     link_path_walk+0x4a7/0x830
     path_lookupat+0xc1/0x470
     filename_lookup+0x129/0x270
     user_path_at_empty+0x36/0x40
     path_listxattr+0x98/0x110
     SyS_listxattr+0x13/0x20
     do_syscall_64+0xf5/0x280
     entry_SYSCALL_64_after_hwframe+0x42/0xb7
    
    but had many different failure modes including deadlocks trying to
    lock the inode that was just allocated or KASAN reports of
    use-after-free violations.
    
    The cause of the problem was a corrupt INOBT on a v4 fs where the
    root inode was marked as free in the inobt record. Hence when we
    allocated an inode, it chose the root inode to allocate, found it in
    the cache and re-initialised it.
    
    We recently fixed a similar inode allocation issue caused by inobt
    record corruption problem in xfs_iget_cache_miss() in commit
    ee45700 ("xfs: catch inode allocation state mismatch
    corruption"). This change adds similar checks to the cache-hit path
    to catch it, and turns the reproducer into a corruption shutdown
    situation.
    
    Reported-by: Wen Xu <wen.xu@gatech.edu>
    Signed-Off-By: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    [darrick: fix typos in comment]
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Cc: Eduardo Valentin <eduval@amazon.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  5. xfs: catch inode allocation state mismatch corruption

    Dave Chinner authored and gregkh committed Mar 23, 2018
    commit ee45700 upstream.
    
    We recently came across a V4 filesystem causing memory corruption
    due to a newly allocated inode being setup twice and being added to
    the superblock inode list twice. From code inspection, the only way
    this could happen is if a newly allocated inode was not marked as
    free on disk (i.e. di_mode wasn't zero).
    
    Running the metadump on an upstream debug kernel fails during inode
    allocation like so:
    
    XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
    e.c, line: 838
     ------------[ cut here ]------------
    kernel BUG at fs/xfs/xfs_message.c:114!
    invalid opcode: 0000 [vyos#1] PREEMPT SMP
    CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
    1/2014
    RIP: 0010:assfail+0x28/0x30
    RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
    RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
    RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
    RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
    R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
    FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
    000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
    Call Trace:
     xfs_ialloc+0x383/0x570
     xfs_dir_ialloc+0x6a/0x2a0
     xfs_create+0x412/0x670
     xfs_generic_create+0x1f7/0x2c0
     ? capable_wrt_inode_uidgid+0x3f/0x50
     vfs_mkdir+0xfb/0x1b0
     SyS_mkdir+0xcf/0xf0
     do_syscall_64+0x73/0x1a0
     entry_SYSCALL_64_after_hwframe+0x42/0xb7
    
    Extracting the inode number we crashed on from an event trace and
    looking at it with xfs_db:
    
    xfs_db> inode 184452204
    xfs_db> p
    core.magic = 0x494e
    core.mode = 0100644
    core.version = 2
    core.format = 2 (extents)
    core.nlinkv2 = 1
    core.onlink = 0
    .....
    
    Confirms that it is not a free inode on disk. xfs_repair
    also trips over this inode:
    
    .....
    zero length extent (off = 0, fsbno = 0) in ino 184452204
    correcting nextents for inode 184452204
    bad attribute fork in inode 184452204, would clear attr fork
    bad nblocks 1 for inode 184452204, would reset to 0
    bad anextents 1 for inode 184452204, would reset to 0
    imap claims in-use inode 184452204 is free, would correct imap
    would have cleared inode 184452204
    .....
    disconnected inode 184452204, would move to lost+found
    
    And so we have a situation where the directory structure and the
    inobt thinks the inode is free, but the inode on disk thinks it is
    still in use. Where this corruption came from is not possible to
    diagnose, but we can detect it and prevent the kernel from oopsing
    on lookup. The reproducer now results in:
    
    $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
    mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
    ists
    mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
    ists
    mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
    re needs cleaning
    mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
    utput error
    mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
    utput error
    ....
    
    And this corruption shutdown:
    
    [   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
     marked free on disk
    [   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
    of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
    [   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
    443
    [   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
    S 1.10.2-1 04/01/2014
    [   54.852859] Call Trace:
    [   54.853531]  dump_stack+0x85/0xc5
    [   54.854385]  xfs_trans_cancel+0x197/0x1c0
    [   54.855421]  xfs_create+0x425/0x670
    [   54.856314]  xfs_generic_create+0x1f7/0x2c0
    [   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
    [   54.858586]  vfs_mkdir+0xfb/0x1b0
    [   54.859458]  SyS_mkdir+0xcf/0xf0
    [   54.860254]  do_syscall_64+0x73/0x1a0
    [   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
    [   54.862492] RIP: 0033:0x7fb73bddf547
    [   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
    000000000053
    [   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
    bddf547
    [   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
    a55449a
    [   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
    8670dd0
    [   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
    00001ff
    [   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
    a553500
    [   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
    024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
    [   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
    ing down filesystem
    [   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
    problem(s)
    
    Note that this crash is only possible on v4 filesystemsi or v5
    filesystems mounted with the ikeep mount option. For all other V5
    filesystems, this problem cannot occur because we don't read inodes
    we are allocating from disk - we simply overwrite them with the new
    inode information.
    
    Signed-Off-By: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Cc: Eduardo Valentin <eduval@amazon.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. intel_idle: Graceful probe failure when MWAIT is disabled

    lenb authored and gregkh committed Nov 9, 2017
    commit a4c4475 upstream.
    
    When MWAIT is disabled, intel_idle refuses to probe.
    But it may mis-lead the user by blaming this on the model number:
    
    intel_idle: does not run on family 6 modesl 79
    
    So defer the check for MWAIT until after the model# white-list check succeeds,
    and if the MWAIT check fails, tell the user how to fix it:
    
    intel_idle: Please enable MWAIT in BIOS SETUP
    
    Signed-off-by: Len Brown <len.brown@intel.com>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: Eduardo Valentin <eduval@amazon.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. nvmet-fc: fix target sgl list on large transfers

    jsmart-gh authored and gregkh committed Jul 16, 2018
    commit d082dc1 upstream.
    
    The existing code to carve up the sg list expected an sg element-per-page
    which can be very incorrect with iommu's remapping multiple memory pages
    to fewer bus addresses. To hit this error required a large io payload
    (greater than 256k) and a system that maps on a per-page basis. It's
    possible that large ios could get by fine if the system condensed the
    sgl list into the first 64 elements.
    
    This patch corrects the sg list handling by specifically walking the
    sg list element by element and attempting to divide the transfer up
    on a per-sg element boundary. While doing so, it still tries to keep
    sequences under 256k, but will exceed that rule if a single sg element
    is larger than 256k.
    
    Fixes: 48fa362 ("nvmet-fc: simplify sg list handling")
    Cc: <stable@vger.kernel.org> # 4.14
    Signed-off-by: James Smart <james.smart@broadcom.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  8. nvme-pci: Fix queue double allocations

    Keith Busch authored and gregkh committed Jan 23, 2018
    commit 62314e4 upstream.
    
    The queue count says the highest queue that's been allocated, so don't
    reallocate a queue lower than that.
    
    Fixes: 147b27e ("nvme-pci: allocate device queues storage space at probe")
    Signed-off-by: Keith Busch <keith.busch@intel.com>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  9. nvme-pci: allocate device queues storage space at probe

    sagigrimberg authored and gregkh committed Jan 14, 2018
    commit 147b27e upstream.
    
    It may cause race by setting 'nvmeq' in nvme_init_request()
    because .init_request is called inside switching io scheduler, which
    may happen when the NVMe device is being resetted and its nvme queues
    are being freed and created. We don't have any sync between the two
    pathes.
    
    This patch changes the nvmeq allocation to occur at probe time so
    there is no way we can dereference it at init_request.
    
    [   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
    [   93.274146] invalid opcode: 0000 [vyos#1] SMP
    [   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
    nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
    intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
    kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
    intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
    ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
    shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
    mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
    fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
    megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
    [   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
    [   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
    [   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
    [   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
    [   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
    [   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
    [   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
    [   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
    [   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
    [   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
    [   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
    [   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
    [   93.446368] Call Trace:
    [   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
    [   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
    [   93.459214]  blk_mq_init_sched+0x7e/0x140
    [   93.463687]  elevator_switch+0x5a/0x1f0
    [   93.467966]  ? elevator_get.isra.17+0x52/0xc0
    [   93.472826]  elv_iosched_store+0xde/0x150
    [   93.477299]  queue_attr_store+0x4e/0x90
    [   93.481580]  kernfs_fop_write+0xfa/0x180
    [   93.485958]  __vfs_write+0x33/0x170
    [   93.489851]  ? __inode_security_revalidate+0x4c/0x60
    [   93.495390]  ? selinux_file_permission+0xda/0x130
    [   93.500641]  ? _cond_resched+0x15/0x30
    [   93.504815]  vfs_write+0xad/0x1a0
    [   93.508512]  SyS_write+0x52/0xc0
    [   93.512113]  do_syscall_64+0x61/0x1a0
    [   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
    [   93.521351] RIP: 0033:0x7f33ce96aab0
    [   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
    [   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
    [   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
    [   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
    [   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
    [   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
    74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
    0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
    [   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
    [   93.602273] ---[ end trace 810dde3993e5f14e ]---
    
    Reported-by: Yi Zhang <yi.zhang@redhat.com>
    Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  10. Btrfs: fix file data corruption after cloning a range and fsync

    fdmanana authored and gregkh committed Jul 12, 2018
    commit bd3599a upstream.
    
    When we clone a range into a file we can end up dropping existing
    extent maps (or trimming them) and replacing them with new ones if the
    range to be cloned overlaps with a range in the destination inode.
    When that happens we add the new extent maps to the list of modified
    extents in the inode's extent map tree, so that a "fast" fsync (the flag
    BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
    and log corresponding extent items. However, at the end of range cloning
    operation we do truncate all the pages in the affected range (in order to
    ensure future reads will not get stale data). Sometimes this truncation
    will release the corresponding extent maps besides the pages from the page
    cache. If this happens, then a "fast" fsync operation will miss logging
    some extent items, because it relies exclusively on the extent maps being
    present in the inode's extent tree, leading to data loss/corruption if
    the fsync ends up using the same transaction used by the clone operation
    (that transaction was not committed in the meanwhile). An extent map is
    released through the callback btrfs_invalidatepage(), which gets called by
    truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
    later ends up calling try_release_extent_mapping() which will release the
    extent map if some conditions are met, like the file size being greater
    than 16Mb, gfp flags allow blocking and the range not being locked (which
    is the case during the clone operation) nor being the extent map flagged
    as pinned (also the case for cloning).
    
    The following example, turned into a test for fstests, reproduces the
    issue:
    
      $ mkfs.btrfs -f /dev/sdb
      $ mount /dev/sdb /mnt
    
      $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
      $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
    
      $ xfs_io -c "fsync" /mnt/bar
      # reflink destination offset corresponds to the size of file bar,
      # 2728Kb minus 4Kb.
      $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
      $ xfs_io -c "fsync" /mnt/bar
    
      $ md5sum /mnt/bar
      95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar
    
      <power fail>
    
      $ mount /dev/sdb /mnt
      $ md5sum /mnt/bar
      207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
      # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
      # power failure
    
    In the above example, the destination offset of the clone operation
    corresponds to the size of the "bar" file minus 4Kb. So during the clone
    operation, the extent map covering the range from 2572Kb to 2728Kb gets
    trimmed so that it ends at offset 2724Kb, and a new extent map covering
    the range from 2724Kb to 11724Kb is created. So at the end of the clone
    operation when we ask to truncate the pages in the range from 2724Kb to
    2724Kb + 15908Kb, the page invalidation callback ends up removing the new
    extent map (through try_release_extent_mapping()) when the page at offset
    2724Kb is passed to that callback.
    
    Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
    map is removed at try_release_extent_mapping(), forcing the next fsync to
    search for modified extents in the fs/subvolume tree instead of relying on
    the presence of extent maps in memory. This way we can continue doing a
    "fast" fsync if the destination range of a clone operation does not
    overlap with an existing range or if any of the criteria necessary to
    remove an extent map at try_release_extent_mapping() is not met (file
    size not bigger then 16Mb or gfp flags do not allow blocking).
    
    CC: stable@vger.kernel.org # 3.16+
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. i2c: imx: Fix reinit_completion() use

    Esben Haabendal authored and gregkh committed Jul 9, 2018
    commit 9f9e3e0 upstream.
    
    Make sure to call reinit_completion() before dma is started to avoid race
    condition where reinit_completion() is called after complete() and before
    wait_for_completion_timeout().
    
    Signed-off-by: Esben Haabendal <eha@deif.com>
    Fixes: ce1a788 ("i2c: imx: add DMA support for freescale i2c driver")
    Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
    Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
    Cc: stable@kernel.org
    Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  12. ring_buffer: tracing: Inherit the tracing setting to next ring buffer

    mhiramat authored and gregkh committed Jul 13, 2018
    commit 73c8d89 upstream.
    
    Maintain the tracing on/off setting of the ring_buffer when switching
    to the trace buffer snapshot.
    
    Taking a snapshot is done by swapping the backup ring buffer
    (max_tr_buffer). But since the tracing on/off setting is defined
    by the ring buffer, when swapping it, the tracing on/off setting
    can also be changed. This causes a strange result like below:
    
      /sys/kernel/debug/tracing # cat tracing_on
      1
      /sys/kernel/debug/tracing # echo 0 > tracing_on
      /sys/kernel/debug/tracing # cat tracing_on
      0
      /sys/kernel/debug/tracing # echo 1 > snapshot
      /sys/kernel/debug/tracing # cat tracing_on
      1
      /sys/kernel/debug/tracing # echo 1 > snapshot
      /sys/kernel/debug/tracing # cat tracing_on
      0
    
    We don't touch tracing_on, but snapshot changes tracing_on
    setting each time. This is an anomaly, because user doesn't know
    that each "ring_buffer" stores its own tracing-enable state and
    the snapshot is done by swapping ring buffers.
    
    Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox
    
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
    Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
    Cc: stable@vger.kernel.org
    Fixes: debdd57 ("tracing: Make a snapshot feature available from userspace")
    Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
    [ Updated commit log and comment in the code ]
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
    Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  13. ACPI / PCI: Bail early in acpi_pci_add_bus() if there is no ACPI handle

    Vitaly Kuznetsov authored and gregkh committed Sep 14, 2017
    commit a0040c0 upstream.
    
    Hyper-V instances support PCI pass-through which is implemented through PV
    pci-hyperv driver. When a device is passed through, a new root PCI bus is
    created in the guest. The bus sits on top of VMBus and has no associated
    information in ACPI. acpi_pci_add_bus() in this case proceeds all the way
    to acpi_evaluate_dsm(), which reports
    
      ACPI: \: failed to evaluate _DSM (0x1001)
    
    While acpi_pci_slot_enumerate() and acpiphp_enumerate_slots() are protected
    against ACPI_HANDLE() being NULL and do nothing, acpi_evaluate_dsm() is not
    and gives us the error. It seems the correct fix is to not do anything in
    acpi_pci_add_bus() in such cases.
    
    Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Cc: Sinan Kaya <okaya@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  14. ext4: fix false negatives *and* false positives in ext4_check_descrip…

    tytso authored and gregkh committed Jul 8, 2018
    …tors()
    
    commit 44de022 upstream.
    
    Ext4_check_descriptors() was getting called before s_gdb_count was
    initialized.  So for file systems w/o the meta_bg feature, allocation
    bitmaps could overlap the block group descriptors and ext4 wouldn't
    notice.
    
    For file systems with the meta_bg feature enabled, there was a
    fencepost error which would cause the ext4_check_descriptors() to
    incorrectly believe that the block allocation bitmap overlaps with the
    block group descriptor blocks, and it would reject the mount.
    
    Fix both of these problems.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Cc: stable@vger.kernel.org
    Signed-off-by: Benjamin Gilbert <bgilbert@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  15. netlink: Don't shift on 64 for ngroups

    0x7f454c46 authored and gregkh committed Aug 5, 2018
    commit 91874ec upstream.
    
    It's legal to have 64 groups for netlink_sock.
    
    As user-supplied nladdr->nl_groups is __u32, it's possible to subscribe
    only to first 32 groups.
    
    The check for correctness of .bind() userspace supplied parameter
    is done by applying mask made from ngroups shift. Which broke Android
    as they have 64 groups and the shift for mask resulted in an overflow.
    
    Fixes: 61f4b23 ("netlink: Don't shift with UB on nlk->ngroups")
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Steffen Klassert <steffen.klassert@secunet.com>
    Cc: netdev@vger.kernel.org
    Cc: stable@vger.kernel.org
    Reported-and-Tested-by: Nathan Chancellor <natechancellor@gmail.com>
    Signed-off-by: Dmitry Safonov <dima@arista.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  16. nohz: Fix missing tick reprogram when interrupting an inline softirq

    Frederic Weisbecker authored and gregkh committed Aug 3, 2018
    commit 0a0e082 upstream.
    
    The full nohz tick is reprogrammed in irq_exit() only if the exit is not in
    a nesting interrupt. This stands as an optimization: whether a hardirq or a
    softirq is interrupted, the tick is going to be reprogrammed when necessary
    at the end of the inner interrupt, with even potential new updates on the
    timer queue.
    
    When soft interrupts are interrupted, it's assumed that they are executing
    on the tail of an interrupt return. In that case tick_nohz_irq_exit() is
    called after softirq processing to take care of the tick reprogramming.
    
    But the assumption is wrong: softirqs can be processed inline as well, ie:
    outside of an interrupt, like in a call to local_bh_enable() or from
    ksoftirqd.
    
    Inline softirqs don't reprogram the tick once they are done, as opposed to
    interrupt tail softirq processing. So if a tick interrupts an inline
    softirq processing, the next timer will neither be reprogrammed from the
    interrupting tick's irq_exit() nor after the interrupted softirq
    processing. This situation may leave the tick unprogrammed while timers are
    armed.
    
    To fix this, simply keep reprogramming the tick even if a softirq has been
    interrupted. That can be optimized further, but for now correctness is more
    important.
    
    Note that new timers enqueued in nohz_full mode after a softirq gets
    interrupted will still be handled just fine through self-IPIs triggered by
    the timer code.
    
    Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
    Cc: stable@vger.kernel.org # 4.14+
    Link: https://lkml.kernel.org/r/1533303094-15855-1-git-send-email-frederic@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  17. nohz: Fix local_timer_softirq_pending()

    anna-marialx authored and gregkh committed Jul 31, 2018
    commit 80d20d3 upstream.
    
    local_timer_softirq_pending() checks whether the timer softirq is
    pending with: local_softirq_pending() & TIMER_SOFTIRQ.
    
    This is wrong because TIMER_SOFTIRQ is the softirq number and not a
    bitmask. So the test checks for the wrong bit.
    
    Use BIT(TIMER_SOFTIRQ) instead.
    
    Fixes: 5d62c18 ("nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()")
    Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
    Acked-by: Frederic Weisbecker <frederic@kernel.org>
    Cc: bigeasy@linutronix.de
    Cc: peterz@infradead.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180731161358.29472-1-anna-maria@linutronix.de
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  18. genirq: Make force irq threading setup more robust

    Thomas Gleixner authored and gregkh committed Aug 3, 2018
    commit d1f0301 upstream.
    
    The support of force threading interrupts which are set up with both a
    primary and a threaded handler wreckaged the setup of regular requested
    threaded interrupts (primary handler == NULL).
    
    The reason is that it does not check whether the primary handler is set to
    the default handler which wakes the handler thread. Instead it replaces the
    thread handler with the primary handler as it would do with force threaded
    interrupts which have been requested via request_irq(). So both the primary
    and the thread handler become the same which then triggers the warnon that
    the thread handler tries to wakeup a not configured secondary thread.
    
    Fortunately this only happens when the driver omits the IRQF_ONESHOT flag
    when requesting the threaded interrupt, which is normaly caught by the
    sanity checks when force irq threading is disabled.
    
    Fix it by skipping the force threading setup when a regular threaded
    interrupt is requested. As a consequence the interrupt request which lacks
    the IRQ_ONESHOT flag is rejected correctly instead of silently wreckaging
    it.
    
    Fixes: 2a1d3ab ("genirq: Handle force threading of irqs with primary and thread handler")
    Reported-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  19. scsi: qla2xxx: Return error when TMF returns

    Anil Gurumurthy authored and gregkh committed Jul 18, 2018
    commit b4146c4 upstream.
    
    Propagate the task management completion status properly to avoid
    unnecessary waits for commands to complete.
    
    Fixes: faef62d ("[SCSI] qla2xxx: Fix Task Management command asynchronous handling")
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
    Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  20. scsi: qla2xxx: Fix ISP recovery on unload

    Quinn Tran authored and gregkh committed Jul 18, 2018
    commit b08abbd upstream.
    
    During unload process, the chip can encounter problem where a FW dump would
    be captured. For this case, the full reset sequence will be skip to bring
    the chip back to full operational state.
    
    Fixes: e315cd2 ("[SCSI] qla2xxx: Code changes for qla data structure refactoring")
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
    Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  21. scsi: qla2xxx: Fix NPIV deletion by calling wait_for_sess_deletion

    Quinn Tran authored and gregkh committed Jul 18, 2018
    commit efa93f4 upstream.
    
    Add wait for session deletion to finish before freeing an NPIV scsi host.
    
    Fixes: 726b854 ("qla2xxx: Add framework for async fabric discovery")
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
    Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  22. scsi: qla2xxx: Fix unintialized List head crash

    Quinn Tran authored and gregkh committed Jul 18, 2018
    commit e3dde08 upstream.
    
    In case of IOCB Queue full or system where memory is low and driver
    receives large number of RSCN storm, the stale sp pointer can stay on
    gpnid_list resulting in page_fault.
    
    This patch fixes this issue by initializing the sp->elem list head and
    removing sp->elem before memory is freed.
    
    Following stack trace is seen
    
     9 [ffff987b37d1bc60] page_fault at ffffffffad516768 [exception RIP: qla24xx_async_gpnid+496]
    10 [ffff987b37d1bd10] qla24xx_async_gpnid at ffffffffc039866d [qla2xxx]
    11 [ffff987b37d1bd80] qla2x00_do_work at ffffffffc036169c [qla2xxx]
    12 [ffff987b37d1be38] qla2x00_do_dpc_all_vps at ffffffffc03adfed [qla2xxx]
    13 [ffff987b37d1be78] qla2x00_do_dpc at ffffffffc036458a [qla2xxx]
    14 [ffff987b37d1bec8] kthread at ffffffffacebae31
    
    Fixes: 2d73ac6 ("scsi: qla2xxx: Serialize GPNID for multiple RSCN")
    Cc: <stable@vger.kernel.org> # v4.17+
    Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
    Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commits on Aug 6, 2018
  1. Linux 4.14.61

    gregkh committed Aug 6, 2018
  2. scsi: sg: fix minor memory leak in error path

    abattersby authored and gregkh committed Jul 12, 2018
    commit c170e5a upstream.
    
    Fix a minor memory leak when there is an error opening a /dev/sg device.
    
    Fixes: cc833ac ("sg: O_EXCL and other lock handling")
    Cc: <stable@vger.kernel.org>
    Reviewed-by: Ewan D. Milne <emilne@redhat.com>
    Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
    Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. drm/vc4: Reset ->{x, y}_scaling[1] when dealing with uniplanar formats

    Boris Brezillon authored and gregkh committed Jul 24, 2018
    commit a6a0091 upstream.
    
    This is needed to ensure ->is_unity is correct when the plane was
    previously configured to output a multi-planar format with scaling
    enabled, and is then being reconfigured to output a uniplanar format.
    
    Fixes: fc04023 ("drm/vc4: Add support for YUV planes.")
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
    Reviewed-by: Eric Anholt <eric@anholt.net>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180724133601.32114-1-boris.brezillon@bootlin.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Older
You can’t perform that action at this time.