Skip to content
Permalink
Pablo-Neira-Ay…
Switch branches/tags

Commits on Jun 16, 2021

  1. netfilter: nft_osf: check for TCP packet before further processing

    The osf expression only supports for TCP packets, add a upfront sanity
    check to skip packet parsing if this is not a TCP packet.
    
    Fixes: b96af92 ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes authored and intel-lab-lkp committed Jun 16, 2021
  2. netfilter: nft_exthdr: check for IPv6 packet before further processing

    ipv6_find_hdr() does not validate that this is an IPv6 packet. Add a
    sanity check for calling ipv6_find_hdr() to make sure an IPv6 packet
    is passed for parsing.
    
    Fixes: 9651851 ("netfilter: add nftables")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes authored and intel-lab-lkp committed Jun 16, 2021

Commits on Jun 9, 2021

  1. netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local

    The ip6tables rpfilter match has an extra check to skip packets with
    "::" source address.
    
    Extend this to ipv6 fib expression.  Else ipv6 duplicate address detection
    packets will fail rpf route check -- lookup returns -ENETUNREACH.
    
    While at it, extend the prerouting check to also cover the ingress hook.
    
    Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1543
    Fixes: f6d0cbc ("netfilter: nf_tables: add fib expression")
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Florian Westphal authored and ummakynes committed Jun 9, 2021
  2. selftests: netfilter: add fib test case

    There is a bug report on netfilter.org bugzilla pointing to fib
    expression dropping ipv6 DAD packets.
    
    Add a test case that demonstrates this problem.
    
    Next patch excludes icmpv6 packets coming from any to linklocal.
    
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Florian Westphal authored and ummakynes committed Jun 9, 2021
  3. netfilter: nf_tables: initialize set before expression setup

    nft_set_elem_expr_alloc() needs an initialized set if expression sets on
    the NFT_EXPR_GC flag. Move set fields initialization before expression
    setup.
    
    [4512935.019450] ==================================================================
    [4512935.019456] BUG: KASAN: null-ptr-deref in nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables]
    [4512935.019487] Read of size 8 at addr 0000000000000070 by task nft/23532
    [4512935.019494] CPU: 1 PID: 23532 Comm: nft Not tainted 5.12.0-rc4+ torvalds#48
    [...]
    [4512935.019502] Call Trace:
    [4512935.019505]  dump_stack+0x89/0xb4
    [4512935.019512]  ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables]
    [4512935.019536]  ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables]
    [4512935.019560]  kasan_report.cold.12+0x5f/0xd8
    [4512935.019566]  ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables]
    [4512935.019590]  nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables]
    [4512935.019615]  nf_tables_newset+0xc7f/0x1460 [nf_tables]
    
    Reported-by: syzbot+ce96ca2b1d0b37c6422d@syzkaller.appspotmail.com
    Fixes: 6503842 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Jun 9, 2021
  4. net: lantiq: disable interrupt before sheduling NAPI

    This patch fixes TX hangs with threaded NAPI enabled. The scheduled
    NAPI seems to be executed in parallel with the interrupt on second
    thread. Sometimes it happens that ltq_dma_disable_irq() is executed
    after xrx200_tx_housekeeping(). The symptom is that TX interrupts
    are disabled in the DMA controller. As a result, the TX hangs after
    a few seconds of the iperf test. Scheduling NAPI after disabling
    interrupts fixes this issue.
    
    Tested on Lantiq xRX200 (BT Home Hub 5A).
    
    Fixes: 9423361 ("net: lantiq: Disable IRQs only if NAPI gets scheduled ")
    Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
    Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    abajk authored and davem330 committed Jun 9, 2021

Commits on Jun 8, 2021

  1. net: ena: fix DMA mapping function issues in XDP

    This patch fixes several bugs found when (DMA/LLQ) mapping a packet for
    transmission. The mapping procedure makes the transmitted packet
    accessible by the device.
    When using LLQ, this requires copying the packet's header to push header
    (which would be passed to LLQ) and creating DMA mapping for the payload
    (if the packet doesn't fit the maximum push length).
    When not using LLQ, we map the whole packet with DMA.
    
    The following bugs are fixed in the code:
        1. Add support for non-LLQ machines:
           The ena_xdp_tx_map_frame() function assumed that LLQ is
           supported, and never mapped the whole packet using DMA. On some
           instances, which don't support LLQ, this causes loss of traffic.
    
        2. Wrong DMA buffer length passed to device:
           When using LLQ, the first 'tx_max_header_size' bytes of the
           packet would be copied to push header. The rest of the packet
           would be copied to a DMA'd buffer.
    
        3. Freeing the XDP buffer twice in case of a mapping error:
           In case a buffer DMA mapping fails, the function uses
           xdp_return_frame_rx_napi() to free the RX buffer and returns from
           the function with an error. XDP frames that fail to xmit get
           freed by the kernel and so there is no need for this call.
    
    Fixes: 548c494 ("net: ena: Implement XDP_TX action")
    Signed-off-by: Shay Agroskin <shayagr@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    ShayAgros authored and davem330 committed Jun 8, 2021
  2. net: dsa: felix: re-enable TX flow control in ocelot_port_flush()

    Because flow control is set up statically in ocelot_init_port(), and not
    in phylink_mac_link_up(), what happens is that after the blamed commit,
    the flow control remains disabled after the port flushing procedure.
    
    Fixes: eb4733d ("net: dsa: felix: implement port flushing on .phylink_mac_link_down")
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vladimiroltean authored and davem330 committed Jun 8, 2021
  3. net: rds: fix memory leak in rds_recvmsg

    Syzbot reported memory leak in rds. The problem
    was in unputted refcount in case of error.
    
    int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
    		int msg_flags)
    {
    ...
    
    	if (!rds_next_incoming(rs, &inc)) {
    		...
    	}
    
    After this "if" inc refcount incremented and
    
    	if (rds_cmsg_recv(inc, msg, rs)) {
    		ret = -EFAULT;
    		goto out;
    	}
    ...
    out:
    	return ret;
    }
    
    in case of rds_cmsg_recv() fail the refcount won't be
    decremented. And it's easy to see from ftrace log, that
    rds_inc_addref() don't have rds_inc_put() pair in
    rds_recvmsg() after rds_cmsg_recv()
    
     1)               |  rds_recvmsg() {
     1)   3.721 us    |    rds_inc_addref();
     1)   3.853 us    |    rds_message_inc_copy_to_user();
     1) + 10.395 us   |    rds_cmsg_recv();
     1) + 34.260 us   |  }
    
    Fixes: bdbe6fb ("RDS: recv.c")
    Reported-and-tested-by: syzbot+5134cdf021c4ed5aaa5f@syzkaller.appspotmail.com
    Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
    Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
    Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pskrgag authored and davem330 committed Jun 8, 2021
  4. Merge tag 'batadv-net-pullrequest-20210608' of git://git.open-mesh.or…

    …g/linux-merge
    
    Simon Wunderlich says:
    
    ====================
    Here is a batman-adv bugfix:
    
     - Avoid WARN_ON timing related checks, by Sven Eckelmann
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Jun 8, 2021
  5. vrf: fix maximum MTU

    My initial goal was to fix the default MTU, which is set to 65536, ie above
    the maximum defined in the driver: 65535 (ETH_MAX_MTU).
    
    In fact, it's seems more consistent, wrt min_mtu, to set the max_mtu to
    IP6_MAX_MTU (65535 + sizeof(struct ipv6hdr)) and use it by default.
    
    Let's also, for consistency, set the mtu in vrf_setup(). This function
    calls ether_setup(), which set the mtu to 1500. Thus, the whole mtu config
    is done in the same function.
    
    Before the patch:
    $ ip link add blue type vrf table 1234
    $ ip link list blue
    9: blue: <NOARP,MASTER> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether fa:f5:27:70:24:2a brd ff:ff:ff:ff:ff:ff
    $ ip link set dev blue mtu 65535
    $ ip link set dev blue mtu 65536
    Error: mtu greater than device maximum.
    
    Fixes: 5055376 ("net: vrf: Fix ping failed when vrf mtu is set to 0")
    CC: Miaohe Lin <linmiaohe@huawei.com>
    Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
    Reviewed-by: David Ahern <dsahern@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    NicolasDichtel authored and davem330 committed Jun 8, 2021
  6. net: appletalk: fix the usage of preposition

    The preposition "for" should be changed to preposition "of".
    
    Signed-off-by: gushengxian <gushengxian@yulong.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    gushengxian authored and davem330 committed Jun 8, 2021
  7. net: ipv4: Remove unneed BUG() function

    When 'nla_parse_nested_deprecated' failed, it's no need to
    BUG() here, return -EINVAL is ok.
    
    Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Zheng Yongjun authored and davem330 committed Jun 8, 2021
  8. net: ipv4: fix memory leak in netlbl_cipsov4_add_std

    Reported by syzkaller:
    BUG: memory leak
    unreferenced object 0xffff888105df7000 (size 64):
    comm "syz-executor842", pid 360, jiffies 4294824824 (age 22.546s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [<00000000e67ed558>] kmalloc include/linux/slab.h:590 [inline]
    [<00000000e67ed558>] kzalloc include/linux/slab.h:720 [inline]
    [<00000000e67ed558>] netlbl_cipsov4_add_std net/netlabel/netlabel_cipso_v4.c:145 [inline]
    [<00000000e67ed558>] netlbl_cipsov4_add+0x390/0x2340 net/netlabel/netlabel_cipso_v4.c:416
    [<0000000006040154>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 net/netlink/genetlink.c:739
    [<00000000204d7a1c>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
    [<00000000204d7a1c>] genl_rcv_msg+0x2bf/0x4f0 net/netlink/genetlink.c:800
    [<00000000c0d6a995>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
    [<00000000d78b9d2c>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
    [<000000009733081b>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
    [<000000009733081b>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
    [<00000000d5fd43b8>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929
    [<000000000a2d1e40>] sock_sendmsg_nosec net/socket.c:654 [inline]
    [<000000000a2d1e40>] sock_sendmsg+0x139/0x170 net/socket.c:674
    [<00000000321d1969>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
    [<00000000964e16bc>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
    [<000000001615e288>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433
    [<000000004ee8b6a5>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47
    [<00000000171c7cee>] entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    The memory of doi_def->map.std pointing is allocated in
    netlbl_cipsov4_add_std, but no place has freed it. It should be
    freed in cipso_v4_doi_free which frees the cipso DOI resource.
    
    Fixes: 96cb8e3 ("[NetLabel]: CIPSOv4 and Unlabeled packet integration")
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
    Acked-by: Paul Moore <paul@paul-moore.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    sunnanyong authored and davem330 committed Jun 8, 2021

Commits on Jun 7, 2021

  1. neighbour: allow NUD_NOARP entries to be forced GCed

    IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to
    fill up the neighbour table with enough entries that it will overflow for
    valid connections after that.
    
    This behaviour is more prevalent after commit 5895631 ("neighbor:
    Improve garbage collection") is applied, as it prevents removal from
    entries that are not NUD_FAILED, unless they are more than 5s old.
    
    Fixes: 5895631 (neighbor: Improve garbage collection)
    Reported-by: Kasper Dupont <kasperd@gjkwv.06.feb.2021.kasperd.net>
    Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
    Signed-off-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    David Ahern authored and davem330 committed Jun 7, 2021
  2. revert "net: kcm: fix memory leak in kcm_sendmsg"

    In commit c47cc30 ("net: kcm: fix memory leak in kcm_sendmsg")
    I misunderstood the root case of the memory leak and came up with
    completely broken fix.
    
    So, simply revert this commit to avoid GPF reported by
    syzbot.
    
    Im so sorry for this situation.
    
    Fixes: c47cc30 ("net: kcm: fix memory leak in kcm_sendmsg")
    Reported-by: syzbot+65badd5e74ec62cb67dc@syzkaller.appspotmail.com
    Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pskrgag authored and davem330 committed Jun 7, 2021
  3. Merge branch 'mlxsw-fixes'

    Merge branch 'mlxsw-fixes'
    
    Ido Schimmel says:
    
    ====================
    mlxsw: Thermal and qdisc fixes
    
    Patches #1-#2 fix wrong validation of burst size in qdisc code and a
    user triggerable WARN_ON().
    
    Patch #3 fixes a regression in thermal monitoring of transceiver modules
    and gearboxes.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Jun 7, 2021
  4. mlxsw: core: Set thermal zone polling delay argument to real value at…

    … init
    
    Thermal polling delay argument for modules and gearboxes thermal zones
    used to be initialized with zero value, while actual delay was used to
    be set by mlxsw_thermal_set_mode() by thermal operation callback
    set_mode(). After operations set_mode()/get_mode() have been removed by
    cited commits, modules and gearboxes thermal zones always have polling
    time set to zero and do not perform temperature monitoring.
    
    Set non-zero "polling_delay" in thermal_zone_device_register() routine,
    thus, the relevant thermal zones will perform thermal monitoring.
    
    Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
    Fixes: 5d7bd8a ("thermal: Simplify or eliminate unnecessary set_mode() methods")
    Fixes: 1ee1482 ("thermal: remove get_mode() operation of drivers")
    Signed-off-by: Mykola Kostenok <c_mykolak@nvidia.com>
    Acked-by: Vadim Pasternak <vadimp@nvidia.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    odmyko authored and davem330 committed Jun 7, 2021
  5. mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()

    In mlxsw Qdisc offload, find_class() is an operation that yields a qdisc
    offload descriptor given a parental qdisc descriptor and a class handle. In
    __mlxsw_sp_qdisc_ets_graft() however, a band number is passed to that
    function instead of a handle. This can lead to a trigger of a WARN_ON
    with the following splat:
    
     WARNING: CPU: 3 PID: 808 at drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c:1356 __mlxsw_sp_qdisc_ets_graft+0x115/0x130 [mlxsw_spectrum]
     [...]
     Call Trace:
      mlxsw_sp_setup_tc_prio+0xe3/0x100 [mlxsw_spectrum]
      qdisc_offload_graft_helper+0x35/0xa0
      prio_graft+0x176/0x290 [sch_prio]
      qdisc_graft+0xb3/0x540
      tc_modify_qdisc+0x56a/0x8a0
      rtnetlink_rcv_msg+0x12c/0x370
      netlink_rcv_skb+0x49/0xf0
      netlink_unicast+0x1f6/0x2b0
      netlink_sendmsg+0x1fb/0x410
      ____sys_sendmsg+0x1f3/0x220
      ___sys_sendmsg+0x70/0xb0
      __sys_sendmsg+0x54/0xa0
      do_syscall_64+0x3a/0x70
      entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Since the parent handle is not passed with the offload information, compute
    it from the band number and qdisc handle.
    
    Fixes: 28052e618b04 ("mlxsw: spectrum_qdisc: Track children per qdisc")
    Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pmachata authored and davem330 committed Jun 7, 2021
  6. mlxsw: reg: Spectrum-3: Enforce lowest max-shaper burst size of 11

    A max-shaper is the HW component responsible for delaying egress traffic
    above a configured transmission rate. Burst size is the amount of traffic
    that is allowed to pass without accounting. The burst size value needs to
    be such that it can be expressed as 2^BS * 512 bits, where BS lies in a
    certain ASIC-dependent range. mlxsw enforces that this holds before
    attempting to configure the shaper.
    
    The assumption for Spectrum-3 was that the lower limit of BS would be 5,
    like for Spectrum-1. But as of now, the limit is still 11. Therefore fix
    the driver accordingly, so that incorrect values are rejected early with a
    proper message.
    
    Fixes: 23effa2 ("mlxsw: reg: Add max_shaper_bs to QoS ETS Element Configuration")
    Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pmachata authored and davem330 committed Jun 7, 2021
  7. ethtool: Fix NULL pointer dereference during module EEPROM dump

    When get_module_eeprom_by_page() is not implemented by the driver, NULL
    pointer dereference can occur [1].
    
    Fix by testing if get_module_eeprom_by_page() is implemented instead of
    get_module_info().
    
    [1]
     BUG: kernel NULL pointer dereference, address: 0000000000000000
     [...]
     CPU: 0 PID: 251 Comm: ethtool Not tainted 5.13.0-rc3-custom-00940-g3822d0670c9d #989
     Call Trace:
      eeprom_prepare_data+0x101/0x2d0
      ethnl_default_doit+0xc2/0x290
      genl_family_rcv_msg_doit+0xdc/0x140
      genl_rcv_msg+0xd7/0x1d0
      netlink_rcv_skb+0x49/0xf0
      genl_rcv+0x1f/0x30
      netlink_unicast+0x1f6/0x2c0
      netlink_sendmsg+0x1f9/0x400
      __sys_sendto+0xe1/0x130
      __x64_sys_sendto+0x1b/0x20
      do_syscall_64+0x3a/0x70
      entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    Fixes: c97a31f ("ethtool: wire in generic SFP module access")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Acked-by: Moshe Shemesh <moshe@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    idosch authored and davem330 committed Jun 7, 2021

Commits on Jun 4, 2021

  1. cxgb4: avoid link re-train during TC-MQPRIO configuration

    When configuring TC-MQPRIO offload, only turn off netdev carrier and
    don't bring physical link down in hardware. Otherwise, when the
    physical link is brought up again after configuration, it gets
    re-trained and stalls ongoing traffic.
    
    Also, when firmware is no longer accessible or crashed, avoid sending
    FLOWC and waiting for reply that will never come.
    
    Fix following hung_task_timeout_secs trace seen in these cases.
    
    INFO: task tc:20807 blocked for more than 122 seconds.
          Tainted: G S                5.13.0-rc3+ torvalds#122
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:tc   state:D stack:14768 pid:20807 ppid: 19366 flags:0x00000000
    Call Trace:
     __schedule+0x27b/0x6a0
     schedule+0x37/0xa0
     schedule_preempt_disabled+0x5/0x10
     __mutex_lock.isra.14+0x2a0/0x4a0
     ? netlink_lookup+0x120/0x1a0
     ? rtnl_fill_ifinfo+0x10f0/0x10f0
     __netlink_dump_start+0x70/0x250
     rtnetlink_rcv_msg+0x28b/0x380
     ? rtnl_fill_ifinfo+0x10f0/0x10f0
     ? rtnl_calcit.isra.42+0x120/0x120
     netlink_rcv_skb+0x4b/0xf0
     netlink_unicast+0x1a0/0x280
     netlink_sendmsg+0x216/0x440
     sock_sendmsg+0x56/0x60
     __sys_sendto+0xe9/0x150
     ? handle_mm_fault+0x6d/0x1b0
     ? do_user_addr_fault+0x1c5/0x620
     __x64_sys_sendto+0x1f/0x30
     do_syscall_64+0x3c/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f7f73218321
    RSP: 002b:00007ffd19626208 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 000055b7c0a8b240 RCX: 00007f7f73218321
    RDX: 0000000000000028 RSI: 00007ffd19626210 RDI: 0000000000000003
    RBP: 000055b7c08680ff R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000055b7c085f5f6
    R13: 000055b7c085f60a R14: 00007ffd19636470 R15: 00007ffd196262a0
    
    Fixes: b1396c2 ("cxgb4: parse and configure TC-MQPRIO offload")
    Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    chelsiocudbg authored and davem330 committed Jun 4, 2021
  2. sch_htb: fix refcount leak in htb_parent_to_leaf_offload

    The commit ae81feb ("sch_htb: fix null pointer dereference
    on a null new_q") fixes a NULL pointer dereference bug, but it
    is not correct.
    
    Because htb_graft_helper properly handles the case when new_q
    is NULL, and after the previous patch by skipping this call
    which creates an inconsistency : dev_queue->qdisc will still
    point to the old qdisc, but cl->parent->leaf.q will point to
    the new one (which will be noop_qdisc, because new_q was NULL).
    The code is based on an assumption that these two pointers are
    the same, so it can lead to refcount leaks.
    
    The correct fix is to add a NULL pointer check to protect
    qdisc_refcount_inc inside htb_parent_to_leaf_offload.
    
    Fixes: ae81feb ("sch_htb: fix null pointer dereference on a null new_q")
    Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
    Suggested-by: Maxim Mikityanskiy <maximmi@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    wyjwang authored and davem330 committed Jun 4, 2021
  3. Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/gi…

    …t/tnguy/net-queue
    
    Tony Nguyen says:
    
    ====================
    Intel Wired LAN Driver Updates 2021-06-04
    
    This series contains updates to virtchnl header file and ice driver.
    
    Brett fixes VF being unable to request a different number of queues then
    allocated and adds clearing of VF_MBX_ATQLEN register for VF reset.
    
    Haiyue handles error of rebuilding VF VSI during reset.
    
    Paul fixes reporting of autoneg to use the PHY capabilities.
    
    Dave allows LLDP packets without priority of TC_PRIO_CONTROL to be
    transmitted.
    
    Geert Uytterhoeven adds explicit padding to virtchnl_proto_hdrs
    structure in the virtchnl header file.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Jun 4, 2021
  4. Merge branch 'wireguard-fixes'

    Jason A. Donenfeld says:
    
    ====================
    wireguard fixes for 5.13-rc5
    
    Here are bug fixes to WireGuard for 5.13-rc5:
    
    1-2,6) These are small, trivial tweaks to our test harness.
    
    3) Linus thinks -O3 is still dangerous to enable. The code gen wasn't so
       much different with -O2 either.
    
    4) We were accidentally calling synchronize_rcu instead of
       synchronize_net while holding the rtnl_lock, resulting in some rather
       large stalls that hit production machines.
    
    5) Peer allocation was wasting literally hundreds of megabytes on real
       world deployments, due to oddly sized large objects not fitting
       nicely into a kmalloc slab.
    
    7-9) We move from an insanely expensive O(n) algorithm to a fast O(1)
         algorithm, and cleanup a massive memory leak in the process, in
         which allowed ips churn would leave danging nodes hanging around
         without cleanup until the interface was removed. The O(1) algorithm
         eliminates packet stalls and high latency issues, in addition to
         bringing operations that took as much as 10 minutes down to less
         than a second.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Jun 4, 2021
  5. wireguard: allowedips: free empty intermediate nodes when removing si…

    …ngle node
    
    When removing single nodes, it's possible that that node's parent is an
    empty intermediate node, in which case, it too should be removed.
    Otherwise the trie fills up and never is fully emptied, leading to
    gradual memory leaks over time for tries that are modified often. There
    was originally code to do this, but was removed during refactoring in
    2016 and never reworked. Now that we have proper parent pointers from
    the previous commits, we can implement this properly.
    
    In order to reduce branching and expensive comparisons, we want to keep
    the double pointer for parent assignment (which lets us easily chain up
    to the root), but we still need to actually get the parent's base
    address. So encode the bit number into the last two bits of the pointer,
    and pack and unpack it as needed. This is a little bit clumsy but is the
    fastest and less memory wasteful of the compromises. Note that we align
    the root struct here to a minimum of 4, because it's embedded into a
    larger struct, and we're relying on having the bottom two bits for our
    flag, which would only be 16-bit aligned on m68k.
    
    The existing macro-based helpers were a bit unwieldy for adding the bit
    packing to, so this commit replaces them with safer and clearer ordinary
    functions.
    
    We add a test to the randomized/fuzzer part of the selftests, to free
    the randomized tries by-peer, refuzz it, and repeat, until it's supposed
    to be empty, and then then see if that actually resulted in the whole
    thing being emptied. That combined with kmemcheck should hopefully make
    sure this commit is doing what it should. Along the way this resulted in
    various other cleanups of the tests and fixes for recent graphviz.
    
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  6. wireguard: allowedips: allocate nodes in kmem_cache

    The previous commit moved from O(n) to O(1) for removal, but in the
    process introduced an additional pointer member to a struct that
    increased the size from 60 to 68 bytes, putting nodes in the 128-byte
    slab. With deployed systems having as many as 2 million nodes, this
    represents a significant doubling in memory usage (128 MiB -> 256 MiB).
    Fix this by using our own kmem_cache, that's sized exactly right. This
    also makes wireguard's memory usage more transparent in tools like
    slabtop and /proc/slabinfo.
    
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Suggested-by: Arnd Bergmann <arnd@arndb.de>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  7. wireguard: allowedips: remove nodes in O(1)

    Previously, deleting peers would require traversing the entire trie in
    order to rebalance nodes and safely free them. This meant that removing
    1000 peers from a trie with a half million nodes would take an extremely
    long time, during which we're holding the rtnl lock. Large-scale users
    were reporting 200ms latencies added to the networking stack as a whole
    every time their userspace software would queue up significant removals.
    That's a serious situation.
    
    This commit fixes that by maintaining a double pointer to the parent's
    bit pointer for each node, and then using the already existing node list
    belonging to each peer to go directly to the node, fix up its pointers,
    and free it with RCU. This means removal is O(1) instead of O(n), and we
    don't use gobs of stack.
    
    The removal algorithm has the same downside as the code that it fixes:
    it won't collapse needlessly long runs of fillers.  We can enhance that
    in the future if it ever becomes a problem. This commit documents that
    limitation with a TODO comment in code, a small but meaningful
    improvement over the prior situation.
    
    Currently the biggest flaw, which the next commit addresses, is that
    because this increases the node size on 64-bit machines from 60 bytes to
    68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up
    using twice as much memory per node, because of power-of-two
    allocations, which is a big bummer. We'll need to figure something out
    there.
    
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  8. wireguard: allowedips: initialize list head in selftest

    The randomized trie tests weren't initializing the dummy peer list head,
    resulting in a NULL pointer dereference when used. Fix this by
    initializing it in the randomized trie test, just like we do for the
    static unit test.
    
    While we're at it, all of the other strings like this have the word
    "self-test", so add it to the missing place here.
    
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  9. wireguard: peer: allocate in kmem_cache

    With deployments having upwards of 600k peers now, this somewhat heavy
    structure could benefit from more fine-grained allocations.
    Specifically, instead of using a 2048-byte slab for a 1544-byte object,
    we can now use 1544-byte objects directly, thus saving almost 25%
    per-peer, or with 600k peers, that's a savings of 303 MiB. This also
    makes wireguard's memory usage more transparent in tools like slabtop
    and /proc/slabinfo.
    
    Fixes: 8b5553a ("wireguard: queueing: get rid of per-peer ring buffers")
    Suggested-by: Arnd Bergmann <arnd@arndb.de>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  10. wireguard: use synchronize_net rather than synchronize_rcu

    Many of the synchronization points are sometimes called under the rtnl
    lock, which means we should use synchronize_net rather than
    synchronize_rcu. Under the hood, this expands to using the expedited
    flavor of function in the event that rtnl is held, in order to not stall
    other concurrent changes.
    
    This fixes some very, very long delays when removing multiple peers at
    once, which would cause some operations to take several minutes.
    
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  11. wireguard: do not use -O3

    Apparently, various versions of gcc have O3-related miscompiles. Looking
    at the difference between -O2 and -O3 for gcc 11 doesn't indicate
    miscompiles, but the difference also doesn't seem so significant for
    performance that it's worth risking.
    
    Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/
    Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/
    Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  12. wireguard: selftests: make sure rp_filter is disabled on vethc

    Some distros may enable strict rp_filter by default, which will prevent
    vethc from receiving the packets with an unrouteable reverse path address.
    
    Reported-by: Hangbin Liu <liuhangbin@gmail.com>
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  13. wireguard: selftests: remove old conntrack kconfig value

    On recent kernels, this config symbol is no longer used.
    
    Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
    Fixes: e7096c1 ("net: WireGuard secure network tunnel")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    zx2c4 authored and davem330 committed Jun 4, 2021
  14. virtchnl: Add missing padding to virtchnl_proto_hdrs

    On m68k (Coldfire M547x):
    
          CC      drivers/net/ethernet/intel/i40e/i40e_main.o
        In file included from drivers/net/ethernet/intel/i40e/i40e_prototype.h:9,
    		     from drivers/net/ethernet/intel/i40e/i40e.h:41,
    		     from drivers/net/ethernet/intel/i40e/i40e_main.c:12:
        include/linux/avf/virtchnl.h:153:36: warning: division by zero [-Wdiv-by-zero]
          153 |  { virtchnl_static_assert_##X = (n)/((sizeof(struct X) == (n)) ? 1 : 0) }
    	  |                                    ^
        include/linux/avf/virtchnl.h:844:1: note: in expansion of macro ‘VIRTCHNL_CHECK_STRUCT_LEN’
          844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs);
    	  | ^~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/avf/virtchnl.h:844:33: error: enumerator value for ‘virtchnl_static_assert_virtchnl_proto_hdrs’ is not an integer constant
          844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs);
    	  |                                 ^~~~~~~~~~~~~~~~~~~
    
    On m68k, integers are aligned on addresses that are multiples of two,
    not four, bytes.  Hence the size of a structure containing integers may
    not be divisible by 4.
    
    Fix this by adding explicit padding.
    
    Fixes: 1f7ea1c ("ice: Enable FDIR Configure for AVF")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    geertu authored and anguy11 committed Jun 4, 2021
Older