Skip to content
Permalink
Eric-Dumazet/n…
Switch branches/tags

Commits on Aug 10, 2021

  1. net: igmp: fix data-race in igmp_ifc_timer_expire()

    Fix the data-race reported by syzbot [1]
    Issue here is that igmp_ifc_timer_expire() can update in_dev->mr_ifc_count
    while another change just occured from another context.
    
    in_dev->mr_ifc_count is only 8bit wide, so the race had little
    consequences.
    
    [1]
    BUG: KCSAN: data-race in igmp_ifc_event / igmp_ifc_timer_expire
    
    write to 0xffff8881051e3062 of 1 bytes by task 12547 on cpu 0:
     igmp_ifc_event+0x1d5/0x290 net/ipv4/igmp.c:821
     igmp_group_added+0x462/0x490 net/ipv4/igmp.c:1356
     ____ip_mc_inc_group+0x3ff/0x500 net/ipv4/igmp.c:1461
     __ip_mc_join_group+0x24d/0x2c0 net/ipv4/igmp.c:2199
     ip_mc_join_group_ssm+0x20/0x30 net/ipv4/igmp.c:2218
     do_ip_setsockopt net/ipv4/ip_sockglue.c:1285 [inline]
     ip_setsockopt+0x1827/0x2a80 net/ipv4/ip_sockglue.c:1423
     tcp_setsockopt+0x8c/0xa0 net/ipv4/tcp.c:3657
     sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3362
     __sys_setsockopt+0x18f/0x200 net/socket.c:2159
     __do_sys_setsockopt net/socket.c:2170 [inline]
     __se_sys_setsockopt net/socket.c:2167 [inline]
     __x64_sys_setsockopt+0x62/0x70 net/socket.c:2167
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    read to 0xffff8881051e3062 of 1 bytes by interrupt on cpu 1:
     igmp_ifc_timer_expire+0x706/0xa30 net/ipv4/igmp.c:808
     call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1419
     expire_timers+0x135/0x250 kernel/time/timer.c:1464
     __run_timers+0x358/0x420 kernel/time/timer.c:1732
     run_timer_softirq+0x19/0x30 kernel/time/timer.c:1745
     __do_softirq+0x12c/0x26e kernel/softirq.c:558
     invoke_softirq kernel/softirq.c:432 [inline]
     __irq_exit_rcu+0x9a/0xb0 kernel/softirq.c:636
     sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1100
     asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
     console_unlock+0x8e8/0xb30 kernel/printk/printk.c:2646
     vprintk_emit+0x125/0x3d0 kernel/printk/printk.c:2174
     vprintk_default+0x22/0x30 kernel/printk/printk.c:2185
     vprintk+0x15a/0x170 kernel/printk/printk_safe.c:392
     printk+0x62/0x87 kernel/printk/printk.c:2216
     selinux_netlink_send+0x399/0x400 security/selinux/hooks.c:6041
     security_netlink_send+0x42/0x90 security/security.c:2070
     netlink_sendmsg+0x59e/0x7c0 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:703 [inline]
     sock_sendmsg net/socket.c:723 [inline]
     ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
     ___sys_sendmsg net/socket.c:2446 [inline]
     __sys_sendmsg+0x1ed/0x270 net/socket.c:2475
     __do_sys_sendmsg net/socket.c:2484 [inline]
     __se_sys_sendmsg net/socket.c:2482 [inline]
     __x64_sys_sendmsg+0x42/0x50 net/socket.c:2482
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    
    value changed: 0x01 -> 0x02
    
    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 12539 Comm: syz-executor.1 Not tainted 5.14.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    
    Fixes: 1da177e ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    neebe000 authored and intel-lab-lkp committed Aug 10, 2021

Commits on Aug 9, 2021

  1. bareudp: Fix invalid read beyond skb's linear data

    Data beyond the UDP header might not be part of the skb's linear data.
    Use skb_copy_bits() instead of direct access to skb->data+X, so that
    we read the correct bytes even on a fragmented skb.
    
    Fixes: 4b5f672 ("net: Special handling for IP & MPLS.")
    Signed-off-by: Guillaume Nault <gnault@redhat.com>
    Link: https://lore.kernel.org/r/7741c46545c6ef02e70c80a9b32814b22d9616b3.1628264975.git.gnault@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Guillaume Nault authored and Jakub Kicinski committed Aug 9, 2021
  2. net: openvswitch: fix kernel-doc warnings in flow.c

    Repair kernel-doc notation in a few places to make it conform to
    the expected format.
    
    Fixes the following kernel-doc warnings:
    
    flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
     * Parse vlan tag from vlan header.
    flow.c:296: warning: missing initial short description on line:
     * Parse vlan tag from vlan header.
    flow.c:537: warning: No description found for return value of 'key_extract_l3l4'
    flow.c:769: warning: No description found for return value of 'key_extract'
    
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Cc: Pravin B Shelar <pshelar@ovn.org>
    Cc: dev@openvswitch.org
    Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    rddunlap authored and Jakub Kicinski committed Aug 9, 2021
  3. psample: Add a fwd declaration for skbuff

    Without this there is a warning if source files include psample.h
    before skbuff.h or doesn't include it at all.
    
    Fixes: 6ae0a62 ("net: Introduce psample, a new genetlink channel for packet sampling")
    Signed-off-by: Roi Dayan <roid@nvidia.com>
    Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    roidayan authored and Jakub Kicinski committed Aug 9, 2021
  4. net: sched: act_mirred: Reset ct info when mirror/redirect skb

    When mirror/redirect a skb to a different port, the ct info should be reset
    for reclassification. Or the pkts will match unexpected rules. For example,
    with following topology and commands:
    
        -----------
                  |
           veth0 -+-------
                  |
           veth1 -+-------
                  |
       ------------
    
     tc qdisc add dev veth0 clsact
     # The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1"
     tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1
     tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1
     tc qdisc add dev veth1 clsact
     tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop
    
     ping <remove ip via veth0> &
     tc -s filter show dev veth1 ingress
    
    With command 'tc -s filter show', we can find the pkts were dropped on
    veth1.
    
    Fixes: b57dc7c ("net/sched: Introduce action ct")
    Signed-off-by: Roi Dayan <roid@nvidia.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    liuhangbin authored and davem330 committed Aug 9, 2021
  5. Merge branch 'smc-fixes'

    Guvenc Gulce says:
    
    ====================
    net/smc: fixes 2021-08-09
    
    please apply the following patch series for smc to netdev's net tree.
    One patch fixes invalid connection counting for links and the other
    one fixes an access to an already cleared link.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Aug 9, 2021
  6. net/smc: Correct smc link connection counter in case of smc client

    SMC clients may be assigned to a different link after the initial
    connection between two peers was established. In such a case,
    the connection counter was not correctly set.
    
    Update the connection counter correctly when a smc client connection
    is assigned to a different smc link.
    
    Fixes: 07d5158 ("net/smc: Add connection counters for links")
    Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
    Tested-by: Karsten Graul <kgraul@linux.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    guvenc authored and davem330 committed Aug 9, 2021
  7. net/smc: fix wait on already cleared link

    There can be a race between the waiters for a tx work request buffer
    and the link down processing that finally clears the link. Although
    all waiters are woken up before the link is cleared there might be
    waiters which did not yet get back control and are still waiting.
    This results in an access to a cleared wait queue head.
    
    Fix this by introducing atomic reference counting around the wait calls,
    and wait with the link clear processing until all waiters have finished.
    Move the work request layer related calls into smc_wr.c and set the
    link state to INACTIVE before calling smcr_link_clear() in
    smc_llc_srv_add_link().
    
    Fixes: 15e1b99 ("net/smc: no WR buffer wait for terminating link group")
    Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
    Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    karstengr authored and davem330 committed Aug 9, 2021
  8. net: ethernet: ti: cpsw: fix min eth packet size for non-switch use-c…

    …ases
    
    The CPSW switchdev driver inherited fix from commit 9421c90 ("net:
    ethernet: ti: cpsw: fix min eth packet size") which changes min TX packet
    size to 64bytes (VLAN_ETH_ZLEN, excluding ETH_FCS). It was done to fix HW
    packed drop issue when packets are sent from Host to the port with PVID and
    un-tagging enabled. Unfortunately this breaks some other non-switch
    specific use-cases, like:
    - [1] CPSW port as DSA CPU port with DSA-tag applied at the end of the
    packet
    - [2] Some industrial protocols, which expects min TX packet size 60Bytes
    (excluding FCS).
    
    Fix it by configuring min TX packet size depending on driver mode
     - 60Bytes (ETH_ZLEN) for multi mac (dual-mac) mode
     - 64Bytes (VLAN_ETH_ZLEN) for switch mode
    and update it during driver mode change and annotate with
    READ_ONCE()/WRITE_ONCE() as it can be read by napi while writing.
    
    [1] https://lore.kernel.org/netdev/20210531124051.GA15218@cephalopod/
    [2] https://e2e.ti.com/support/arm/sitara_arm/f/791/t/701669
    
    Cc: stable@vger.kernel.org
    Fixes: ed3525e ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac")
    Reported-by: Ben Hutchings <ben.hutchings@essensium.com>
    Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    grygoriyS authored and davem330 committed Aug 9, 2021
  9. page_pool: mask the page->signature before the checking

    As mentioned in commit c07aea3 ("mm: add a signature in
    struct page"):
    "The page->signature field is aliased to page->lru.next and
    page->compound_head."
    
    And as the comment in page_is_pfmemalloc():
    "lru.next has bit 1 set if the page is allocated from the
    pfmemalloc reserves. Callers may simply overwrite it if they
    do not need to preserve that information."
    
    The page->signature is OR’ed with PP_SIGNATURE when a page is
    allocated in page pool, see __page_pool_alloc_pages_slow(),
    and page->signature is checked directly with PP_SIGNATURE in
    page_pool_return_skb_page(), which might cause resoure leaking
    problem for a page from page pool if bit 1 of lru.next is set
    for a pfmemalloc page. What happens here is that the original
    pp->signature is OR'ed with PP_SIGNATURE after the allocation
    in order to preserve any existing bits(such as the bit 1, used
    to indicate a pfmemalloc page), so when those bits are present,
    those page is not considered to be from page pool and the DMA
    mapping of those pages will be left stale.
    
    As bit 0 is for page->compound_head, So mask both bit 0/1 before
    the checking in page_pool_return_skb_page(). And we will return
    those pfmemalloc pages back to the page allocator after cleaning
    up the DMA mapping.
    
    Fixes: 6a5bcd8 ("page_pool: Allow drivers to hint on SKB recycling")
    Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Yunsheng Lin authored and davem330 committed Aug 9, 2021
  10. dccp: add do-while-0 stubs for dccp_pr_debug macros

    GCC complains about empty macros in an 'if' statement, so convert
    them to 'do {} while (0)' macros.
    
    Fixes these build warnings:
    
    net/dccp/output.c: In function 'dccp_xmit_packet':
    ../net/dccp/output.c:283:71: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
      283 |                 dccp_pr_debug("transmit_skb() returned err=%d\n", err);
    net/dccp/ackvec.c: In function 'dccp_ackvec_update_old':
    ../net/dccp/ackvec.c:163:80: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
      163 |                                               (unsigned long long)seqno, state);
    
    Fixes: dc841e3 ("dccp: Extend CCID packet dequeueing interface")
    Fixes: 3802408 ("dccp ccid-2: Update code for the Ack Vector input/registration routine")
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Cc: dccp@vger.kernel.org
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    rddunlap authored and davem330 committed Aug 9, 2021

Commits on Aug 8, 2021

  1. ppp: Fix generating ppp unit id when ifname is not specified

    When registering new ppp interface via PPPIOCNEWUNIT ioctl then kernel has
    to choose interface name as this ioctl API does not support specifying it.
    
    Kernel in this case register new interface with name "ppp<id>" where <id>
    is the ppp unit id, which can be obtained via PPPIOCGUNIT ioctl. This
    applies also in the case when registering new ppp interface via rtnl
    without supplying IFLA_IFNAME.
    
    PPPIOCNEWUNIT ioctl allows to specify own ppp unit id which will kernel
    assign to ppp interface, in case this ppp id is not already used by other
    ppp interface.
    
    In case user does not specify ppp unit id then kernel choose the first free
    ppp unit id. This applies also for case when creating ppp interface via
    rtnl method as it does not provide a way for specifying own ppp unit id.
    
    If some network interface (does not have to be ppp) has name "ppp<id>"
    with this first free ppp id then PPPIOCNEWUNIT ioctl or rtnl call fails.
    
    And registering new ppp interface is not possible anymore, until interface
    which holds conflicting name is renamed. Or when using rtnl method with
    custom interface name in IFLA_IFNAME.
    
    As list of allocated / used ppp unit ids is not possible to retrieve from
    kernel to userspace, userspace has no idea what happens nor which interface
    is doing this conflict.
    
    So change the algorithm how ppp unit id is generated. And choose the first
    number which is not neither used as ppp unit id nor in some network
    interface with pattern "ppp<id>".
    
    This issue can be simply reproduced by following pppd call when there is no
    ppp interface registered and also no interface with name pattern "ppp<id>":
    
        pppd ifname ppp1 +ipv6 noip noauth nolock local nodetach pty "pppd +ipv6 noip noauth nolock local nodetach notty"
    
    Or by creating the one ppp interface (which gets assigned ppp unit id 0),
    renaming it to "ppp1" and then trying to create a new ppp interface (which
    will always fails as next free ppp unit id is 1, but network interface with
    name "ppp1" exists).
    
    This patch fixes above described issue by generating new and new ppp unit
    id until some non-conflicting id with network interfaces is generated.
    
    Signed-off-by: Pali Rohár <pali@kernel.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pali authored and davem330 committed Aug 8, 2021
  2. ppp: Fix generating ifname when empty IFLA_IFNAME is specified

    IFLA_IFNAME is nul-term string which means that IFLA_IFNAME buffer can be
    larger than length of string which contains.
    
    Function __rtnl_newlink() generates new own ifname if either IFLA_IFNAME
    was not specified at all or userspace passed empty nul-term string.
    
    It is expected that if userspace does not specify ifname for new ppp netdev
    then kernel generates one in format "ppp<id>" where id matches to the ppp
    unit id which can be later obtained by PPPIOCGUNIT ioctl.
    
    And it works in this way if IFLA_IFNAME is not specified at all. But it
    does not work when IFLA_IFNAME is specified with empty string.
    
    So fix this logic also for empty IFLA_IFNAME in ppp_nl_newlink() function
    and correctly generates ifname based on ppp unit identifier if userspace
    did not provided preferred ifname.
    
    Without this patch when IFLA_IFNAME was specified with empty string then
    kernel created a new ppp interface in format "ppp<id>" but id did not
    match ppp unit id returned by PPPIOCGUNIT ioctl. In this case id was some
    number generated by __rtnl_newlink() function.
    
    Signed-off-by: Pali Rohár <pali@kernel.org>
    Fixes: bb8082f ("ppp: build ifname using unit identifier for rtnl based devices")
    Signed-off-by: David S. Miller <davem@davemloft.net>
    pali authored and davem330 committed Aug 8, 2021
  3. Merge branch 'bnxt_en-ptp-fixes'

    Michael Chan says:
    
    ====================
    bnxt_en: PTP fixes
    
    This series includes 2 fixes for the PTP feature.  Update to the new
    firmware interface so that the driver can pass the PTP sequence number
    header offset of TX packets to the firmware.  This is needed for all
    PTP packet types (v1, v2, with or without VLAN) to work.  The 2nd
    fix is to use a different register window to read the PHC to avoid
    conflict with an older Broadcom tool.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Aug 8, 2021
  4. bnxt_en: Use register window 6 instead of 5 to read the PHC

    Some older Broadcom debug tools use window 5 and may conflict, so switch
    to use window 6 instead.
    
    Fixes: 118612d ("bnxt_en: Add PTP clock APIs, ioctls, and ethtool methods")
    Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Michael Chan authored and davem330 committed Aug 8, 2021
  5. bnxt_en: Update firmware call to retrieve TX PTP timestamp

    New firmware interface requires the PTP sequence ID header offset to
    be passed to the firmware to properly find the matching timestamp
    for all protocols.
    
    Fixes: 83bb623 ("bnxt_en: Transmit and retrieve packet timestamps")
    Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Michael Chan authored and davem330 committed Aug 8, 2021
  6. bnxt_en: Update firmware interface to 1.10.2.52

    The key change is the firmware call to retrieve the PTP TX timestamp.
    The header offset for the PTP sequence number field is now added.
    
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Michael Chan authored and davem330 committed Aug 8, 2021
  7. once: Fix panic when module unload

    DO_ONCE
    DEFINE_STATIC_KEY_TRUE(___once_key);
    __do_once_done
      once_disable_jump(once_key);
        INIT_WORK(&w->work, once_deferred);
        struct once_work *w;
        w->key = key;
        schedule_work(&w->work);                     module unload
                                                       //*the key is
    destroy*
    process_one_work
      once_deferred
        BUG_ON(!static_key_enabled(work->key));
           static_key_count((struct static_key *)x)    //*access key, crash*
    
    When module uses DO_ONCE mechanism, it could crash due to the above
    concurrency problem, we could reproduce it with link[1].
    
    Fix it by add/put module refcount in the once work process.
    
    [1] https://lore.kernel.org/netdev/eaa6c371-465e-57eb-6be9-f4b16b9d7cbf@huawei.com/
    
    Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Eric Dumazet <edumazet@google.com>
    Reported-by: Minmin chen <chenmingmin@huawei.com>
    Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
    Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Kefeng Wang authored and davem330 committed Aug 8, 2021
  8. ptp: Fix possible memory leak caused by invalid cast

    Fixes possible leak of PTP virtual clocks.
    
    The number of PTP virtual clocks to be unregistered is passed as
    'u32', but the function that unregister the devices handles that as
    'u8'.
    
    Fixes: 73f3706 ("ptp: support ptp physical/virtual clocks conversion")
    Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    vcgomes authored and davem330 committed Aug 8, 2021
  9. net: phy: micrel: Fix link detection on ksz87xx switch"

    Commit a5e63c7 "net: phy: micrel: Fix detection of ksz87xx
    switch" broke link detection on the external ports of the KSZ8795.
    
    The previously unused phy_driver structure for these devices specifies
    config_aneg and read_status functions that appear to be designed for a
    fixed link and do not work with the embedded PHYs in the KSZ8795.
    
    Delete the use of these functions in favour of the generic PHY
    implementations which were used previously.
    
    Fixes: a5e63c7 ("net: phy: micrel: Fix detection of ksz87xx switch")
    Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    bwh-mind authored and davem330 committed Aug 8, 2021

Commits on Aug 7, 2021

  1. net: wwan: mhi_wwan_ctrl: Fix possible deadlock

    Lockdep detected possible interrupt unsafe locking scenario:
    
            CPU0                    CPU1
            ----                    ----
       lock(&mhiwwan->rx_lock);
                                   local_irq_disable();
                                   lock(&mhi_cntrl->pm_lock);
                                   lock(&mhiwwan->rx_lock);
       <Interrupt>
         lock(&mhi_cntrl->pm_lock);
    
      *** DEADLOCK ***
    
    To prevent this we need to disable the soft-interrupts when taking
    the rx_lock.
    
    Cc: stable@vger.kernel.org
    Fixes: fa588eb ("net: Add Qcom WWAN control driver")
    Reported-by: Thomas Perrot <thomas.perrot@bootlin.com>
    Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
    Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Loic Poulain authored and davem330 committed Aug 7, 2021
  2. net: dsa: qca: ar9331: make proper initial port defaults

    Make sure that all external port are actually isolated from each other,
    so no packets are leaked.
    
    Fixes: ec6698c ("net: dsa: add support for Atheros AR9331 built-in switch")
    Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    olerem authored and davem330 committed Aug 7, 2021
  3. Merge branch 'r8169-RTL8106e'

    Hayes Wang says:
    
    ====================
    r8169: adjust the setting for RTL8106e
    
    These patches are uesed to avoid the delay of link-up interrupt, when
    enabling ASPM for RTL8106e. The patch #1 is used to enable ASPM if
    it is possible. And the patch #2 is used to modify the entrance latencies
    of L0 and L1.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Aug 7, 2021
  4. r8169: change the L0/L1 entrance latencies for RTL8106e

    The original L0 and L1 entrance latencies of RTL8106e are 4us. And
    they cause the delay of link-up interrupt when enabling ASPM. Change
    the L0 entrance latency to 7us and L1 entrance latency to 32us. Then,
    they could avoid the issue.
    
    Tested-by: Koba Ko <koba.ko@canonical.com>
    Signed-off-by: Hayes Wang <hayeswang@realtek.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    hayesorz authored and davem330 committed Aug 7, 2021
  5. Revert "r8169: avoid link-up interrupt issue on RTL8106e if user enab…

    …les ASPM"
    
    This reverts commit 1ee8856.
    
    This is used to re-enable ASPM on RTL8106e, if it is possible.
    
    Signed-off-by: Hayes Wang <hayeswang@realtek.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    hayesorz authored and davem330 committed Aug 7, 2021
  6. Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

    Daniel Borkmann says:
    
    ====================
    pull-request: bpf 2021-08-07
    
    The following pull-request contains BPF updates for your *net* tree.
    
    We've added 4 non-merge commits during the last 9 day(s) which contain
    a total of 4 files changed, 8 insertions(+), 7 deletions(-).
    
    The main changes are:
    
    1) Fix integer overflow in htab's lookup + delete batch op, from Tatsuhiko Yasumatsu.
    
    2) Fix invalid fd 0 close in libbpf if BTF parsing failed, from Daniel Xu.
    
    3) Fix libbpf feature probe for BPF_PROG_TYPE_CGROUP_SOCKOPT, from Robin Gögge.
    
    4) Fix minor libbpf doc warning regarding code-block language, from Randy Dunlap.
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Aug 7, 2021

Commits on Aug 6, 2021

  1. bpf: Fix integer overflow involving bucket_size

    In __htab_map_lookup_and_delete_batch(), hash buckets are iterated
    over to count the number of elements in each bucket (bucket_size).
    If bucket_size is large enough, the multiplication to calculate
    kvmalloc() size could overflow, resulting in out-of-bounds write
    as reported by KASAN:
    
      [...]
      [  104.986052] BUG: KASAN: vmalloc-out-of-bounds in __htab_map_lookup_and_delete_batch+0x5ce/0xb60
      [  104.986489] Write of size 4194224 at addr ffffc9010503be70 by task crash/112
      [  104.986889]
      [  104.987193] CPU: 0 PID: 112 Comm: crash Not tainted 5.14.0-rc4 torvalds#13
      [  104.987552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      [  104.988104] Call Trace:
      [  104.988410]  dump_stack_lvl+0x34/0x44
      [  104.988706]  print_address_description.constprop.0+0x21/0x140
      [  104.988991]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
      [  104.989327]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
      [  104.989622]  kasan_report.cold+0x7f/0x11b
      [  104.989881]  ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
      [  104.990239]  kasan_check_range+0x17c/0x1e0
      [  104.990467]  memcpy+0x39/0x60
      [  104.990670]  __htab_map_lookup_and_delete_batch+0x5ce/0xb60
      [  104.990982]  ? __wake_up_common+0x4d/0x230
      [  104.991256]  ? htab_of_map_free+0x130/0x130
      [  104.991541]  bpf_map_do_batch+0x1fb/0x220
      [...]
    
    In hashtable, if the elements' keys have the same jhash() value, the
    elements will be put into the same bucket. By putting a lot of elements
    into a single bucket, the value of bucket_size can be increased to
    trigger the integer overflow.
    
    Triggering the overflow is possible for both callers with CAP_SYS_ADMIN
    and callers without CAP_SYS_ADMIN.
    
    It will be trivial for a caller with CAP_SYS_ADMIN to intentionally
    reach this overflow by enabling BPF_F_ZERO_SEED. As this flag will set
    the random seed passed to jhash() to 0, it will be easy for the caller
    to prepare keys which will be hashed into the same value, and thus put
    all the elements into the same bucket.
    
    If the caller does not have CAP_SYS_ADMIN, BPF_F_ZERO_SEED cannot be
    used. However, it will be still technically possible to trigger the
    overflow, by guessing the random seed value passed to jhash() (32bit)
    and repeating the attempt to trigger the overflow. In this case,
    the probability to trigger the overflow will be low and will take
    a very long time.
    
    Fix the integer overflow by calling kvmalloc_array() instead of
    kvmalloc() to allocate memory.
    
    Fixes: 0579963 ("bpf: Add batch ops to all htab bpf map")
    Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210806150419.109658-1-th.yasumatsu@gmail.com
    ty60 authored and borkmann committed Aug 6, 2021
  2. libbpf, doc: Eliminate warnings in libbpf_naming_convention

    Use "code-block: none" instead of "c" for non-C-language code blocks.
    Removes these warnings:
    
      lnx-514-rc4/Documentation/bpf/libbpf/libbpf_naming_convention.rst:111: WARNING: Could not lex literal_block as "c". Highlighting skipped.
      lnx-514-rc4/Documentation/bpf/libbpf/libbpf_naming_convention.rst:124: WARNING: Could not lex literal_block as "c". Highlighting skipped.
    
    Fixes: f42cfb4 ("bpf: Add documentation for libbpf including API autogen")
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210802015037.787-1-rdunlap@infradead.org
    rddunlap authored and borkmann committed Aug 6, 2021
  3. libbpf: Do not close un-owned FD 0 on errors

    Before this patch, btf_new() was liable to close an arbitrary FD 0 if
    BTF parsing failed. This was because:
    
    * btf->fd was initialized to 0 through the calloc()
    * btf__free() (in the `done` label) closed any FDs >= 0
    * btf->fd is left at 0 if parsing fails
    
    This issue was discovered on a system using libbpf v0.3 (without
    BTF_KIND_FLOAT support) but with a kernel that had BTF_KIND_FLOAT types
    in BTF. Thus, parsing fails.
    
    While this patch technically doesn't fix any issues b/c upstream libbpf
    has BTF_KIND_FLOAT support, it'll help prevent issues in the future if
    more BTF types are added. It also allow the fix to be backported to
    older libbpf's.
    
    Fixes: 3289959 ("libbpf: Support BTF loading and raw data output in both endianness")
    Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/5969bb991adedb03c6ae93e051fd2a00d293cf25.1627513670.git.dxu@dxuuu.xyz
    danobi authored and borkmann committed Aug 6, 2021
  4. libbpf: Fix probe for BPF_PROG_TYPE_CGROUP_SOCKOPT

    This patch fixes the probe for BPF_PROG_TYPE_CGROUP_SOCKOPT,
    so the probe reports accurate results when used by e.g.
    bpftool.
    
    Fixes: 4cdbfb5 ("libbpf: support sockopt hooks")
    Signed-off-by: Robin Gögge <r.goegge@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Quentin Monnet <quentin@isovalent.com>
    Link: https://lore.kernel.org/bpf/20210728225825.2357586-1-r.goegge@gmail.com
    rgo3 authored and borkmann committed Aug 6, 2021
  5. Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

    Pablo Neira Ayuso says:
    
    ====================
    Netfilter fixes for net
    
    The following patchset contains Netfilter fixes for net:
    
    1) Restrict range element expansion in ipset to avoid soft lockup,
       from Jozsef Kadlecsik.
    
    2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
       from Yajun Deng.
    
    3) Simplify conntrack garbage collection strategy to avoid frequent
       wake-ups, from Florian Westphal.
    
    4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.
    
    5) Missing chain family netlink attribute in chain description
       in nfnetlink_hook.
    
    6) Incorrect sequence number on nfnetlink_hook dumps.
    
    7) Use netlink request family in reply message for consistency.
    
    8) Remove offload_pickup sysctl, use conntrack for established state
       instead, from Florian Westphal.
    
    9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
       NFPROTO_INET is not exposed through nfnetlink_hook.
    
    * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
      netfilter: nfnetlink_hook: translate inet ingress to netdev
      netfilter: conntrack: remove offload_pickup sysctl again
      netfilter: nfnetlink_hook: Use same family as request message
      netfilter: nfnetlink_hook: use the sequence number of the request message
      netfilter: nfnetlink_hook: missing chain family
      netfilter: nfnetlink_hook: strip off module name from hookfn
      netfilter: conntrack: collect all entries in one cycle
      netfilter: nf_conntrack_bridge: Fix memory leak when error
      netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
    ====================
    
    Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Jakub Kicinski committed Aug 6, 2021
  6. netfilter: nfnetlink_hook: translate inet ingress to netdev

    The NFPROTO_INET pseudofamily is not exposed through this new netlink
    interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6
    for NFPROTO_INET prerouting/input/forward/output/postrouting hooks.
    The NFNLA_CHAIN_FAMILY attribute provides the family chain, which
    specifies if this hook applies to inet traffic only (either IPv4 or
    IPv6).
    
    Translate the inet/ingress hook to netdev/ingress to fully hide the
    NFPROTO_INET implementation details.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  7. netfilter: conntrack: remove offload_pickup sysctl again

    These two sysctls were added because the hardcoded defaults (2 minutes,
    tcp, 30 seconds, udp) turned out to be too low for some setups.
    
    They appeared in 5.14-rc1 so it should be fine to remove it again.
    
    Marcelo convinced me that there should be no difference between a flow
    that was offloaded vs. a flow that was not wrt. timeout handling.
    Thus the default is changed to those for TCP established and UDP stream,
    5 days and 120 seconds, respectively.
    
    Marcelo also suggested to account for the timeout value used for the
    offloading, this avoids increase beyond the value in the conntrack-sysctl
    and will also instantly expire the conntrack entry with altered sysctls.
    
    Example:
       nf_conntrack_udp_timeout_stream=60
       nf_flowtable_udp_timeout=60
    
    This will remove offloaded udp flows after one minute, rather than two.
    
    An earlier version of this patch also cleared the ASSURED bit to
    allow nf_conntrack to evict the entry via early_drop (i.e., table full).
    However, it looks like we can safely assume that connection timed out
    via HW is still in established state, so this isn't needed.
    
    Quoting Oz:
     [..] the hardware sends all packets with a set FIN flags to sw.
     [..] Connections that are aged in hardware are expected to be in the
     established state.
    
    In case it turns out that back-to-sw-path transition can occur for
    'dodgy' connections too (e.g., one side disappeared while software-path
    would have been in RETRANS timeout), we can adjust this later.
    
    Cc: Oz Shlomo <ozsh@nvidia.com>
    Cc: Paul Blakey <paulb@nvidia.com>
    Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Florian Westphal authored and ummakynes committed Aug 6, 2021
  8. netfilter: nfnetlink_hook: Use same family as request message

    Use the same family as the request message, for consistency. The
    netlink payload provides sufficient information to describe the hook
    object, including the family.
    
    This makes it easier to userspace to correlate the hooks are that
    visited by the packets for a certain family.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
  9. netfilter: nfnetlink_hook: use the sequence number of the request mes…

    …sage
    
    The sequence number allows to correlate the netlink reply message (as
    part of the dump) with the original request message.
    
    The cb->seq field is internally used to detect an interference (update)
    of the hook list during the netlink dump, do not use it as sequence
    number in the netlink dump header.
    
    Fixes: e2cf17d ("netfilter: add new hook nfnl subsystem")
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    ummakynes committed Aug 6, 2021
Older