Skip to content
Permalink
Eric-Dumazet/n…
Switch branches/tags

Commits on Nov 17, 2021

  1. net: no longer stop all TX queues in dev_watchdog()

    There is no reason for stopping all TX queues from dev_watchdog()
    
    Not only this stops feeding the NIC, it also migrates all qdiscs
    to be serviced on the cpu calling netif_tx_unlock(), causing
    a potential latency artifact.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    neebe000 authored and intel-lab-lkp committed Nov 17, 2021
  2. net: do not inline netif_tx_lock()/netif_tx_unlock()

    These are not fast path, there is no point in inlining them.
    
    Also provide netif_freeze_queues()/netif_unfreeze_queues()
    so that we can use them from dev_watchdog() in the following patch.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    neebe000 authored and intel-lab-lkp committed Nov 17, 2021
  3. net: annotate accesses to queue->trans_start

    In following patches, dev_watchdog() will no longer stop all queues.
    It will read queue->trans_start locklessly.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    neebe000 authored and intel-lab-lkp committed Nov 17, 2021
  4. net: use an atomic_long_t for queue->trans_timeout

    tx_timeout_show() assumed dev_watchdog() would stop all
    the queues, to fetch queue->trans_timeout under protection
    of the queue->_xmit_lock.
    
    As we want to no longer disrupt transmits, we use an
    atomic_long_t instead.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: david decotigny <david.decotigny@google.com>
    neebe000 authored and intel-lab-lkp committed Nov 17, 2021

Commits on Nov 16, 2021

  1. Merge branch 'inuse-cleanups'

    Eric Dumazet says:
    
    ====================
    net: prot_inuse and sock_inuse cleanups
    
    Small series cleaning and optimizing sock_prot_inuse_add()
    and sock_inuse_add().
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Nov 16, 2021
  2. net: drop nopreempt requirement on sock_prot_inuse_add()

    This is distracting really, let's make this simpler,
    because many callers had to take care of this
    by themselves, even if on x86 this adds more
    code than really needed.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  3. net: merge net->core.prot_inuse and net->core.sock_inuse

    net->core.sock_inuse is a per cpu variable (int),
    while net->core.prot_inuse is another per cpu variable
    of 64 integers.
    
    per cpu allocator tend to place them in very different places.
    
    Grouping them together makes sense, since it makes
    updates potentially faster, if hitting the same
    cache line.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  4. net: make sock_inuse_add() available

    MPTCP hard codes it, let us instead provide this helper.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  5. net: inline sock_prot_inuse_add()

    sock_prot_inuse_add() is very small, we can inline it.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  6. Merge branch 'gro-out-of-core-files'

    Eric Dumazet says:
    
    ====================
    gro: get out of core files
    
    Move GRO related content into net/core/gro.c
    and include/net/gro.h.
    
    This reduces GRO scope to where it is really needed,
    and shrinks too big files (include/linux/netdevice.h
    and net/core/dev.c)
    ====================
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Nov 16, 2021
  7. net: gro: populate net/core/gro.c

    Move gro code and data from net/core/dev.c to net/core/gro.c
    to ease maintenance.
    
    gro_normal_list() and gro_normal_one() are inlined
    because they are called from both files.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  8. net: gro: move skb_gro_receive into net/core/gro.c

    net/core/gro.c will contain all core gro functions,
    to shrink net/core/skbuff.c and net/core/dev.c
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  9. net: gro: move skb_gro_receive_list to udp_offload.c

    This helper is used once, no need to keep it in fat net/core/skbuff.c
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  10. net: move gro definitions to include/net/gro.h

    include/linux/netdevice.h became too big, move gro stuff
    into include/net/gro.h
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  11. Merge branch 'tcp-optimizations'

    Eric Dumazet says:
    
    ====================
    tcp: optimizations for linux-5.17
    
    Mostly small improvements in this series.
    
    The notable change is in "defer skb freeing after
    socket lock is released" in recvmsg() (and RX zerocopy)
    
    The idea is to try to let skb freeing to BH handler,
    whenever possible, or at least perform the freeing
    outside of the socket lock section, for much improved
    performance. This idea can probably be extended
    to other protocols.
    
     Tests on a 100Gbit NIC
     Max throughput for one TCP_STREAM flow, over 10 runs.
    
     MTU : 1500  (1428 bytes of TCP payload per MSS)
     Before: 55 Gbit
     After:  66 Gbit
    
     MTU : 4096+ (4096 bytes of TCP payload, plus TCP/IPv6 headers)
     Before: 82 Gbit
     After:  95 Gbit
    ====================
    
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    davem330 committed Nov 16, 2021
  12. net: move early demux fields close to sk_refcnt

    sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux,
    and currently spans two cache lines.
    
    Moving them close to sk_refcnt makes more sense, as only one cache
    line is needed.
    
    New layout for this hot cache line is :
    
    struct sock {
    	struct sock_common         __sk_common;          /*     0  0x88 */
    	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
    	struct dst_entry *         sk_rx_dst;            /*  0x88   0x8 */
    	int                        sk_rx_dst_ifindex;    /*  0x90   0x4 */
    	u32                        sk_rx_dst_cookie;     /*  0x94   0x4 */
    	socket_lock_t              sk_lock;              /*  0x98  0x20 */
    	atomic_t                   sk_drops;             /*  0xb8   0x4 */
    	int                        sk_rcvlowat;          /*  0xbc   0x4 */
    	/* --- cacheline 3 boundary (192 bytes) --- */
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  13. tcp: do not call tcp_cleanup_rbuf() if we have a backlog

    Under pressure, tcp recvmsg() has logic to process the socket backlog,
    but calls tcp_cleanup_rbuf() right before.
    
    Avoiding sending ACK right before processing new segments makes
    a lot of sense, as this decrease the number of ACK packets,
    with no impact on effective ACK clocking.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  14. tcp: check local var (timeo) before socket fields in one test

    Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.
    
    Modern applications use non-blocking IO, while a socket is terminated
    only once during its life time.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  15. tcp: defer skb freeing after socket lock is released

    tcp recvmsg() (or rx zerocopy) spends a fair amount of time
    freeing skbs after their payload has been consumed.
    
    A typical ~64KB GRO packet has to release ~45 page
    references, eventually going to page allocator
    for each of them.
    
    Currently, this freeing is performed while socket lock
    is held, meaning that there is a high chance that
    BH handler has to queue incoming packets to tcp socket backlog.
    
    This can cause additional latencies, because the user
    thread has to process the backlog at release_sock() time,
    and while doing so, additional frames can be added
    by BH handler.
    
    This patch adds logic to defer these frees after socket
    lock is released, or directly from BH handler if possible.
    
    Being able to free these skbs from BH handler helps a lot,
    because this avoids the usual alloc/free assymetry,
    when BH handler and user thread do not run on same cpu or
    NUMA node.
    
    One cpu can now be fully utilized for the kernel->user copy,
    and another cpu is handling BH processing and skb/page
    allocs/frees (assuming RFS is not forcing use of a single CPU)
    
    Tested:
     100Gbit NIC
     Max throughput for one TCP_STREAM flow, over 10 runs
    
    MTU : 1500
    Before: 55 Gbit
    After:  66 Gbit
    
    MTU : 4096+(headers)
    Before: 82 Gbit
    After:  95 Gbit
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  16. tcp: avoid indirect calls to sock_rfree

    TCP uses sk_eat_skb() when skbs can be removed from receive queue.
    However, the call to skb_orphan() from __kfree_skb() incurs
    an indirect call so sock_rfee(), which is more expensive than
    a direct call, especially for CONFIG_RETPOLINE=y.
    
    Add tcp_eat_recv_skb() function to make the call before
    __kfree_skb().
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  17. tcp: tp->urg_data is unlikely to be set

    Use some unlikely() hints in the fast path.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  18. tcp: annotate races around tp->urg_data

    tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
    owned.
    
    Also, it is faster to first check tp->urg_data in tcp_poll(),
    then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
    located in a different/cold cache line.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  19. tcp: annotate data-races on tp->segs_in and tp->data_segs_in

    tcp_segs_in() can be called from BH, while socket spinlock
    is held but socket owned by user, eventually reading these
    fields from tcp_get_info()
    
    Found by code inspection, no need to backport this patch
    to older kernels.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  20. tcp: add RETPOLINE mitigation to sk_backlog_rcv

    Use INDIRECT_CALL_INET() to avoid an indirect call
    when/if CONFIG_RETPOLINE=y
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  21. tcp: small optimization in tcp recvmsg()

    When reading large chunks of data, incoming packets might
    be added to the backlog from BH.
    
    tcp recvmsg() detects the backlog queue is not empty, and uses
    a release_sock()/lock_sock() pair to process this backlog.
    
    We now have __sk_flush_backlog() to perform this
    a bit faster.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  22. net: cache align tcp_memory_allocated, tcp_sockets_allocated

    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.
    
    Also take care of udp_memory_allocated and mptcp_sockets_allocated.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  23. net: forward_alloc_get depends on CONFIG_MPTCP

    (struct proto)->sk_forward_alloc is currently only used by MPTCP.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  24. net: shrink struct sock by 8 bytes

    Move sk_bind_phc next to sk_peer_lock to fill a hole.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  25. ipv6: shrink struct ipcm6_cookie

    gso_size can be moved after tclass, to use an existing hole.
    (8 bytes saved on 64bit arches)
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  26. net: remove sk_route_nocaps

    Instead of using a full netdev_features_t, we can use a single bit,
    as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
    sk->sk_route_cap.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  27. net: remove sk_route_forced_caps

    We were only using one bit, and we can replace it by sk_is_tcp()
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  28. net: use sk_is_tcp() in more places

    Move sk_is_tcp() to include/net/sock.h and use it where we can.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  29. tcp: small optimization in tcp_v6_send_check()

    For TCP flows, inet6_sk(sk)->saddr has the same value
    than sk->sk_v6_rcv_saddr.
    
    Using sk->sk_v6_rcv_saddr increases data locality.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  30. tcp: remove dead code in __tcp_v6_send_check()

    For some reason, I forgot to change __tcp_v6_send_check() at
    the same time I removed (ip_summed == CHECKSUM_PARTIAL) check
    in __tcp_v4_send_check()
    
    Fixes: 98be9b1 ("tcp: remove dead code after CHECKSUM_PARTIAL adoption")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
  31. tcp: minor optimization in tcp_add_backlog()

    If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
    are not used. Defer their access to the point we need them.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    neebe000 authored and davem330 committed Nov 16, 2021
Older