Skip to content
Permalink
Jussi-Maki/XDP…
Switch branches/tags

Commits on Jun 16, 2021

  1. selftests/bpf: Add tests for XDP bonding

    Add a test suite to test XDP bonding implementation
    over a pair of veth devices.
    
    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    joamaki authored and intel-lab-lkp committed Jun 16, 2021
  2. net: bonding: Use per-cpu rr_tx_counter

    The round-robin rr_tx_counter was shared across CPUs leading
    to significant cache trashing at high packet rates. This patch
    switches the round-robin mechanism to use a per-cpu counter to
    decide the destination device.
    
    On a 100Gbit 64 byte packet test this reduces the CPU load from
    50% to 10% on the test system.
    
    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    joamaki authored and intel-lab-lkp committed Jun 16, 2021
  3. net: bonding: Add XDP support to the bonding driver

    XDP is implemented in the bonding driver by transparently delegating
    the XDP program loading, removal and xmit operations to the bonding
    slave devices. The overall goal of this work is that XDP programs
    can be attached to a bond device *without* any further changes (or
    awareness) necessary to the program itself, meaning the same XDP
    program can be attached to a native device but also a bonding device.
    
    Semantics of XDP_TX when attached to a bond are equivalent in such
    setting to the case when a tc/BPF program would be attached to the
    bond, meaning transmitting the packet out of the bond itself using one
    of the bond's configured xmit methods to select a slave device (rather
    than XDP_TX on the slave itself). Handling of XDP_TX to transmit
    using the configured bonding mechanism is therefore implemented by
    rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
    performance impact this check is guarded by a static key, which is
    incremented when a XDP program is loaded onto a bond device. This
    approach was chosen to avoid changes to drivers implementing XDP. If
    the slave device does not match the receive device, then XDP_REDIRECT
    is transparently used to perform the redirection in order to have
    the network driver release the packet from its RX ring.  The bonding
    driver hashing functions have been refactored to allow reuse with
    xdp_buff's to avoid code duplication.
    
    The motivation for this change is to enable use of bonding (and
    802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
    XDP and also to transparently support bond devices for projects that
    use XDP given most modern NICs have dual port adapters.  An alternative
    to this approach would be to implement 802.3ad in user-space and
    implement the bonding load-balancing in the XDP program itself, but
    is rather a cumbersome endeavor in terms of slave device management
    (e.g. by watching netlink) and requires separate programs for native
    vs bond cases for the orchestrator. A native in-kernel implementation
    overcomes these issues and provides more flexibility.
    
    Below are benchmark results done on two machines with 100Gbit
    Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
    16-core 3950X on receiving machine. 64 byte packets were sent with
    pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
    ice driver, so the tests were performed with iommu=off and patch [2]
    applied. Additionally the bonding round robin algorithm was modified
    to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
    of cache misses were caused by the shared rr_tx_counter (see patch
    2/3). The statistics were collected using "sar -n dev -u 1 10".
    
     -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
     without patch (1 dev):
       XDP_DROP:              3.15%      48.6Mpps
       XDP_TX:                3.12%      18.3Mpps     18.3Mpps
       XDP_DROP (RSS):        9.47%      116.5Mpps
       XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
     -----------------------
     with patch, bond (1 dev):
       XDP_DROP:              3.14%      46.7Mpps
       XDP_TX:                3.15%      13.9Mpps     13.9Mpps
       XDP_DROP (RSS):        10.33%     117.2Mpps
       XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
     -----------------------
     with patch, bond (2 devs):
       XDP_DROP:              6.27%      92.7Mpps
       XDP_TX:                6.26%      17.6Mpps     17.5Mpps
       XDP_DROP (RSS):       11.38%      117.2Mpps
       XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
     --------------------------------------------------------------
    
    RSS: Receive Side Scaling, e.g. the packets were sent to a range of
    destination IPs.
    
    [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
    [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
    [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
    
    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    joamaki authored and intel-lab-lkp committed Jun 16, 2021

Commits on Jun 15, 2021

  1. Merge branch 'bpf-sock-migration'

    Kuniyuki Iwashima says:
    
    ====================
    The SO_REUSEPORT option allows sockets to listen on the same port and to
    accept connections evenly. However, there is a defect in the current
    implementation [1]. When a SYN packet is received, the connection is tied
    to a listening socket. Accordingly, when the listener is closed, in-flight
    requests during the three-way handshake and child sockets in the accept
    queue are dropped even if other listeners on the same port could accept
    such connections.
    
    This situation can happen when various server management tools restart
    server (such as nginx) processes. For instance, when we change nginx
    configurations and restart it, it spins up new workers that respect the
    new configuration and closes all listeners on the old workers, resulting
    in the in-flight ACK of 3WHS is responded by RST.
    
    To avoid such a situation, users have to know deeply how the kernel handles
    SYN packets and implement connection draining by eBPF [2]:
    
      1. Stop routing SYN packets to the listener by eBPF.
      2. Wait for all timers to expire to complete requests
      3. Accept connections until EAGAIN, then close the listener.
    
      or
    
      1. Start counting SYN packets and accept syscalls using the eBPF map.
      2. Stop routing SYN packets.
      3. Accept connections up to the count, then close the listener.
    
    In either way, we cannot close a listener immediately. However, ideally,
    the application need not drain the not yet accepted sockets because 3WHS
    and tying a connection to a listener are just the kernel behaviour. The
    root cause is within the kernel, so the issue should be addressed in kernel
    space and should not be visible to user space. This patchset fixes it so
    that users need not take care of kernel implementation and connection
    draining. With this patchset, the kernel redistributes requests and
    connections from a listener to the others in the same reuseport group
    at/after close or shutdown syscalls.
    
    Although some software does connection draining, there are still merits in
    migration. For some security reasons, such as replacing TLS certificates,
    we may want to apply new settings as soon as possible and/or we may not be
    able to wait for connection draining. The sockets in the accept queue have
    not started application sessions yet. So, if we do not drain such sockets,
    they can be handled by the newer listeners and could have a longer
    lifetime. It is difficult to drain all connections in every case, but we
    can decrease such aborted connections by migration. In that sense,
    migration is always better than draining.
    
    Moreover, auto-migration simplifies user space logic and also works well in
    a case where we cannot modify and build a server program to implement the
    workaround.
    
    Note that the source and destination listeners MUST have the same settings
    at the socket API level; otherwise, applications may face inconsistency and
    cause errors. In such a case, we have to use the eBPF program to select a
    specific listener or to cancel migration.
    
    Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
    snippets along the way.
    
    Link:
     [1] The SO_REUSEPORT socket option
     https://lwn.net/Articles/542629/
    
     [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
     https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/
    
    Changelog:
     v8:
      * Make reuse const in reuseport_sock_index()
      * Don't use __reuseport_add_sock() in reuseport_alloc()
      * Change the arg of the second memcpy() in reuseport_grow()
      * Fix coding style to use goto in reuseport_alloc()
      * Keep sk_refcnt uninitialized in inet_reqsk_clone()
      * Initialize ireq_opt and ipv6_opt separately in reqsk_migrate_reset()
    
      [ This series does not include a stats patch suggested by Yuchung Cheng
        not to drop Acked-by/Reviewed-by tags and save reviewer's time. I will
        post the patch as a follow up after this series is merged. ]
    
     v7:
     https://lore.kernel.org/bpf/20210521182104.18273-1-kuniyu@amazon.co.jp/
      * Prevent attaching/detaching a bpf prog via shutdowned socket
      * Fix typo in commit messages
      * Split selftest into subtests
    
     v6:
     https://lore.kernel.org/bpf/20210517002258.75019-1-kuniyu@amazon.co.jp/
      * Change description in ip-sysctl.rst
      * Test IPPROTO_TCP before reading tfo_listener
      * Move reqsk_clone() to inet_connection_sock.c and rename to
        inet_reqsk_clone()
      * Pass req->rsk_listener to inet_csk_reqsk_queue_drop() and
        reqsk_queue_removed() in the migration path of receiving ACK
      * s/ARG_PTR_TO_SOCKET/PTR_TO_SOCKET/ in sk_reuseport_is_valid_access()
      * In selftest, use atomic ops to increment global vars, drop ACK by XDP,
        enable force fastopen, use "skel->bss" instead of "skel->data"
    
     v5:
     https://lore.kernel.org/bpf/20210510034433.52818-1-kuniyu@amazon.co.jp/
      * Move initializtion of sk_node from 6th to 5th patch
      * Initialize sk_refcnt in reqsk_clone()
      * Modify some definitions in reqsk_timer_handler()
      * Validate in which path/state migration happens in selftest
    
     v4:
     https://lore.kernel.org/bpf/20210427034623.46528-1-kuniyu@amazon.co.jp/
      * Make some functions and variables 'static' in selftest
      * Remove 'scalability' from the cover letter
    
     v3:
     https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/
      * Add sysctl back for reuseport_grow()
      * Add helper functions to manage socks[]
      * Separate migration related logic into functions: reuseport_resurrect(),
        reuseport_stop_listen_sock(), reuseport_migrate_sock()
      * Clone request_sock to be migrated
      * Migrate request one by one
      * Pass child socket to eBPF prog
    
     v2:
     https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/
      * Do not save closed sockets in socks[]
      * Revert 607904c
      * Extract inet_csk_reqsk_queue_migrate() into a single patch
      * Change the spin_lock order to avoid lockdep warning
      * Add static to __reuseport_select_sock
      * Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
      * Set the default attach type in bpf_prog_load_check_attach()
      * Define new proto of BPF_FUNC_get_socket_cookie
      * Fix test to be compiled successfully
      * Update commit messages
    
     v1:
     https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/
      * Remove the sysctl option
      * Enable migration if eBPF progam is not attached
      * Add expected_attach_type to check if eBPF program can migrate sockets
      * Add a field to tell migration type to eBPF program
      * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
      * Allocate an empty skb if skb is NULL
      * Pass req_to_sk(req)->sk_hash because listener's hash is zero
      * Update commit messages and coverletter
    
     RFC:
     https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/
    ====================
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    borkmann committed Jun 15, 2021
  2. bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

    This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE and
    removes 'static' from settimeo() in network_helpers.c.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-12-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  3. libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.

    This commit introduces a new section (sk_reuseport/migrate) and sets
    expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
    program.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-11-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  4. bpf: Support socket migration by eBPF.

    This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT
    to check if the attached eBPF program is capable of migrating sockets. When
    the eBPF program is attached, we run it for socket migration if the
    expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or
    net.ipv4.tcp_migrate_req is enabled.
    
    Currently, the expected_attach_type is not enforced for the
    BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the
    earlier idea in the commit aac3fc3 ("bpf: Post-hooks for sys_bind") to
    fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().
    
    Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to
    select a new listener based on the child socket. migrating_sk varies
    depending on if it is migrating a request in the accept queue or during
    3WHS.
    
      - accept_queue : sock (ESTABLISHED/SYN_RECV)
      - 3WHS         : request_sock (NEW_SYN_RECV)
    
    In the eBPF program, we can select a new listener by
    BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
    SK_DROP. This feature is useful when listeners have different settings at
    the socket API level or when we want to free resources as soon as possible.
    
      - SK_PASS with selected_sk, select it as a new listener
      - SK_PASS with selected_sk NULL, fallbacks to the random selection
      - SK_DROP, cancel the migration.
    
    There is a noteworthy point. We select a listening socket in three places,
    but we do not have struct skb at closing a listener or retransmitting a
    SYN+ACK. On the other hand, some helper functions do not expect skb is NULL
    (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer()
    in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
    temporarily before running the eBPF program.
    
    Suggested-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/
    Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/
    Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
    Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  5. bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEP…

    …ORT.
    
    We will call sock_reuseport.prog for socket migration in the next commit,
    so the eBPF program has to know which listener is closing to select a new
    listener.
    
    We can currently get a unique ID of each listener in the userspace by
    calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.
    
    This patch makes the pointer of sk available in sk_reuseport_md so that we
    can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.
    
    Suggested-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/
    Link: https://lore.kernel.org/bpf/20210612123224.12525-9-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  6. tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.

    This patch also changes the code to call reuseport_migrate_sock() and
    inet_reqsk_clone(), but unlike the other cases, we do not call
    inet_reqsk_clone() right after reuseport_migrate_sock().
    
    Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
    has three kinds of refcnt:
    
      (A) for listener itself
      (B) carried by reuqest_sock
      (C) sock_hold() in tcp_v[46]_rcv()
    
    While processing the req, (A) may disappear by close(listener). Also, (B)
    can disappear by accept(listener) once we put the req into the accept
    queue. So, we have to hold another refcnt (C) for the listener to prevent
    use-after-free.
    
    For socket migration, we call reuseport_migrate_sock() to select a listener
    with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
    This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
    Thus we have to take another refcnt (B) for the newly cloned request_sock.
    
    In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
    try to put the new req into the accept queue. By migrating req after
    winning the "own_req" race, we can avoid such a worst situation:
    
      CPU 1 looks up req1
      CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
      CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
      ...
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  7. tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.

    As with the preceding patch, this patch changes reqsk_timer_handler() to
    call reuseport_migrate_sock() and inet_reqsk_clone() to migrate in-flight
    requests at retransmitting SYN+ACKs. If we can select a new listener and
    clone the request, we resume setting the SYN+ACK timer for the new req. If
    we can set the timer, we call inet_ehash_insert() to unhash the old req and
    put the new req into ehash.
    
    The noteworthy point here is that by unhashing the old req, another CPU
    processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and
    drop the final ACK packet. However, the new timer will recover this
    situation.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-7-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  8. tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

    When we call close() or shutdown() for listening sockets, each child socket
    in the accept queue are freed at inet_csk_listen_stop(). If we can get a
    new listener by reuseport_migrate_sock() and clone the request by
    inet_reqsk_clone(), we try to add it into the new listener's accept queue
    by inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free()
    to call sock_put() for its listener and free the cloned request.
    
    After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets
    NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be
    non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid
    double free.
    
    Note that we do not update req->rsk_listener and instead clone the req to
    migrate because another path may reference the original request. If we
    protected it by RCU, we would need to add rcu_read_lock() in many places.
    
    Suggested-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/
    Link: https://lore.kernel.org/bpf/20210612123224.12525-6-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  9. tcp: Add reuseport_migrate_sock() to select a new listener.

    reuseport_migrate_sock() does the same check done in
    reuseport_listen_stop_sock(). If the reuseport group is capable of
    migration, reuseport_migrate_sock() selects a new listener by the child
    socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
    fail in the migration, we have to decrement it later.
    
    We will support migration by eBPF in the later commits.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  10. tcp: Keep TCP_CLOSE sockets in the reuseport group.

    When we close a listening socket, to migrate its connections to another
    listener in the same reuseport group, we have to handle two kinds of child
    sockets. One is that a listening socket has a reference to, and the other
    is not.
    
    The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
    accept queue of their listening socket. So we can pop them out and push
    them into another listener's queue at close() or shutdown() syscalls. On
    the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
    three-way handshake and not in the accept queue. Thus, we cannot access
    such sockets at close() or shutdown() syscalls. Accordingly, we have to
    migrate immature sockets after their listening socket has been closed.
    
    Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
    sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
    that time, if we could select a new listener from the same reuseport group,
    no connection would be aborted. However, we cannot do that because
    reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
    the reuseport group from closed sockets.
    
    This patch allows TCP_CLOSE sockets to remain in the reuseport group and
    access it while any child socket references them. The point is that
    reuseport_detach_sock() was called twice from inet_unhash() and
    sk_destruct(). This patch replaces the first reuseport_detach_sock() with
    reuseport_stop_listen_sock(), which checks if the reuseport group is
    capable of migration. If capable, it decrements num_socks, moves the socket
    backwards in socks[] and increments num_closed_socks. When all connections
    are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
    socket from socks[], decrement num_closed_socks, and set NULL to
    sk_reuseport_cb.
    
    By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
    Consequently, calling listen() after shutdown() can cause EADDRINUSE or
    EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
    such sockets not to have the reuseport group. Therefore, this patch also
    loosens such validation rules so that a socket can listen again if it has a
    reuseport group with num_closed_socks more than 0.
    
    When such sockets listen again, we handle them in reuseport_resurrect(). If
    there is an existing reuseport group (reuseport_add_sock() path), we move
    the socket from the old group to the new one and free the old one if
    necessary. If there is no existing group (reuseport_alloc() path), we
    allocate a new reuseport group, detach sk from the old one, and free it if
    necessary, not to break the current shutdown behaviour:
    
      - we cannot carry over the eBPF prog of shutdowned sockets
      - we cannot attach/detach an eBPF prog to/from listening sockets via
        shutdowned sockets
    
    Note that when the number of sockets gets over U16_MAX, we try to detach a
    closed socket randomly to make room for the new listening socket in
    reuseport_grow().
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  11. tcp: Add num_closed_socks to struct sock_reuseport.

    As noted in the following commit, a closed listener has to hold the
    reference to the reuseport group for socket migration. This patch adds a
    field (num_closed_socks) to struct sock_reuseport to manage closed sockets
    within the same reuseport group. Moreover, this and the following commits
    introduce some helper functions to split socks[] into two sections and keep
    TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
    queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
    sockets from the end.
    
      TCP_LISTEN---------->       <-------TCP_CLOSE
      +---+---+  ---  +---+  ---  +---+  ---  +---+
      | 0 | 1 |  ...  | i |  ...  | j |  ...  | k |
      +---+---+  ---  +---+  ---  +---+  ---  +---+
    
      i = num_socks - 1
      j = max_socks - num_closed_socks
      k = max_socks - 1
    
    This patch also extends reuseport_add_sock() and reuseport_grow() to
    support num_closed_socks.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  12. net: Introduce net.ipv4.tcp_migrate_req.

    This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
    option is enabled or eBPF program is attached, we will be able to migrate
    child sockets from a listener to another in the same reuseport group after
    close() or shutdown() syscalls.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
    q2ven authored and borkmann committed Jun 15, 2021
  13. libbpf: Set NLM_F_EXCL when creating qdisc

    This got lost during the refactoring across versions. We always use
    NLM_F_EXCL when creating some TC object, so reflect what the function
    says and set the flag.
    
    Fixes: 715c5ce ("libbpf: Add low level TC-BPF management API")
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210612023502.1283837-3-memxor@gmail.com
    kkdwivedi authored and borkmann committed Jun 15, 2021
  14. libbpf: Remove unneeded check for flags during tc detach

    Coverity complained about this being unreachable code. It is right
    because we already enforce flags to be unset, so a check validating
    the flag value is redundant.
    
    Fixes: 715c5ce ("libbpf: Add low level TC-BPF management API")
    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210612023502.1283837-2-memxor@gmail.com
    kkdwivedi authored and borkmann committed Jun 15, 2021

Commits on Jun 11, 2021

  1. tools/bpftool: Fix error return code in do_batch()

    Fix to return a negative error code from the error handling
    case instead of 0, as done elsewhere in this function.
    
    Fixes: 668da74 ("tools: bpftool: add support for quotations ...")
    Reported-by: Hulk Robot <hulkci@huawei.com>
    Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Reviewed-by: Quentin Monnet <quentin@isovalent.com>
    Link: https://lore.kernel.org/bpf/20210609115916.2186872-1-chengzhihao1@huawei.com
    Zhihao Cheng authored and anakryiko committed Jun 11, 2021
  2. libbpf: Simplify the return expression of bpf_object__init_maps function

    There is no need for special treatment of the 'ret == 0' case.
    This patch simplifies the return expression.
    
    Signed-off-by: Wang Hai <wanghai38@huawei.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210609115651.3392580-1-wanghai38@huawei.com
    Wang Hai authored and anakryiko committed Jun 11, 2021

Commits on Jun 8, 2021

  1. selftests, bpf: Make docs tests fail more reliably

    Previously, if rst2man caught errors, then these would be ignored and
    the output file would be written anyway. This would allow developers to
    introduce regressions in the docs comments in the BPF headers.
    
    Additionally, even if you instruct rst2man to fail out, it will still
    write out to the destination target file, so if you ran the tests twice
    in a row it would always pass. Use a temporary file for the initial run
    to ensure that if rst2man fails out under "--strict" mode, subsequent
    runs will not automatically pass.
    
    Tested via ./tools/testing/selftests/bpf/test_doc_build.sh
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Quentin Monnet <quentin@isovalent.com>
    Link: https://lore.kernel.org/bpf/20210608015756.340385-1-joe@cilium.io
    joestringer authored and borkmann committed Jun 8, 2021
  2. libbpf: Fix pr_warn type warnings on 32bit

    The printed value is ptrdiff_t and is formatted wiht %ld. This works on
    64bit but produces a warning on 32bit. Fix the format specifier to %td.
    
    Fixes: 6723474 ("libbpf: Generate loader program out of BPF ELF file.")
    Signed-off-by: Michal Suchanek <msuchanek@suse.de>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210604112448.32297-1-msuchanek@suse.de
    hramrach authored and borkmann committed Jun 8, 2021
  3. tools/bpftool: Fix cross-build

    When the bootstrap and final bpftool have different architectures, we
    need to build two distinct disasm.o objects. Add a recipe for the
    bootstrap disasm.o.
    
    After commit d510296 ("bpftool: Use syscall/loader program in
    "prog load" and "gen skeleton" command.") cross-building bpftool didn't
    work anymore, because the bootstrap bpftool was linked using objects
    from different architectures:
    
      $ make O=/tmp/bpftool ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -C tools/bpf/bpftool/ V=1
      [...]
      aarch64-linux-gnu-gcc ... -c -MMD -o /tmp/bpftool/disasm.o /home/z/src/linux/kernel/bpf/disasm.c
      gcc ... -c -MMD -o /tmp/bpftool//bootstrap/main.o main.c
      gcc ... -o /tmp/bpftool//bootstrap/bpftool /tmp/bpftool//bootstrap/main.o ... /tmp/bpftool/disasm.o
      /usr/bin/ld: /tmp/bpftool/disasm.o: Relocations in generic ELF (EM: 183)
      /usr/bin/ld: /tmp/bpftool/disasm.o: Relocations in generic ELF (EM: 183)
      /usr/bin/ld: /tmp/bpftool/disasm.o: Relocations in generic ELF (EM: 183)
      /usr/bin/ld: /tmp/bpftool/disasm.o: error adding symbols: file in wrong format
      collect2: error: ld returned 1 exit status
      [...]
    
    The final bpftool was built for e.g. arm64, while the bootstrap bpftool,
    executed on the host, was built for x86. The problem here was that disasm.o
    linked into the bootstrap bpftool was arm64 rather than x86. With the fix
    we build two disasm.o, one for the target bpftool in arm64, and one for
    the bootstrap bpftool in x86.
    
    Fixes: d510296 ("bpftool: Use syscall/loader program in "prog load" and "gen skeleton" command.")
    Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210603170515.1854642-1-jean-philippe@linaro.org
    jpbrucker authored and borkmann committed Jun 8, 2021

Commits on Jun 3, 2021

  1. selftests/bpf: Add xdp_redirect_multi into .gitignore

    When xdp_redirect_multi test binary was added recently, it wasn't added to
    .gitignore. Fix that.
    
    Fixes: d232924 ("selftests/bpf: Add xdp_redirect_multi test")
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210603004026.2698513-5-andrii@kernel.org
    anakryiko authored and borkmann committed Jun 3, 2021
  2. libbpf: Install skel_internal.h header used from light skeletons

    Light skeleton code assumes skel_internal.h header to be installed system-wide
    by libbpf package. Make sure it is actually installed.
    
    Fixes: 6723474 ("libbpf: Generate loader program out of BPF ELF file.")
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210603004026.2698513-4-andrii@kernel.org
    anakryiko authored and borkmann committed Jun 3, 2021
  3. libbpf: Refactor header installation portions of Makefile

    As we gradually get more headers that have to be installed, it's quite
    annoying to copy/paste long $(call) commands. So extract that logic and do
    a simple $(foreach) over the list of headers.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210603004026.2698513-3-andrii@kernel.org
    anakryiko authored and borkmann committed Jun 3, 2021
  4. libbpf: Move few APIs from 0.4 to 0.5 version

    Official libbpf 0.4 release doesn't include three APIs that were tentatively
    put into 0.4 section. Fix libbpf.map and move these three APIs:
    
      - bpf_map__initial_value;
      - bpf_map_lookup_and_delete_elem_flags;
      - bpf_object__gen_loader.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20210603004026.2698513-2-andrii@kernel.org
    anakryiko authored and borkmann committed Jun 3, 2021

Commits on Jun 1, 2021

  1. bpf, tnums: Provably sound, faster, and more precise algorithm for tn…

    …um_mul
    
    This patch introduces a new algorithm for multiplication of tristate
    numbers (tnums) that is provably sound. It is faster and more precise when
    compared to the existing method.
    
    Like the existing method, this new algorithm follows the long
    multiplication algorithm. The idea is to generate partial products by
    multiplying each bit in the multiplier (tnum a) with the multiplicand
    (tnum b), and adding the partial products after appropriately bit-shifting
    them. The new algorithm, however, uses just a single loop over the bits of
    the multiplier (tnum a) and accumulates only the uncertain components of
    the multiplicand (tnum b) into a mask-only tnum. The following paper
    explains the algorithm in more detail: https://arxiv.org/abs/2105.05398.
    
    A natural way to construct the tnum product is by performing a tnum
    addition on all the partial products. This algorithm presents another
    method of doing this: decompose each partial product into two tnums,
    consisting of the values and the masks separately. The mask-sum is
    accumulated within the loop in acc_m. The value-sum tnum is generated
    using a.value * b.value. The tnum constructed by tnum addition of the
    value-sum and the mask-sum contains all possible summations of concrete
    values drawn from the partial product tnums pairwise. We prove this result
    in the paper.
    
    Our evaluations show that the new algorithm is overall more precise
    (producing tnums with less uncertain components) than the existing method.
    As an illustrative example, consider the input tnums A and B. The numbers
    in the parenthesis correspond to (value;mask).
    
      A                = 000000x1 (1;2)
      B                = 0010011x (38;1)
      A * B (existing) = xxxxxxxx (0;255)
      A * B (new)      = 0x1xxxxx (32;95)
    
    Importantly, we present a proof of soundness of the new algorithm in the
    aforementioned paper. Additionally, we show that this new algorithm is
    empirically faster than the existing method.
    
    Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu>
    Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
    Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
    Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu>
    Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
    Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
    Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@rutgers.edu>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
    Link: https://arxiv.org/abs/2105.05398
    Link: https://lore.kernel.org/bpf/20210531020157.7386-1-harishankar.vishwanathan@rutgers.edu
    harishankarv authored and borkmann committed Jun 1, 2021

Commits on May 28, 2021

  1. bpf, devmap: Remove drops variable from bq_xmit_all()

    As Colin pointed out, the first drops assignment after declaration will
    be overwritten by the second drops assignment before using, which makes
    it useless.
    
    Since the drops variable will be used only once. Just remove it and
    use "cnt - sent" in trace_xdp_devmap_xmit().
    
    Fixes: cb261b5 ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
    Reported-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20210528024356.24333-1-liuhangbin@gmail.com
    liuhangbin authored and borkmann committed May 28, 2021
  2. bpf, docs: Add llvm_reloc.rst to explain llvm bpf relocations

    LLVM upstream commit https://reviews.llvm.org/D102712 made some changes
    to bpf relocations to make them llvm linker lld friendly. The scope of
    existing relocations R_BPF_64_{64,32} is narrowed and new relocations
    R_BPF_64_{ABS32,ABS64,NODYLD32} are introduced.
    
    Let us add some documentation about llvm bpf relocations so people can
    understand how to resolve them properly in their respective tools.
    
    Signed-off-by: Yonghong Song <yhs@fb.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20210526152457.335210-1-yhs@fb.com
    yonghong-song authored and borkmann committed May 28, 2021

Commits on May 26, 2021

  1. libbpf: Move BPF_SEQ_PRINTF and BPF_SNPRINTF to bpf_helpers.h

    These macros are convenient wrappers around the bpf_seq_printf and
    bpf_snprintf helpers. They are currently provided by bpf_tracing.h which
    targets low level tracing primitives. bpf_helpers.h is a better fit.
    
    The __bpf_narg and __bpf_apply are needed in both files and provided
    twice. __bpf_empty isn't used anywhere and is removed from bpf_tracing.h
    
    Reported-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Florent Revest <revest@chromium.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210526164643.2881368-1-revest@chromium.org
    Florent Revest authored and anakryiko committed May 26, 2021
  2. Merge branch 'bpf-xdp-bcast'

    Hangbin Liu says:
    
    ====================
    This patchset is a new implementation for XDP multicast support based
    on my previous 2 maps implementation[1]. The reason is that Daniel thinks
    the exclude map implementation is missing proper bond support in XDP
    context. And there is a plan to add native XDP bonding support. Adding
    a exclude map in the helper also increases the complexity of verifier and
    has drawbacks on performance.
    
    The new implementation just add two new flags BPF_F_BROADCAST and
    BPF_F_EXCLUDE_INGRESS to extend xdp_redirect_map for broadcast support.
    
    With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
    in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
    excluded when do broadcasting.
    
    The patchv11 link is here [2].
    
      [1] https://lore.kernel.org/bpf/20210223125809.1376577-1-liuhangbin@gmail.com
      [2] https://lore.kernel.org/bpf/20210513070447.1878448-1-liuhangbin@gmail.com
    
    v12: As Daniel pointed out:
      a) defined as const u64 for flag_mask and action_mask in
         __bpf_xdp_redirect_map()
      b) remove BPF_F_ACTION_MASK in uapi header
      c) remove EXPORT_SYMBOL_GPL for xdpf_clone()
    
    v11:
      a) Use unlikely() when checking if this is for broadcast redirecting.
      b) Fix a tracepoint NULL pointer issue Jesper found
      c) Remove BPF_F_REDIR_MASK and just use OR flags to make the reader more
         clear about what's flags we are using
      d) Add the performace number with multi veth interfaces in patch 01
         description.
      e) remove some sleeps to reduce the testing time in patch04. Re-struct the
         test and make clear what flags we are testing.
    
    v10: use READ/WRITE_ONCE when read/write map instead of xchg()
    v9: Update patch 01 commit description
    v8: use hlist_for_each_entry_rcu() when looping the devmap hash ojbs
    v7: No need to free xdpf in dev_map_enqueue_clone() if xdpf_clone failed.
    v6: Fix a skb leak in the error path for generic XDP
    v5: Just walk the map directly to get interfaces as get_next_key() of devmap
        hash may restart looping from the first key if the device get removed.
        After update the performace has improved 10% compired with v4.
    v4: Fix flags never cleared issue in patch 02. Update selftest to cover this.
    v3: Rebase the code based on latest bpf-next
    v2: fix flag renaming issue in patch 02
    ====================
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    borkmann committed May 26, 2021
  3. selftests/bpf: Add xdp_redirect_multi test

    Add a bpf selftest for new helper xdp_redirect_map_multi(). In this
    test there are 3 forward groups and 1 exclude group. The test will
    redirect each interface's packets to all the interfaces in the forward
    group, and exclude the interface in exclude map.
    
    Two maps (DEVMAP, DEVMAP_HASH) and two xdp modes (generic, drive) will
    be tested. XDP egress program will also be tested by setting pkt src MAC
    to egress interface's MAC address.
    
    For more test details, you can find it in the test script. Here is
    the test result.
    ]# time ./test_xdp_redirect_multi.sh
    Pass: xdpgeneric arp(F_BROADCAST) ns1-1
    Pass: xdpgeneric arp(F_BROADCAST) ns1-2
    Pass: xdpgeneric arp(F_BROADCAST) ns1-3
    Pass: xdpgeneric IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-1
    Pass: xdpgeneric IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-2
    Pass: xdpgeneric IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-3
    Pass: xdpgeneric IPv6 (no flags) ns1-1
    Pass: xdpgeneric IPv6 (no flags) ns1-2
    Pass: xdpdrv arp(F_BROADCAST) ns1-1
    Pass: xdpdrv arp(F_BROADCAST) ns1-2
    Pass: xdpdrv arp(F_BROADCAST) ns1-3
    Pass: xdpdrv IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-1
    Pass: xdpdrv IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-2
    Pass: xdpdrv IPv4 (F_BROADCAST|F_EXCLUDE_INGRESS) ns1-3
    Pass: xdpdrv IPv6 (no flags) ns1-1
    Pass: xdpdrv IPv6 (no flags) ns1-2
    Pass: xdpegress mac ns1-2
    Pass: xdpegress mac ns1-3
    Summary: PASS 18, FAIL 0
    
    real    1m18.321s
    user    0m0.123s
    sys     0m0.350s
    
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210519090747.1655268-5-liuhangbin@gmail.com
    liuhangbin authored and borkmann committed May 26, 2021
  4. sample/bpf: Add xdp_redirect_map_multi for redirect_map broadcast test

    This is a sample for xdp redirect broadcast. In the sample we could forward
    all packets between given interfaces. There is also an option -X that could
    enable 2nd xdp_prog on egress interface.
    
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210519090747.1655268-4-liuhangbin@gmail.com
    liuhangbin authored and borkmann committed May 26, 2021
  5. xdp: Extend xdp_redirect_map with broadcast support

    This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
    extend xdp_redirect_map for broadcast support.
    
    With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
    in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
    excluded when do broadcasting.
    
    When getting the devices in dev hash map via dev_map_hash_get_next_key(),
    there is a possibility that we fall back to the first key when a device
    was removed. This will duplicate packets on some interfaces. So just walk
    the whole buckets to avoid this issue. For dev array map, we also walk the
    whole map to find valid interfaces.
    
    Function bpf_clear_redirect_map() was removed in
    commit ee75aef ("bpf, xdp: Restructure redirect actions").
    Add it back as we need to use ri->map again.
    
    With test topology:
      +-------------------+             +-------------------+
      | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
      +-------------------+             |                   |
                                        |   Host B          |
      +-------------------+             |                   |
      | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
      +-------------------+             |                   |
                                        |          +------+ |
                                        | veth0 -- | Peer | |
                                        | veth1 -- |      | |
                                        | veth2 -- |  NS  | |
                                        |          +------+ |
                                        +-------------------+
    
    On Host A:
     # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
    
    On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
    Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
    All the veth peers in the NS have a XDP_DROP program loaded. The
    forward_map max_entries in xdp_redirect_map_multi is modify to 4.
    
    Testing the performance impact on the regular xdp_redirect path with and
    without patch (to check impact of additional check for broadcast mode):
    
    5.12 rc4         | redirect_map        i40e->i40e      |    2.0M |  9.7M
    5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.8M
    5.12 rc4 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
    5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.7M
    
    Testing the performance when cloning packets with the redirect_map_multi
    test, using a redirect map size of 4, filled with 1-3 devices:
    
    5.12 rc4 + patch | redirect_map multi  i40e->veth (x1) |    1.7M | 11.4M
    5.12 rc4 + patch | redirect_map multi  i40e->veth (x2) |    1.1M |  4.3M
    5.12 rc4 + patch | redirect_map multi  i40e->veth (x3) |    0.8M |  2.6M
    
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
    liuhangbin authored and borkmann committed May 26, 2021
  6. bpf: Run devmap xdp_prog on flush instead of bulk enqueue

    This changes the devmap XDP program support to run the program when the
    bulk queue is flushed instead of before the frame is enqueued. This has
    a couple of benefits:
    
    - It "sorts" the packets by destination devmap entry, and then runs the
      same BPF program on all the packets in sequence. This ensures that we
      keep the XDP program and destination device properties hot in I-cache.
    
    - It makes the multicast implementation simpler because it can just
      enqueue packets using bq_enqueue() without having to deal with the
      devmap program at all.
    
    The drawback is that if the devmap program drops the packet, the enqueue
    step is redundant. However, arguably this is mostly visible in a
    micro-benchmark, and with more mixed traffic the I-cache benefit should
    win out. The performance impact of just this patch is as follows:
    
    Using 2 10Gb i40e NIC, redirecting one to another, or into a veth interface,
    which do XDP_DROP on veth peer. With xdp_redirect_map in sample/bpf, send
    pkts via pktgen cmd:
    ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
    
    There are about +/- 0.1M deviation for native testing, the performance
    improved for the base-case, but some drop back with xdp devmap prog attached.
    
    Version          | Test                           | Generic | Native | Native + 2nd xdp_prog
    5.12 rc4         | xdp_redirect_map   i40e->i40e  |    1.9M |   9.6M |  8.4M
    5.12 rc4         | xdp_redirect_map   i40e->veth  |    1.7M |  11.7M |  9.8M
    5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |    1.9M |   9.8M |  8.0M
    5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  12.0M |  9.4M
    
    When bq_xmit_all() is called from bq_enqueue(), another packet will
    always be enqueued immediately after, so clearing dev_rx, xdp_prog and
    flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
    and only check them once in bq_enqueue() since they are all modified
    together.
    
    This change also has the side effect of extending the lifetime of the
    RCU-protected xdp_prog that lives inside the devmap entries: Instead of
    just living for the duration of the XDP program invocation, the
    reference now lives all the way until the bq is flushed. This is safe
    because the bq flush happens at the end of the NAPI poll loop, so
    everything happens between a local_bh_disable()/local_bh_enable() pair.
    However, this is by no means obvious from looking at the call sites; in
    particular, some drivers have an additional rcu_read_lock() around only
    the XDP program invocation, which only confuses matters further.
    Cleaning this up will be done in a separate patch series.
    
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20210519090747.1655268-2-liuhangbin@gmail.com
    netoptimizer authored and borkmann committed May 26, 2021
Older