Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue: Implement doorbell batching for the new API #164

Open
wants to merge 169 commits into
base: vNext
Choose a base branch
from

Conversation

pasis
Copy link
Member

@pasis pasis commented Jun 8, 2024

Description

New API requires user to call explicit flush. We can implement doorbell batching and close unfinished batch by the flush. This will guarantee doorbell and zcopy completions progress even if user stops sending data.

What

Implement doorbell batching for the new API.

Why ?

Up to 4% performance improvement depending on scenario.

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

AlexanderGrissik and others added 30 commits January 14, 2024 16:02
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
At most a single element of this vector is always used.
Once rfs constructor is complete there must be exactly one attach_flow_data element in case of ring_simple.
For ring_tap this element remains null.

Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Alex Briskin <abriskin@nvidia.com>
Set ETIMEDOUT errno and return -1 from recv in case a socket was timed out, instead of 0 return value and 0 errno.
For instance, in case of TCP keep alive timeout.

Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
The idea is to scan all rpm/deb packages for personal emails
we should not be releasing packages with such emails
the scan is done on both the metadat info and the changelog
of a specific package

Issue: HPCINFRA-919
Signed-off-by: Daniel Pressler <danielpr@nvidia.com>
pasis and others added 25 commits April 1, 2024 00:09
poll_group takes additional reference to each its ring. But it doesn't
release it once the group is destroyed. This leads to two issues:

1. Extra resources are utilized if user destroys a polling group before
   the application terminates.
2. Polling is not possible for a destroyed group. Therefore, if there
   are not completed WQEs in the SQ, respective sockets won't report
   TX completions and cannot be fully terminated. The ring needs to be
   destroyed to flush all the completions.

Release all the native rings explicitly in the poll_group destructor to
resolve the above issues.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
When an RX packet event happens, XLIO passes the ownership to user.
Further, user releases the buffer explicitly. However, XLIO frees the
buffer unconditionally just after emitting the event.

Fix this and free buffers only if user doesn't provides the RX event
callback.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
reclaim_recv_single_buffer() accumulates buffers in a list. In the
performance oriented API we want to reuse hot buffers immediately, so
reclaim_recv_buffers() implementation is more suitable.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
The memory callback provides hugepage size of the underlying pages.
Replace hardcoded 0 with real hugepage size.

Keep the page size in xlio_allocator object. This field a relevant only
the hugepage allocation method and 0 in all other cases.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
XLIO Socket API must guarantee that the XLIO_SOCKET_EVENT_TERMINATED is
not followed by any other events. Therefore, all the TX completion
events must be completed by that moment.

Do a polling iteration before calling socket destructor to increase the
chance that all the relevant WQEs are completed. This mechanism needs to
be improved in the future.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
xlio_init_ex() changes some default parameters. However, a global object
can trigger safe_mce_sys() constructor at the start. Therefore, we need
to re-read the environment variables again to guarantee that the changed
parameters take place.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Avoid using connect() with sock fd interface, because fd_collection
doesn't keep xlio_socket_t objects.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
xlio_socket_t objects aren't connected to the fd_collection anymore.
Therefore, all the methods must be called from the sockinfo_tcp objects
directly.

Also, xlio_socket_fd() is not relevant anymore and can be removed.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Iterate over std::list of TCP sockets while
erasing socket during iteration.
Overcomed by increasing iterator before erase.

Signed-off-by: Iftah Levi <iftahl@nvidia.com>
rdma-core limits number of UARs per context to 16 by default. After
creating 16 QPs, XLIO receives duplicates of blueflame registers for
each subsequent QP. As results, blueflame doorbell method can write WQEs
concurrently without serialization and this leads to a data corruption.

BlueFlame can make impact on throughput, since copy to the blueflame
register is expensive. It can improve latency in some low latency
scenarios, however, XLIO targets high traffic/PPS rates.
Removing blueflame method also slightly improves performance in some
scenarios.

BlueFlame can be returned back in the future to improve low-latency
scenarios, however, it will need some rework to avoid the data
corruption.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
The inline WQE branch is not likely in most throughput scenarios.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Avoid calling register_socket_timer_event when a socket is already registered (TIME-WAIT).
Although there is no functionality issue with that, it produces too high rate of posting events for internal-thread.
This leads to lock contantion inside internal-thread and degraded performance of HTTP CPS.

Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Signed-off-by: Gal Noam <gnoam@nvidia.com>
UTLS uses tcp_tx_express() for non blocking sockets. However, this TX
method doesn't support XLIO_RX_POLL_ON_TX_TCP. Additional RX polling
improves scenarios such as WEB servers.

Insert RX polling into UTLS TX path to resolve performance degradation.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
In heavy CPS scenarios a socket may go to TIME-WAIT state and be reused before first TCP timer registration is performed by internal-thread.
1. Setting timer_registered=true while posting the event prevents the second attemp to try and post the event again.
2. Adding sanity check in add_new_timer that verifies that the socket is not already in the timer map.

Signed-off-by: Alexander Grissik <agrissik@nvidia.com>
Added new env parameter - XLIO_MAX_TSO_SIZE.
It allows the user to control maximum size of TSO,
instead of taking the maximum cap by HW.
The default size is 256KB (maximum by current HW).
Values higher than HW capabilities won't be taken into account.

Signed-off-by: Iftah Levi <iftahl@nvidia.com>
Signed-off-by: Gal Noam <gnoam@nvidia.com>
PBUF_NONE was used mistakenly instead of PBUF_DESC_NONE.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
The field doesn't have to be initialized if we do copy. This is extra
operation, therefore, move it to the else branch.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Inline part of the fill_wqe() is overcomplicated. Hide it in a separate
method, so refactoring can be isolated.

Also, don't check multiple scatter-gather case for inline criteria. This
is unlikely scenario because TCP layer copies non-zcopy data to a single
buffer until it's full.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
URGENT flag requests TX completion for the respective WQE. This is
required for zerocopy interfaces where user cannot specify the last
send operation explicitly. Otherwise, TX completion batching can lead
to a dead lock if user stops sending data and waits for the completions.

CALLBACK flag requests to call a callback once the buffer is released
by XLIO.

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
This allows to achieve doorbell batching similar to TX completions
batching. XLIO extra socket API provide explicit flush functionality.
Such a flush operation will close WQE accumulation session and ring
doorbell. This guarantees doorbells and zcopy completions even if
user stops sending data (as far as user calls the final flush).

Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
Signed-off-by: Dmytro Podgornyi <dmytrop@nvidia.com>
@pasis pasis added enhancement New feature or request draft Not to review yet labels Jun 8, 2024
@galnoam
Copy link
Collaborator

galnoam commented Jun 10, 2024

bot:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
draft Not to review yet enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants