New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SASL GSSAPI & GSS-SPNEGO performance issue #4506
Comments
My gut feeling tells me that the issue could be related to global lock. I see a log of activity related to replication and index management in the trace log. ids_sasl_check_bind() is protected by the big lock |
I doubt it's the lock you mention here - more likely it's related to #3032 The change mentioned here has to add a global krb lock to resolve an issue/race with freeing memory, and it was noted at the time that significant performance issues may arise from this change. Certainly this would also have interesting interleaving effects with the conn lock. The conn lock however, is correctly placed and is per-connection. It was also discussed that the "long term" solution would be to move to GSSAPI as the api rather than krb directly, but that is an extensive, invasive and complex piece of work. |
See commit ff9387b |
As a work around you may find that reducing your worker thread count to match the number of available CPU's will increase performance due to a reduction in lock contention. |
Thanks @Firstyear! I have done my best to isolate the system, reduce any interference, and avoid lock contention from external sources.
|
389-ds-base/ldap/servers/slapd/ldaputil.c Lines 1157 to 1159 in 0f38410
|
Add log trace for sasl_server_start() and sasl_server_step() to investigate performance issues with SASL bind. sasl_errstring() perform a simple and fast switch case mapping from error code to const string. See: 389ds#4506 Signed-off-by: Christian Heimes <cheimes@redhat.com>
I just realized that the problem may also be on the client side. SASL GSSAPI authentication takes multiple roundtrips. The client may be slow reading or responding to SASL responses. I'm also seeing the same issue with python-ldap, which is just a wrapper for OpenSSL and Cyrus SASL client libs. We need to keep in mind that it might be the client side or an issue in SASL library on either server or client side. |
I noticed that the performance of SASL bind operation has a peculiar distribution. It's not random at all. Slow bind operations are exactly a multiple of 250ms slower. This cannot be a coincidence. It looks like a
2000 SASL GSSAPI bind over LDAPI. I only measured the actual bind operation with a perf counter:
|
It's the wakeup timer! I changed the wakeup period from 250 to 25 and the performance of SASL bind operations changed dramatically. The fastest connections are a bit slower and slower operations are now quantized with multiple of 25ms. Is it possible that some SASL calls are either blocked by or executed in the main thread?
|
Riggghhhttttt interesting, that would indicate to me that there is a problem with sasl/krb from indicating that the socket is ready for work. |
I have tried a different patch that only modifies the timout of 389-ds-base/ldap/servers/slapd/daemon.c Lines 989 to 1012 in 0f38410
|
This is a very nice finding. I have to admit I am a bit puzzled. reducing pr_timeout to 0.025s means that if no operation (like sasl_bind) comes in within 0.025s (on an established connection), it will timeout/recompute the fd set/poll again. So the gain is to take into account faster new incoming connections. 3ms 329 4ms 1004 5ms 52 6ms 1 253ms 182 ... ~1380 (out of 2000) connections were established on the first poll and their sasl_bind processed. Is it possible to test a scenario with 2000 established connection, then 2000 sasl bind, to confirm if pr_timeout is impacting or not the response time ? At the moment, I have the feeling that reducing the timeout improve the response time of new incoming connections (and established connections are idle (after sasl_bind)). The benefit of response time will have the cost of the recomputation of the established connections. |
I choose 25ms for testing only. My hack is not a proper solution. You can find my trivial test script at https://gist.github.com/tiran/4b56faf5a8b14b9828ef8ba2ad9292ba |
i think the problem here is that if there is no new work/delay on the POLL_FN and we wait up to pr_timeout, then we are delaying the call to handle_listeners to accept new connections. Then in handle_new_connection when we call to accept_and_configure we again, call PR_Accept with pr_timeout set to the slapd_wakeup_timer. So POLL_FN is not polling the listeners, only connections in the CT. So that then cascades to
Which certainly would align to @tiran's observations that binds are delayed sporadicly, because a bind generally is a new connection, and since it would complete "rapidly" then the poll for new work would hit timeout because there is no longer term connections active in the conntable. The solution is probably not trivial, but we could consider breaking out each fd listener for accept to a unique thread and changing their timeout to A slightly more "palatable" intermediate fix is rather than per-thread for each listener, to break out accept to a single thread and have it iterate on just accepts. IE move handle_listeners to a unique thread in a @tbordaz Thoughts? |
This is exactly what puzzled me. IMHO listeners are polled. ct->fd (polled array) is filled with n_tcps, s_tcps, i_unix, then with established & !gettingber connections. |
@tiran, this portion of code is specific to GSSAPI auth on outgoing connection (replication agreements, chaining). It is not related to the reported concern (response time of incoming GSSAPI auth). |
Bug Description: While investigating 4506 it was noticed that it was possible to exceed the capacity of the connection table fd array if you had many listeners and a large number of connections. The number of connections required and in the correct state to cause this is in the thousands and would be infeasible in reality, but it is still worth defending from this. Fix Description: Add the correct bound on the while loop setting up the fd for polling. relates: 389ds#4506 Author: William Brown <william@blackhats.net.au> Review by: ???
Bug Description: Previously we accepted connections and selected for new work in the same event loop. This could cause connection table polling to delay accepts, and accepts to delay connection activity from being ready. Fix Description: This seperates those functions allowing accept to occur in parallel to our normal work. fixes: 389ds#4506 Author: William Brown <william@blackhats.net.au> Review by: ???
Issue 4506 - RFE - connection accept thread Bug Description: Previously we accepted connections and selected for new work in the same event loop. This could cause connection table polling to delay accepts, and accepts to delay connection activity from being ready. Fix Description: This seperates those functions allowing accept to occur in parallel to our normal work. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @progier389 (Thanks!)
We have a major problem. This broke LDAPI --> the installer, and CLI tools. Installer and CLI tools now hang indefinitely. Please fix this soon, or we will have to revert it. |
@mreynolds389 Please revert it, I can't reproduce this so it may take me some time to investigate and solve. I will re-submit once I can reproduce and have a fix. Are there more details about the hang? |
I took a stack trace: Here are the only two threads of interest during the hang:
I'll try and do some more testing tonight if I get time... |
What test produced the issue? |
Running dscreate :-) Then I tried dsconf and it also just hung. Appears to be LDAPI specific... I rebuilding the server now to confirm the behavior. I should have confirmation soon... |
dscreate hangs when it tries to bind as root:
Same thing for dsconf. The server is not detecting/accepting the LDAPI connection |
Bug Description: during review it was requested that a piece of code be changed which seemed quite innocent. The code was moved but the logic around the code wasn't considered causing the fd array for the accept thread to be allocated with a size of zero, causing the values to be lost. Fix Description: Move the allocation to the correct location. fixes: 389ds#4506 Author: William Brown <william@blackhats.net.au> Review by: ???
Bug Description: during review it was requested that a piece of code be changed which seemed quite innocent. The code was moved but the logic around the code wasn't considered causing the fd array for the accept thread to be allocated with a size of zero, causing the values to be lost. Fix Description: Move the allocation to the correct location. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @droideck
Issue 4506 - RFE - connection accept thread Bug Description: Previously we accepted connections and selected for new work in the same event loop. This could cause connection table polling to delay accepts, and accepts to delay connection activity from being ready. Fix Description: This seperates those functions allowing accept to occur in parallel to our normal work. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @progier389 (Thanks!)
Bug Description: during review it was requested that a piece of code be changed which seemed quite innocent. The code was moved but the logic around the code wasn't considered causing the fd array for the accept thread to be allocated with a size of zero, causing the values to be lost. Fix Description: Move the allocation to the correct location. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @droideck
Description: Converted all SLAPI_LOG_TRACE logging to Connection logging (SLAPI_LOG_CONNS). sasl_errstring() perform a simple and fast switch case mapping from error code to const string. relates : 389ds#4506 Signed-off-by: Christian Heimes <cheimes@redhat.com> Reviewed by: mreynolds
Description: Converted all SLAPI_LOG_TRACE logging to Connection logging (SLAPI_LOG_CONNS). sasl_errstring() perform a simple and fast switch case mapping from error code to const string. relates : #4506 Signed-off-by: Christian Heimes <cheimes@redhat.com> Reviewed by: mreynolds
Description: Converted all SLAPI_LOG_TRACE logging to Connection logging (SLAPI_LOG_CONNS). sasl_errstring() perform a simple and fast switch case mapping from error code to const string. relates : #4506 Signed-off-by: Christian Heimes <cheimes@redhat.com> Reviewed by: mreynolds
Description: Converted all SLAPI_LOG_TRACE logging to Connection logging (SLAPI_LOG_CONNS). sasl_errstring() perform a simple and fast switch case mapping from error code to const string. relates : #4506 Signed-off-by: Christian Heimes <cheimes@redhat.com> Reviewed by: mreynolds
Issue 4506 - RFE - connection accept thread Bug Description: Previously we accepted connections and selected for new work in the same event loop. This could cause connection table polling to delay accepts, and accepts to delay connection activity from being ready. Fix Description: This seperates those functions allowing accept to occur in parallel to our normal work. fixes: 389ds#4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @progier389 (Thanks!)
Bug Description: during review it was requested that a piece of code be changed which seemed quite innocent. The code was moved but the logic around the code wasn't considered causing the fd array for the accept thread to be allocated with a size of zero, causing the values to be lost. Fix Description: Move the allocation to the correct location. fixes: 389ds#4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @droideck
Issue 4506 - Temporary fix for io issues (#4516) Bug Description: Previously we accepted connections and selected for new work in the same event loop. This could cause connection table polling to delay accepts, and accepts to delay connection activity from being ready. Fix Description: This seperates those functions allowing accept to occur in parallel to our normal work. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @progier389 (Thanks!) Issue 4506 - BUG - fix oob alloc for fds (#4555) Bug Description: during review it was requested that a piece of code be changed which seemed quite innocent. The code was moved but the logic around the code wasn't considered causing the fd array for the accept thread to be allocated with a size of zero, causing the values to be lost. Fix Description: Move the allocation to the correct location. fixes: #4506 Author: William Brown <william@blackhats.net.au> Review by: @mreynolds389 @droideck Co-authored-by: Firstyear <william@blackhats.net.au>
Issue Description
Today I noticed again that LDAP SASL bind with GSSAPI and GSS-SPNEGO are sometimes slow. The issue is similar to https://pagure.io/freeipa/issue/6656. In about half the cases SASL bind is faster. In other cases the operation is more than a magnitude slower. I'm seeing the same performance spikes with GSS-SPNEGO. In fast case GSS-SPNEGO is a bit faster than GSSAPI. In slow cases it degrades like GSSAPI.
Package Version and Platform:
Steps to Reproduce
Steps to reproduce the behavior:
kinit -kt /etc/named.keytab DNS/$(hostname)
time ldapwhoami -H ldapi://%2Frun%2Fslapd-IPA-TEST.socket -Y GSSAPI
multiple timesExpected results
Command finishes in around 20ms consistently.
Actual result
Wall clock time for the command is all over the board. Sometimes the command finishes in less than 20ms, sometimes it's around 250 and every know and then it takes more than 500ms.
Additional context
For testing I have stop all IPA services except DS, disabled and removed all IPA plugins that may affect 389-DS performance, and restarted 389-DS. It did not affect performance.
fast bind
slow bind (delay between
op=1
andop=2
)At one point I also enabled tracing to trace function calls into SASL stack. There is a long delay between the first and second
ids_sasl_listmech
call. You can find a full log at https://cheimes.fedorapeople.org/sasl.logThe text was updated successfully, but these errors were encountered: