Skip to content

Issue 6284 - BUG - freelist ordering causes high wtime #6285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 4, 2024

Conversation

Firstyear
Copy link
Contributor

Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of listener so new connections always fill the first listener thread, then only once that listener is full it spills over to the next.

If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener.

Fix Description: When the freelist is constructed rather than iterate over each listener then each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener.

A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change.

fixes: #6284

Author: William Brown william@blackhats.net.au

Review by: ???

Bug Description: Multithread listeners were added to
support having multiple threads able to perform IO
for connections to reduce wtime. However, when the server
is first started if the number of clients is less than
the size of a single listeners conntable this causes
extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of *listener*
so new connections always fill the first listener thread,
then only once that listener is full it spills over to the
next.

If the conntable size is small, this isn't noticeable, but
an example is if your contable is 16384 elements with 4
listeners, then the first 4096 connections will always
go to the first listener.

Fix Description: When the freelist is constructed rather
than iterate over each listener *then* each slot, we
iterate over each slot then each listenr. This causes
the initial freelist to be interleaved between all
listeners so that even with a small number of connections
the work is spread fairly without lasering a single
listener.

A supplied test from a user shows a significant drop in
wtime and an increase in throughput of operations with
this change.

fixes: 389ds#6284

Author: William Brown <william@blackhats.net.au>

Review by: ???
@Firstyear Firstyear requested a review from jchapma August 2, 2024 00:32
Copy link
Contributor

@tbordaz tbordaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding !! that explains why we were not able to measure improvement on throughput. The initial clients were all hitting the same listener.

The patch looks good

A question though, my understanding is that wtime starts when the listener creates a pblock and adds it to the work queue. The problem being that the first listener is overloaded and slow, it should only add the pblock slower. I understand it impacts throughput but I miss how it impact the time spent in the workqueue by a specific pblock ?

@Firstyear
Copy link
Contributor Author

Nice finding !! that explains why we were not able to measure improvement on throughput. The initial clients were all hitting the same listener.

The patch looks good

A question though, my understanding is that wtime starts when the listener creates a pblock and adds it to the work queue. The problem being that the first listener is overloaded and slow, it should only add the pblock slower. I understand it impacts throughput but I miss how it impact the time spent in the workqueue by a specific pblock ?

handle_pr_read_ready() is what we poll with the multiple listeners. This iterates over it's list of connections which calls connection_activity(). If activity is found, the connection is put on the work_q.

Then connection_threadmain picks up the operation and if it's the first iteration of the loop, initialises the operation. This is where wtime starts.

connection_threadmain then attempts connection_read_operation. If the PDU is incomplete then the connection needs more data, so it's unlocked. It is removed from the work_q and returned to be polled.

Then handle_pr_read_ready() runs again, when more data is available. connection_threadmain picks up the connection again and since the connection is already initialised the pblock remains. Now connection_read_operation completes the PDU and can proceed. wtime now ends.

This is how wtime is affected because if there is a single listener we have to wait longer for the second handle_pr_read_ready() as we have a longer head of connections in front that need to be handled first.

By having multiple listeners parallel polling then in the case of incomplete PDU's, we return to poll the connection sooner reducing the wtime.

@jchapma
Copy link
Contributor

jchapma commented Aug 4, 2024

Why is it when someone explains something it seems so obvious, great catch.

The intention was that when a connection becomes active, connection_table_get_list() will find the least used list to assign the connection to, balancing the connections across multi lists.

I am very interested in doing a throughput comparison to measure the benefit of your fix.

@Firstyear Firstyear merged commit e7b0e20 into 389ds:main Aug 4, 2024
190 of 195 checks passed
@Firstyear
Copy link
Contributor Author

No problem, happy to have found it. Depending on our customers testing and feedback there may be more coming to improve the problem :)

I'm planning to backport this to 2.5 at least if that's no issue. Our customer is on 2.2 so we make need to backport to that too incase we aren't able to update to 2.5.

@Firstyear Firstyear deleted the 20240802-conntable-interleaving branch August 4, 2024 22:58
@jchapma
Copy link
Contributor

jchapma commented Aug 5, 2024

@Firstyear
I cant see any issue with back porting to all relevant branches, which if I remember correctly is 2.2 upwards.

@tbordaz
Copy link
Contributor

tbordaz commented Aug 5, 2024

Nice finding !! that explains why we were not able to measure improvement on throughput. The initial clients were all hitting the same listener.
The patch looks good
A question though, my understanding is that wtime starts when the listener creates a pblock and adds it to the work queue. The problem being that the first listener is overloaded and slow, it should only add the pblock slower. I understand it impacts throughput but I miss how it impact the time spent in the workqueue by a specific pblock ?

handle_pr_read_ready() is what we poll with the multiple listeners. This iterates over it's list of connections which calls connection_activity(). If activity is found, the connection is put on the work_q.

Then connection_threadmain picks up the operation and if it's the first iteration of the loop, initialises the operation. This is where wtime starts.

connection_threadmain then attempts connection_read_operation. If the PDU is incomplete then the connection needs more data, so it's unlocked. It is removed from the work_q and returned to be polled.

Then handle_pr_read_ready() runs again, when more data is available. connection_threadmain picks up the connection again and since the connection is already initialised the pblock remains. Now connection_read_operation completes the PDU and can proceed. wtime now ends.

This is how wtime is affected because if there is a single listener we have to wait longer for the second handle_pr_read_ready() as we have a longer head of connections in front that need to be handled first.

By having multiple listeners parallel polling then in the case of incomplete PDU's, we return to poll the connection sooner reducing the wtime.

@Firstyear , thank you sooo much for this detailed explanations. So my understandings are

  • In server startup this bug overloaded the primary listener by putting the free connection in its bucket
  • initial clients are all hitting the same listener, preventing the benefit of the multi-listener and keeping the initial throughput as if there was a single listener
  • If requests are large or just incomplete, then the listener has to do several poll/read on the same connection, the overloaded/slow listener is responsible of high wtime.

So high wtime are only seen when clients send large/incomple req. Am I correct ?

I am a bit surprise about the last bullet. I thought that it was the worker thread that poll/read the full request. This may have changed recently (@jchapma do you know ?).

@Firstyear
Copy link
Contributor Author

I am a bit surprise about the last bullet. I thought that it was the worker thread that poll/read the full request. This may have changed recently (@jchapma do you know ?).

Only in turbo-mode, and only a subset of threads are elligible for this.

Firstyear added a commit that referenced this pull request Aug 6, 2024
Bug Description: Multithread listeners were added to
support having multiple threads able to perform IO
for connections to reduce wtime. However, when the server
is first started if the number of clients is less than
the size of a single listeners conntable this causes
extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of *listener*
so new connections always fill the first listener thread,
then only once that listener is full it spills over to the
next.

If the conntable size is small, this isn't noticeable, but
an example is if your contable is 16384 elements with 4
listeners, then the first 4096 connections will always
go to the first listener.

Fix Description: When the freelist is constructed rather
than iterate over each listener *then* each slot, we
iterate over each slot then each listenr. This causes
the initial freelist to be interleaved between all
listeners so that even with a small number of connections
the work is spread fairly without lasering a single
listener.

A supplied test from a user shows a significant drop in
wtime and an increase in throughput of operations with
this change.

fixes: #6284

Author: William Brown <william@blackhats.net.au>

Review by: @tbordaz and @jchapma (Thanks!)
Firstyear added a commit that referenced this pull request Aug 6, 2024
Bug Description: Multithread listeners were added to
support having multiple threads able to perform IO
for connections to reduce wtime. However, when the server
is first started if the number of clients is less than
the size of a single listeners conntable this causes
extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of *listener*
so new connections always fill the first listener thread,
then only once that listener is full it spills over to the
next.

If the conntable size is small, this isn't noticeable, but
an example is if your contable is 16384 elements with 4
listeners, then the first 4096 connections will always
go to the first listener.

Fix Description: When the freelist is constructed rather
than iterate over each listener *then* each slot, we
iterate over each slot then each listenr. This causes
the initial freelist to be interleaved between all
listeners so that even with a small number of connections
the work is spread fairly without lasering a single
listener.

A supplied test from a user shows a significant drop in
wtime and an increase in throughput of operations with
this change.

fixes: #6284

Author: William Brown <william@blackhats.net.au>

Review by: @tbordaz and @jchapma (Thanks!)
Firstyear added a commit that referenced this pull request Aug 6, 2024
Bug Description: Multithread listeners were added to
support having multiple threads able to perform IO
for connections to reduce wtime. However, when the server
is first started if the number of clients is less than
the size of a single listeners conntable this causes
extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of *listener*
so new connections always fill the first listener thread,
then only once that listener is full it spills over to the
next.

If the conntable size is small, this isn't noticeable, but
an example is if your contable is 16384 elements with 4
listeners, then the first 4096 connections will always
go to the first listener.

Fix Description: When the freelist is constructed rather
than iterate over each listener *then* each slot, we
iterate over each slot then each listenr. This causes
the initial freelist to be interleaved between all
listeners so that even with a small number of connections
the work is spread fairly without lasering a single
listener.

A supplied test from a user shows a significant drop in
wtime and an increase in throughput of operations with
this change.

fixes: #6284

Author: William Brown <william@blackhats.net.au>

Review by: @tbordaz and @jchapma (Thanks!)
Firstyear added a commit that referenced this pull request Aug 6, 2024
Bug Description: Multithread listeners were added to
support having multiple threads able to perform IO
for connections to reduce wtime. However, when the server
is first started if the number of clients is less than
the size of a single listeners conntable this causes
extremely high wtimes.

This is because the initial connection freelist
construction adds every connection in order of *listener*
so new connections always fill the first listener thread,
then only once that listener is full it spills over to the
next.

If the conntable size is small, this isn't noticeable, but
an example is if your contable is 16384 elements with 4
listeners, then the first 4096 connections will always
go to the first listener.

Fix Description: When the freelist is constructed rather
than iterate over each listener *then* each slot, we
iterate over each slot then each listenr. This causes
the initial freelist to be interleaved between all
listeners so that even with a small number of connections
the work is spread fairly without lasering a single
listener.

A supplied test from a user shows a significant drop in
wtime and an increase in throughput of operations with
this change.

fixes: #6284

Author: William Brown <william@blackhats.net.au>

Review by: @tbordaz and @jchapma (Thanks!)
@Firstyear
Copy link
Contributor Author

   4d7218b75..146c6fde6  389-ds-base-2.2 -> 389-ds-base-2.2
   41add0d65..085b48957  389-ds-base-2.3 -> 389-ds-base-2.3
   84d6ae0f4..3427e2a14  389-ds-base-2.4 -> 389-ds-base-2.4
   161888969..9c5a0ce32  389-ds-base-2.5 -> 389-ds-base-2.5
   f029897e5..3ccf62745  389-ds-base-3.0 -> 389-ds-base-3.0

@tbordaz
Copy link
Contributor

tbordaz commented Aug 6, 2024

I am a bit surprise about the last bullet. I thought that it was the worker thread that poll/read the full request. This may have changed recently (@jchapma do you know ?).

Only in turbo-mode, and only a subset of threads are elligible for this.

Okay. That make sense.
I think it may lead to is a significant support issue. So far we have been using wtime as a sign of worker starvation (i.e. time spent in the workq before getting a free worker) that we can address by increasing the number of workers. But because worker can put back an incomplete req in the workq, a ldap client can create fake high wtime. And so high wtime may also measure lazy client or large client req. In that last case an admin should not increase the number of worker.

IMHO wtime should stick to its initial purpose: detect worker starvation

@Firstyear
Copy link
Contributor Author

@jchapma and @tbordaz - our customer has now confirmed with these changes (and them reading their test numbers more closely) that 389-ds on like-for-like hardware etc is twice as fast at connection handling and remains stable to higher numbers than openldap. :)

@Firstyear
Copy link
Contributor Author

@tbordaz Related to your wtime comment, it was invaluable here to find the IO stall time, so we should have a new counter for IOwait as well.

@tbordaz
Copy link
Contributor

tbordaz commented Aug 19, 2024

@tbordaz Related to your wtime comment, it was invaluable here to find the IO stall time, so we should have a new counter for IOwait as well.

Although I see the benefit for wtime (it will focus wtime on the time to get a free worker), I am unsure on adding a new 'iotime' counter in the access log. If admin/support sees high 'iotime' (related to big req or lazy client), can he tune the server to protect it ? should we log a warning about those large iotime ?

@progier389
Copy link
Contributor

Although we cannot do anything to improve iotime on the server, IMHO it is still interesting to compute it:

  • it is simple/cheap to compute
  • it drastically decrease the troubleshooting time in case of client problem
  • probably also helpful for some replication timing issues

@Firstyear
Copy link
Contributor Author

iotime on a single connection yields very little, but across a wide number of connections it can reveal a trend such as ioworker starvation, slow network, or something else. It gives us a metric we can use to measure improvements to connection handling. And it means wtime is the true wait time in queue for a worker, rather than the curernt io + wait that it is today.

As a contrast, to help explain why this is super useful, consider this log from a separate project.

Aug 20 10:34:37  INFO     ldap-request [ 1.00ms | 2.66% / 100.00% ]
Aug 20 10:34:37  INFO     ┕━ handle_ldaprequest [ 977µs | 1.86% / 97.34% ]
Aug 20 10:34:37  DEBUG       ┕━ do_search [ 958µs | 37.59% / 95.48% ]
Aug 20 10:34:37  DEBUG          ┝━ proxy_read [ 54.6µs | 4.78% / 5.44% ]
Aug 20 10:34:37  DEBUG          │  ┕━ read_acquire_ticket [ 6.64µs | 0.66% ]
Aug 20 10:34:37  DEBUG          ┝━ validate_ldap_session [ 138µs | 1.93% / 13.80% ]
Aug 20 10:34:37  DEBUG          │  ┕━ internal_search_uuid [ 119µs | 2.71% / 11.87% ]
Aug 20 10:34:37  DEBUG          │     ┕━ search [ 91.8µs | 2.68% / 9.15% ]
Aug 20 10:34:37  DEBUG          │        ┝━ be::search [ 35.1µs | 3.50% ]
Aug 20 10:34:37  DEBUG          │        ┕━ access::filter_entries [ 29.8µs | 2.97% ]
Aug 20 10:34:37  DEBUG          ┝━ search_ext [ 388µs | 1.35% / 38.64% ]
Aug 20 10:34:37  DEBUG          │  ┝━ search [ 253µs | 7.00% / 25.23% ]
Aug 20 10:34:37  DEBUG          │  │  ┝━ be::search [ 40.1µs | 4.00% ]
Aug 20 10:34:37  DEBUG          │  │  ┕━ access::filter_entries [ 143µs | 14.23% ]
Aug 20 10:34:37  DEBUG          │  ┕━ access::search_filter_entry_attributes [ 121µs | 12.06% ]
Aug 20 10:34:37  INFO           ┕━ i [info]: LDAP Search Success -> number of entries | event_tag_id: 3 | nentries: 2

Having really good, segmented timings of different server segments like this is invaluable. You can see precisely where time was consumed in the operation and where. The request as a whole took 1.00ms, we can see it look 5% of that time to acquire a ticket (database lock), the search itself took 91.8us, and access controls dominated with 2.97% + 14.23% of our time consumed.

So we can really easily see here that if we were to improve things we should look at session validation or access control execution.

This is why having timers across things like the iowait time, queue wait time, optime etc is so valuable.

@tbordaz
Copy link
Contributor

tbordaz commented Aug 20, 2024

My question regarding 'iotime' was confusing. I agree we need precise counter and stick 'wtime' to time waiting for a worker. I agree that 'iotime' is valuable (to identify connection handling issue for example) but my question was should we change the access log format, adding iotime, or just log a warning (error log) when it goes over a limit.

Regarding the metrics they are very valuable, thanks sharing them. Not sure what is the validate_ldap_session, is it search/access to base entry ?

@tbordaz
Copy link
Contributor

tbordaz commented Aug 20, 2024

JFR invalid wtime can be reproduced on ADDing a large group (5M members). A large iotime (accounted on wtime) is because of reading many PDU to get the total request. Actually all workers were free, so it should be iotime=67 and wtime=0.

[20/Aug/2024:15:35:54.284898594 +0200] conn=1 op=1 ADD dn="cn=big_group_5M-1,dc=example,dc=com"
[20/Aug/2024:15:36:15.194910115 +0200] conn=1 op=1 RESULT err=0 tag=105 nentries=0 wtime=67.167284747 optime=20.910026595 etime=88.077306754

@Firstyear
Copy link
Contributor Author

Regarding the metrics they are very valuable, thanks sharing them. Not sure what is the validate_ldap_session, is it search/access to base entry ?

It's an example, don't stress to much about it. It's from another project, not 389-ds :)

The point was that having good timings can help identify issues much more accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

high wtimes on high connection load.
4 participants