-
Notifications
You must be signed in to change notification settings - Fork 103
Issue 6284 - BUG - freelist ordering causes high wtime #6285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 6284 - BUG - freelist ordering causes high wtime #6285
Conversation
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes. This is because the initial connection freelist construction adds every connection in order of *listener* so new connections always fill the first listener thread, then only once that listener is full it spills over to the next. If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener. Fix Description: When the freelist is constructed rather than iterate over each listener *then* each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener. A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change. fixes: 389ds#6284 Author: William Brown <william@blackhats.net.au> Review by: ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding !! that explains why we were not able to measure improvement on throughput. The initial clients were all hitting the same listener.
The patch looks good
A question though, my understanding is that wtime starts when the listener creates a pblock and adds it to the work queue. The problem being that the first listener is overloaded and slow, it should only add the pblock slower. I understand it impacts throughput but I miss how it impact the time spent in the workqueue by a specific pblock ?
handle_pr_read_ready() is what we poll with the multiple listeners. This iterates over it's list of connections which calls connection_activity(). If activity is found, the connection is put on the work_q. Then connection_threadmain picks up the operation and if it's the first iteration of the loop, initialises the operation. This is where wtime starts. connection_threadmain then attempts connection_read_operation. If the PDU is incomplete then the connection needs more data, so it's unlocked. It is removed from the work_q and returned to be polled. Then handle_pr_read_ready() runs again, when more data is available. connection_threadmain picks up the connection again and since the connection is already initialised the pblock remains. Now connection_read_operation completes the PDU and can proceed. wtime now ends. This is how wtime is affected because if there is a single listener we have to wait longer for the second handle_pr_read_ready() as we have a longer head of connections in front that need to be handled first. By having multiple listeners parallel polling then in the case of incomplete PDU's, we return to poll the connection sooner reducing the wtime. |
|
Why is it when someone explains something it seems so obvious, great catch. The intention was that when a connection becomes active, connection_table_get_list() will find the least used list to assign the connection to, balancing the connections across multi lists. I am very interested in doing a throughput comparison to measure the benefit of your fix. |
|
No problem, happy to have found it. Depending on our customers testing and feedback there may be more coming to improve the problem :) I'm planning to backport this to 2.5 at least if that's no issue. Our customer is on 2.2 so we make need to backport to that too incase we aren't able to update to 2.5. |
|
@Firstyear |
@Firstyear , thank you sooo much for this detailed explanations. So my understandings are
So high wtime are only seen when clients send large/incomple req. Am I correct ? I am a bit surprise about the last bullet. I thought that it was the worker thread that poll/read the full request. This may have changed recently (@jchapma do you know ?). |
Only in turbo-mode, and only a subset of threads are elligible for this. |
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes. This is because the initial connection freelist construction adds every connection in order of *listener* so new connections always fill the first listener thread, then only once that listener is full it spills over to the next. If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener. Fix Description: When the freelist is constructed rather than iterate over each listener *then* each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener. A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change. fixes: #6284 Author: William Brown <william@blackhats.net.au> Review by: @tbordaz and @jchapma (Thanks!)
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes. This is because the initial connection freelist construction adds every connection in order of *listener* so new connections always fill the first listener thread, then only once that listener is full it spills over to the next. If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener. Fix Description: When the freelist is constructed rather than iterate over each listener *then* each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener. A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change. fixes: #6284 Author: William Brown <william@blackhats.net.au> Review by: @tbordaz and @jchapma (Thanks!)
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes. This is because the initial connection freelist construction adds every connection in order of *listener* so new connections always fill the first listener thread, then only once that listener is full it spills over to the next. If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener. Fix Description: When the freelist is constructed rather than iterate over each listener *then* each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener. A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change. fixes: #6284 Author: William Brown <william@blackhats.net.au> Review by: @tbordaz and @jchapma (Thanks!)
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes. This is because the initial connection freelist construction adds every connection in order of *listener* so new connections always fill the first listener thread, then only once that listener is full it spills over to the next. If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener. Fix Description: When the freelist is constructed rather than iterate over each listener *then* each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener. A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change. fixes: #6284 Author: William Brown <william@blackhats.net.au> Review by: @tbordaz and @jchapma (Thanks!)
|
Okay. That make sense. IMHO wtime should stick to its initial purpose: detect worker starvation |
|
@tbordaz Related to your wtime comment, it was invaluable here to find the IO stall time, so we should have a new counter for IOwait as well. |
Although I see the benefit for wtime (it will focus wtime on the time to get a free worker), I am unsure on adding a new 'iotime' counter in the access log. If admin/support sees high 'iotime' (related to big req or lazy client), can he tune the server to protect it ? should we log a warning about those large iotime ? |
|
Although we cannot do anything to improve iotime on the server, IMHO it is still interesting to compute it:
|
|
iotime on a single connection yields very little, but across a wide number of connections it can reveal a trend such as ioworker starvation, slow network, or something else. It gives us a metric we can use to measure improvements to connection handling. And it means wtime is the true wait time in queue for a worker, rather than the curernt io + wait that it is today. As a contrast, to help explain why this is super useful, consider this log from a separate project. Having really good, segmented timings of different server segments like this is invaluable. You can see precisely where time was consumed in the operation and where. The request as a whole took 1.00ms, we can see it look 5% of that time to acquire a ticket (database lock), the search itself took 91.8us, and access controls dominated with 2.97% + 14.23% of our time consumed. So we can really easily see here that if we were to improve things we should look at session validation or access control execution. This is why having timers across things like the iowait time, queue wait time, optime etc is so valuable. |
|
My question regarding 'iotime' was confusing. I agree we need precise counter and stick 'wtime' to time waiting for a worker. I agree that 'iotime' is valuable (to identify connection handling issue for example) but my question was should we change the access log format, adding iotime, or just log a warning (error log) when it goes over a limit. Regarding the metrics they are very valuable, thanks sharing them. Not sure what is the validate_ldap_session, is it search/access to base entry ? |
|
JFR invalid wtime can be reproduced on ADDing a large group (5M members). A large iotime (accounted on wtime) is because of reading many PDU to get the total request. Actually all workers were free, so it should be iotime=67 and wtime=0. [20/Aug/2024:15:35:54.284898594 +0200] conn=1 op=1 ADD dn="cn=big_group_5M-1,dc=example,dc=com" [20/Aug/2024:15:36:15.194910115 +0200] conn=1 op=1 RESULT err=0 tag=105 nentries=0 wtime=67.167284747 optime=20.910026595 etime=88.077306754 |
It's an example, don't stress to much about it. It's from another project, not 389-ds :) The point was that having good timings can help identify issues much more accurately. |
Bug Description: Multithread listeners were added to support having multiple threads able to perform IO for connections to reduce wtime. However, when the server is first started if the number of clients is less than the size of a single listeners conntable this causes extremely high wtimes.
This is because the initial connection freelist
construction adds every connection in order of listener so new connections always fill the first listener thread, then only once that listener is full it spills over to the next.
If the conntable size is small, this isn't noticeable, but an example is if your contable is 16384 elements with 4 listeners, then the first 4096 connections will always go to the first listener.
Fix Description: When the freelist is constructed rather than iterate over each listener then each slot, we iterate over each slot then each listenr. This causes the initial freelist to be interleaved between all listeners so that even with a small number of connections the work is spread fairly without lasering a single listener.
A supplied test from a user shows a significant drop in wtime and an increase in throughput of operations with this change.
fixes: #6284
Author: William Brown william@blackhats.net.au
Review by: ???