-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise idlistscanlimit by default #2435
Comments
Comment from firstyear (@Firstyear) at 2017-09-11 04:07:06 Here is an example:
So despite being over 40,000 entries, using the indexes is still faster. In fact, it looks like the idlistscan limit is an attempt to optimise the "(&(uid=x)(objectClass=y))" case to improve this search, but the recent changes to filter_optimise in #2431 solve this much more completely. I think we should remove the idlistscanlimit, as no other database (IE postgresql) has such an odd concept of bypassing indexes. We can gain more from improve filter optimise than trying to hand tune these numbers for specific cases, |
Comment from firstyear (@Firstyear) at 2017-09-11 04:07:07 Metadata Update from @Firstyear:
|
Comment from firstyear (@Firstyear) at 2017-09-11 04:24:28 Following this, this would actually be a case to raise the filter test threshold, because this would allow better shortcutting of genuinely small sets that want to avoid large indexes. IE if we set test thrushold to say 32, but then if the set is larger than this, it's probably better to actually use the index. We could do some testing of the numbers here to be sure, but 32 - 64 could be good guesses. |
Comment from firstyear (@Firstyear) at 2017-09-11 04:46:26 Another point is that to maek filter optimisation easier, perhaps we should change their structure from linked list to something easier to sort / re-arrange? |
Comment from firstyear (@Firstyear) at 2017-09-12 01:55:51 |
Comment from firstyear (@Firstyear) at 2017-09-12 01:57:20 Metadata Update from @Firstyear:
|
Comment from mreynolds (@mreynolds389) at 2017-09-14 17:35:25 Metadata Update from @mreynolds389:
|
Comment from firstyear (@Firstyear) at 2017-10-03 06:18:10 Relies on #2431 |
Comment from firstyear (@Firstyear) at 2017-10-03 06:18:12 Metadata Update from @Firstyear:
|
Comment from lkrispen (@elkris) at 2017-10-04 11:02:22 I don't think we should change the default. There are some scenarios wher it might be better to always return an idl. It is always a decision what is more costly: to handle and manage a very large idlist (it also has a lot of database accesses) or to handle all entries (which might be already in the entry cache). In general I think for changes like this we should have a set of reliable performance benchmarks first. |
Comment from firstyear (@Firstyear) at 2017-10-05 00:56:47 @elkris I have been performing a number of tests with the filter optimisation patch enabled, and in all cases the idlistscanlimit of 999999999 exceeded the performance of the current default. I will prepare a set of numbers for you. As much as I love and respect GSS, they do NOT know how to suggest a proper setting here. This is a deeply confusing setting that relies on intimate knowledge of our internal search procedure that most users don't have. I agree that we should remove the scan limit completely, but I would like to do this first to give us a clean backout strategy if it does fail on us, |
Comment from lkrispen (@elkris) at 2017-11-17 09:39:17 waiting for the set of numbers |
Comment from firstyear (@Firstyear) at 2017-11-20 10:31:21 Yep! I need to make the numbers for you! I've been quite busy as you might know, and I want to work on this with @vashirov to make our performance test framework? If you are going to do something, may as well do it correctly. |
I don't have any official numbers but I also have not seen a performance hit when having a very large value for idscanlimit. I do recall with DSEE that it was worse, but in 389 I did not see same behavior. I'm not sure about LMDB, but increasing this with bdb has more pro's than con's |
@mreynolds389 This change is predicated on query optimisation existing, and I ran into waaayyyy too many issues with VLV to get it to "stick" in the past. |
@mreynolds389 @progier389 Can we review this again? We have the optimiser code now, and the IDL scan limit does nothing for us but actively harm performance of large queries. We should increase this value, but better yet, we should remove it entirely. |
I would love to drop it entirely, but I know others on the team are not so sure. Need to prove that it won't hurt anything... |
My gut feeling is that @Firstyear is right and that that limit is obsolete (probably since the idl_new index format was added), but we have an easy way to mitigate the risks by doing the change in two phases:
|
Sounds like a reasonable plan to me :) @progier389 It's not because of the idl_new format, but the query optimiser. summary is if we have "(&(objectClass=person)(uid=william))" then we want to hit the "filter test threshold" as fast as possible to minimise work. The idlscanlimit does this by loading the idl for objectClass=person then discarding if it is too long, then "hoping" what ever is next is going to be more precise. The optimiser does this by swapping the terms to "(&(uid=william)(objectClass=person))" meaning the idl for uid=william is loaded, is seen to be length 1, and the shortcut out is taken and filter test applied. The idlscanlimit has a single pathological bad case, which is where you are scanning large sets for example, "(&(objectClass=person)(country=australia))". In the idlscanlimit case, if both idls are > scanlimit, this becomes fully unindexed and we filter test the whole DB. By raising the idlscanlimit we avoid this by still ending up loading two large idls, but that is a fraction of cost compared to a full table scan. Since the optimiser now does the work to ensure queries are in "shortcut likely" arrangements, the idlscanlimit now just causes a degradation, not an improvement. |
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: 389ds#2435 Author: William Brown <william@blackhats.net.au> Review by: ???
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: #2435 Author: William Brown <william@blackhats.net.au> Review by: @progier389 @mreynolds389
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: #2435 Author: William Brown <william@blackhats.net.au> Review by: @progier389 @mreynolds389
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: #2435 Author: William Brown <william@blackhats.net.au> Review by: @progier389 @mreynolds389
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: #2435 Author: William Brown <william@blackhats.net.au> Review by: @progier389 @mreynolds389
Bug Description: The IDL scan limit existed as a naive attempt at query optimisation to force longer IDLs to be skipped in favour of short IDLs. However, since we now have a true query optimiser that can handle this correctl, the IDL scan limit is redundant and not functional. The only condition the IDL scan limit can now impact is making queries slower by rejecting the use of longer IDLs when shortcut conditions are not possible. Fix Description: Raise the IDL Scan Limit to INT_MAX to observe the results. In the future we can remove the code entirely. fixes: #2435 Author: William Brown <william@blackhats.net.au> Review by: @progier389 @mreynolds389
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49376
Issue Description
With https://pagure.io/389-ds-base/issue/49372 we try to avoid hitting large IDLs by fixing the filters.
There are some cases left we may access IDLs and in those cases, we probably want to return an IDL anyway.
One case is if we hit the idlistscanlimit in an OR query. In this case we return ALLIDS to the parent filter, which likely inturn raises this back to the parent. This means that we do a very expensive id2entry scan. Had we of return an IDL, even if the IDL had say 100,000 items, that's still better than doing a full id2entry on a db with 200,000.
Another case is broad AND queries. If we do a subtree search of objectClass=group under say ou=Groups, if this returns ALLIDS, we have just wasted a lot of time again falling back to a full db search when we could have loaded the IDL and intersected it with parentid.
I think that we should examine the current idlistscanlimit and either raise it, or consider removing it all together.
The text was updated successfully, but these errors were encountered: