New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance - include/exclude terms on high cardinality fields #7526
Comments
@markharwood Yes, using include/exclude is slow since it will iterate over all terms and for each term check if it matches with the provided includes and doesn't match with the provided exclude. Each terms that is accepted the global ordinal will be saved in a bitset. The idea here was to be potentially slower (which really is the case for high cardinality fields) rather than increasing the transient memory footprint of a search request with a terms aggregation. I think we can optimize this logic in certain scenarios. For example if include with without regex expression is used or a simple include prefix is used we can iterator over all includes and lookup the global ordinal instead of iterating over all possible terms and check if they match with the defined include terms. I think the same can be done for exclude terms without regex and prefixes. Would this help in your tests? And maybe then if include/exclude terms are used with regex we should then fallback to the map execution hint? Similar to what happens if terms aggregation with script is executed. |
Thanks for the comment, @martijnvg My use case is when looking at the interactions of a pre-defined set of entities e.g. all the actors who have appeared in "mafia" movies or all the people in Enron who participated in emails referencing "LJM". This is likely to be a common form of analysis as it provides us with the raw data required to conduct graph analytics (centrality measures, key influencers, initiators etc)
Each leaf bucket is then an edge in our graph with a count of number of interactions between a pair of actors and the potential for further child aggs (date histograms summarising relationship over time etc). This is a little clumsy and we could create a special "graph" agg for this use case as it would overcome these concerns:
I'm not sure what the other use cases are that require these include clauses but this feels like a broad one deserving of its own agg. For the general case I like your suggestion of taking the terms in my include set and looking up just their global ordinals as an alternative to looking up ALL terms. Would you suggest we always adopt this approach for non-regex IncludeExcludes? |
Yes, this will improve non-regex includes/excludes a lot and this should be a trivial change. |
I made the change - what was a 74 seconds lookup is now only 153 milliseconds on my dataset |
Blocked pending a review for the required non-regex support in #7529 |
For simpler cases where exact terms are passed in include/exclude clauses (a feature enabled in this addition: #7529 ) performance is radically improved. However it is acknowledged that performance is still bad for a pure regex-based filter on high cardinality fields (we deliberately chose a slow response over the possibility of running out of RAM). @jpountz has suggested we could look at using some of Lucene's automaton features to efficiently filter termsenums. For that reason I have labeled this issue as "high hanging fruit" to recognise the remaining work that may need doing to make regex-based filters faster |
We talked about intersecting the field data terms dictionary with automatons to speed up the generation of the bit set of matching terms. This would be a breaking change in 2.0 as we would switch from Java's regular expressions to Lucene's which have a slightly different syntax. Another change might be required: in some cases evaluating whether a term matches the regular expression at search time can be faster than doing it ahead of time like we do today. |
…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close elastic#7526
…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close elastic#7526
I have noticed that if you add an "include" or "exclude" section as part of a terms or significant_terms aggregation on a high-cardinality field the performance can degrade quite badly.
The root cause is that the IncludeExclude.acceptedGlobalOrdinals() method enumerates terms eagerly for all terms in the index rather than lazily for those in the result set. For a high cardinality field this can take a very long time (tens of seconds in my test). As the method name suggests, this code is run when global ordinals are used and so a work-around is for the client to use
to their agg definition to avoid the use of global ordinals. In my tests this reduced a query that took 30 seconds down to sub-second but obviously there may be memory concerns relating to not using global ordinals.
This issue is created to discuss the ways in which we could automatically "do the right thing" without users needing to provide execution hints or incurring other costs.
The text was updated successfully, but these errors were encountered: