Performance - include/exclude terms on high cardinality fields #7526

markharwood · 2014-09-01T10:30:14Z

I have noticed that if you add an "include" or "exclude" section as part of a terms or significant_terms aggregation on a high-cardinality field the performance can degrade quite badly.

The root cause is that the IncludeExclude.acceptedGlobalOrdinals() method enumerates terms eagerly for all terms in the index rather than lazily for those in the result set. For a high cardinality field this can take a very long time (tens of seconds in my test). As the method name suggests, this code is run when global ordinals are used and so a work-around is for the client to use

"execution_hint" : "map"

to their agg definition to avoid the use of global ordinals. In my tests this reduced a query that took 30 seconds down to sub-second but obviously there may be memory concerns relating to not using global ordinals.

This issue is created to discuss the ways in which we could automatically "do the right thing" without users needing to provide execution hints or incurring other costs.

The text was updated successfully, but these errors were encountered:

martijnvg · 2014-09-09T07:54:18Z

@markharwood Yes, using include/exclude is slow since it will iterate over all terms and for each term check if it matches with the provided includes and doesn't match with the provided exclude. Each terms that is accepted the global ordinal will be saved in a bitset. The idea here was to be potentially slower (which really is the case for high cardinality fields) rather than increasing the transient memory footprint of a search request with a terms aggregation.

I think we can optimize this logic in certain scenarios. For example if include with without regex expression is used or a simple include prefix is used we can iterator over all includes and lookup the global ordinal instead of iterating over all possible terms and check if they match with the defined include terms. I think the same can be done for exclude terms without regex and prefixes. Would this help in your tests?

And maybe then if include/exclude terms are used with regex we should then fallback to the map execution hint? Similar to what happens if terms aggregation with script is executed.

markharwood · 2014-09-09T12:13:55Z

Thanks for the comment, @martijnvg My use case is when looking at the interactions of a pre-defined set of entities e.g. all the actors who have appeared in "mafia" movies or all the people in Enron who participated in emails referencing "LJM". This is likely to be a common form of analysis as it provides us with the raw data required to conduct graph analytics (centrality measures, key influencers, initiators etc)
In these scenarios a previous query for a topic ("mafia" or "LJM") may have selected the main entities of interest (using a terms or significant_terms agg) and these IDs are then used in a subsequent query like the one below to produce the edges in our graph:

        "aggs" : {
                        "actor":{
                            "terms":{"field":"actorName",  "size":500,
                                "include": ["Robert DeNiro", "Joe Pesci", "Al Pacino"...]
                            },
                            "aggs":{
                                    "costar":{
                                        "terms":{"field":"actorName",  "size":500,
                                                    "include": ["Robert DeNiro", "Joe Pesci", "Al Pacino"...]
                                        }
                                        }
                                }                            
                            }
        }

Each leaf bucket is then an edge in our graph with a count of number of interactions between a pair of actors and the potential for further child aggs (date histograms summarising relationship over time etc).

This is a little clumsy and we could create a special "graph" agg for this use case as it would overcome these concerns:

the need to declare 2 aggs clauses repeating the same list of "includes"
the default removal of "self-connecting" edges i.e. a DeNiro->DeNiro bucket

I'm not sure what the other use cases are that require these include clauses but this feels like a broad one deserving of its own agg.

For the general case I like your suggestion of taking the terms in my include set and looking up just their global ordinals as an alternative to looking up ALL terms. Would you suggest we always adopt this approach for non-regex IncludeExcludes?

martijnvg · 2014-09-09T12:53:07Z

Would you suggest we always adopt this approach for non-regex IncludeExcludes?

Yes, this will improve non-regex includes/excludes a lot and this should be a trivial change.

markharwood · 2014-09-09T12:55:25Z

I made the change - what was a 74 seconds lookup is now only 153 milliseconds on my dataset

markharwood · 2014-09-09T14:15:19Z

Blocked pending a review for the required non-regex support in #7529

markharwood · 2014-09-12T15:29:21Z

For simpler cases where exact terms are passed in include/exclude clauses (a feature enabled in this addition: #7529 ) performance is radically improved.

However it is acknowledged that performance is still bad for a pure regex-based filter on high cardinality fields (we deliberately chose a slow response over the possibility of running out of RAM). @jpountz has suggested we could look at using some of Lucene's automaton features to efficiently filter termsenums. For that reason I have labeled this issue as "high hanging fruit" to recognise the remaining work that may need doing to make regex-based filters faster

jpountz · 2014-11-07T10:46:02Z

We talked about intersecting the field data terms dictionary with automatons to speed up the generation of the bit set of matching terms. This would be a breaking change in 2.0 as we would switch from Java's regular expressions to Lucene's which have a slightly different syntax.

Another change might be required: in some cases evaluating whether a term matches the regular expression at search time can be faster than doing it ahead of time like we do today.

…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close elastic#7526

markharwood added the discuss label Sep 1, 2014

markharwood mentioned this issue Sep 9, 2014

Aggs enhancement - allow Include/Exclude clauses to use array of terms #7529

Closed

markharwood added the high hanging fruit label Sep 12, 2014

clintongormley added help wanted adoptme v2.0.0-beta1 >enhancement :Analytics/Aggregations Aggregations and removed discuss labels Nov 7, 2014

jpountz mentioned this issue Feb 24, 2015

Aggs: Terms filtering should leverage Terms.intersect #9848

Closed

jpountz mentioned this issue Apr 3, 2015

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Merged

jpountz closed this as completed in #10418 Apr 9, 2015

CCheSumo mentioned this issue Jul 20, 2016

Cche add execution hint SumoLogic/elasticsearch-client#43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance - include/exclude terms on high cardinality fields #7526

Performance - include/exclude terms on high cardinality fields #7526

markharwood commented Sep 1, 2014

martijnvg commented Sep 9, 2014

markharwood commented Sep 9, 2014

martijnvg commented Sep 9, 2014

markharwood commented Sep 9, 2014

markharwood commented Sep 9, 2014

markharwood commented Sep 12, 2014

jpountz commented Nov 7, 2014

Performance - include/exclude terms on high cardinality fields #7526

Performance - include/exclude terms on high cardinality fields #7526

Comments

markharwood commented Sep 1, 2014

martijnvg commented Sep 9, 2014

markharwood commented Sep 9, 2014

martijnvg commented Sep 9, 2014

markharwood commented Sep 9, 2014

markharwood commented Sep 9, 2014

markharwood commented Sep 12, 2014

jpountz commented Nov 7, 2014