New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418
Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418
Conversation
Note for self: @rashidkpc asked to create an issue in Kibana it this change makes it in so that Kibana can be changed too. |
@rmuir @mikemccand @s1monw I am not familiar with the Automaton/RegExp/CompiledAutomaton/ByteRunAutomaton API so if one of you could check that I did not misuse these APIs in this change, that would be great! |
@@ -89,12 +93,20 @@ private void addReject(long val) { | |||
* @param excludeValues The terms to be excluded | |||
* (may only be {@code null} if one of the other arguments is none-null. | |||
*/ | |||
public IncludeExclude(Pattern include, Pattern exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) { | |||
assert includeValues != null || include != null || | |||
public IncludeExclude(RegExp include, RegExp exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this includeValues/excludeValues? Should it just build the same matcher too (maybe require SortedSet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option for these is to reverse-lookup their ordinals against the dv instance, and you just exclude those ordinals. This should make this option much faster as it wouldnt need to traverse the Terms at all and do contains(BytesRef)
I added some comments. In general I am a little confused as to what is going on in all cases. Doing this kind of filtering seems costly, since it will intersect against the entire termsenum, potentially dereferencing many global ordinals to byte[] to do the matching, in order to finally get the bitset. Can we use a more efficient bitset in some cases? We will be populating it in ordinal order... Do we cache these bitsets anywhere in case the same filtering is repeated over and over? |
I thought this over more and brainstormed with @jpountz . I think we should always build one 'automaton' based on includes, excludes, includeValues, excludeValues, whatever. This can be Operations.minus(includes, excludes) basically. Then, we can always simply use terms.intersect to make the bitset and not enumerate terms, doing ord-BytesRef resolution. Making it completely optimal for this case is interesting, its different than the techniques we would use for scoring documents, depending on the regex. For dense cases, prefix cases ( But intersect() is probably much better already than what we do today. |
a7dea5f
to
3c067fd
Compare
I pushed a new commit in order to fold in exclusions in the automaton. It is still far from great (we should better reuse what we build when several aggregators reuse the same include/exclude rules for instance) but better than what we have today? |
private Set<BytesRef> includeValues; | ||
private Set<BytesRef> excludeValues; | ||
private final boolean hasRegexTest; | ||
public static class StringFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this used? Only in the non-global ordinals case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only for scripts.
looks great to me. I added comments but only asking for code comments |
…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close elastic#7526
3c067fd
to
aecd9ac
Compare
…ude_exclude Aggregations: Speed up include/exclude in terms aggregations with regexps. Close #10418
@jpountz I am wondering if it would be possible to support both, lucene and java regexps. On the one hand I greatly appreciate the performance improvements achieved by using lucene regexps but on the other hand I need the power and flexibility of java regexps. In a perfect world we would get both, speed AND flexibility, but in the meantime it would be great if the user could make the decision which regexp engine to use. Background information: It is currently not possible to include/exclude terms on a caseinsensitive basis. I can think of some possible workarounds, but they are not as simple and straight forward as this implementation:
|
Just index your content correctly. If you want case sensitive, that means indexing with lowercasefilter. its a search engine! |
I don't know how this would help me in this particular issue. I already have a multi field mapping to preserve the original field value. So it already is possible to build the aggregation over the lowercase field. But in this case the returned field values would also be lowercase, but I am interested in the exact original value. I know that the decision to only use Lucenes Regexps is carefully chosen for various reasons. But I am note sure if this goal is reached in the end if users have to implement complexer mappings or client side logic to achieve the same result. As I said, this usecase could be implemented perfectly fine just using the snippet above using Elasticsearch 1.7. |
Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.
This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.
Close #7526
Close #9848