Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

jpountz · 2015-04-03T16:47:38Z

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.

Close #7526
Close #9848

jpountz · 2015-04-03T16:49:19Z

Note for self: @rashidkpc asked to create an issue in Kibana it this change makes it in so that Kibana can be changed too.

jpountz · 2015-04-03T17:01:36Z

@rmuir @mikemccand @s1monw I am not familiar with the Automaton/RegExp/CompiledAutomaton/ByteRunAutomaton API so if one of you could check that I did not misuse these APIs in this change, that would be great!

rmuir · 2015-04-03T17:18:56Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/support/IncludeExclude.java

@@ -89,12 +93,20 @@ private void addReject(long val) {
     * @param excludeValues   The terms to be excluded
     *                  (may only be {@code null} if one of the other arguments is none-null.
     */
-    public IncludeExclude(Pattern include, Pattern exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {
-        assert includeValues != null || include != null || 
+    public IncludeExclude(RegExp include, RegExp exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {


what is this includeValues/excludeValues? Should it just build the same matcher too (maybe require SortedSet).

Another option for these is to reverse-lookup their ordinals against the dv instance, and you just exclude those ordinals. This should make this option much faster as it wouldnt need to traverse the Terms at all and do contains(BytesRef)

rmuir · 2015-04-03T17:25:01Z

I added some comments. In general I am a little confused as to what is going on in all cases.

Doing this kind of filtering seems costly, since it will intersect against the entire termsenum, potentially dereferencing many global ordinals to byte[] to do the matching, in order to finally get the bitset.

Can we use a more efficient bitset in some cases? We will be populating it in ordinal order...

Do we cache these bitsets anywhere in case the same filtering is repeated over and over?

rmuir · 2015-04-03T18:15:11Z

I thought this over more and brainstormed with @jpountz . I think we should always build one 'automaton' based on includes, excludes, includeValues, excludeValues, whatever. This can be Operations.minus(includes, excludes) basically.

Then, we can always simply use terms.intersect to make the bitset and not enumerate terms, doing ord-BytesRef resolution. Making it completely optimal for this case is interesting, its different than the techniques we would use for scoring documents, depending on the regex.

For dense cases, prefix cases (ab*), or similar regexps, it could be a win to be tricky in some cases. If someone does ab*, we could just do two reverse lookups (ab and ac) and fill the bitset range in between, instead of reading all the ordinals and deref'ing them to terms in ab* .

But intersect() is probably much better already than what we do today.

jpountz · 2015-04-08T10:13:49Z

I pushed a new commit in order to fold in exclusions in the automaton. It is still far from great (we should better reuse what we build when several aggregators reuse the same include/exclude rules for instance) but better than what we have today?

rmuir · 2015-04-08T12:23:02Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/support/IncludeExclude.java

-    private Set<BytesRef> includeValues;
-    private Set<BytesRef> excludeValues;
-    private final boolean hasRegexTest;
+    public static class StringFilter {


When is this used? Only in the non-global ordinals case?

Yes, only for scripts.

rmuir · 2015-04-08T16:11:10Z

looks great to me. I added comments but only asking for code comments

…exps. Today we check every regular expression eagerly against every possible term. This can be very slow if you have lots of unique terms, and even the bottleneck if your query is selective. This commit switches to Lucene regular expressions instead of Java (not exactly the same syntax yet most existing regular expressions should keep working) and uses the same logic as RegExpQuery to intersect the regular expression with the terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made things faster and the same request that took 750ms on master now takes 74ms with this change. Close elastic#7526

…ude_exclude Aggregations: Speed up include/exclude in terms aggregations with regexps. Close #10418

dawi · 2015-12-24T16:24:42Z

@jpountz I am wondering if it would be possible to support both, lucene and java regexps.

On the one hand I greatly appreciate the performance improvements achieved by using lucene regexps but on the other hand I need the power and flexibility of java regexps.

In a perfect world we would get both, speed AND flexibility, but in the meantime it would be great if the user could make the decision which regexp engine to use.

Background information:

It is currently not possible to include/exclude terms on a caseinsensitive basis.

I can think of some possible workarounds, but they are not as simple and straight forward as this implementation:

.addAggregation(
    terms("fieldAgg")
        .field(field + ".original")
        .include(regex, Pattern.CASE_INSENSITIVE) // supported by elasticsearch 1.7
        .size(maxResults)
        .order(Terms.Order.count(false))
);

rmuir · 2015-12-24T17:13:23Z

Just index your content correctly. If you want case sensitive, that means indexing with lowercasefilter. its a search engine!

dawi · 2015-12-24T17:56:11Z

I don't know how this would help me in this particular issue. I already have a multi field mapping to preserve the original field value. So it already is possible to build the aggregation over the lowercase field. But in this case the returned field values would also be lowercase, but I am interested in the exact original value.

I know that the decision to only use Lucenes Regexps is carefully chosen for various reasons.

But I am note sure if this goal is reached in the end if users have to implement complexer mappings or client side logic to achieve the same result. As I said, this usecase could be implemented perfectly fine just using the snippet above using Elasticsearch 1.7.

jpountz added >breaking v2.0.0-beta1 review :Analytics/Aggregations Aggregations labels Apr 3, 2015

rmuir reviewed Apr 3, 2015
View reviewed changes

jpountz force-pushed the enhancement/speed_up_aggs_include_exclude branch from a7dea5f to 3c067fd Compare April 8, 2015 10:12

rmuir reviewed Apr 8, 2015
View reviewed changes

jpountz force-pushed the enhancement/speed_up_aggs_include_exclude branch from 3c067fd to aecd9ac Compare April 9, 2015 10:15

jpountz added a commit that referenced this pull request Apr 9, 2015

Merge pull request #10418 from jpountz/enhancement/speed_up_aggs_incl…

e25db22

…ude_exclude Aggregations: Speed up include/exclude in terms aggregations with regexps. Close #10418

jpountz merged commit e25db22 into elastic:master Apr 9, 2015

jpountz deleted the enhancement/speed_up_aggs_include_exclude branch April 9, 2015 10:16

jpountz mentioned this pull request Apr 9, 2015

Syntax change for the include/exclude parameters of the (significant_)terms aggregation elastic/kibana#3557

Closed

jpountz mentioned this pull request May 15, 2015

Aggs: Terms filtering should leverage Terms.intersect #9848

Closed

clintongormley changed the title ~~Aggregations: Speed up include/exclude in terms aggregations with regexps.~~ Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions Jun 6, 2015

clintongormley removed the review label Aug 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

jpountz commented Apr 3, 2015

jpountz commented Apr 3, 2015

jpountz commented Apr 3, 2015

rmuir Apr 3, 2015

rmuir Apr 3, 2015

rmuir commented Apr 3, 2015

rmuir commented Apr 3, 2015

jpountz commented Apr 8, 2015

rmuir Apr 8, 2015

jpountz Apr 8, 2015

rmuir commented Apr 8, 2015

dawi commented Dec 24, 2015

rmuir commented Dec 24, 2015

dawi commented Dec 24, 2015

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Conversation

jpountz commented Apr 3, 2015

jpountz commented Apr 3, 2015

jpountz commented Apr 3, 2015

rmuir Apr 3, 2015

Choose a reason for hiding this comment

rmuir Apr 3, 2015

Choose a reason for hiding this comment

rmuir commented Apr 3, 2015

rmuir commented Apr 3, 2015

jpountz commented Apr 8, 2015

rmuir Apr 8, 2015

Choose a reason for hiding this comment

jpountz Apr 8, 2015

Choose a reason for hiding this comment

rmuir commented Apr 8, 2015

dawi commented Dec 24, 2015

rmuir commented Dec 24, 2015

dawi commented Dec 24, 2015