Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418

Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 3, 2015

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.

Close #7526
Close #9848

@jpountz
Copy link
Contributor Author

jpountz commented Apr 3, 2015

Note for self: @rashidkpc asked to create an issue in Kibana it this change makes it in so that Kibana can be changed too.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 3, 2015

@rmuir @mikemccand @s1monw I am not familiar with the Automaton/RegExp/CompiledAutomaton/ByteRunAutomaton API so if one of you could check that I did not misuse these APIs in this change, that would be great!

@@ -89,12 +93,20 @@ private void addReject(long val) {
* @param excludeValues The terms to be excluded
* (may only be {@code null} if one of the other arguments is none-null.
*/
public IncludeExclude(Pattern include, Pattern exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {
assert includeValues != null || include != null ||
public IncludeExclude(RegExp include, RegExp exclude, Set<BytesRef> includeValues, Set<BytesRef> excludeValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this includeValues/excludeValues? Should it just build the same matcher too (maybe require SortedSet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option for these is to reverse-lookup their ordinals against the dv instance, and you just exclude those ordinals. This should make this option much faster as it wouldnt need to traverse the Terms at all and do contains(BytesRef)

@rmuir
Copy link
Contributor

rmuir commented Apr 3, 2015

I added some comments. In general I am a little confused as to what is going on in all cases.

Doing this kind of filtering seems costly, since it will intersect against the entire termsenum, potentially dereferencing many global ordinals to byte[] to do the matching, in order to finally get the bitset.

Can we use a more efficient bitset in some cases? We will be populating it in ordinal order...

Do we cache these bitsets anywhere in case the same filtering is repeated over and over?

@rmuir
Copy link
Contributor

rmuir commented Apr 3, 2015

I thought this over more and brainstormed with @jpountz . I think we should always build one 'automaton' based on includes, excludes, includeValues, excludeValues, whatever. This can be Operations.minus(includes, excludes) basically.

Then, we can always simply use terms.intersect to make the bitset and not enumerate terms, doing ord-BytesRef resolution. Making it completely optimal for this case is interesting, its different than the techniques we would use for scoring documents, depending on the regex.

For dense cases, prefix cases (ab*), or similar regexps, it could be a win to be tricky in some cases. If someone does ab*, we could just do two reverse lookups (ab and ac) and fill the bitset range in between, instead of reading all the ordinals and deref'ing them to terms in ab* .

But intersect() is probably much better already than what we do today.

@jpountz jpountz force-pushed the enhancement/speed_up_aggs_include_exclude branch from a7dea5f to 3c067fd Compare April 8, 2015 10:12
@jpountz
Copy link
Contributor Author

jpountz commented Apr 8, 2015

I pushed a new commit in order to fold in exclusions in the automaton. It is still far from great (we should better reuse what we build when several aggregators reuse the same include/exclude rules for instance) but better than what we have today?

private Set<BytesRef> includeValues;
private Set<BytesRef> excludeValues;
private final boolean hasRegexTest;
public static class StringFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this used? Only in the non-global ordinals case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only for scripts.

@rmuir
Copy link
Contributor

rmuir commented Apr 8, 2015

looks great to me. I added comments but only asking for code comments

…exps.

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.

Close elastic#7526
@jpountz jpountz force-pushed the enhancement/speed_up_aggs_include_exclude branch from 3c067fd to aecd9ac Compare April 9, 2015 10:15
jpountz added a commit that referenced this pull request Apr 9, 2015
…ude_exclude

Aggregations: Speed up include/exclude in terms aggregations with regexps.

Close #10418
@jpountz jpountz merged commit e25db22 into elastic:master Apr 9, 2015
@jpountz jpountz deleted the enhancement/speed_up_aggs_include_exclude branch April 9, 2015 10:16
@clintongormley clintongormley changed the title Aggregations: Speed up include/exclude in terms aggregations with regexps. Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions Jun 6, 2015
@dawi
Copy link

dawi commented Dec 24, 2015

@jpountz I am wondering if it would be possible to support both, lucene and java regexps.

On the one hand I greatly appreciate the performance improvements achieved by using lucene regexps but on the other hand I need the power and flexibility of java regexps.

In a perfect world we would get both, speed AND flexibility, but in the meantime it would be great if the user could make the decision which regexp engine to use.

Background information:

It is currently not possible to include/exclude terms on a caseinsensitive basis.

I can think of some possible workarounds, but they are not as simple and straight forward as this implementation:

.addAggregation(
    terms("fieldAgg")
        .field(field + ".original")
        .include(regex, Pattern.CASE_INSENSITIVE) // supported by elasticsearch 1.7
        .size(maxResults)
        .order(Terms.Order.count(false))
);

@rmuir
Copy link
Contributor

rmuir commented Dec 24, 2015

Just index your content correctly. If you want case sensitive, that means indexing with lowercasefilter. its a search engine!

@dawi
Copy link

dawi commented Dec 24, 2015

I don't know how this would help me in this particular issue. I already have a multi field mapping to preserve the original field value. So it already is possible to build the aggregation over the lowercase field. But in this case the returned field values would also be lowercase, but I am interested in the exact original value.

I know that the decision to only use Lucenes Regexps is carefully chosen for various reasons.

But I am note sure if this goal is reached in the end if users have to implement complexer mappings or client side logic to achieve the same result. As I said, this usecase could be implemented perfectly fine just using the snippet above using Elasticsearch 1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants