New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freq terms enum #5597
Freq terms enum #5597
Conversation
A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester. This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies
…ing TTF and DF stats and FreqTermsEnum now only handles the responsibility of caching TTF and DF stats. Renamed "buildTermsEnum" method in SignificantTermsAggFactory to "prepareBackground" and made it return number of docs in (potentially filtered) background set because required context for signif terms algo. Added missing support to SignificantTermsAggFactory for creating a filtered TermsEnum when only one child aggregator. Added experimental code to SignificantTermsParser to define a filter for subsetting the background index but after testing left commented out with notes for future work in another PR Added test in SignificantTermsTests for a nested aggregation in order to exercise the FreqTermsEnum caching logic.
LGTM, I would add the ability to provide the filter to significant terms agg, with a note about how to properly use it, I think it will end up being very beneficial to users. |
} | ||
found = true; | ||
if (anEnum.bits == null) { | ||
if (needDocFreqs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get why we have all these branches. For the docFreq
we can just calculate is all the time. For the TTF we can just do what lucene does internally:
if (totalTermFreq == -1 || leafTotalTermFreq == -1) {
totalTermFreq = -1;
continue;
}
then we don't need the hasMissingTotalTermFreq
alltogether
Base class always calculates DocFreq now - removed boolean flag that made this optional. Simplified logic handling where different codecs used some with TTF some without Option to control totalTermFreq calculation now controlled using existing DocsEnum#flag setting rather than new boolean. Changed single DocsEnum iteration-heavy loop to a choice of 2 loops with appropriate inner logic depending on mode. Tidied resource tearDown in FreqTermsEnumTests
} | ||
|
||
current = found ? text : null; | ||
if(found){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some minor formatting glitches here like whitespaces - can you reformat the files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this - we could assign the NOT_FOUND
in the beginning of this method and then assign currentDocFreq
as well as currentTotalTermFreq
each time we update their local corresponding variables ie. as the last statement in the for (Holder anEnum : enums)
loop. Same is true for the text
and that way we can just trash everything below the for loop and return found
I also think it's ok to just use -1
as NOT_FOUND
I did another review round! Looks good though! |
LGTM +1 so squash and push |
A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester. This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies Closes #5597
Based on the work started in #5489
Two new utility classes to support analysis of term frequencies used by significant_terms agg and phrase suggester. The FilterableTermsEnum base class provides access to stats with an optional filter (this will be of use in examining significant_terms against a background other than the entire index).
The FreqTermsEnum subclass adds caching support for scenarios where there are likely to be repeated calls for the same term's counts.