Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freq terms enum #5597

Closed
wants to merge 5 commits into from
Closed

Freq terms enum #5597

wants to merge 5 commits into from

Conversation

markharwood
Copy link
Contributor

Based on the work started in #5489
Two new utility classes to support analysis of term frequencies used by significant_terms agg and phrase suggester. The FilterableTermsEnum base class provides access to stats with an optional filter (this will be of use in examining significant_terms against a background other than the entire index).
The FreqTermsEnum subclass adds caching support for scenarios where there are likely to be repeated calls for the same term's counts.

kimchy and others added 3 commits March 28, 2014 16:26
A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester.
This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies
…ing TTF and DF stats and FreqTermsEnum now only handles the responsibility of caching TTF and DF stats.

Renamed "buildTermsEnum" method in SignificantTermsAggFactory to "prepareBackground" and made it return number of docs in (potentially filtered) background set because required context for signif terms algo.
Added missing support to SignificantTermsAggFactory for creating a filtered TermsEnum when only one child aggregator.
Added experimental code to SignificantTermsParser to define a filter for subsetting the background index but after testing left commented out with notes for future work in another PR
Added test in SignificantTermsTests for a nested aggregation in order to exercise the FreqTermsEnum caching logic.
@kimchy
Copy link
Member

kimchy commented Mar 31, 2014

LGTM, I would add the ability to provide the filter to significant terms agg, with a note about how to properly use it, I think it will end up being very beneficial to users.

}
found = true;
if (anEnum.bits == null) {
if (needDocFreqs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why we have all these branches. For the docFreq we can just calculate is all the time. For the TTF we can just do what lucene does internally:

if (totalTermFreq == -1 || leafTotalTermFreq == -1) {
  totalTermFreq = -1;
  continue;
}

then we don't need the hasMissingTotalTermFreq alltogether

Base class always calculates DocFreq now - removed boolean flag that made this optional.
Simplified logic handling where different codecs used some with TTF some without
Option to control totalTermFreq calculation now controlled using existing DocsEnum#flag setting rather than new boolean.
Changed single DocsEnum iteration-heavy loop to a choice of 2 loops with appropriate inner logic depending on mode.
Tidied resource tearDown in FreqTermsEnumTests
}

current = found ? text : null;
if(found){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some minor formatting glitches here like whitespaces - can you reformat the files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this - we could assign the NOT_FOUND in the beginning of this method and then assign currentDocFreq as well as currentTotalTermFreq each time we update their local corresponding variables ie. as the last statement in the for (Holder anEnum : enums) loop. Same is true for the text and that way we can just trash everything below the for loop and return found I also think it's ok to just use -1 as NOT_FOUND

@s1monw
Copy link
Contributor

s1monw commented Apr 2, 2014

I did another review round! Looks good though!

@s1monw
Copy link
Contributor

s1monw commented Apr 2, 2014

LGTM +1 so squash and push

@kimchy kimchy closed this in 63290a9 Apr 2, 2014
kimchy added a commit that referenced this pull request Apr 2, 2014
A frequency caching terms enum, that also allows to be configured with an optional filter. To be used by both significant terms and phrase suggester.
This change extracts the frequency caching into the same code, and allow in the future to add a filter to control/customize the background frequencies
Closes #5597
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories >enhancement and removed >feature labels Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants