New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant_terms agg: added option for a backgroundFilter #5944
Conversation
…background context for analysis of term frequencies
// term may be missing from the background, so for the purposes of this calculation | ||
// we assume a value of 1 for our calculations which avoids returning an "infinity" result | ||
supersetFreq = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use additional smoothing instead in order not to have this particular case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "0" is the only special case that needs treatment as it is the absence of any evidence and without this adjustment the score returned is infinity. Anything > 0 is at least "some evidence" and therefore something I don't think we should tamper with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking about something like Laplace smoothing, that we already use eg. in the phrase suggester http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html#_smoothing_models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be done in a different issue though.
The patch looks good. Can we also have a test that makes sure that this agg finds the right terms when a background filter is set? |
"tags" : { | ||
"significant_terms" : { | ||
"field" : "tag", | ||
"backgroundFilter": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all our examples include the _
notation, can we change to background_filter
Added test for background filter tuning out selected words
|
||
if (BACKGROUND_FILTER.match(currentFieldName)) { | ||
filter = context.queryParserService().parseInnerFilter(parser).filter(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a else
to fail the parsing if an object with a different key than BACKGROUND_FILTER
is found?
@markharwood I just did another review and left a couple of minor comments. Other than that it looks good to me, it's nice that the documentation makes clear that although convenient this option can be very slow. |
…er checking and Junit asserts
LGTM |
… background context for analysis of term frequencies Closes #5944
Allows for a narrowed background context in analysis of term frequencies.