Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

javanna · 2015-05-09T12:43:21Z

Our own fork of the lucene PostingsHighlighter is very hard to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene the way I implemented them as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, terms are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:

dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.

Closes #10625

Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably. One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term s are pulled out of the automata for multi term queries). Removing our fork means the following in terms of features: - dropped support for require_field_match: the postings highlighter will only highlight fields that were queried - the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value Closes elastic#10625 Closes elastic#11077

kimchy · 2015-05-09T13:03:28Z

this looks great Luca!, agreed that this fork is a challenge to maintain. Out of the features we removed, the one that users ask for (regardless of highlighting impl) is to be able to take multi value fields into account and highlight each one, its a big usability aspect. I would check if its possible to try and add it to Lucene posting highlighter itself in the future.

javanna · 2015-05-09T13:21:26Z

the one that users ask for (regardless of highlighting impl) is to be able to take multi value fields into account and highlight each one, its a big usability aspect

I may have made it sound worse than it actually is. The postings highlighter is aware of multiple values, at least the way we use it, as we use a specific paragraph separator between values which the break iterator can detect. That doesn't work only when wanting to highlight the whole content of a field (setting fragments_size to 0), as in that case we use a whole break iterator. We could split the resulting snippet ourselves based on where we find the paragraph separator, but it seems a bit of a hack since the original text might contain that same character.

Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably. One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term s are pulled out of the automata for multi term queries). Removing our fork means the following in terms of features: - dropped support for require_field_match: the postings highlighter will only highlight fields that were queried - the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value Closes elastic#10625 Closes elastic#11077

javanna · 2015-05-09T20:43:21Z

One other idea to address the "highlight the whole content, value per value" usecase could be to write a break iterator to use instead of the WholeBreakIterator, which breaks only when it encounters the paragraph separator that we use between values...that said I'm not 100% comfortable with this as the field content can always contain that same character...

nik9000 · 2015-05-10T00:19:27Z

One other idea to address the "highlight the whole content, value per value" usecase could be to write a break iterator to replace the whole break iterator, which breaks only when it encounters the paragraph separator which we use between values...that said I'm not 100% comfortable with this as the field content can always contain that same character...

For what its worth I think the experimental highlighter supports this use case at roughly the same performance you'd get out of the postings highlighter. Its been a while since I looked at the other highlighters so I'm not 100% sure what the use case is - I'm just going from memory here.

jpountz · 2015-05-12T09:50:52Z

LGTM, I think this is a great change. Let's make sure there is a Lucene ticket for the multi-value highlighting issue and reference it from here?

javanna · 2015-05-12T10:03:00Z

@rmuir can you have a look too please?

rmuir · 2015-05-12T10:04:57Z

paragraph separator is only a "recommended" thing to use, because the default java breakiterator will split on it already out of box (for typical highlighting use cases). If you want to do something atypical, use a different character that can't be in the data. Use U+0000 if you want.

/** 
 * Returns the logical separator between values for multi-valued fields.
 * The default value is a space character, which means passages can span across values,
 * but a subclass can override, for example with {@code U+2029 PARAGRAPH SEPARATOR (PS)}
 * if each value holds a discrete passage for highlighting.
 */
protected char getMultiValuedSeparator(String field) {
  return ' ';
}

In this patch (which is much simpler, thanks!), this method is actually only called by overridden code, not by lucene at all. So really @jpountz, there isn't anything for lucene "to fix" here. Just don't use 2029, use something else for whatever the strange use case is. Or don't call the method at all and do something different :)

javanna · 2015-05-12T10:14:41Z

this method is actually only called by overridden code, not by lucene at all

right @rmuir that is because we need to load the content from _source. I tried to mimic what the lucene code does by using the same method though, so the behaviour (apart from loading from source) should be the same (apart from bugs that I may have added to it :) )...

don't call the method at all and do something different

Correct me if I'm wrong, I don't think we can do whatever we want as offsets etc. need to match the value loaded from stored fields. loadFieldValues really needs to return the whole content of a field (all values) and from then on we lose the distinction between the different values. I will look into using a different separator between values (that can't be in the data), which should solve this.

rmuir · 2015-05-12T10:19:41Z

Correct if I'm wrong, I don't think we can do whatever we want as offsets etc. need to match the value loaded from stored fields. loadFieldValues really needs to return the whole content of a field (all values) and from then on we lose the distinction between the different values. I will look into using a different separator between values (that can't be in the data), which should solve this.

right, because Analyzer.getOffsetGap() is 1 by default, it should be one character. I recommend a control character like U+0000 or INFORMATION SEPARATOR X.

If we were programming in C, we would be using NUL terminated strings implicitly and without hesitation!

…or highlighting one field value at a time

javanna · 2015-05-13T08:38:16Z

I pushed another commit, highlighting one field value at a time works well now, the output is the same as before. Added a CustomSeparatorBreakIterator that can probably be contributed back to lucene as well, which we use to break on U+0000 when number_of_fragments is set to 0 (knowing that the same separator is used between different values).

Updated the migrate docs, I removed the above limitation and added another one that I found around highlighting match query with type set to phrase_prefix, which is not supported anymore.

@rmuir can you have a look? The break iterator tests will be a deja vu for you I believe :)

rmuir · 2015-05-13T15:18:02Z

This looks great, I like the strategy for the breakiterator. That will be a nice one to put in lucene!

javanna · 2015-05-15T19:03:25Z

This PR is marked breaking as it removes support for the require_field_match option when using the postings highlighter. This option is not supported by the lucene postings highlighter and the way it was supported in elasticsearch made the postings highlighter much slower and harder to maintain, so we decided to stick with the lucene implementation.

andreip · 2016-01-07T08:03:35Z

do you guys think this change generated the issue I'm describing here #15560 ? @javanna

javanna added >enhancement >breaking v2.0.0-beta1 review :Search/Highlighting How a query matched a document labels May 9, 2015

javanna force-pushed the enhancement/remove_xpostings_hl branch from dfb242a to 07ac01a Compare May 9, 2015 12:44

javanna force-pushed the enhancement/remove_xpostings_hl branch from 07ac01a to a2487b6 Compare May 9, 2015 13:40

javanna mentioned this pull request May 9, 2015

require_field_match now defaults to true #11067

Closed

javanna added 2 commits May 13, 2015 10:24

Add break iterator that breaks on custom separator, restore support f…

3a7c939

…or highlighting one field value at a time

update migrate_2.0

88151f5

updated docs

94eb74a

javanna closed this in 46c521f May 15, 2015

kevinkluge removed the review label May 15, 2015

javanna added review and removed review labels May 15, 2015

clintongormley changed the title ~~Highlighting: nuke XPostingsHighlighter~~ Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter Jun 6, 2015

clintongormley mentioned this pull request Dec 14, 2015

Possible regression in highlighting - highlight_query ignored if is different than highlight field #15407

Closed

andreip mentioned this pull request Dec 19, 2015

Highlight on more_like_this query not working in 2.0+, it was in 1.7.2 #15560

Closed

javanna mentioned this pull request Feb 23, 2016

2.2 posting highlighter does not work when highlighting fields different than searched #16778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

javanna commented May 9, 2015

kimchy commented May 9, 2015

javanna commented May 9, 2015

javanna commented May 9, 2015

nik9000 commented May 10, 2015

jpountz commented May 12, 2015

javanna commented May 12, 2015

rmuir commented May 12, 2015

javanna commented May 12, 2015

rmuir commented May 12, 2015

javanna commented May 13, 2015

rmuir commented May 13, 2015

javanna commented May 15, 2015

andreip commented Jan 7, 2016

Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

Conversation

javanna commented May 9, 2015

kimchy commented May 9, 2015

javanna commented May 9, 2015

javanna commented May 9, 2015

nik9000 commented May 10, 2015

jpountz commented May 12, 2015

javanna commented May 12, 2015

rmuir commented May 12, 2015

javanna commented May 12, 2015

rmuir commented May 12, 2015

javanna commented May 13, 2015

rmuir commented May 13, 2015

javanna commented May 15, 2015

andreip commented Jan 7, 2016