Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077

Closed

Conversation

javanna
Copy link
Member

@javanna javanna commented May 9, 2015

Our own fork of the lucene PostingsHighlighter is very hard to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene the way I implemented them as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, terms are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:

  • dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
  • some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.

Closes #10625

@javanna javanna force-pushed the enhancement/remove_xpostings_hl branch from dfb242a to 07ac01a Compare May 9, 2015 12:44
javanna added a commit to javanna/elasticsearch that referenced this pull request May 9, 2015
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
 make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
 things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value

Closes elastic#10625
Closes elastic#11077
@kimchy
Copy link
Member

kimchy commented May 9, 2015

this looks great Luca!, agreed that this fork is a challenge to maintain. Out of the features we removed, the one that users ask for (regardless of highlighting impl) is to be able to take multi value fields into account and highlight each one, its a big usability aspect. I would check if its possible to try and add it to Lucene posting highlighter itself in the future.

@javanna
Copy link
Member Author

javanna commented May 9, 2015

the one that users ask for (regardless of highlighting impl) is to be able to take multi value fields into account and highlight each one, its a big usability aspect

I may have made it sound worse than it actually is. The postings highlighter is aware of multiple values, at least the way we use it, as we use a specific paragraph separator between values which the break iterator can detect. That doesn't work only when wanting to highlight the whole content of a field (setting fragments_size to 0), as in that case we use a whole break iterator. We could split the resulting snippet ourselves based on where we find the paragraph separator, but it seems a bit of a hack since the original text might contain that same character.

Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
 make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
 things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value

Closes elastic#10625
Closes elastic#11077
@javanna
Copy link
Member Author

javanna commented May 9, 2015

One other idea to address the "highlight the whole content, value per value" usecase could be to write a break iterator to use instead of the WholeBreakIterator, which breaks only when it encounters the paragraph separator that we use between values...that said I'm not 100% comfortable with this as the field content can always contain that same character...

@nik9000
Copy link
Member

nik9000 commented May 10, 2015

One other idea to address the "highlight the whole content, value per value" usecase could be to write a break iterator to replace the whole break iterator, which breaks only when it encounters the paragraph separator which we use between values...that said I'm not 100% comfortable with this as the field content can always contain that same character...

For what its worth I think the experimental highlighter supports this use case at roughly the same performance you'd get out of the postings highlighter. Its been a while since I looked at the other highlighters so I'm not 100% sure what the use case is - I'm just going from memory here.

@jpountz
Copy link
Contributor

jpountz commented May 12, 2015

LGTM, I think this is a great change. Let's make sure there is a Lucene ticket for the multi-value highlighting issue and reference it from here?

@javanna
Copy link
Member Author

javanna commented May 12, 2015

@rmuir can you have a look too please?

@rmuir
Copy link
Contributor

rmuir commented May 12, 2015

paragraph separator is only a "recommended" thing to use, because the default java breakiterator will split on it already out of box (for typical highlighting use cases). If you want to do something atypical, use a different character that can't be in the data. Use U+0000 if you want.

/** 
 * Returns the logical separator between values for multi-valued fields.
 * The default value is a space character, which means passages can span across values,
 * but a subclass can override, for example with {@code U+2029 PARAGRAPH SEPARATOR (PS)}
 * if each value holds a discrete passage for highlighting.
 */
protected char getMultiValuedSeparator(String field) {
  return ' ';
}

In this patch (which is much simpler, thanks!), this method is actually only called by overridden code, not by lucene at all. So really @jpountz, there isn't anything for lucene "to fix" here. Just don't use 2029, use something else for whatever the strange use case is. Or don't call the method at all and do something different :)

@javanna
Copy link
Member Author

javanna commented May 12, 2015

this method is actually only called by overridden code, not by lucene at all

right @rmuir that is because we need to load the content from _source. I tried to mimic what the lucene code does by using the same method though, so the behaviour (apart from loading from source) should be the same (apart from bugs that I may have added to it :) )...

don't call the method at all and do something different

Correct me if I'm wrong, I don't think we can do whatever we want as offsets etc. need to match the value loaded from stored fields. loadFieldValues really needs to return the whole content of a field (all values) and from then on we lose the distinction between the different values. I will look into using a different separator between values (that can't be in the data), which should solve this.

@rmuir
Copy link
Contributor

rmuir commented May 12, 2015

Correct if I'm wrong, I don't think we can do whatever we want as offsets etc. need to match the value loaded from stored fields. loadFieldValues really needs to return the whole content of a field (all values) and from then on we lose the distinction between the different values. I will look into using a different separator between values (that can't be in the data), which should solve this.

right, because Analyzer.getOffsetGap() is 1 by default, it should be one character. I recommend a control character like U+0000 or INFORMATION SEPARATOR X.

If we were programming in C, we would be using NUL terminated strings implicitly and without hesitation!

@javanna
Copy link
Member Author

javanna commented May 13, 2015

I pushed another commit, highlighting one field value at a time works well now, the output is the same as before. Added a CustomSeparatorBreakIterator that can probably be contributed back to lucene as well, which we use to break on U+0000 when number_of_fragments is set to 0 (knowing that the same separator is used between different values).

Updated the migrate docs, I removed the above limitation and added another one that I found around highlighting match query with type set to phrase_prefix, which is not supported anymore.

@rmuir can you have a look? The break iterator tests will be a deja vu for you I believe :)

@rmuir
Copy link
Contributor

rmuir commented May 13, 2015

This looks great, I like the strategy for the breakiterator. That will be a nice one to put in lucene!

@javanna javanna closed this in 46c521f May 15, 2015
@kevinkluge kevinkluge removed the review label May 15, 2015
@javanna javanna added review and removed review labels May 15, 2015
@javanna
Copy link
Member Author

javanna commented May 15, 2015

This PR is marked breaking as it removes support for the require_field_match option when using the postings highlighter. This option is not supported by the lucene postings highlighter and the way it was supported in elasticsearch made the postings highlighter much slower and harder to maintain, so we decided to stick with the lucene implementation.

@clintongormley clintongormley changed the title Highlighting: nuke XPostingsHighlighter Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter Jun 6, 2015
@andreip
Copy link

andreip commented Jan 7, 2016

do you guys think this change generated the issue I'm describing here #15560 ? @javanna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove XPostingsHighlighter
7 participants