New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove XPostingsHighlighter in favour of Lucene's PostingsHighlighter #11077
Conversation
dfb242a
to
07ac01a
Compare
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably. One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term s are pulled out of the automata for multi term queries). Removing our fork means the following in terms of features: - dropped support for require_field_match: the postings highlighter will only highlight fields that were queried - the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value Closes elastic#10625 Closes elastic#11077
this looks great Luca!, agreed that this fork is a challenge to maintain. Out of the features we removed, the one that users ask for (regardless of highlighting impl) is to be able to take multi value fields into account and highlight each one, its a big usability aspect. I would check if its possible to try and add it to Lucene posting highlighter itself in the future. |
I may have made it sound worse than it actually is. The postings highlighter is aware of multiple values, at least the way we use it, as we use a specific paragraph separator between values which the break iterator can detect. That doesn't work only when wanting to highlight the whole content of a field (setting |
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably. One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term s are pulled out of the automata for multi term queries). Removing our fork means the following in terms of features: - dropped support for require_field_match: the postings highlighter will only highlight fields that were queried - the output is different compared to other highlighters in case `fragment_size` is set to 0: one single snippet is returned in case a field has multiple values, rather than one highlighted snipper per value Closes elastic#10625 Closes elastic#11077
07ac01a
to
a2487b6
Compare
One other idea to address the "highlight the whole content, value per value" usecase could be to write a break iterator to use instead of the |
For what its worth I think the experimental highlighter supports this use case at roughly the same performance you'd get out of the postings highlighter. Its been a while since I looked at the other highlighters so I'm not 100% sure what the use case is - I'm just going from memory here. |
LGTM, I think this is a great change. Let's make sure there is a Lucene ticket for the multi-value highlighting issue and reference it from here? |
@rmuir can you have a look too please? |
paragraph separator is only a "recommended" thing to use, because the default java breakiterator will split on it already out of box (for typical highlighting use cases). If you want to do something atypical, use a different character that can't be in the data. Use U+0000 if you want.
In this patch (which is much simpler, thanks!), this method is actually only called by overridden code, not by lucene at all. So really @jpountz, there isn't anything for lucene "to fix" here. Just don't use 2029, use something else for whatever the strange use case is. Or don't call the method at all and do something different :) |
right @rmuir that is because we need to load the content from
Correct me if I'm wrong, I don't think we can do whatever we want as offsets etc. need to match the value loaded from stored fields. |
right, because Analyzer.getOffsetGap() is 1 by default, it should be one character. I recommend a control character like U+0000 or INFORMATION SEPARATOR X. If we were programming in C, we would be using NUL terminated strings implicitly and without hesitation! |
…or highlighting one field value at a time
I pushed another commit, highlighting one field value at a time works well now, the output is the same as before. Added a Updated the migrate docs, I removed the above limitation and added another one that I found around highlighting match query with type set to @rmuir can you have a look? The break iterator tests will be a deja vu for you I believe :) |
This looks great, I like the strategy for the breakiterator. That will be a nice one to put in lucene! |
This PR is marked breaking as it removes support for the |
Our own fork of the lucene PostingsHighlighter is very hard to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the
require_field_match
option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't make it into lucene the way I implemented them as they slow things down and shouldn't have been supported from day one on our end probably.One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, terms are pulled out of the automata for multi term queries).
Removing our fork means the following in terms of features:
phrase_prefix
. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with ourMultiPhrasePrefixQuery
.Closes #10625