New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numeric fields can't make use of Lucene ranking behaviours #10628
Comments
I think we should just build a TermQuery manually with the help of NumericUtils? |
How does that look from the user's perspective in the DSL? |
Maybe we don't need to change the DSL at all and can just fix our number mappers to generate a term query instead of a range on a single term. This way term queries against numbers would always be aware of the IDF? |
Nice. |
If you don't use range queries then you can just disable trie encoding by setting precision_step=MAX_VALUE. Other than that numeric terms use a fixed-length binary encoding which is more space-efficient than string for large numbers and less space-efficient for small integers such as "1". But net/net I don't think this would be a big difference. |
OK.
Would there be BWC concerns ie unexpected changes in relevance ranking for existing queries? |
Field mappers know the version of the index for which they have been created so we could handle bw compat if we want. But I don't think we should worry about it? |
OK running tests on a PR now.. |
I have broken the problem into 2 pieces now.
The key remaining question on this issue is if users expect to have relevance ranking of numerics on or off by default? Arguments for no ranking by default:
Arguments for ranking numerics by default:
|
I would say the argument is simpler: queries rank, filters do not. Since this is a numeric query, it should rank.
They are just asking for trouble. Lucene makes no guarantees scoring will not change. |
That's the problem - they don't always in elasticsearch. While it is cut-and-dry in Lucene land, rightly or wrongly at some stage we decided in elasticsearch that changing the query DSL ranking behaviour based on the field type being queried was usually helpful. While changing to the purity of the Lucene model may upset some user expectations it seems that we never documented the current behaviour with numerics - in fact TermQuery docs claim the functionality is exactly that of Lucene's TermQuery. So we either document the existing "helpful" behaviour and provide new DSL options to override its defaults for odd cases or switch to the pure Query/Filter model you propose and typically require users to be smarter about ranking behaviours eg use of ConstantScoreQuery. |
The consensus from a quick poll is in favour of retaining the existing default of queries not ranking on numeric fields. This means this issue is now about adding new options in the Query DSL for allowing user overrides which explicitly enable Lucene scoring for numerics. |
Why do we need new options in the query DSL, users who don't want scoring can just wrap in |
I think so too. i think it should be IDF-based, but norms and frequencies should remain omitted. This way the score is the same for every document that matches the term, it just reflects the popularity of the term. This is completely consistent with querying other non-full text fields, like a if someone wants to omit the IDF, we have |
To elaborate a bit more on my previous comment, I don't have a strong opinion about whether these queries should score or not by default (I like the scoring one a bit more but could go the other way if there is consensus) but I'm against adding a new option to the query DSL for that. To me there are only two things we can do: either make numeric term queries score and require users to wrap in |
I'm in favour of continuing with the constant-scoring default for numerics (that's arguably what most users expect). For the power user (OK, me) both IDF and norms are things I would like to be ranking factors for the reasons I outlined in the opening description. |
@brwe raises another interesting option - tackling the issue via mapping definitions. |
I don't know if that many users have scoring expectations when it comes to numerics. I don't have data to back me but I was assuming numerics were mostly useful for filtering (and sorting) in which case scoring does not matter? I don't like the mapping option better than the query DSL option. In my opinion there are already too many options and settings in general, and even if most of them look harmless, they are hindering our ability to move forward. So I would really prefer not to add new options, especially given that we already have a way to turn scoring off using |
I agree we already have this ability and shouldn't add an alternative way to turn off scoring. It is very difficult to remove such options later; we should aggressively guard from adding new options and find ways to do things without further complicating the api (query, mappings, etc). |
Just in case there's any confusion - we were talking about adding a way to turn on scoring. We have 3 options:
I understand you're not happy with 2) and I'm nervous about 3 - at some stage we obviously felt that deliberately disabling ranking on numerics was a generally useful thing for most users and we've had no complaints so a change may be upsetting. |
Spoke with @clintongormley and we agreed to go with option 3) in my previous comment - queries on numeric fields will be changed to use Lucene ranking by default.
|
At the moment there is no way of using scoring with numerics, but I'd be against adding new parameters to enable this. I agree with @jpountz that it should be done within the existing framework. Currently, the constant-scoring of numeric queries is not documented (and indeed surprised me early on when I was experimenting with Elasticsearch). With the move to Lucene 5, queries = filters + scoring. I'd be happy for making that true for numeric term queries as well. We already tell people that queries on structured data which shouldn't affect scoring should be specified as filters, so I think that the majority of people will see absolutely no change in behaviour. For the minority who do see a change, the fix is simple: use constant_score or the |
Why you need this
I have a use case where I want the normal rules of Lucene relevance ranking to apply to numeric fields, the rules being
The use case might be a product recommendation engine where each doc represents a user and the field of interest is a list of numeric product IDs that the user has liked (this is the format used in the MovieLens dataset to represent users and the movies they like).
My query might be to list the movies I like and find the users who share most of my interests. Lucene relevance ranking should be doing all the heavy-lifting here. The above rules would apply as follows:
Example failure
In this gist we can see a simple example of where 2 users have their list of "likes" encoded as both strings or numerics. One user has a short list of likes and the other a long list. When we search we would hope to rank the user with the short list of likes that reflect our movie choices higher than the user with many choices. The example queries show how all of the string-based queries use norms effectively and how none of the numeric-based queries make use of this.
Why it fails
All of the query parsers resolve the various different ways of expressing a numeric query down to a call to IntegerFieldMapper.termQuery(..) which produces a NumericRangeQuery with none of the usual Lucene ranking behaviour.
Solutions
I can think of 4 options:
Option 1 may not be awful as this is a bit of an edge case and we may want to just add some docs around this approach.
Options 2 and 3 may warrant further investigation
Option 4 doesn't feel right for performance and backward compatibility reasons.
The text was updated successfully, but these errors were encountered: