Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric fields can't make use of Lucene ranking behaviours #10628

Closed
markharwood opened this issue Apr 16, 2015 · 22 comments
Closed

Numeric fields can't make use of Lucene ranking behaviours #10628

markharwood opened this issue Apr 16, 2015 · 22 comments
Assignees
Labels
:Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1

Comments

@markharwood
Copy link
Contributor

Why you need this

I have a use case where I want the normal rules of Lucene relevance ranking to apply to numeric fields, the rules being

  1. TF - repetition of a term in a doc
  2. IDF - scarcity of the term in the index
  3. norms - normalizing for the length of the field.

The use case might be a product recommendation engine where each doc represents a user and the field of interest is a list of numeric product IDs that the user has liked (this is the format used in the MovieLens dataset to represent users and the movies they like).
My query might be to list the movies I like and find the users who share most of my interests. Lucene relevance ranking should be doing all the heavy-lifting here. The above rules would apply as follows:

  1. TF - would favour users who had repeatedly watched one of my choices
  2. IDF - would ensure any common movie IDs (e.g. most people have watched Shawshank Redemption) would be scored less highly.
  3. norms - would favour those users who have a concentrated list of movies than those with encyclopaedic lists of favourites.

Example failure

In this gist we can see a simple example of where 2 users have their list of "likes" encoded as both strings or numerics. One user has a short list of likes and the other a long list. When we search we would hope to rank the user with the short list of likes that reflect our movie choices higher than the user with many choices. The example queries show how all of the string-based queries use norms effectively and how none of the numeric-based queries make use of this.

Why it fails

All of the query parsers resolve the various different ways of expressing a numeric query down to a call to IntegerFieldMapper.termQuery(..) which produces a NumericRangeQuery with none of the usual Lucene ranking behaviour.

Solutions

I can think of 4 options:

  1. Do nothing - users wanting this ranking behaviour would have to encode numerics as strings
  2. Add a new query clause to the DSL e.g. RankingNumericTerms
  3. Add a "ranking_please" flag to existing query clauses e.g. Terms
  4. Introduce ranking scores to all existing numeric queries

Option 1 may not be awful as this is a bit of an edge case and we may want to just add some docs around this approach.
Options 2 and 3 may warrant further investigation
Option 4 doesn't feel right for performance and backward compatibility reasons.

@jpountz
Copy link
Contributor

jpountz commented Apr 16, 2015

I think we should just build a TermQuery manually with the help of NumericUtils?

@markharwood
Copy link
Contributor Author

How does that look from the user's perspective in the DSL?

@jpountz
Copy link
Contributor

jpountz commented Apr 16, 2015

Maybe we don't need to change the DSL at all and can just fix our number mappers to generate a term query instead of a range on a single term. This way term queries against numbers would always be aware of the IDF?

@markharwood
Copy link
Contributor Author

Nice.
Is there a behind-the-scenes overhead to representing IDs as integers in the index?
I'm thinking all of the trie-range encoding stuff that helps support efficient range queries. If we know something is an ID and therefore not subject to range queries maybe there's gains to be had there too?

@jpountz
Copy link
Contributor

jpountz commented Apr 16, 2015

If you don't use range queries then you can just disable trie encoding by setting precision_step=MAX_VALUE. Other than that numeric terms use a fixed-length binary encoding which is more space-efficient than string for large numbers and less space-efficient for small integers such as "1". But net/net I don't think this would be a big difference.

@markharwood
Copy link
Contributor Author

OK.

Maybe we don't need to change the DSL at all and can just fix our number mappers to generate a term query instead of a range on a single term.

Would there be BWC concerns ie unexpected changes in relevance ranking for existing queries?

@jpountz
Copy link
Contributor

jpountz commented Apr 16, 2015

Field mappers know the version of the index for which they have been created so we could handle bw compat if we want. But I don't think we should worry about it?

@markharwood
Copy link
Contributor Author

OK running tests on a PR now..

@markharwood
Copy link
Contributor Author

I have broken the problem into 2 pieces now.

  1. Single value numeric queries shouldn't be handled by NumericRangeQuery #10646 Moving 2.x, 1.x and 1.5 away from NumericRangeQuery and onto ConstantScoreQuery+TermQuery
  2. This issue which is about how and when we can request relevance ranking on numerics removing the ConstantScoreQuery wrapper on the TermQuery added by 1).

The key remaining question on this issue is if users expect to have relevance ranking of numerics on or off by default?

Arguments for no ranking by default:

  1. Most users don't want, need or expect ranking on numeric fields and adding it may break the ranking on some queries (one customer is known to maintain a test suite that fails if the scores for any existing queries changes between system versions)
  2. Ranges don't rank the individual numeric values, so logically why should requests for an individual numeric?

Arguments for ranking numerics by default:

  1. There is an established means of turning off ranking in the DSL (ConstantScoreQuery wrapper)
  2. If we don't rank numerics by default we'd need to change the DSL or mapping to add an option allowing users to explicitly turn on ranking

@rjernst
Copy link
Member

rjernst commented Apr 17, 2015

I would say the argument is simpler: queries rank, filters do not. Since this is a numeric query, it should rank.

one customer is known to maintain a test suite that fails if the scores for any existing queries changes between system versions

They are just asking for trouble. Lucene makes no guarantees scoring will not change.

@markharwood
Copy link
Contributor Author

queries rank, filters do not

That's the problem - they don't always in elasticsearch. While it is cut-and-dry in Lucene land, rightly or wrongly at some stage we decided in elasticsearch that changing the query DSL ranking behaviour based on the field type being queried was usually helpful.

While changing to the purity of the Lucene model may upset some user expectations it seems that we never documented the current behaviour with numerics - in fact TermQuery docs claim the functionality is exactly that of Lucene's TermQuery.

So we either document the existing "helpful" behaviour and provide new DSL options to override its defaults for odd cases or switch to the pure Query/Filter model you propose and typically require users to be smarter about ranking behaviours eg use of ConstantScoreQuery.

@markharwood
Copy link
Contributor Author

The consensus from a quick poll is in favour of retaining the existing default of queries not ranking on numeric fields. This means this issue is now about adding new options in the Query DSL for allowing user overrides which explicitly enable Lucene scoring for numerics.

@jpountz
Copy link
Contributor

jpountz commented Apr 21, 2015

Why do we need new options in the query DSL, users who don't want scoring can just wrap in constant_score?

@rmuir
Copy link
Contributor

rmuir commented Apr 21, 2015

I think so too. i think it should be IDF-based, but norms and frequencies should remain omitted. This way the score is the same for every document that matches the term, it just reflects the popularity of the term.

This is completely consistent with querying other non-full text fields, like a not_analyzed string.

if someone wants to omit the IDF, we have constant_score for that, just like with anything else.

@jpountz
Copy link
Contributor

jpountz commented Apr 21, 2015

To elaborate a bit more on my previous comment, I don't have a strong opinion about whether these queries should score or not by default (I like the scoring one a bit more but could go the other way if there is consensus) but I'm against adding a new option to the query DSL for that. To me there are only two things we can do: either make numeric term queries score and require users to wrap in constant_score if they don't want scoring or make numeric term queries never score and require to index as strings to have IDF contributions to the score.

@markharwood
Copy link
Contributor Author

I'm in favour of continuing with the constant-scoring default for numerics (that's arguably what most users expect).

For the power user (OK, me) both IDF and norms are things I would like to be ranking factors for the reasons I outlined in the opening description.
The encode-numbers-as-strings solution feels hacky but it avoids complicating the DSL and is probably only required for edge cases like the one I outlined.
I'm happy to close this issue if we feel that's the favoured approach - at least we had the debate and uncovered some stuff along the way.

@markharwood
Copy link
Contributor Author

@brwe raises another interesting option - tackling the issue via mapping definitions.
A single flag held against a numeric field could be used to determine the scoring/non-scoring query policy used by default.

@jpountz
Copy link
Contributor

jpountz commented Apr 21, 2015

I'm in favour of continuing with the constant-scoring default for numerics (that's arguably what most users expect).

I don't know if that many users have scoring expectations when it comes to numerics. I don't have data to back me but I was assuming numerics were mostly useful for filtering (and sorting) in which case scoring does not matter?

I don't like the mapping option better than the query DSL option. In my opinion there are already too many options and settings in general, and even if most of them look harmless, they are hindering our ability to move forward. So I would really prefer not to add new options, especially given that we already have a way to turn scoring off using constant_score.

@rjernst
Copy link
Member

rjernst commented Apr 21, 2015

I agree we already have this ability and shouldn't add an alternative way to turn off scoring. It is very difficult to remove such options later; we should aggressively guard from adding new options and find ways to do things without further complicating the api (query, mappings, etc).

@markharwood
Copy link
Contributor Author

and shouldn't add an alternative way to turn off scoring.

Just in case there's any confusion - we were talking about adding a way to turn on scoring.

We have 3 options:

  1. Do nothing - meaning we stick with a model where you can't relevance rank on numeric fields.
  2. Stick with current model of no-ranking on numeric fields but add an option (query DSL or mapping) to enable scoring.
  3. Change the default query behaviour so numeric queries are ranked (there are existing ways to disable scoring in the DSL)

I understand you're not happy with 2) and I'm nervous about 3 - at some stage we obviously felt that deliberately disabling ranking on numerics was a generally useful thing for most users and we've had no complaints so a change may be upsetting.
Given the need for ranking on numerics is rare, 1) might be the safest option because a client workaround is numerics can always be provided as strings.

@markharwood
Copy link
Contributor Author

Spoke with @clintongormley and we agreed to go with option 3) in my previous comment - queries on numeric fields will be changed to use Lucene ranking by default.
This change will involve:

  • removing the ConstantScoreQuery wrapper around the TermQuery objects produced in NumberFieldMapper.
  • Adding a note to migration docs to document this change in default ranking behaviour.

@clintongormley
Copy link

At the moment there is no way of using scoring with numerics, but I'd be against adding new parameters to enable this. I agree with @jpountz that it should be done within the existing framework.

Currently, the constant-scoring of numeric queries is not documented (and indeed surprised me early on when I was experimenting with Elasticsearch). With the move to Lucene 5, queries = filters + scoring. I'd be happy for making that true for numeric term queries as well.

We already tell people that queries on structured data which shouldn't affect scoring should be specified as filters, so I think that the majority of people will see absolutely no change in behaviour. For the minority who do see a change, the fix is simple: use constant_score or the filter clause in the bool query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1
Projects
None yet
6 participants