Numeric fields can't make use of Lucene ranking behaviours #10628

markharwood · 2015-04-16T11:07:00Z

Why you need this

I have a use case where I want the normal rules of Lucene relevance ranking to apply to numeric fields, the rules being

TF - repetition of a term in a doc
IDF - scarcity of the term in the index
norms - normalizing for the length of the field.

The use case might be a product recommendation engine where each doc represents a user and the field of interest is a list of numeric product IDs that the user has liked (this is the format used in the MovieLens dataset to represent users and the movies they like).
My query might be to list the movies I like and find the users who share most of my interests. Lucene relevance ranking should be doing all the heavy-lifting here. The above rules would apply as follows:

TF - would favour users who had repeatedly watched one of my choices
IDF - would ensure any common movie IDs (e.g. most people have watched Shawshank Redemption) would be scored less highly.
norms - would favour those users who have a concentrated list of movies than those with encyclopaedic lists of favourites.

Example failure

In this gist we can see a simple example of where 2 users have their list of "likes" encoded as both strings or numerics. One user has a short list of likes and the other a long list. When we search we would hope to rank the user with the short list of likes that reflect our movie choices higher than the user with many choices. The example queries show how all of the string-based queries use norms effectively and how none of the numeric-based queries make use of this.

Why it fails

All of the query parsers resolve the various different ways of expressing a numeric query down to a call to IntegerFieldMapper.termQuery(..) which produces a NumericRangeQuery with none of the usual Lucene ranking behaviour.

Solutions

I can think of 4 options:

Do nothing - users wanting this ranking behaviour would have to encode numerics as strings
Add a new query clause to the DSL e.g. RankingNumericTerms
Add a "ranking_please" flag to existing query clauses e.g. Terms
Introduce ranking scores to all existing numeric queries

Option 1 may not be awful as this is a bit of an edge case and we may want to just add some docs around this approach.
Options 2 and 3 may warrant further investigation
Option 4 doesn't feel right for performance and backward compatibility reasons.

jpountz · 2015-04-16T11:20:16Z

I think we should just build a TermQuery manually with the help of NumericUtils?

markharwood · 2015-04-16T11:34:04Z

How does that look from the user's perspective in the DSL?

jpountz · 2015-04-16T11:39:16Z

Maybe we don't need to change the DSL at all and can just fix our number mappers to generate a term query instead of a range on a single term. This way term queries against numbers would always be aware of the IDF?

markharwood · 2015-04-16T11:41:43Z

Nice.
Is there a behind-the-scenes overhead to representing IDs as integers in the index?
I'm thinking all of the trie-range encoding stuff that helps support efficient range queries. If we know something is an ID and therefore not subject to range queries maybe there's gains to be had there too?

jpountz · 2015-04-16T11:53:23Z

If you don't use range queries then you can just disable trie encoding by setting precision_step=MAX_VALUE. Other than that numeric terms use a fixed-length binary encoding which is more space-efficient than string for large numbers and less space-efficient for small integers such as "1". But net/net I don't think this would be a big difference.

markharwood · 2015-04-16T13:20:52Z

OK.

Maybe we don't need to change the DSL at all and can just fix our number mappers to generate a term query instead of a range on a single term.

Would there be BWC concerns ie unexpected changes in relevance ranking for existing queries?

jpountz · 2015-04-16T13:28:02Z

Field mappers know the version of the index for which they have been created so we could handle bw compat if we want. But I don't think we should worry about it?

markharwood · 2015-04-16T14:25:23Z

OK running tests on a PR now..

markharwood · 2015-04-17T16:22:50Z

I have broken the problem into 2 pieces now.

Single value numeric queries shouldn't be handled by NumericRangeQuery #10646 Moving 2.x, 1.x and 1.5 away from NumericRangeQuery and onto ConstantScoreQuery+TermQuery
This issue which is about how and when we can request relevance ranking on numerics removing the ConstantScoreQuery wrapper on the TermQuery added by 1).

The key remaining question on this issue is if users expect to have relevance ranking of numerics on or off by default?

Arguments for no ranking by default:

Most users don't want, need or expect ranking on numeric fields and adding it may break the ranking on some queries (one customer is known to maintain a test suite that fails if the scores for any existing queries changes between system versions)
Ranges don't rank the individual numeric values, so logically why should requests for an individual numeric?

Arguments for ranking numerics by default:

There is an established means of turning off ranking in the DSL (ConstantScoreQuery wrapper)
If we don't rank numerics by default we'd need to change the DSL or mapping to add an option allowing users to explicitly turn on ranking

rjernst · 2015-04-17T18:48:23Z

I would say the argument is simpler: queries rank, filters do not. Since this is a numeric query, it should rank.

one customer is known to maintain a test suite that fails if the scores for any existing queries changes between system versions

They are just asking for trouble. Lucene makes no guarantees scoring will not change.

markharwood · 2015-04-17T22:10:23Z

queries rank, filters do not

That's the problem - they don't always in elasticsearch. While it is cut-and-dry in Lucene land, rightly or wrongly at some stage we decided in elasticsearch that changing the query DSL ranking behaviour based on the field type being queried was usually helpful.

While changing to the purity of the Lucene model may upset some user expectations it seems that we never documented the current behaviour with numerics - in fact TermQuery docs claim the functionality is exactly that of Lucene's TermQuery.

So we either document the existing "helpful" behaviour and provide new DSL options to override its defaults for odd cases or switch to the pure Query/Filter model you propose and typically require users to be smarter about ranking behaviours eg use of ConstantScoreQuery.

markharwood · 2015-04-21T15:15:24Z

The consensus from a quick poll is in favour of retaining the existing default of queries not ranking on numeric fields. This means this issue is now about adding new options in the Query DSL for allowing user overrides which explicitly enable Lucene scoring for numerics.

jpountz · 2015-04-21T15:27:24Z

Why do we need new options in the query DSL, users who don't want scoring can just wrap in constant_score?

rmuir · 2015-04-21T15:42:03Z

I think so too. i think it should be IDF-based, but norms and frequencies should remain omitted. This way the score is the same for every document that matches the term, it just reflects the popularity of the term.

This is completely consistent with querying other non-full text fields, like a not_analyzed string.

if someone wants to omit the IDF, we have constant_score for that, just like with anything else.

jpountz · 2015-04-21T15:46:44Z

To elaborate a bit more on my previous comment, I don't have a strong opinion about whether these queries should score or not by default (I like the scoring one a bit more but could go the other way if there is consensus) but I'm against adding a new option to the query DSL for that. To me there are only two things we can do: either make numeric term queries score and require users to wrap in constant_score if they don't want scoring or make numeric term queries never score and require to index as strings to have IDF contributions to the score.

markharwood · 2015-04-21T15:56:05Z

I'm in favour of continuing with the constant-scoring default for numerics (that's arguably what most users expect).

For the power user (OK, me) both IDF and norms are things I would like to be ranking factors for the reasons I outlined in the opening description.
The encode-numbers-as-strings solution feels hacky but it avoids complicating the DSL and is probably only required for edge cases like the one I outlined.
I'm happy to close this issue if we feel that's the favoured approach - at least we had the debate and uncovered some stuff along the way.

markharwood · 2015-04-21T17:35:10Z

@brwe raises another interesting option - tackling the issue via mapping definitions.
A single flag held against a numeric field could be used to determine the scoring/non-scoring query policy used by default.

jpountz · 2015-04-21T20:22:58Z

I'm in favour of continuing with the constant-scoring default for numerics (that's arguably what most users expect).

I don't know if that many users have scoring expectations when it comes to numerics. I don't have data to back me but I was assuming numerics were mostly useful for filtering (and sorting) in which case scoring does not matter?

I don't like the mapping option better than the query DSL option. In my opinion there are already too many options and settings in general, and even if most of them look harmless, they are hindering our ability to move forward. So I would really prefer not to add new options, especially given that we already have a way to turn scoring off using constant_score.

rjernst · 2015-04-21T20:36:16Z

I agree we already have this ability and shouldn't add an alternative way to turn off scoring. It is very difficult to remove such options later; we should aggressively guard from adding new options and find ways to do things without further complicating the api (query, mappings, etc).

markharwood · 2015-04-21T21:52:35Z

and shouldn't add an alternative way to turn off scoring.

Just in case there's any confusion - we were talking about adding a way to turn on scoring.

We have 3 options:

Do nothing - meaning we stick with a model where you can't relevance rank on numeric fields.
Stick with current model of no-ranking on numeric fields but add an option (query DSL or mapping) to enable scoring.
Change the default query behaviour so numeric queries are ranked (there are existing ways to disable scoring in the DSL)

I understand you're not happy with 2) and I'm nervous about 3 - at some stage we obviously felt that deliberately disabling ranking on numerics was a generally useful thing for most users and we've had no complaints so a change may be upsetting.
Given the need for ranking on numerics is rare, 1) might be the safest option because a client workaround is numerics can always be provided as strings.

markharwood · 2015-04-24T10:10:07Z

Spoke with @clintongormley and we agreed to go with option 3) in my previous comment - queries on numeric fields will be changed to use Lucene ranking by default.
This change will involve:

removing the ConstantScoreQuery wrapper around the TermQuery objects produced in NumberFieldMapper.
Adding a note to migration docs to document this change in default ranking behaviour.

clintongormley · 2015-04-24T10:11:41Z

At the moment there is no way of using scoring with numerics, but I'd be against adding new parameters to enable this. I agree with @jpountz that it should be done within the existing framework.

Currently, the constant-scoring of numeric queries is not documented (and indeed surprised me early on when I was experimenting with Elasticsearch). With the move to Lucene 5, queries = filters + scoring. I'd be happy for making that true for numeric term queries as well.

We already tell people that queries on structured data which shouldn't affect scoring should be specified as filters, so I think that the majority of people will see absolutely no change in behaviour. For the minority who do see a change, the fix is simple: use constant_score or the filter clause in the bool query.

markharwood added the :Query DSL label Apr 16, 2015

kevinkluge assigned markharwood Apr 16, 2015

kevinkluge added the in progress label Apr 16, 2015

This was referenced Apr 16, 2015

Query scoring change for single-value queries on numeric fields #10631

Closed

Single value numeric queries shouldn't be handled by NumericRangeQuery #10646

Closed

markharwood added discuss v2.0.0-beta1 and removed discuss labels Apr 24, 2015

markharwood mentioned this issue Apr 24, 2015

Enable Lucene ranking behaviour for numeric term queries #10790

Closed

markharwood closed this as completed in 1b8b993 Apr 27, 2015

kevinkluge removed the in progress label Apr 27, 2015

jpountz mentioned this issue Jun 15, 2016

term query on long type field on version 1.6 and on 2.3 plays completely different #18888

Closed

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric fields can't make use of Lucene ranking behaviours #10628

Numeric fields can't make use of Lucene ranking behaviours #10628

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

markharwood commented Apr 17, 2015

rjernst commented Apr 17, 2015

markharwood commented Apr 17, 2015

markharwood commented Apr 21, 2015

jpountz commented Apr 21, 2015

rmuir commented Apr 21, 2015

jpountz commented Apr 21, 2015

markharwood commented Apr 21, 2015

markharwood commented Apr 21, 2015

jpountz commented Apr 21, 2015

rjernst commented Apr 21, 2015

markharwood commented Apr 21, 2015

markharwood commented Apr 24, 2015

clintongormley commented Apr 24, 2015

Numeric fields can't make use of Lucene ranking behaviours #10628

Numeric fields can't make use of Lucene ranking behaviours #10628

Comments

markharwood commented Apr 16, 2015

Why you need this

Example failure

Why it fails

Solutions

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

jpountz commented Apr 16, 2015

markharwood commented Apr 16, 2015

markharwood commented Apr 17, 2015

Arguments for no ranking by default:

Arguments for ranking numerics by default:

rjernst commented Apr 17, 2015

markharwood commented Apr 17, 2015

markharwood commented Apr 21, 2015

jpountz commented Apr 21, 2015

rmuir commented Apr 21, 2015

jpountz commented Apr 21, 2015

markharwood commented Apr 21, 2015

markharwood commented Apr 21, 2015

jpountz commented Apr 21, 2015

rjernst commented Apr 21, 2015

markharwood commented Apr 21, 2015

markharwood commented Apr 24, 2015

clintongormley commented Apr 24, 2015