Query scoring change for single-value queries on numeric fields #10631

markharwood · 2015-04-16T15:10:20Z

Change Long and Integer fields to use a TermQuery rather than a non-scoring NumericRangeQuery when querying single values. Allows for relevance-ranked retrieval on non-text fields with example use case being recommendation engine finding users with similar tastes in movies where users' liked movie ids are held as arrays of numbers.

Closes #10628

…ing TermQuery rather than a non-scoring NumericRangeQuery. Closes #10628

mikemccand · 2015-04-16T17:48:33Z

Maybe Lucene should have a NumericTermQuery?

rmuir · 2015-04-16T17:49:42Z

For the legacy numeric fields? won't this be confusing for autoprefix? what would this query do? "Analyze" the input number? why not just make NumericTokenStream work as a regular analyzer. I can think of many better ways to do it.

rmuir · 2015-04-16T17:50:24Z

But mark's patch is fine (my complaints are about numeric ranges in lucene). mark's patch changes this to use a single term instead of a one-term range.

rmuir · 2015-04-16T18:00:26Z

looks good to me. This pull request may have performance benefits as well. I do not think a one-term numericrangequery is as efficient (it likely pulls all the bits for matching docs for that single term eagerly instead of lazily for intersection, etc).

rmuir · 2015-04-16T18:04:33Z

I wonder if there is some reason Integer and Long are singled out. Should we do it for all numeric types/dates/ip/etc?

rjernst · 2015-04-16T20:14:46Z

src/main/java/org/elasticsearch/index/mapper/core/IntegerFieldMapper.java

-        int iValue = parseValue(value);
-        return NumericRangeQuery.newIntRange(names.indexName(), precisionStep,
-                iValue, iValue, true, true);
+        return new TermQuery(new Term(names.indexName(), indexedValueForSearch(value)));


Could we move this up to NumberFieldMapper (replacing the range query there now) to get the benefit with the other types @rmuir mentions?

markharwood · 2015-04-17T08:56:28Z

@clintongormley what do you think of the impact this change might have on existing queries?
For example, we steer people to use the "match" query because it is simple and "does the right thing" - this change will mean that those users who match on numeric fields will have changes to the ranking behaviour.
Ordinarily users might expect numeric queries to not rank at all and be accustomed to that. My use case for the recommendation scenario requires ranking and is perhaps an edge case that deserves special extensions to the DSL?

So is our default behaviour ranked-matching-on-everything or ranked-on-text-only?

… using the BytesRef returned by subclasses in their existing indexedValueForSearch() method. All subclass impls of termQuery method (which used to create NumericRangeQuerys) are now removed.

rmuir · 2015-04-17T11:16:55Z

IMO termquery does the right thing because it acts like a normal term that appears once in matching documents. Numerics already omit norms and frequencies so the score for numerics will be a constant for all documents: so with the default sim and querynorm, will users even care?

Anyway, if for some reason we want to change what this exact constant value is, we should wrap the termquery in ConstantScoreQuery (the user can do that too, always!), instead of using a less efficient RangeQuery...

brwe · 2015-04-17T11:26:39Z

What about idf? Using something like idf for scoring assumes that relevance of a term can actually be estimated by its doc freq and I thought this is mainly a good heuristic for fields containing natural language. But is it also true for most numeric fields? I worry that it is not so and that this change would break things for users. Anybody knows what the original motivation for using range queries for numerics?

rmuir · 2015-04-17T11:27:35Z

This is crazy. I will make my own PR that changes this to a constantscorequery(termquery()) then and you guys can debate the finer aspects of scoring.

But the performance here today is shit, the "range query of one" is completely stupid.

markharwood · 2015-04-17T11:37:21Z

@rmuir that's a one line change in my existing PR if you want to take that as the basis.
There's some Junit tests that check for the existance of TermQuery as outputs from the parser that would need adjusting

rmuir · 2015-04-17T11:40:03Z

Or we can commit your PR with a constantscorequery as a step and discuss the scoring as a followup? I just want these "terms" to be able to use skipdata when participating in a boolean :)

markharwood · 2015-04-17T11:45:13Z

Would we want to backport the constantscorequery change to other branches?

rmuir · 2015-04-17T11:50:06Z

Absolutely! its not breaking at all, unless you consider making queries faster breaking :)

markharwood · 2015-04-17T11:52:54Z

:) So that might be a good reason to separate this PR into 2 stages - removing range queries for the benefit of multiple branches and adding the new/changed scoring capabilities into master

rmuir · 2015-04-17T11:55:12Z

+1

markharwood · 2015-04-17T15:55:05Z

The guts of this change (moving from NumericRangeQuery to CSQ wrapped TermQuery) are now in #10648
I'll close this PR and open a rebased one once the above PR is landed and we have decided how to allow a plain TermQuery to be used to get scoring for numerics. Let's continue the discussion on how to achieve that over on the issue #10628

Query scoring - for single-value queries on numeric fields use a scor…

c2e7904

…ing TermQuery rather than a non-scoring NumericRangeQuery. Closes #10628

markharwood added >enhancement v2.0.0-beta1 :Search/Search Search-related issues that do not fall into other categories labels Apr 16, 2015

kevinkluge added the in progress label Apr 16, 2015

markharwood added the review label Apr 16, 2015

rjernst reviewed Apr 16, 2015
View reviewed changes

NumberFieldMapper termQuery impl is now final and creates a TermQuery…

ec150b9

… using the BytesRef returned by subclasses in their existing indexedValueForSearch() method. All subclass impls of termQuery method (which used to create NumericRangeQuerys) are now removed.

markharwood closed this Apr 17, 2015

kevinkluge removed in progress labels Apr 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query scoring change for single-value queries on numeric fields #10631

Query scoring change for single-value queries on numeric fields #10631

markharwood commented Apr 16, 2015

mikemccand commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rjernst Apr 16, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

brwe commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

Query scoring change for single-value queries on numeric fields #10631

Query scoring change for single-value queries on numeric fields #10631

Conversation

markharwood commented Apr 16, 2015

mikemccand commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rmuir commented Apr 16, 2015

rjernst Apr 16, 2015

Choose a reason for hiding this comment

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

brwe commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015

rmuir commented Apr 17, 2015

markharwood commented Apr 17, 2015