New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query scoring change for single-value queries on numeric fields #10631
Conversation
…ing TermQuery rather than a non-scoring NumericRangeQuery. Closes #10628
Maybe Lucene should have a NumericTermQuery? |
For the legacy numeric fields? won't this be confusing for autoprefix? what would this query do? "Analyze" the input number? why not just make NumericTokenStream work as a regular analyzer. I can think of many better ways to do it. |
But mark's patch is fine (my complaints are about numeric ranges in lucene). mark's patch changes this to use a single term instead of a one-term range. |
looks good to me. This pull request may have performance benefits as well. I do not think a one-term numericrangequery is as efficient (it likely pulls all the bits for matching docs for that single term eagerly instead of lazily for intersection, etc). |
I wonder if there is some reason Integer and Long are singled out. Should we do it for all numeric types/dates/ip/etc? |
int iValue = parseValue(value); | ||
return NumericRangeQuery.newIntRange(names.indexName(), precisionStep, | ||
iValue, iValue, true, true); | ||
return new TermQuery(new Term(names.indexName(), indexedValueForSearch(value))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this up to NumberFieldMapper
(replacing the range query there now) to get the benefit with the other types @rmuir mentions?
@clintongormley what do you think of the impact this change might have on existing queries? So is our default behaviour ranked-matching-on-everything or ranked-on-text-only? |
… using the BytesRef returned by subclasses in their existing indexedValueForSearch() method. All subclass impls of termQuery method (which used to create NumericRangeQuerys) are now removed.
IMO termquery does the right thing because it acts like a normal term that appears once in matching documents. Numerics already omit norms and frequencies so the score for numerics will be a constant for all documents: so with the default sim and querynorm, will users even care? Anyway, if for some reason we want to change what this exact constant value is, we should wrap the termquery in ConstantScoreQuery (the user can do that too, always!), instead of using a less efficient RangeQuery... |
What about idf? Using something like idf for scoring assumes that relevance of a term can actually be estimated by its doc freq and I thought this is mainly a good heuristic for fields containing natural language. But is it also true for most numeric fields? I worry that it is not so and that this change would break things for users. Anybody knows what the original motivation for using range queries for numerics? |
This is crazy. I will make my own PR that changes this to a constantscorequery(termquery()) then and you guys can debate the finer aspects of scoring. But the performance here today is shit, the "range query of one" is completely stupid. |
@rmuir that's a one line change in my existing PR if you want to take that as the basis. |
Or we can commit your PR with a constantscorequery as a step and discuss the scoring as a followup? I just want these "terms" to be able to use skipdata when participating in a boolean :) |
Would we want to backport the constantscorequery change to other branches? |
Absolutely! its not breaking at all, unless you consider making queries faster breaking :) |
:) So that might be a good reason to separate this PR into 2 stages - removing range queries for the benefit of multiple branches and adding the new/changed scoring capabilities into master |
+1 |
The guts of this change (moving from NumericRangeQuery to CSQ wrapped TermQuery) are now in #10648 |
Change Long and Integer fields to use a TermQuery rather than a non-scoring NumericRangeQuery when querying single values. Allows for relevance-ranked retrieval on non-text fields with example use case being recommendation engine finding users with similar tastes in movies where users' liked movie ids are held as arrays of numbers.
Closes #10628