Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query scoring change for single-value queries on numeric fields #10631

Closed
wants to merge 2 commits into from
Closed

Query scoring change for single-value queries on numeric fields #10631

wants to merge 2 commits into from

Conversation

markharwood
Copy link
Contributor

Change Long and Integer fields to use a TermQuery rather than a non-scoring NumericRangeQuery when querying single values. Allows for relevance-ranked retrieval on non-text fields with example use case being recommendation engine finding users with similar tastes in movies where users' liked movie ids are held as arrays of numbers.

Closes #10628

…ing TermQuery rather than a non-scoring NumericRangeQuery.

Closes #10628
@markharwood markharwood added >enhancement v2.0.0-beta1 :Search/Search Search-related issues that do not fall into other categories labels Apr 16, 2015
@mikemccand
Copy link
Contributor

Maybe Lucene should have a NumericTermQuery?

@rmuir
Copy link
Contributor

rmuir commented Apr 16, 2015

For the legacy numeric fields? won't this be confusing for autoprefix? what would this query do? "Analyze" the input number? why not just make NumericTokenStream work as a regular analyzer. I can think of many better ways to do it.

@rmuir
Copy link
Contributor

rmuir commented Apr 16, 2015

But mark's patch is fine (my complaints are about numeric ranges in lucene). mark's patch changes this to use a single term instead of a one-term range.

@rmuir
Copy link
Contributor

rmuir commented Apr 16, 2015

looks good to me. This pull request may have performance benefits as well. I do not think a one-term numericrangequery is as efficient (it likely pulls all the bits for matching docs for that single term eagerly instead of lazily for intersection, etc).

@rmuir
Copy link
Contributor

rmuir commented Apr 16, 2015

I wonder if there is some reason Integer and Long are singled out. Should we do it for all numeric types/dates/ip/etc?

int iValue = parseValue(value);
return NumericRangeQuery.newIntRange(names.indexName(), precisionStep,
iValue, iValue, true, true);
return new TermQuery(new Term(names.indexName(), indexedValueForSearch(value)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this up to NumberFieldMapper (replacing the range query there now) to get the benefit with the other types @rmuir mentions?

@markharwood
Copy link
Contributor Author

@clintongormley what do you think of the impact this change might have on existing queries?
For example, we steer people to use the "match" query because it is simple and "does the right thing" - this change will mean that those users who match on numeric fields will have changes to the ranking behaviour.
Ordinarily users might expect numeric queries to not rank at all and be accustomed to that. My use case for the recommendation scenario requires ranking and is perhaps an edge case that deserves special extensions to the DSL?

So is our default behaviour ranked-matching-on-everything or ranked-on-text-only?

… using the BytesRef returned by subclasses in their existing indexedValueForSearch() method.

All subclass impls of termQuery method (which used to create NumericRangeQuerys) are now removed.
@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2015

IMO termquery does the right thing because it acts like a normal term that appears once in matching documents. Numerics already omit norms and frequencies so the score for numerics will be a constant for all documents: so with the default sim and querynorm, will users even care?

Anyway, if for some reason we want to change what this exact constant value is, we should wrap the termquery in ConstantScoreQuery (the user can do that too, always!), instead of using a less efficient RangeQuery...

@brwe
Copy link
Contributor

brwe commented Apr 17, 2015

What about idf? Using something like idf for scoring assumes that relevance of a term can actually be estimated by its doc freq and I thought this is mainly a good heuristic for fields containing natural language. But is it also true for most numeric fields? I worry that it is not so and that this change would break things for users. Anybody knows what the original motivation for using range queries for numerics?

@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2015

This is crazy. I will make my own PR that changes this to a constantscorequery(termquery()) then and you guys can debate the finer aspects of scoring.

But the performance here today is shit, the "range query of one" is completely stupid.

@markharwood
Copy link
Contributor Author

@rmuir that's a one line change in my existing PR if you want to take that as the basis.
There's some Junit tests that check for the existance of TermQuery as outputs from the parser that would need adjusting

@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2015

Or we can commit your PR with a constantscorequery as a step and discuss the scoring as a followup? I just want these "terms" to be able to use skipdata when participating in a boolean :)

@markharwood
Copy link
Contributor Author

Would we want to backport the constantscorequery change to other branches?

@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2015

Absolutely! its not breaking at all, unless you consider making queries faster breaking :)

@markharwood
Copy link
Contributor Author

:) So that might be a good reason to separate this PR into 2 stages - removing range queries for the benefit of multiple branches and adding the new/changed scoring capabilities into master

@rmuir
Copy link
Contributor

rmuir commented Apr 17, 2015

+1

@markharwood
Copy link
Contributor Author

The guts of this change (moving from NumericRangeQuery to CSQ wrapped TermQuery) are now in #10648
I'll close this PR and open a rebased one once the above PR is landed and we have decided how to allow a plain TermQuery to be used to get scoring for numerics. Let's continue the discussion on how to achieve that over on the issue #10628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numeric fields can't make use of Lucene ranking behaviours
6 participants