New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random score order changes on doc updates #6907
Comments
The order docs are scored in (and thus what order the RNG is generating random scores with) is based on the order the documents appear in lucene segments. Merges will also change this by creating new segments. So even adding or deleting docs can eventually cause the order of existing docs to change with the same seed. I think we could fix this by having another parameter, |
Yes, actually ANY other solution other than the current one would make more sense, as long the scores/positions stay the same per seed and per doc/field. Is there any workaround one could use as of now? Preferably one which is robust and performant enough. |
@rjernst While you're working on this, I think that For instance, imagine you are filtering on "features":
Documents can have a score of 1, 2, 3, 4 or 5. It would be nice to use randomization here to randomize all docs that have a score of 2. In other words, docs with an original score of 2 should end up with: You could do this by specifying a
|
And while you're about it, probably worth supporting a |
I've done an initial implementation using _uid, and it is about 3x slower than the current random score. I discussed with @rmuir a bunch and after thinking about it more, it seems this will always be a problem, even when using something "consistent" from a document (whether _uid or a unique numeric field), since segments can always be moved around as adds/updates/deletes happen. @heyarny Have you tried using random score with a scroll? This should allow a consistent view, given the same seed for each request (although it is possible there is an edge case; I'm working on adding a test). @clintongormley I think the return values for random score are just wrong right now. Java's Random.nextFloat() returns a value 0.0 - 1.0. We should fix random score to do exactly this, and the user can scale it to their range however they want. |
@rjernst Is the performance drop off as bad for a numeric field instead of Scrolling isn't really useful here - the user would have to pull, eg, 10 pages of results and cache them somewhere. Scrolling doesn't let you click forward then back (although the page could be cached i suppose). Also, keeping lots of scrolls open uses a lot of extra filehandles. Re the return value, 0.0 - 1.0 may be OK, but if you want to change that range, it is very difficult to do so. Supporting it directly in |
@clintongormley You are right, I think it is still pretty fast? Sorting 1 million random docs takes on average 13ms with _uid, instead of 4ms as it is now. Using a numeric field is about the same (5ms for 1mil docs). I think the |
@rjernst great, and re the |
Big +1 on making this function return a score between 0 and 1. |
@rjernst Is it possible that the explain on random score also does not work as intended? Example Query: {
"explain": true,
"size": 1,
"_source": [
"age"
],
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "sum",
"boost": 1,
"functions": [
{
"random_score": {
"seed": 123
}
}
]
}
}
} Result: {
"took": 87,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1758039,
"max_score": 16777215,
"hits": [
{
"_shard": 1,
"_node": "EC5O22oVR-aBhjrlgjESqg",
"_index": "index_profile",
"_type": "type_profile",
"_id": "697639",
"_score": 16777215,
"_source": {
"age": 27
},
"_explanation": {
"value": 0,
"description": "function score, product of:",
"details": [
{
"value": 0,
"description": "Math.min of",
"details": [
{
"value": 0,
"description": "function score, score mode [sum]",
"details": [
{
"value": 0,
"description": "function score, product of:",
"details": [
{
"value": 1,
"description": "match filter: *:*"
},
{
"value": 0,
"description": "random score function (seed: -1443338702012022662)",
"details": [
{
"value": 1,
"description": "ConstantScore(*:*), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 1,
"description": "queryNorm"
}
]
}
]
}
]
}
]
},
{
"value": 3.4028235e+38,
"description": "maxBoost"
}
]
},
{
"value": 1,
"description": "queryBoost"
}
]
}
}
]
}
} |
I agree what @rjernst proposes is a much better solution. Even if we loose some perf here I think we can bring it back by improving _uid in the future. +1 to fix this |
@stha What version of elasticsearch are you running on? There are existing tests to check explain correctly contains the original seed. |
@rjernst It's ES Edit: I can reproduce the |
…urn values in rang [0.0, 1.0] RandomScoreFunction previously relied on the order the documents were iterated in from Lucene. This caused changes in ordering, with the same seed, if documents moved to different segments. With this change, a murmur32 hash of the _uid for each document is used as the "random" value. Also, the hash is adjusted so as to only return values between 0.0 and 1.0 to enable easier manipulation to fit into users' scoring models. closes elastic#6907
@rjernst Thanks! Are you going to fix this in |
Since the PR changes existing behavior, it will only be added to 1.4. |
…urn values in rang [0.0, 1.0] RandomScoreFunction previously relied on the order the documents were iterated in from Lucene. This caused changes in ordering, with the same seed, if documents moved to different segments. With this change, a murmur32 hash of the _uid for each document is used as the "random" value. Also, the hash is adjusted so as to only return values between 0.0 and 1.0 to enable easier manipulation to fit into users' scoring models. closes #6907, #7446
…urn values in rang [0.0, 1.0] RandomScoreFunction previously relied on the order the documents were iterated in from Lucene. This caused changes in ordering, with the same seed, if documents moved to different segments. With this change, a murmur32 hash of the _uid for each document is used as the "random" value. Also, the hash is adjusted so as to only return values between 0.0 and 1.0 to enable easier manipulation to fit into users' scoring models. closes #6907, #7446
Not sure how the random score is calculated, but it seems like the docs positions are changing on updates.
Our docs are continuously getting updates in some cron-jobs or by user actions, thus the pagination is not really possible on random scored lists.
What exactly is elasticsearch using to generate random score besides the provided seed?
Why not just taking the doc UID & the seed to calculate the score?
Btw. there was a post a while ago on your forums:
https://groups.google.com/forum/#!topic/elasticsearch/QOP3kSK5qR0
The text was updated successfully, but these errors were encountered: