Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random score order changes on doc updates #6907

Closed
heyarny opened this issue Jul 17, 2014 · 16 comments
Closed

Random score order changes on doc updates #6907

heyarny opened this issue Jul 17, 2014 · 16 comments
Assignees

Comments

@heyarny
Copy link

heyarny commented Jul 17, 2014

Not sure how the random score is calculated, but it seems like the docs positions are changing on updates.
Our docs are continuously getting updates in some cron-jobs or by user actions, thus the pagination is not really possible on random scored lists.

What exactly is elasticsearch using to generate random score besides the provided seed?
Why not just taking the doc UID & the seed to calculate the score?

Btw. there was a post a while ago on your forums:
https://groups.google.com/forum/#!topic/elasticsearch/QOP3kSK5qR0

@rjernst
Copy link
Member

rjernst commented Jul 18, 2014

The order docs are scored in (and thus what order the RNG is generating random scores with) is based on the order the documents appear in lucene segments. Merges will also change this by creating new segments. So even adding or deleting docs can eventually cause the order of existing docs to change with the same seed.

I think we could fix this by having another parameter, field, which the RNG can use per document to generate the score. Instead of updating the seed, the RNG will use that per document value as the randomness (but still incorporate the seed, so a different order can be achieved without reindexing documents). If the field is numeric, it would use the value of the field. If it is anything else, it would use a hash of the byte values. Thoughts?

@rjernst rjernst self-assigned this Jul 18, 2014
@heyarny
Copy link
Author

heyarny commented Jul 18, 2014

Yes, actually ANY other solution other than the current one would make more sense, as long the scores/positions stay the same per seed and per doc/field.

Is there any workaround one could use as of now? Preferably one which is robust and performant enough.

@clintongormley
Copy link

@rjernst While you're working on this, I think that random_score needs a max parameter. Currently it gives a large number like 8207847.5. It is impossible to combine this with other function_score functions to just introduce a little randomness to the results.

For instance, imagine you are filtering on "features":

GET /_search
{
  "query": {
    "function_score": {
      "filter": { 
        "term": { "city": "Barcelona" }
      },
      "functions": [
        {
          "filter": { "term": { "features": "wifi" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "garden" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "pool" }}, 
          "boost_factor": 2 
        }
      ],
      "score_mode": "sum", 
    }
  }
}

Documents can have a score of 1, 2, 3, 4 or 5. It would be nice to use randomization here to randomize all docs that have a score of 2. In other words, docs with an original score of 2 should end up with: 2.0 <= score <= 3.0. Currently, the result from random_score would completely swamp the other functions.

You could do this by specifying a max of 0.99:

GET /_search
{
  "query": {
    "function_score": {
      "filter": { 
        "term": { "city": "Barcelona" }
      },
      "functions": [
        {
          "filter": { "term": { "features": "wifi" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "garden" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "pool" }}, 
          "boost_factor": 2 
        },
        {
          "random_score": {
            "seed": "foo",
            "max":  0.99
          }
        }
      ],
      "score_mode": "sum", 
    }
  }
}

@clintongormley
Copy link

And while you're about it, probably worth supporting a min parameter as well...

@rjernst
Copy link
Member

rjernst commented Jul 21, 2014

I've done an initial implementation using _uid, and it is about 3x slower than the current random score. I discussed with @rmuir a bunch and after thinking about it more, it seems this will always be a problem, even when using something "consistent" from a document (whether _uid or a unique numeric field), since segments can always be moved around as adds/updates/deletes happen.

@heyarny Have you tried using random score with a scroll? This should allow a consistent view, given the same seed for each request (although it is possible there is an edge case; I'm working on adding a test).

@clintongormley I think the return values for random score are just wrong right now. Java's Random.nextFloat() returns a value 0.0 - 1.0. We should fix random score to do exactly this, and the user can scale it to their range however they want.

@clintongormley
Copy link

@rjernst Is the performance drop off as bad for a numeric field instead of _uid. Also, is 3x slower significant? I mean, if it is 3 x very fast, it's still fast.

Scrolling isn't really useful here - the user would have to pull, eg, 10 pages of results and cache them somewhere. Scrolling doesn't let you click forward then back (although the page could be cached i suppose). Also, keeping lots of scrolls open uses a lot of extra filehandles.

Re the return value, 0.0 - 1.0 may be OK, but if you want to change that range, it is very difficult to do so. Supporting it directly in random_score would make it a whole lot cleaner.

@rjernst
Copy link
Member

rjernst commented Jul 31, 2014

@clintongormley You are right, I think it is still pretty fast? Sorting 1 million random docs takes on average 13ms with _uid, instead of 4ms as it is now. Using a numeric field is about the same (5ms for 1mil docs).

I think the weight parameter in #6955 will work if you want a value outside of 0.0-1.0?

@clintongormley
Copy link

@rjernst great, and re the weight parameter: yes a much better solution

@jpountz
Copy link
Contributor

jpountz commented Aug 4, 2014

Big +1 on making this function return a score between 0 and 1.

@stha
Copy link

stha commented Aug 15, 2014

@rjernst Is it possible that the explain on random score also does not work as intended?

Example Query:

{
  "explain": true,
  "size": 1,
  "_source": [
    "age"
  ],
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "score_mode": "sum",
      "boost": 1,
      "functions": [
        {
          "random_score": {
            "seed": 123
          }
        }
      ]
    }
  }
}

Result:

{
   "took": 87,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 1758039,
      "max_score": 16777215,
      "hits": [
         {
            "_shard": 1,
            "_node": "EC5O22oVR-aBhjrlgjESqg",
            "_index": "index_profile",
            "_type": "type_profile",
            "_id": "697639",
            "_score": 16777215,
            "_source": {
               "age": 27
            },
            "_explanation": {
               "value": 0,
               "description": "function score, product of:",
               "details": [
                  {
                     "value": 0,
                     "description": "Math.min of",
                     "details": [
                        {
                           "value": 0,
                           "description": "function score, score mode [sum]",
                           "details": [
                              {
                                 "value": 0,
                                 "description": "function score, product of:",
                                 "details": [
                                    {
                                       "value": 1,
                                       "description": "match filter: *:*"
                                    },
                                    {
                                       "value": 0,
                                       "description": "random score function (seed: -1443338702012022662)",
                                       "details": [
                                          {
                                             "value": 1,
                                             "description": "ConstantScore(*:*), product of:",
                                             "details": [
                                                {
                                                   "value": 1,
                                                   "description": "boost"
                                                },
                                                {
                                                   "value": 1,
                                                   "description": "queryNorm"
                                                }
                                             ]
                                          }
                                       ]
                                    }
                                 ]
                              }
                           ]
                        },
                        {
                           "value": 3.4028235e+38,
                           "description": "maxBoost"
                        }
                     ]
                  },
                  {
                     "value": 1,
                     "description": "queryBoost"
                  }
               ]
            }
         }
      ]
   }
}

@s1monw
Copy link
Contributor

s1monw commented Aug 19, 2014

I agree what @rjernst proposes is a much better solution. Even if we loose some perf here I think we can bring it back by improving _uid in the future. +1 to fix this

@rjernst
Copy link
Member

rjernst commented Aug 19, 2014

@stha What version of elasticsearch are you running on? There are existing tests to check explain correctly contains the original seed.

@stha
Copy link

stha commented Aug 20, 2014

@rjernst It's ES 1.2.1.

Edit: I can reproduce the explain issue as well with ES 1.3.2.

rjernst added a commit to rjernst/elasticsearch that referenced this issue Aug 25, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes elastic#6907
@rjernst
Copy link
Member

rjernst commented Aug 25, 2014

@stha I found the issue, and I've added a fix and test.

I have a PR open now to fix the consistency, original seed reporting, and range of values produced.
See #7446

@stha
Copy link

stha commented Aug 26, 2014

@rjernst Thanks! Are you going to fix this in 1.2 and 1.3 too?

@rjernst
Copy link
Member

rjernst commented Aug 26, 2014

Since the PR changes existing behavior, it will only be added to 1.4.

rjernst added a commit that referenced this issue Aug 27, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446
rjernst added a commit that referenced this issue Sep 8, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants