Random score order changes on doc updates #6907

heyarny · 2014-07-17T14:50:53Z

Not sure how the random score is calculated, but it seems like the docs positions are changing on updates.
Our docs are continuously getting updates in some cron-jobs or by user actions, thus the pagination is not really possible on random scored lists.

What exactly is elasticsearch using to generate random score besides the provided seed?
Why not just taking the doc UID & the seed to calculate the score?

Btw. there was a post a while ago on your forums:
https://groups.google.com/forum/#!topic/elasticsearch/QOP3kSK5qR0

rjernst · 2014-07-18T16:15:53Z

The order docs are scored in (and thus what order the RNG is generating random scores with) is based on the order the documents appear in lucene segments. Merges will also change this by creating new segments. So even adding or deleting docs can eventually cause the order of existing docs to change with the same seed.

I think we could fix this by having another parameter, field, which the RNG can use per document to generate the score. Instead of updating the seed, the RNG will use that per document value as the randomness (but still incorporate the seed, so a different order can be achieved without reindexing documents). If the field is numeric, it would use the value of the field. If it is anything else, it would use a hash of the byte values. Thoughts?

heyarny · 2014-07-18T23:47:28Z

Yes, actually ANY other solution other than the current one would make more sense, as long the scores/positions stay the same per seed and per doc/field.

Is there any workaround one could use as of now? Preferably one which is robust and performant enough.

clintongormley · 2014-07-20T16:34:11Z

@rjernst While you're working on this, I think that random_score needs a max parameter. Currently it gives a large number like 8207847.5. It is impossible to combine this with other function_score functions to just introduce a little randomness to the results.

For instance, imagine you are filtering on "features":

GET /_search
{
  "query": {
    "function_score": {
      "filter": { 
        "term": { "city": "Barcelona" }
      },
      "functions": [
        {
          "filter": { "term": { "features": "wifi" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "garden" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "pool" }}, 
          "boost_factor": 2 
        }
      ],
      "score_mode": "sum", 
    }
  }
}

Documents can have a score of 1, 2, 3, 4 or 5. It would be nice to use randomization here to randomize all docs that have a score of 2. In other words, docs with an original score of 2 should end up with: 2.0 <= score <= 3.0. Currently, the result from random_score would completely swamp the other functions.

You could do this by specifying a max of 0.99:

GET /_search
{
  "query": {
    "function_score": {
      "filter": { 
        "term": { "city": "Barcelona" }
      },
      "functions": [
        {
          "filter": { "term": { "features": "wifi" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "garden" }}, 
          "boost_factor": 1
        },
        {
          "filter": { "term": { "features": "pool" }}, 
          "boost_factor": 2 
        },
        {
          "random_score": {
            "seed": "foo",
            "max":  0.99
          }
        }
      ],
      "score_mode": "sum", 
    }
  }
}

clintongormley · 2014-07-20T17:07:36Z

And while you're about it, probably worth supporting a min parameter as well...

rjernst · 2014-07-21T17:57:38Z

I've done an initial implementation using _uid, and it is about 3x slower than the current random score. I discussed with @rmuir a bunch and after thinking about it more, it seems this will always be a problem, even when using something "consistent" from a document (whether _uid or a unique numeric field), since segments can always be moved around as adds/updates/deletes happen.

@heyarny Have you tried using random score with a scroll? This should allow a consistent view, given the same seed for each request (although it is possible there is an edge case; I'm working on adding a test).

@clintongormley I think the return values for random score are just wrong right now. Java's Random.nextFloat() returns a value 0.0 - 1.0. We should fix random score to do exactly this, and the user can scale it to their range however they want.

clintongormley · 2014-07-21T18:58:42Z

@rjernst Is the performance drop off as bad for a numeric field instead of _uid. Also, is 3x slower significant? I mean, if it is 3 x very fast, it's still fast.

Scrolling isn't really useful here - the user would have to pull, eg, 10 pages of results and cache them somewhere. Scrolling doesn't let you click forward then back (although the page could be cached i suppose). Also, keeping lots of scrolls open uses a lot of extra filehandles.

Re the return value, 0.0 - 1.0 may be OK, but if you want to change that range, it is very difficult to do so. Supporting it directly in random_score would make it a whole lot cleaner.

rjernst · 2014-07-31T17:40:15Z

@clintongormley You are right, I think it is still pretty fast? Sorting 1 million random docs takes on average 13ms with _uid, instead of 4ms as it is now. Using a numeric field is about the same (5ms for 1mil docs).

I think the weight parameter in #6955 will work if you want a value outside of 0.0-1.0?

clintongormley · 2014-07-31T18:29:05Z

@rjernst great, and re the weight parameter: yes a much better solution

jpountz · 2014-08-04T06:41:12Z

Big +1 on making this function return a score between 0 and 1.

stha · 2014-08-15T10:51:06Z

@rjernst Is it possible that the explain on random score also does not work as intended?

Example Query:

{
  "explain": true,
  "size": 1,
  "_source": [
    "age"
  ],
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "score_mode": "sum",
      "boost": 1,
      "functions": [
        {
          "random_score": {
            "seed": 123
          }
        }
      ]
    }
  }
}

Result:

{
   "took": 87,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 1758039,
      "max_score": 16777215,
      "hits": [
         {
            "_shard": 1,
            "_node": "EC5O22oVR-aBhjrlgjESqg",
            "_index": "index_profile",
            "_type": "type_profile",
            "_id": "697639",
            "_score": 16777215,
            "_source": {
               "age": 27
            },
            "_explanation": {
               "value": 0,
               "description": "function score, product of:",
               "details": [
                  {
                     "value": 0,
                     "description": "Math.min of",
                     "details": [
                        {
                           "value": 0,
                           "description": "function score, score mode [sum]",
                           "details": [
                              {
                                 "value": 0,
                                 "description": "function score, product of:",
                                 "details": [
                                    {
                                       "value": 1,
                                       "description": "match filter: *:*"
                                    },
                                    {
                                       "value": 0,
                                       "description": "random score function (seed: -1443338702012022662)",
                                       "details": [
                                          {
                                             "value": 1,
                                             "description": "ConstantScore(*:*), product of:",
                                             "details": [
                                                {
                                                   "value": 1,
                                                   "description": "boost"
                                                },
                                                {
                                                   "value": 1,
                                                   "description": "queryNorm"
                                                }
                                             ]
                                          }
                                       ]
                                    }
                                 ]
                              }
                           ]
                        },
                        {
                           "value": 3.4028235e+38,
                           "description": "maxBoost"
                        }
                     ]
                  },
                  {
                     "value": 1,
                     "description": "queryBoost"
                  }
               ]
            }
         }
      ]
   }
}

s1monw · 2014-08-19T20:40:26Z

I agree what @rjernst proposes is a much better solution. Even if we loose some perf here I think we can bring it back by improving _uid in the future. +1 to fix this

rjernst · 2014-08-19T22:35:19Z

@stha What version of elasticsearch are you running on? There are existing tests to check explain correctly contains the original seed.

stha · 2014-08-20T06:34:29Z

@rjernst It's ES 1.2.1.

Edit: I can reproduce the explain issue as well with ES 1.3.2.

…urn values in rang [0.0, 1.0] RandomScoreFunction previously relied on the order the documents were iterated in from Lucene. This caused changes in ordering, with the same seed, if documents moved to different segments. With this change, a murmur32 hash of the _uid for each document is used as the "random" value. Also, the hash is adjusted so as to only return values between 0.0 and 1.0 to enable easier manipulation to fit into users' scoring models. closes elastic#6907

rjernst · 2014-08-25T23:29:15Z

@stha I found the issue, and I've added a fix and test.

I have a PR open now to fix the consistency, original seed reporting, and range of values produced.
See #7446

stha · 2014-08-26T07:13:05Z

@rjernst Thanks! Are you going to fix this in 1.2 and 1.3 too?

rjernst · 2014-08-26T15:57:58Z

Since the PR changes existing behavior, it will only be added to 1.4.

…urn values in rang [0.0, 1.0] RandomScoreFunction previously relied on the order the documents were iterated in from Lucene. This caused changes in ordering, with the same seed, if documents moved to different segments. With this change, a murmur32 hash of the _uid for each document is used as the "random" value. Also, the hash is adjusted so as to only return values between 0.0 and 1.0 to enable easier manipulation to fit into users' scoring models. closes #6907, #7446

clintongormley added the discuss label Jul 17, 2014

rjernst added enhancement and removed discuss labels Jul 18, 2014

rjernst self-assigned this Jul 18, 2014

clintongormley mentioned this issue Jul 22, 2014

Make function scores functions tunable #6955

Closed

clintongormley added the v1.4.0 label Aug 20, 2014

s1monw mentioned this issue Aug 20, 2014

Make sure that all shard level requests hold the original indices #7319

Closed

rjernst mentioned this issue Aug 25, 2014

Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] #7446

Closed

rjernst closed this as completed in 65afa1d Aug 27, 2014

rjernst removed enhancement labels Aug 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random score order changes on doc updates #6907

Random score order changes on doc updates #6907

heyarny commented Jul 17, 2014

rjernst commented Jul 18, 2014

heyarny commented Jul 18, 2014

clintongormley commented Jul 20, 2014

clintongormley commented Jul 20, 2014

rjernst commented Jul 21, 2014

clintongormley commented Jul 21, 2014

rjernst commented Jul 31, 2014

clintongormley commented Jul 31, 2014

jpountz commented Aug 4, 2014

stha commented Aug 15, 2014

s1monw commented Aug 19, 2014

rjernst commented Aug 19, 2014

stha commented Aug 20, 2014

rjernst commented Aug 25, 2014

stha commented Aug 26, 2014

rjernst commented Aug 26, 2014

Random score order changes on doc updates #6907

Random score order changes on doc updates #6907

Comments

heyarny commented Jul 17, 2014

rjernst commented Jul 18, 2014

heyarny commented Jul 18, 2014

clintongormley commented Jul 20, 2014

clintongormley commented Jul 20, 2014

rjernst commented Jul 21, 2014

clintongormley commented Jul 21, 2014

rjernst commented Jul 31, 2014

clintongormley commented Jul 31, 2014

jpountz commented Aug 4, 2014

stha commented Aug 15, 2014

s1monw commented Aug 19, 2014

rjernst commented Aug 19, 2014

stha commented Aug 20, 2014

rjernst commented Aug 25, 2014

stha commented Aug 26, 2014

rjernst commented Aug 26, 2014