Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing questionable PNRG behavior #5613

Closed
wants to merge 1 commit into from

Conversation

pickypg
Copy link
Member

@pickypg pickypg commented Mar 31, 2014

Closes #5454 and #5578

This strengthens and simplifies the PNRG used by random_score by more closely mirroring the Random.nextFloat() method, rather than a mix of that, nextInt and nextDouble. The docBase or docId are no longer used as they were biasing the result (particularly if it was 0, which consistently made it the highest scoring result in tests), which partially defeats the purpose of random scoring.

@uboness
Copy link
Contributor

uboness commented Apr 1, 2014

well, the reason why we added the doc id is to have better support for consistent pagination, so it guarantees consistent random number generation on a consistent point of view over the indexes.

@pickypg
Copy link
Member Author

pickypg commented Apr 1, 2014

But, if you are allowing multiple indexes to invoke it, then isn't it going
to inherently be inconsistent because you will possibly not pick up where
you left off, thereby restarting further down the seed?

Personally, I think that having only the seed gives it a more consistent
flow, and the introducing the docBase + docId just adds the potential for
a strong bias in the case of 0, which must be substituted somehow, and
incidental bias thereafter, generally increasing as the sum grows as it
increases the likelyhood that the nextFloat's former xor operation (rand ^ seed) will have a much larger or much smaller number given a seed with
similar bits. I definitely noticed this behavior while adding the unit
tests, which led me to my current "fix".

On Tue, Apr 1, 2014 at 6:51 AM, uboness notifications@github.com wrote:

well, the reason why we added the doc id is to have better support for
consistent pagination, so it guarantees consistent random number generation
on a consistent point of view over the indexes.

Reply to this email directly or view it on GitHubhttps://github.com//pull/5613#issuecomment-39192401
.

@pickypg
Copy link
Member Author

pickypg commented Apr 1, 2014

Thinking it through some more, the docId does provide something if two indexes invoke it at the same time since there is no synchronization of the seed update (thus returning the same value for them both without it). However, as long as they are separate invocations, then it's still a random result to both (ironically even if it repeats for one of them) without it, and I am still not convinced that the bias of the docId is superior.

Of course, I am not an expert on random number theory, so take that with a grain of salt.

@uboness
Copy link
Contributor

uboness commented Apr 1, 2014

yes, the idea initially was the be consistent with pagination across indices (where the starts at 0 for each index). But thinking about it more, indeed there are too many "if"s here... we can remove the id and stick to the normal PRNG implementation, and who ever wants consistent pagination can just use seed+scroll.

btw, the seed is already biased here as the original seed that the user sends (or the default current time millis) is merged with the shard id (which is a must so each shard will generate different random number per doc)

I'll work on your PR, thx!

@pickypg
Copy link
Member Author

pickypg commented Apr 1, 2014

Works for me. The shard id modification of the seed makes perfect sense because otherwise there will be "unexpected" collisions within the merged results.

@s1monw
Copy link
Contributor

s1monw commented Apr 2, 2014

@uboness I labeled this issue and assigned it to you - please feel free to re-label

@uboness
Copy link
Contributor

uboness commented Apr 2, 2014

thx @s1monw

@jpountz jpountz removed the v1.1.1 label Apr 14, 2014
@pickypg pickypg closed this Apr 28, 2014
@pickypg pickypg deleted the feature/prng-no-doc-id-5578 branch April 28, 2014 17:46
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Illogical xor operation in RandomScoreFunction
5 participants