Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] #7446

Closed
wants to merge 5 commits into from

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Aug 25, 2014

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907

…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes elastic#6907
@@ -160,7 +160,7 @@

# Path to log files:
#
#path.logs: /path/to/logs
path.logs: logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Yes, will remove.

@rjernst
Copy link
Member Author

rjernst commented Aug 25, 2014

Sorry about that @jpountz I have removed those extraneous changes.

@@ -28,7 +28,7 @@
*/
public class RandomScoreFunctionBuilder implements ScoreFunctionBuilder {

private Long seed = null;
private Integer seed = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason (which I fail to see obviously :-) why this is an object and not a regular int?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is to be able to do if (seed != null) in the toXContent method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I rest my case.. not sure if the noargs constructor makes sense, if that one was gone, the seed would never be empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went with what was there. I think having the option to not supply the seed (ie you don'g care about reproducing, you just want some randomness) is a good option to keep.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this was changed from a long to an int? In 1.4 I can no longer use this function because the seed I was using matches data that is only 64 bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harmsk This was due to using a 32 bit hash, however it was fixed so longs (as well as strings) work later in #8311

@jpountz
Copy link
Contributor

jpountz commented Aug 26, 2014

The diff looks good to me but I'm wondering that we should enable doc values by default on _uid before merging such a change?

@jpountz jpountz removed the review label Aug 26, 2014
@rjernst
Copy link
Member Author

rjernst commented Aug 26, 2014

@jpountz That's the assumption that I was waiting on this fix for all this time. However, this still brings improvements that I believe are important to get into master (2 bug fixes and an simplification of return values). The caveat of course is the cost of pulling _uid into field data, but I think that is just something to be improved upon in the future, rather than continuing with the current broken behavior.

@jpountz
Copy link
Contributor

jpountz commented Aug 26, 2014

Then let's add some documentation to make clear that this feature relies on fielddata of the _uid field?

@rjernst
Copy link
Member Author

rjernst commented Aug 27, 2014

@jpountz I modified the docs for random_score a little to mention uid. Let me know if that is what you had in mind, or if you wanted something more.

@rjernst rjernst added the review label Aug 27, 2014
@jpountz
Copy link
Contributor

jpountz commented Aug 27, 2014

This is what I had in mind. Maybe also put a warning that it will load field data for the _uid field (which can be very memory-intensive given that all values are unique). Otherwise LGTM (feel free to push without asking for further review from my end).

@jpountz jpountz removed the review label Aug 27, 2014
rjernst added a commit that referenced this pull request Aug 27, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446
rjernst added a commit that referenced this pull request Aug 27, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446
@rjernst rjernst closed this Aug 27, 2014
rjernst added a commit that referenced this pull request Sep 8, 2014
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446
@clintongormley clintongormley changed the title FunctionScore: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] Sep 11, 2014
@rjernst rjernst deleted the pr/6907 branch January 21, 2015 23:22
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking >bug :Search/Search Search-related issues that do not fall into other categories v1.4.0.Beta1 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random score order changes on doc updates
5 participants