New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to murmurhash3 to route documents to shards. #7954
Conversation
// (as there could be with DJB hash with 33 shards and incremental numeric ids) | ||
|
||
// nocommit: should we force the use of the type? it is tempting to avoid some worst-cases in the lots-of-types | ||
// case but it currently breaks parent/child given that in that case only the parent id is passed as a routing key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps in case of p/c, for child docs we use the type of the direct parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just use the Id really
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll remove the nocommit
left some comments |
- do: | ||
index: | ||
index: test_1 | ||
type: test | ||
id: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will make our tests go wild for rest compat so be prepared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe not if we go with 2.x only change since REST tests only run against major version and if not we should fix it.
I was thinking about this a bit more and I don't think we can do this safely in |
@s1monw Good call, this would indeed fail. I will make this change 2.0-only. |
I pushed a new commit that makes this change only happen as of 2.0 and makes the hash function an index setting for indices that have been created before 2.0. For newer indices the hash function is hard-coded to Murmur3. |
7252d46
to
422cc46
Compare
I added a note about this change to the migration guide. |
private int shardId(ClusterState clusterState, String index, String type, @Nullable String id, @Nullable String routing) { | ||
private int shardId(ClusterState clusterState, String index, String type, String id, @Nullable String routing) { | ||
final IndexMetaData indexMetaData = indexMetaData(clusterState, index); | ||
final Version createdVersion = Version.indexCreated(indexMetaData.getSettings()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the version a real part of the IndexMetaData
and add IndexMetaData#getVersionCreated()
?
We currently use the djb2 hash function in order to compute the shard a document should go to. Unfortunately this hash function is not very sophisticated and you can sometimes hit adversarial cases, such as numeric ids on 33 shards. Murmur3 generates hashes with a better distribution, which should avoid the adversarial cases. Here are some examples of how 100000 incremental ids are distributed to shards using either djb2 or murmur3. 5 shards: Murmur3: [19933, 19964, 19940, 20030, 20133] DJB: [20000, 20000, 20000, 20000, 20000] 3 shards: Murmur3: [33185, 33347, 33468] DJB: [30100, 30000, 39900] 33 shards: Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977] DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0] Even if djb2 looks ideal in some cases (5 shards), the fact that the distribution of its hashes has some patterns can raise issues with some shard counts (eg. 3, or even worse 33). Some tests have been modified because they relied on implementation details of the routing hash function.
- index creation version as a first-class citizen of index metadata - 32-bits murmur3 instead of 128 (we only took 32 bits of it anyway) - more tests
422cc46
to
38a384d
Compare
@s1monw I pushed a new iteration that
There is some noise in the tests because making the creation version a first-class citizen broke some tests that didn't set the version. I also fixed more tests that relied on implementation details of the routing function. |
@@ -20,8 +20,6 @@ | |||
package org.elasticsearch.cluster.routing.operation; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should drop this module entirely and a make it hardcoded wherever it's used? Or at least not changeable? I don't think anybody should change this module or have it's own operation routing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimchy what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be delayed to another PR? I am slightly concerned this PR has already been awaiting merge for a long time?
* Elasticsearch 2.0 deprecated custom routing hash functions. So what we do here is that for old indices, we | ||
* move this old & deprecated node setting to an index setting so that we can keep things backward compatible. | ||
*/ | ||
private void pre20Upgrade() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a different note - I wonder if we want to port a slightly modified version of this to 1.x such that folk coming from 1.5 can remove their custom hash before the 2.0 upgrade. I also think we should deprecate the setting in 1.x already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a new PR for 1.x once this one is in?
left some minor comments otherwise LGTM |
@jpountz Tried this branch to see if I could get better bulk distribution to battle an ingestion scaling issue. It helped a little, but not as much as I was looking for. In the course of testing, 30 shards on 30 nodes worked great. Tried 31, 33, 37 shards, and when I did, after a bit (like, 6gb later, not immediately) the index would go red with a single shard unassigned. The log on the node that was assigned that shard had this in the log:
There's nothing wrong with the directory on disk. It has files, responds to filesystem commands. I'm not sure if the branch is still a work-in-progress, but this surprised me a bit. Just posting it here in case it's really a bug. Thanks! |
@drewr this is the exception you would get if you ran out of file descriptors, could you check your system limits? |
Interesting. I had never seen that particular symptom of too many open files... I was in fact upping the limit, but I found a bug with the way Ubuntu handles |
Given that this PR has been out for a long time, I plan to merge it soon (if there are no objections). Then we can have another issue/PR if we think that PlainOperationRouting should be hardcoded. |
We currently use the djb2 hash function in order to compute the shard a
document should go to. Unfortunately this hash function is not very
sophisticated and you can sometimes hit adversarial cases, such as numeric ids
on 33 shards.
Murmur3 generates hashes with a better distribution, which should avoid the
adversarial cases.
Here are some examples of how 100000 incremental ids are distributed to shards
using either djb2 or murmur3.
5 shards:
Murmur3: [19933, 19964, 19940, 20030, 20133]
DJB: [20000, 20000, 20000, 20000, 20000]
3 shards:
Murmur3: [33185, 33347, 33468]
DJB: [30100, 30000, 39900]
33 shards:
Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977]
DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]
Even if djb2 looks ideal in some cases (5 shards), the fact that the
distribution of its hashes has some patterns can raise issues with some shard
counts (eg. 3, or even worse 33).
Some tests have been modified because they relied on implementation details of
the routing hash function.
This change only affects indices that are created on or after elasticsearch 2.0.