Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to murmurhash3 to route documents to shards. #7954

Closed
wants to merge 4 commits into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Oct 2, 2014

We currently use the djb2 hash function in order to compute the shard a
document should go to. Unfortunately this hash function is not very
sophisticated and you can sometimes hit adversarial cases, such as numeric ids
on 33 shards.

Murmur3 generates hashes with a better distribution, which should avoid the
adversarial cases.

Here are some examples of how 100000 incremental ids are distributed to shards
using either djb2 or murmur3.

5 shards:
Murmur3: [19933, 19964, 19940, 20030, 20133]
DJB: [20000, 20000, 20000, 20000, 20000]

3 shards:
Murmur3: [33185, 33347, 33468]
DJB: [30100, 30000, 39900]

33 shards:
Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977]
DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]

Even if djb2 looks ideal in some cases (5 shards), the fact that the
distribution of its hashes has some patterns can raise issues with some shard
counts (eg. 3, or even worse 33).

Some tests have been modified because they relied on implementation details of
the routing hash function.

This change only affects indices that are created on or after elasticsearch 2.0.

// (as there could be with DJB hash with 33 shards and incremental numeric ids)

// nocommit: should we force the use of the type? it is tempting to avoid some worst-cases in the lots-of-types
// case but it currently breaks parent/child given that in that case only the parent id is passed as a routing key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in case of p/c, for child docs we use the type of the direct parent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just use the Id really

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll remove the nocommit

@s1monw
Copy link
Contributor

s1monw commented Oct 2, 2014

left some comments

- do:
index:
index: test_1
type: test
id: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make our tests go wild for rest compat so be prepared

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe not if we go with 2.x only change since REST tests only run against major version and if not we should fix it.

@s1monw
Copy link
Contributor

s1monw commented Oct 3, 2014

I was thinking about this a bit more and I don't think we can do this safely in 1.5 Even if you have a checks for indices that are created with 1.5 you might have node clients in the cluster that are from a previous version and then your routing is messed up ie. /_get calls won't work correctly. I also think we should then remove the ability to configure the hashfunction as much as possible.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 3, 2014

@s1monw Good call, this would indeed fail. I will make this change 2.0-only.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 8, 2014

I pushed a new commit that makes this change only happen as of 2.0 and makes the hash function an index setting for indices that have been created before 2.0. For newer indices the hash function is hard-coded to Murmur3.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 12, 2014

I added a note about this change to the migration guide.

private int shardId(ClusterState clusterState, String index, String type, @Nullable String id, @Nullable String routing) {
private int shardId(ClusterState clusterState, String index, String type, String id, @Nullable String routing) {
final IndexMetaData indexMetaData = indexMetaData(clusterState, index);
final Version createdVersion = Version.indexCreated(indexMetaData.getSettings());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the version a real part of the IndexMetaData and add IndexMetaData#getVersionCreated()?

We currently use the djb2 hash function in order to compute the shard a
document should go to. Unfortunately this hash function is not very
sophisticated and you can sometimes hit adversarial cases, such as numeric ids
on 33 shards.

Murmur3 generates hashes with a better distribution, which should avoid the
adversarial cases.

Here are some examples of how 100000 incremental ids are distributed to shards
using either djb2 or murmur3.

5 shards:
Murmur3: [19933, 19964, 19940, 20030, 20133]
DJB:     [20000, 20000, 20000, 20000, 20000]

3 shards:
Murmur3: [33185, 33347, 33468]
DJB:     [30100, 30000, 39900]

33 shards:
Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977]
DJB:     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]

Even if djb2 looks ideal in some cases (5 shards), the fact that the
distribution of its hashes has some patterns can raise issues with some shard
counts (eg. 3, or even worse 33).

Some tests have been modified because they relied on implementation details of
the routing hash function.
 - index creation version as a first-class citizen of index metadata
 - 32-bits murmur3 instead of 128 (we only took 32 bits of it anyway)
 - more tests
@jpountz
Copy link
Contributor Author

jpountz commented Oct 17, 2014

@s1monw I pushed a new iteration that

  • adds a test that upgrades old indices (with both default routing and custom routing)
  • switches to the 32-bits version of murmur3 (we didn't really need 128)
  • makes the version and hash function first-class citizen of the index metadata
  • logs a warning if you have the deprecated settings still set in the elasticsearch.yml

There is some noise in the tests because making the creation version a first-class citizen broke some tests that didn't set the version. I also fixed more tests that relied on implementation details of the routing function.

@@ -20,8 +20,6 @@
package org.elasticsearch.cluster.routing.operation;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should drop this module entirely and a make it hardcoded wherever it's used? Or at least not changeable? I don't think anybody should change this module or have it's own operation routing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimchy what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be delayed to another PR? I am slightly concerned this PR has already been awaiting merge for a long time?

* Elasticsearch 2.0 deprecated custom routing hash functions. So what we do here is that for old indices, we
* move this old & deprecated node setting to an index setting so that we can keep things backward compatible.
*/
private void pre20Upgrade() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a different note - I wonder if we want to port a slightly modified version of this to 1.x such that folk coming from 1.5 can remove their custom hash before the 2.0 upgrade. I also think we should deprecate the setting in 1.x already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a new PR for 1.x once this one is in?

@s1monw
Copy link
Contributor

s1monw commented Oct 17, 2014

left some minor comments otherwise LGTM

@s1monw s1monw removed the review label Oct 17, 2014
@drewr
Copy link
Contributor

drewr commented Oct 24, 2014

@jpountz Tried this branch to see if I could get better bulk distribution to battle an ingestion scaling issue. It helped a little, but not as much as I was looking for.

In the course of testing, 30 shards on 30 nodes worked great. Tried 31, 33, 37 shards, and when I did, after a bit (like, 6gb later, not immediately) the index would go red with a single shard unassigned. The log on the node that was assigned that shard had this in the log:

[2014-10-24 21:28:48,949][WARN ][index.engine.internal    ] [ops27-data04-A] [foo][31] failed engine [merge exception]
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: directory '/d/es/data/ops27-data04-A/org.elasticsearch.test.ops27.data/nodes/0/indices/foo/31/index' exists and is a directory, but cannot be listed: list() returned null
        at org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvider$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeSchedulerProvider.java:133)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)

There's nothing wrong with the directory on disk. It has files, responds to filesystem commands.

I'm not sure if the branch is still a work-in-progress, but this surprised me a bit. Just posting it here in case it's really a bug. Thanks!

@jpountz
Copy link
Contributor Author

jpountz commented Oct 26, 2014

@drewr this is the exception you would get if you ran out of file descriptors, could you check your system limits?

@drewr
Copy link
Contributor

drewr commented Oct 29, 2014

Interesting. I had never seen that particular symptom of too many open files...

I was in fact upping the limit, but I found a bug with the way Ubuntu handles ulimit -n, so those instances may in fact have had a low limit. However, with the 1.x branch of ES I hadn't had this issue. That could be a test dissimilarity though. Will try again.

@jpountz
Copy link
Contributor Author

jpountz commented Nov 3, 2014

Given that this PR has been out for a long time, I plan to merge it soon (if there are no objections). Then we can have another issue/PR if we think that PlainOperationRouting should be hardcoded.

@jpountz jpountz closed this in 9ea25df Nov 4, 2014
clintongormley added a commit to elastic/elasticsearch-perl that referenced this pull request Nov 5, 2014
@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 6, 2015
@jpountz jpountz deleted the fix/routing_hash branch November 12, 2015 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants