Switch to murmurhash3 to route documents to shards. #7954

jpountz · 2014-10-02T00:39:13Z

We currently use the djb2 hash function in order to compute the shard a
document should go to. Unfortunately this hash function is not very
sophisticated and you can sometimes hit adversarial cases, such as numeric ids
on 33 shards.

Murmur3 generates hashes with a better distribution, which should avoid the
adversarial cases.

Here are some examples of how 100000 incremental ids are distributed to shards
using either djb2 or murmur3.

5 shards:
Murmur3: [19933, 19964, 19940, 20030, 20133]
DJB: [20000, 20000, 20000, 20000, 20000]

3 shards:
Murmur3: [33185, 33347, 33468]
DJB: [30100, 30000, 39900]

33 shards:
Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977]
DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]

Even if djb2 looks ideal in some cases (5 shards), the fact that the
distribution of its hashes has some patterns can raise issues with some shard
counts (eg. 3, or even worse 33).

Some tests have been modified because they relied on implementation details of
the routing hash function.

This change only affects indices that are created on or after elasticsearch 2.0.

martijnvg · 2014-10-02T13:02:37Z

src/main/java/org/elasticsearch/cluster/routing/operation/plain/PlainOperationRouting.java

+            // (as there could be with DJB hash with 33 shards and incremental numeric ids)
+
+            // nocommit: should we force the use of the type? it is tempting to avoid some worst-cases in the lots-of-types
+            // case but it currently breaks parent/child given that in that case only the parent id is passed as a routing key


Perhaps in case of p/c, for child docs we use the type of the direct parent?

I'd just use the Id really

ok, I'll remove the nocommit

s1monw · 2014-10-02T20:16:05Z

left some comments

s1monw · 2014-10-02T20:16:33Z

rest-api-spec/test/delete/50_refresh.yaml

 - do:
      index:
          index:   test_1
          type:    test
-          id:      2


this will make our tests go wild for rest compat so be prepared

hmm maybe not if we go with 2.x only change since REST tests only run against major version and if not we should fix it.

s1monw · 2014-10-03T06:02:38Z

I was thinking about this a bit more and I don't think we can do this safely in 1.5 Even if you have a checks for indices that are created with 1.5 you might have node clients in the cluster that are from a previous version and then your routing is messed up ie. /_get calls won't work correctly. I also think we should then remove the ability to configure the hashfunction as much as possible.

jpountz · 2014-10-03T23:16:00Z

@s1monw Good call, this would indeed fail. I will make this change 2.0-only.

jpountz · 2014-10-08T11:21:44Z

I pushed a new commit that makes this change only happen as of 2.0 and makes the hash function an index setting for indices that have been created before 2.0. For newer indices the hash function is hard-coded to Murmur3.

jpountz · 2014-10-12T16:12:46Z

I added a note about this change to the migration guide.

s1monw · 2014-10-12T20:28:06Z

src/main/java/org/elasticsearch/cluster/routing/operation/plain/PlainOperationRouting.java

-    private int shardId(ClusterState clusterState, String index, String type, @Nullable String id, @Nullable String routing) {
+    private int shardId(ClusterState clusterState, String index, String type, String id, @Nullable String routing) {
+        final IndexMetaData indexMetaData = indexMetaData(clusterState, index);
+        final Version createdVersion = Version.indexCreated(indexMetaData.getSettings());


I think we should make the version a real part of the IndexMetaData and add IndexMetaData#getVersionCreated()?

We currently use the djb2 hash function in order to compute the shard a document should go to. Unfortunately this hash function is not very sophisticated and you can sometimes hit adversarial cases, such as numeric ids on 33 shards. Murmur3 generates hashes with a better distribution, which should avoid the adversarial cases. Here are some examples of how 100000 incremental ids are distributed to shards using either djb2 or murmur3. 5 shards: Murmur3: [19933, 19964, 19940, 20030, 20133] DJB: [20000, 20000, 20000, 20000, 20000] 3 shards: Murmur3: [33185, 33347, 33468] DJB: [30100, 30000, 39900] 33 shards: Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977] DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0] Even if djb2 looks ideal in some cases (5 shards), the fact that the distribution of its hashes has some patterns can raise issues with some shard counts (eg. 3, or even worse 33). Some tests have been modified because they relied on implementation details of the routing hash function.

- index creation version as a first-class citizen of index metadata - 32-bits murmur3 instead of 128 (we only took 32 bits of it anyway) - more tests

jpountz · 2014-10-17T00:52:08Z

@s1monw I pushed a new iteration that

adds a test that upgrades old indices (with both default routing and custom routing)
switches to the 32-bits version of murmur3 (we didn't really need 128)
makes the version and hash function first-class citizen of the index metadata
logs a warning if you have the deprecated settings still set in the elasticsearch.yml

There is some noise in the tests because making the creation version a first-class citizen broke some tests that didn't set the version. I also fixed more tests that relied on implementation details of the routing function.

s1monw · 2014-10-17T08:16:04Z

src/main/java/org/elasticsearch/cluster/routing/operation/OperationRoutingModule.java

@@ -20,8 +20,6 @@
 package org.elasticsearch.cluster.routing.operation;



I wonder if we should drop this module entirely and a make it hardcoded wherever it's used? Or at least not changeable? I don't think anybody should change this module or have it's own operation routing?

@kimchy what do you think?

Should it be delayed to another PR? I am slightly concerned this PR has already been awaiting merge for a long time?

s1monw · 2014-10-17T08:17:53Z

src/main/java/org/elasticsearch/gateway/local/state/meta/LocalGatewayMetaState.java

+     * Elasticsearch 2.0 deprecated custom routing hash functions. So what we do here is that for old indices, we
+     * move this old & deprecated node setting to an index setting so that we can keep things backward compatible.
+     */
+    private void pre20Upgrade() throws Exception {


on a different note - I wonder if we want to port a slightly modified version of this to 1.x such that folk coming from 1.5 can remove their custom hash before the 2.0 upgrade. I also think we should deprecate the setting in 1.x already

I will create a new PR for 1.x once this one is in?

s1monw · 2014-10-17T08:23:22Z

left some minor comments otherwise LGTM

drewr · 2014-10-24T21:50:25Z

@jpountz Tried this branch to see if I could get better bulk distribution to battle an ingestion scaling issue. It helped a little, but not as much as I was looking for.

In the course of testing, 30 shards on 30 nodes worked great. Tried 31, 33, 37 shards, and when I did, after a bit (like, 6gb later, not immediately) the index would go red with a single shard unassigned. The log on the node that was assigned that shard had this in the log:

[2014-10-24 21:28:48,949][WARN ][index.engine.internal    ] [ops27-data04-A] [foo][31] failed engine [merge exception]
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: directory '/d/es/data/ops27-data04-A/org.elasticsearch.test.ops27.data/nodes/0/indices/foo/31/index' exists and is a directory, but cannot be listed: list() returned null
        at org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvider$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeSchedulerProvider.java:133)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)

There's nothing wrong with the directory on disk. It has files, responds to filesystem commands.

I'm not sure if the branch is still a work-in-progress, but this surprised me a bit. Just posting it here in case it's really a bug. Thanks!

jpountz · 2014-10-26T21:59:37Z

@drewr this is the exception you would get if you ran out of file descriptors, could you check your system limits?

drewr · 2014-10-29T18:05:40Z

Interesting. I had never seen that particular symptom of too many open files...

I was in fact upping the limit, but I found a bug with the way Ubuntu handles ulimit -n, so those instances may in fact have had a low limit. However, with the 1.x branch of ES I hadn't had this issue. That could be a test dissimilarity though. Will try again.

jpountz · 2014-11-03T10:56:52Z

Given that this PR has been out for a long time, I plan to merge it soon (if there are no objections). Then we can have another issue/PR if we think that PlainOperationRouting should be hardcoded.

…croll tests See elastic/elasticsearch#7954

jpountz added >bug v1.5.0 v2.0.0-beta1 review labels Oct 2, 2014

jpountz self-assigned this Oct 2, 2014

martijnvg reviewed Oct 2, 2014
View reviewed changes

s1monw reviewed Oct 2, 2014
View reviewed changes

jpountz added >breaking >enhancement review and removed review >bug v1.5.0 labels Oct 3, 2014

jpountz force-pushed the fix/routing_hash branch from 7252d46 to 422cc46 Compare October 12, 2014 16:11

s1monw reviewed Oct 12, 2014
View reviewed changes

jpountz added 4 commits October 16, 2014 11:32

Make it 2.0-only and upgrade index settings.

5bf4127

Add note to the migration guide.

e8c89d5

New iteration.

38a384d

- index creation version as a first-class citizen of index metadata - 32-bits murmur3 instead of 128 (we only took 32 bits of it anyway) - more tests

jpountz force-pushed the fix/routing_hash branch from 422cc46 to 38a384d Compare October 17, 2014 00:47

s1monw reviewed Oct 17, 2014
View reviewed changes

s1monw removed the review label Oct 17, 2014

clintongormley mentioned this pull request Oct 30, 2014

Improve "directory cannot be listed" message to reflect underlying cause #8284

Closed

jpountz mentioned this pull request Nov 3, 2014

Add doc values support to boolean fields. #7961

Merged

jpountz closed this in 9ea25df Nov 4, 2014

clintongormley added a commit to elastic/elasticsearch-perl that referenced this pull request Nov 5, 2014

The routing change to use murmurhash3 has changed the scores in the s…

886dd64

…croll tests See elastic/elasticsearch#7954

clintongormley added the :Core/Infra/Core Core issues without another label label Jun 6, 2015

jpountz deleted the fix/routing_hash branch November 12, 2015 10:15

		@@ -20,8 +20,6 @@
		package org.elasticsearch.cluster.routing.operation;

Switch to murmurhash3 to route documents to shards. #7954

Switch to murmurhash3 to route documents to shards. #7954

Uh oh!

Conversation

jpountz commented Oct 2, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Oct 2, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Oct 3, 2014

Uh oh!

jpountz commented Oct 3, 2014

Uh oh!

jpountz commented Oct 8, 2014

Uh oh!

jpountz commented Oct 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Oct 17, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Oct 17, 2014

Uh oh!

drewr commented Oct 24, 2014

Uh oh!

jpountz commented Oct 26, 2014

Uh oh!

drewr commented Oct 29, 2014

Uh oh!

jpountz commented Nov 3, 2014

Uh oh!

Uh oh!