Shard Started messages should be matched using an exact match #11999

bleskes · 2015-07-02T16:51:40Z

When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully.

Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService.
This commit does the following:

Let ShardStateAction only filter basic stuff like index existence and indexUUID.
Move the trickier shard started matching logic to the AllocationService and make it stricter
Unify ShardStateAction filtering logic for both shard started and shard failed.
Add unit tests for all of the above.

For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/

…atch When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully. This commit changes the test to use ShardRouting.equals and adds unit tests for both shard started and shard failed. For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/

…stricter

bleskes · 2015-07-02T16:52:02Z

@kimchy can you take a look?

kimchy · 2015-07-03T15:32:50Z

LGTM

kimchy · 2015-07-03T15:33:53Z

@bleskes should we backport to 1.7 as well here?

bleskes · 2015-07-06T09:35:14Z

@kimchy I think that makes sense. This is a bug fix and it's good to keep this logic similar as much as we can (give the other work around shard allocation goes into 1.x). I will give @s1monw some time to voice his opinion ...

bleskes · 2015-07-10T06:12:35Z

I pushed this to master. If I don't hear any objection by tomorrow, I'll back port to 1.7 as well...

…atch When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully. Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService. This commit does the following: 1) Let ShardStateAction only filter basic stuff like index existence and indexUUID. 2) Move the trickier shard started matching logic to the AllocationService and make it stricter 3) Unify ShardStateAction filtering logic for both shard started and shard failed. 4) Add unit tests for all of the above. For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/ Closes elastic#11999

bleskes · 2015-07-12T07:08:05Z

pushed this to 1.7 as well..

bleskes added 2 commits July 2, 2015 16:02

moved shard started filtering logic to AllocationService and made it …

0f26949

…stricter

bleskes added >bug v2.0.0-beta1 review labels Jul 2, 2015

bleskes closed this in 28090b3 Jul 10, 2015

kevinkluge removed the review label Jul 10, 2015

bleskes added resiliency v1.7.0 labels Jul 12, 2015

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard Started messages should be matched using an exact match #11999

Shard Started messages should be matched using an exact match #11999

bleskes commented Jul 2, 2015

bleskes commented Jul 2, 2015

kimchy commented Jul 3, 2015

kimchy commented Jul 3, 2015

bleskes commented Jul 6, 2015

bleskes commented Jul 10, 2015

bleskes commented Jul 12, 2015

Shard Started messages should be matched using an exact match #11999

Shard Started messages should be matched using an exact match #11999

Conversation

bleskes commented Jul 2, 2015

bleskes commented Jul 2, 2015

kimchy commented Jul 3, 2015

kimchy commented Jul 3, 2015

bleskes commented Jul 6, 2015

bleskes commented Jul 10, 2015

bleskes commented Jul 12, 2015