Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard Started messages should be matched using an exact match #11999

Closed
wants to merge 2 commits into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Jul 2, 2015

When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully.

Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService.
This commit does the following:

  1. Let ShardStateAction only filter basic stuff like index existence and indexUUID.
  2. Move the trickier shard started matching logic to the AllocationService and make it stricter
  3. Unify ShardStateAction filtering logic for both shard started and shard failed.
  4. Add unit tests for all of the above.

For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/

…atch

When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node  leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully.

This commit changes the test to use ShardRouting.equals and adds unit tests for both shard started and shard failed.

For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/
@bleskes
Copy link
Contributor Author

bleskes commented Jul 2, 2015

@kimchy can you take a look?

@kimchy
Copy link
Member

kimchy commented Jul 3, 2015

LGTM

@kimchy
Copy link
Member

kimchy commented Jul 3, 2015

@bleskes should we backport to 1.7 as well here?

@bleskes
Copy link
Contributor Author

bleskes commented Jul 6, 2015

@kimchy I think that makes sense. This is a bug fix and it's good to keep this logic similar as much as we can (give the other work around shard allocation goes into 1.x). I will give @s1monw some time to voice his opinion ...

@bleskes bleskes closed this in 28090b3 Jul 10, 2015
@kevinkluge kevinkluge removed the review label Jul 10, 2015
@bleskes
Copy link
Contributor Author

bleskes commented Jul 10, 2015

I pushed this to master. If I don't hear any objection by tomorrow, I'll back port to 1.7 as well...

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jul 12, 2015
…atch

When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully.

Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService.
This commit does the following:
1) Let ShardStateAction only filter basic stuff like index existence and indexUUID.
2) Move the trickier shard started matching logic to the AllocationService and make it stricter
3) Unify ShardStateAction filtering logic for both shard started and shard failed.
4) Add unit tests for all of the above.

For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/

Closes elastic#11999
@bleskes
Copy link
Contributor Author

bleskes commented Jul 12, 2015

pushed this to 1.7 as well..

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. resiliency v1.7.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants