New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard Started messages should be matched using an exact match #11999
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…atch When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully. This commit changes the test to use ShardRouting.equals and adds unit tests for both shard started and shard failed. For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/
@kimchy can you take a look? |
LGTM |
@bleskes should we backport to 1.7 as well here? |
I pushed this to master. If I don't hear any objection by tomorrow, I'll back port to 1.7 as well... |
bleskes
added a commit
to bleskes/elasticsearch
that referenced
this pull request
Jul 12, 2015
…atch When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully. Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService. This commit does the following: 1) Let ShardStateAction only filter basic stuff like index existence and indexUUID. 2) Move the trickier shard started matching logic to the AllocationService and make it stricter 3) Unify ShardStateAction filtering logic for both shard started and shard failed. 4) Add unit tests for all of the above. For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/ Closes elastic#11999
pushed this to 1.7 as well.. |
lcawl
added
:Distributed/Distributed
A catch all label for anything in the Distributed Area. If you aren't sure, use this one.
and removed
:Allocation
labels
Feb 13, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed/Distributed
A catch all label for anything in the Distributed Area. If you aren't sure, use this one.
resiliency
v1.7.0
v2.0.0-beta1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a node sends a shard started message to the master, the master goes through the routing table looking for the shard to start. At the moment we validate the indexUUID, the node the shard is assigned to and the fact that the shard is initializing. This check goes wrong if a relocating replica shard finishes recovery just at the moment the source node leaves the cluster. In this case the master will cancel the recovery and will likely assign a new initializing replica to the same target node. In this case the message from the relocation recovery can activate the new replica wrongfully.
Also, the logic for decided whether an incoming shard started message will be applied was split between ShardStateAction and the AllocationService.
This commit does the following:
For an example test failure see: http://build-us-00.elastic.co/job/es_core_16_centos/388/