New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Before deleting shard verify that another node holds an active shard instance #6692
Before deleting shard verify that another node holds an active shard instance #6692
Conversation
} | ||
|
||
if (!shardsToDelete.isEmpty()) { | ||
deleteShardIfExistElseWhere(clusterService.state(), shardsToDelete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should the state we get from the event, to make sure it's consistent with the test we just made.
@bleskes Good points, I updated the PR. |
} | ||
ShardActiveResponseHandler responseHandler = new ShardActiveResponseHandler(shardId, requests.size()); | ||
for (Tuple<DiscoveryNode, ShardActiveRequest> request : requests) { | ||
transportService.submitRequest(request.v1(), ACTION_SHARD_EXISTS, request.v2(), responseHandler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check on the node version, and only call it on nodes that are version 1.3 and above, otherwise they won't have this API
@kimchy good point, I updated the PR. |
List<Tuple<DiscoveryNode, ShardActiveRequest>> requests = new ArrayList<>(); | ||
IndexShardRoutingTable shardRoutingTable = state.routingTable().index(shardId.getIndex()).shard(shardId.id()); | ||
for (ShardRouting shardRouting : shardRoutingTable) { | ||
DiscoveryNode node = state.getNodes().get(shardRouting.currentNodeId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we protect if this is null here, and not delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that make sense.
thinking about it a bit more, i think that the logic when all nodes responded should only continue if the current cluster state is the same as the cluster state that we had during the clusterChangeEvent initiating the check on active on all nodes. If not, it will not do anything, but by definition another cluster state has happened, and will trigger the active check for deletion anyhow |
++ on what kimchy said. |
} | ||
|
||
private void allNodesResponded() { | ||
if (activeCopies.get() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic here should be only if all nodes responded with the shards active we should continue with the deletion process..., same check we do on the cluster state if a shard can be deleted
public void onFailure(String source, Throwable t) { | ||
} | ||
}); | ||
waitNoPendingTasksOnAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know when the active shard request have been completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no mechanism for that to check... since it isn't part of the cluster event processing. The waitForShardDeletion() waits long enough for the active shard to have processed?
return !shardDirectory(server, index, shard).exists(); | ||
@Test | ||
public void testShardCanBeDeleted_noShardStarted() throws Exception { | ||
ClusterState.Builder clusterState = ClusterState.builder(new ClusterName("test")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe randomize the number of shards and their states? (unassigned/initializing/closed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think randomizing the number of shards here, wouldn't have any impact because it is never used. This test checks already for every non stated state, so that is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean random number of replicas with random combination of non-active states
left a minor comment, LGTM |
@bleskes Applied the latest feedback. |
return !shardDirectory(server, index, shard).exists(); | ||
@Test | ||
public void testShardCanBeDeleted_noShardStarted() throws Exception { | ||
int numShards = randomInt(7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 1+ randomInt(), no?
IndexShardRoutingTable.Builder routingTable = new IndexShardRoutingTable.Builder(new ShardId("test", 1), false); | ||
int localShardId = randomInt(numShards - 1); | ||
for (int i = 0; i < numShards; i++) { | ||
String nodeId = i == localShardId ? localNode.getId() : randomBoolean() ? "abc" : "xyz"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can check also randomly on a shard that relocates to the local node
LGTM - life a couple of minor improvement suggestions. Having these unit tests is awesome. |
… node in the cluster actually holds an active shard copy. Closes elastic#6692
… node in the cluster actually holds an active shard copy. Closes #6692
Before removing shard physically from disk verify that another node in the cluster actually holds an active shard instance.