Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Before deleting shard verify that another node holds an active shard instance #6692

Merged

Conversation

martijnvg
Copy link
Member

Before removing shard physically from disk verify that another node in the cluster actually holds an active shard instance.

@martijnvg martijnvg changed the title Before deleting shard verify that another node holds an active shard instance Store: Before deleting shard verify that another node holds an active shard instance Jul 2, 2014
}

if (!shardsToDelete.isEmpty()) {
deleteShardIfExistElseWhere(clusterService.state(), shardsToDelete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should the state we get from the event, to make sure it's consistent with the test we just made.

@martijnvg
Copy link
Member Author

@bleskes Good points, I updated the PR.

}
ShardActiveResponseHandler responseHandler = new ShardActiveResponseHandler(shardId, requests.size());
for (Tuple<DiscoveryNode, ShardActiveRequest> request : requests) {
transportService.submitRequest(request.v1(), ACTION_SHARD_EXISTS, request.v2(), responseHandler);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check on the node version, and only call it on nodes that are version 1.3 and above, otherwise they won't have this API

@martijnvg
Copy link
Member Author

@kimchy good point, I updated the PR.

List<Tuple<DiscoveryNode, ShardActiveRequest>> requests = new ArrayList<>();
IndexShardRoutingTable shardRoutingTable = state.routingTable().index(shardId.getIndex()).shard(shardId.id());
for (ShardRouting shardRouting : shardRoutingTable) {
DiscoveryNode node = state.getNodes().get(shardRouting.currentNodeId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we protect if this is null here, and not delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that make sense.

@kimchy
Copy link
Member

kimchy commented Jul 4, 2014

thinking about it a bit more, i think that the logic when all nodes responded should only continue if the current cluster state is the same as the cluster state that we had during the clusterChangeEvent initiating the check on active on all nodes. If not, it will not do anything, but by definition another cluster state has happened, and will trigger the active check for deletion anyhow

@bleskes
Copy link
Contributor

bleskes commented Jul 4, 2014

++ on what kimchy said.

@martijnvg
Copy link
Member Author

@kimchy @bleskes I've updated the PR with the suggested cluster state check and added unit tests.

}

private void allNodesResponded() {
if (activeCopies.get() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here should be only if all nodes responded with the shards active we should continue with the deletion process..., same check we do on the cluster state if a shard can be deleted

public void onFailure(String source, Throwable t) {
}
});
waitNoPendingTasksOnAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know when the active shard request have been completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no mechanism for that to check... since it isn't part of the cluster event processing. The waitForShardDeletion() waits long enough for the active shard to have processed?

return !shardDirectory(server, index, shard).exists();
@Test
public void testShardCanBeDeleted_noShardStarted() throws Exception {
ClusterState.Builder clusterState = ClusterState.builder(new ClusterName("test"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe randomize the number of shards and their states? (unassigned/initializing/closed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think randomizing the number of shards here, wouldn't have any impact because it is never used. This test checks already for every non stated state, so that is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean random number of replicas with random combination of non-active states

@kimchy
Copy link
Member

kimchy commented Jul 7, 2014

left a minor comment, LGTM

@martijnvg
Copy link
Member Author

@bleskes Applied the latest feedback.

return !shardDirectory(server, index, shard).exists();
@Test
public void testShardCanBeDeleted_noShardStarted() throws Exception {
int numShards = randomInt(7);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 1+ randomInt(), no?

IndexShardRoutingTable.Builder routingTable = new IndexShardRoutingTable.Builder(new ShardId("test", 1), false);
int localShardId = randomInt(numShards - 1);
for (int i = 0; i < numShards; i++) {
String nodeId = i == localShardId ? localNode.getId() : randomBoolean() ? "abc" : "xyz";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can check also randomly on a shard that relocates to the local node

@bleskes
Copy link
Contributor

bleskes commented Jul 9, 2014

LGTM - life a couple of minor improvement suggestions. Having these unit tests is awesome.

… node in the cluster actually holds an active shard copy.

Closes elastic#6692
martijnvg added a commit that referenced this pull request Jul 9, 2014
… node in the cluster actually holds an active shard copy.

Closes #6692
@martijnvg martijnvg merged commit 9abb7c4 into elastic:master Jul 9, 2014
@clintongormley clintongormley changed the title Store: Before deleting shard verify that another node holds an active shard instance Resiliency: Before deleting shard verify that another node holds an active shard instance Jul 16, 2014
@martijnvg martijnvg deleted the improvements/shard-exists-elsewhere branch May 18, 2015 23:31
@clintongormley clintongormley added the :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. label Jun 7, 2015
@clintongormley clintongormley changed the title Resiliency: Before deleting shard verify that another node holds an active shard instance Before deleting shard verify that another node holds an active shard instance Jun 7, 2015
@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement resiliency v1.3.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants