Before deleting shard verify that another node holds an active shard instance #6692

martijnvg · 2014-07-02T19:13:11Z

Before removing shard physically from disk verify that another node in the cluster actually holds an active shard instance.

bleskes · 2014-07-03T13:12:55Z

src/main/java/org/elasticsearch/indices/store/IndicesStore.java

+        }
+
+        if (!shardsToDelete.isEmpty()) {
+            deleteShardIfExistElseWhere(clusterService.state(), shardsToDelete);


I think we should the state we get from the event, to make sure it's consistent with the test we just made.

martijnvg · 2014-07-03T15:07:55Z

@bleskes Good points, I updated the PR.

kimchy · 2014-07-03T15:10:24Z

src/main/java/org/elasticsearch/indices/store/IndicesStore.java

+            }
+            ShardActiveResponseHandler responseHandler = new ShardActiveResponseHandler(shardId, requests.size());
+            for (Tuple<DiscoveryNode, ShardActiveRequest> request : requests) {
+                transportService.submitRequest(request.v1(), ACTION_SHARD_EXISTS, request.v2(), responseHandler);


we need to check on the node version, and only call it on nodes that are version 1.3 and above, otherwise they won't have this API

martijnvg · 2014-07-03T15:31:44Z

@kimchy good point, I updated the PR.

kimchy · 2014-07-04T10:15:23Z

src/main/java/org/elasticsearch/indices/store/IndicesStore.java

+            List<Tuple<DiscoveryNode, ShardActiveRequest>> requests = new ArrayList<>();
+            IndexShardRoutingTable shardRoutingTable = state.routingTable().index(shardId.getIndex()).shard(shardId.id());
+            for (ShardRouting shardRouting : shardRoutingTable) {
+                DiscoveryNode node = state.getNodes().get(shardRouting.currentNodeId());


can we protect if this is null here, and not delete?

Yes, that make sense.

kimchy · 2014-07-04T15:33:44Z

thinking about it a bit more, i think that the logic when all nodes responded should only continue if the current cluster state is the same as the cluster state that we had during the clusterChangeEvent initiating the check on active on all nodes. If not, it will not do anything, but by definition another cluster state has happened, and will trigger the active check for deletion anyhow

bleskes · 2014-07-04T16:31:05Z

++ on what kimchy said.

martijnvg · 2014-07-07T09:24:56Z

@kimchy @bleskes I've updated the PR with the suggested cluster state check and added unit tests.

kimchy · 2014-07-07T09:57:22Z

src/main/java/org/elasticsearch/indices/store/IndicesStore.java

+        }
+
+        private void allNodesResponded() {
+            if (activeCopies.get() == 0) {


the logic here should be only if all nodes responded with the shards active we should continue with the deletion process..., same check we do on the cluster state if a shard can be deleted

bleskes · 2014-07-07T10:59:20Z

src/test/java/org/elasticsearch/indices/store/IndicesStoreIntegrationTests.java

+            public void onFailure(String source, Throwable t) {
+            }
+        });
+        waitNoPendingTasksOnAll();


How do we know when the active shard request have been completed?

We have no mechanism for that to check... since it isn't part of the cluster event processing. The waitForShardDeletion() waits long enough for the active shard to have processed?

bleskes · 2014-07-07T11:05:59Z

src/test/java/org/elasticsearch/indices/store/IndicesStoreTests.java

-                return !shardDirectory(server, index, shard).exists();
+    @Test
+    public void testShardCanBeDeleted_noShardStarted() throws Exception {
+        ClusterState.Builder clusterState = ClusterState.builder(new ClusterName("test"));


maybe randomize the number of shards and their states? (unassigned/initializing/closed)

I think randomizing the number of shards here, wouldn't have any impact because it is never used. This test checks already for every non stated state, so that is ok.

I mean random number of replicas with random combination of non-active states

kimchy · 2014-07-07T11:21:49Z

left a minor comment, LGTM

martijnvg · 2014-07-07T14:28:35Z

@bleskes Applied the latest feedback.

bleskes · 2014-07-08T08:55:35Z

src/test/java/org/elasticsearch/indices/store/IndicesStoreTests.java

-                return !shardDirectory(server, index, shard).exists();
+    @Test
+    public void testShardCanBeDeleted_noShardStarted() throws Exception {
+        int numShards = randomInt(7);


should be 1+ randomInt(), no?

bleskes · 2014-07-09T09:50:34Z

src/test/java/org/elasticsearch/indices/store/IndicesStoreTests.java

+        IndexShardRoutingTable.Builder routingTable = new IndexShardRoutingTable.Builder(new ShardId("test", 1), false);
+        int localShardId = randomInt(numShards - 1);
+        for (int i = 0; i < numShards; i++) {
+            String nodeId = i == localShardId ? localNode.getId() : randomBoolean() ? "abc" : "xyz";


I think we can check also randomly on a shard that relocates to the local node

bleskes · 2014-07-09T09:53:08Z

LGTM - life a couple of minor improvement suggestions. Having these unit tests is awesome.

… node in the cluster actually holds an active shard copy. Closes elastic#6692

… node in the cluster actually holds an active shard copy. Closes #6692

martijnvg changed the title ~~Before deleting shard verify that another node holds an active shard instance~~ Store: Before deleting shard verify that another node holds an active shard instance Jul 2, 2014

jpountz assigned martijnvg Jul 2, 2014

bleskes reviewed Jul 3, 2014
View reviewed changes

kimchy reviewed Jul 3, 2014
View reviewed changes

kimchy reviewed Jul 4, 2014
View reviewed changes

kimchy reviewed Jul 7, 2014
View reviewed changes

kimchy added the resiliency label Jul 7, 2014

bleskes reviewed Jul 7, 2014
View reviewed changes

bleskes reviewed Jul 8, 2014
View reviewed changes

martijnvg added v1.3.0 labels Jul 8, 2014

bleskes reviewed Jul 9, 2014
View reviewed changes

Store: Before removing shard physically from disk verify that another…

9abb7c4

… node in the cluster actually holds an active shard copy. Closes elastic#6692

martijnvg added a commit that referenced this pull request Jul 9, 2014

Store: Before removing shard physically from disk verify that another…

f9ec4e1

… node in the cluster actually holds an active shard copy. Closes #6692

martijnvg merged commit 9abb7c4 into elastic:master Jul 9, 2014

clintongormley changed the title ~~Store: Before deleting shard verify that another node holds an active shard instance~~ Resiliency: Before deleting shard verify that another node holds an active shard instance Jul 16, 2014

clintongormley added the enhancement label Jul 16, 2014

s1monw mentioned this pull request Aug 21, 2014

Internal: Upgrade caused shard data to stay on nodes #7386

Closed

martijnvg deleted the improvements/shard-exists-elsewhere branch May 18, 2015 23:31

clintongormley added the :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. label Jun 7, 2015

clintongormley changed the title ~~Resiliency: Before deleting shard verify that another node holds an active shard instance~~ Before deleting shard verify that another node holds an active shard instance Jun 7, 2015

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Before deleting shard verify that another node holds an active shard instance #6692

Before deleting shard verify that another node holds an active shard instance #6692

martijnvg commented Jul 2, 2014

bleskes Jul 3, 2014

martijnvg commented Jul 3, 2014

kimchy Jul 3, 2014

martijnvg commented Jul 3, 2014

kimchy Jul 4, 2014

martijnvg Jul 4, 2014

kimchy commented Jul 4, 2014

bleskes commented Jul 4, 2014

martijnvg commented Jul 7, 2014

kimchy Jul 7, 2014

bleskes Jul 7, 2014

martijnvg Jul 7, 2014

bleskes Jul 7, 2014

martijnvg Jul 7, 2014

bleskes Jul 7, 2014

kimchy commented Jul 7, 2014

martijnvg commented Jul 7, 2014

bleskes Jul 8, 2014

bleskes Jul 9, 2014

bleskes commented Jul 9, 2014

Before deleting shard verify that another node holds an active shard instance #6692

Before deleting shard verify that another node holds an active shard instance #6692

Conversation

martijnvg commented Jul 2, 2014

Choose a reason for hiding this comment

martijnvg commented Jul 3, 2014

Choose a reason for hiding this comment

martijnvg commented Jul 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimchy commented Jul 4, 2014

bleskes commented Jul 4, 2014

martijnvg commented Jul 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimchy commented Jul 7, 2014

martijnvg commented Jul 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Jul 9, 2014