Remove IndexCloseListener & Store.OnCloseListener #9009

bleskes · 2014-12-19T08:17:18Z

#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( #8608 ).

Example test failure can be found at: http://build-us-00.elasticsearch.org/job/es_core_master_metal/6120/

elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favor of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ).

dakrone · 2014-12-19T11:11:30Z

src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

@@ -159,8 +164,12 @@ public GatewayMetaState(Settings settings, ThreadPool threadPool, NodeEnvironmen

        this.autoImportDangled = AutoImportDangledState.fromString(settings.get(GATEWAY_AUTO_IMPORT_DANGLED, settings.get(GATEWAY_LOCAL_AUTO_IMPORT_DANGLED, AutoImportDangledState.YES.toString())));
        this.danglingTimeout = settings.getAsTime(GATEWAY_DANGLING_TIMEOUT, settings.getAsTime(GATEWAY_LOCAL_DANGLING_TIMEOUT, TimeValue.timeValueHours(2)));
+        this.deleteTimeout = settings.getAsTime(GATEWAY_DELETE_TIMEOUT, TimeValue.timeValueSeconds(30));


How did this arrive at 30 seconds?

That's a good question. The delete index has an ack timeout of 30s (default) and ideally you would like to wait longer than this for the delete not to be acked (we can have a different solution to make sure it's not acked, this is change is meant to small and temporary) but on the other hand not to delay cluster state processing beyond the 30s the master waits to avoid having multiple cluster states in flight. I agree this is not ideal at all but I felt it was good enough as a temporary measure until a proper solution is in place.

dakrone · 2014-12-19T13:26:10Z

LGTM

elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ). Closes elastic#9009

s1monw · 2014-12-29T13:36:59Z

This solution is not fixing the problem. Actually it makes it even worse IMO. Previously a test was failing since the shard was still in-use. which is the right thing btw. Now we wait for a timeout and then do nothing if we can't lock the shard. I really don't like this solution at all since in re-introduces a buggy mechanism that we fixed with ref-counting and deleting files under locks. Now we have the a race condition that shard data will never be deleted if we run into timeouts. I don't understand why we went this way to be honest or rather removed all the infrastructure to this the race that #8608 tried to fix.

s1monw · 2015-01-06T09:06:13Z

FWIW, there was a misunderstanding on my end why this was removed an what is was waiting for. The last comment might have been too rough tone wise which was mainly due to miscommunication. This was basically a temporary workaround until #8608 was fixed which I didn't realize while it's actually cleary written in the desc. Sorry about that :)

bleskes added v1.5.0 v2.0.0-beta1 review resiliency and removed review labels Dec 19, 2014

dakrone reviewed Dec 19, 2014
View reviewed changes

bleskes closed this in 4d699bd Dec 19, 2014

bleskes deleted the delete_shard_listeners branch December 19, 2014 22:33

s1monw mentioned this pull request Dec 29, 2014

Delete shard content under lock #9083

Merged

clintongormley added >enhancement :Internal and removed review labels Mar 19, 2015

clintongormley changed the title ~~Internal: remove IndexCloseListener & Store.OnCloseListener~~ Remove IndexCloseListener & Store.OnCloseListener Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove IndexCloseListener & Store.OnCloseListener #9009

Remove IndexCloseListener & Store.OnCloseListener #9009

bleskes commented Dec 19, 2014

dakrone Dec 19, 2014

bleskes Dec 19, 2014

dakrone commented Dec 19, 2014

s1monw commented Dec 29, 2014

s1monw commented Jan 6, 2015

Remove IndexCloseListener & Store.OnCloseListener #9009

Remove IndexCloseListener & Store.OnCloseListener #9009

Conversation

bleskes commented Dec 19, 2014

dakrone Dec 19, 2014

Choose a reason for hiding this comment

bleskes Dec 19, 2014

Choose a reason for hiding this comment

dakrone commented Dec 19, 2014

s1monw commented Dec 29, 2014

s1monw commented Jan 6, 2015