Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove IndexCloseListener & Store.OnCloseListener #9009

Closed
wants to merge 1 commit into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Dec 19, 2014

#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( #8608 ).

Example test failure can be found at: http://build-us-00.elasticsearch.org/job/es_core_master_metal/6120/

elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favor of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ).
@@ -159,8 +164,12 @@ public GatewayMetaState(Settings settings, ThreadPool threadPool, NodeEnvironmen

this.autoImportDangled = AutoImportDangledState.fromString(settings.get(GATEWAY_AUTO_IMPORT_DANGLED, settings.get(GATEWAY_LOCAL_AUTO_IMPORT_DANGLED, AutoImportDangledState.YES.toString())));
this.danglingTimeout = settings.getAsTime(GATEWAY_DANGLING_TIMEOUT, settings.getAsTime(GATEWAY_LOCAL_DANGLING_TIMEOUT, TimeValue.timeValueHours(2)));
this.deleteTimeout = settings.getAsTime(GATEWAY_DELETE_TIMEOUT, TimeValue.timeValueSeconds(30));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this arrive at 30 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. The delete index has an ack timeout of 30s (default) and ideally you would like to wait longer than this for the delete not to be acked (we can have a different solution to make sure it's not acked, this is change is meant to small and temporary) but on the other hand not to delay cluster state processing beyond the 30s the master waits to avoid having multiple cluster states in flight. I agree this is not ideal at all but I felt it was good enough as a temporary measure until a proper solution is in place.

@dakrone
Copy link
Member

dakrone commented Dec 19, 2014

LGTM

@bleskes bleskes closed this in 4d699bd Dec 19, 2014
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Dec 19, 2014
elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ).

Closes elastic#9009
@bleskes bleskes deleted the delete_shard_listeners branch December 19, 2014 22:33
@s1monw
Copy link
Contributor

s1monw commented Dec 29, 2014

This solution is not fixing the problem. Actually it makes it even worse IMO. Previously a test was failing since the shard was still in-use. which is the right thing btw. Now we wait for a timeout and then do nothing if we can't lock the shard. I really don't like this solution at all since in re-introduces a buggy mechanism that we fixed with ref-counting and deleting files under locks. Now we have the a race condition that shard data will never be deleted if we run into timeouts. I don't understand why we went this way to be honest or rather removed all the infrastructure to this the race that #8608 tried to fix.

@s1monw
Copy link
Contributor

s1monw commented Jan 6, 2015

FWIW, there was a misunderstanding on my end why this was removed an what is was waiting for. The last comment might have been too rough tone wise which was mainly due to miscommunication. This was basically a temporary workaround until #8608 was fixed which I didn't realize while it's actually cleary written in the desc. Sorry about that :)

@clintongormley clintongormley changed the title Internal: remove IndexCloseListener & Store.OnCloseListener Remove IndexCloseListener & Store.OnCloseListener Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants