New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove IndexCloseListener & Store.OnCloseListener #9009
Conversation
elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favor of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ).
@@ -159,8 +164,12 @@ public GatewayMetaState(Settings settings, ThreadPool threadPool, NodeEnvironmen | |||
|
|||
this.autoImportDangled = AutoImportDangledState.fromString(settings.get(GATEWAY_AUTO_IMPORT_DANGLED, settings.get(GATEWAY_LOCAL_AUTO_IMPORT_DANGLED, AutoImportDangledState.YES.toString()))); | |||
this.danglingTimeout = settings.getAsTime(GATEWAY_DANGLING_TIMEOUT, settings.getAsTime(GATEWAY_LOCAL_DANGLING_TIMEOUT, TimeValue.timeValueHours(2))); | |||
this.deleteTimeout = settings.getAsTime(GATEWAY_DELETE_TIMEOUT, TimeValue.timeValueSeconds(30)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this arrive at 30 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. The delete index has an ack timeout of 30s (default) and ideally you would like to wait longer than this for the delete not to be acked (we can have a different solution to make sure it's not acked, this is change is meant to small and temporary) but on the other hand not to delay cluster state processing beyond the 30s the master waits to avoid having multiple cluster states in flight. I agree this is not ideal at all but I felt it was good enough as a temporary measure until a proper solution is in place.
LGTM |
elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ). Closes elastic#9009
This solution is not fixing the problem. Actually it makes it even worse IMO. Previously a test was failing since the shard was still in-use. which is the right thing btw. Now we wait for a timeout and then do nothing if we can't lock the shard. I really don't like this solution at all since in re-introduces a buggy mechanism that we fixed with ref-counting and deleting files under locks. Now we have the a race condition that shard data will never be deleted if we run into timeouts. I don't understand why we went this way to be honest or rather removed all the infrastructure to this the race that #8608 tried to fix. |
FWIW, there was a misunderstanding on my end why this was removed an what is was waiting for. The last comment might have been too rough tone wise which was mainly due to miscommunication. This was basically a temporary workaround until #8608 was fixed which I didn't realize while it's actually cleary written in the desc. Sorry about that :) |
#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( #8608 ).
Example test failure can be found at: http://build-us-00.elasticsearch.org/job/es_core_master_metal/6120/