Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete shard content under lock #9083

Merged
merged 1 commit into from Jan 6, 2015

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Dec 29, 2014

Once we delete the the index on a node we are closing all resources
and subsequently need to delete all shards contents from disk. Yet
this happens today under a lock (the shard lock) that needs to be
acquried in order to execute any operation on the shards data
path. We try to delete all the index meta-data once we acquired
all the shard lock but this operation can run into a timeout which causes
the index to remain on disk. Further, all shard data will be left on
disk if the timeout is reached.

This commit removes all the shards data just before the shard lock
is release as the last operation on a shard that belongs to a deleted
index.

supersedes #8608 & relates to #9009

}
logger.trace("deleted shard {} directory, paths: [{}]", shardId, paths);
assert FileSystemUtils.exists(paths) == false;
IOUtils.rm(paths);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant since the exists check just returned false?

@rjernst
Copy link
Member

rjernst commented Dec 30, 2014

LGTM, but my review was cursory as I am not very familiar with this code.

} catch (Throwable e) {
logger.debug("[{}] failed to close translog", e, shardId);
// ignore
logger.warn("[{}] failed to close store on shard deletion", e, shardId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the store doesn't necessarily mean the shard is being deleted, I tested this and this codepath can happen when the index is closed, so I think this should be "failed to close store on shard closing"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "failed to close store on shard removal"

@dakrone
Copy link
Member

dakrone commented Jan 2, 2015

Left some comments, also with this change the CloseableComponent class can be removed entirely

@s1monw
Copy link
Contributor Author

s1monw commented Jan 2, 2015

Left some comments, also with this change the CloseableComponent class can be removed entirely

yeah I removed it now :) I actually missed that!

@s1monw
Copy link
Contributor Author

s1monw commented Jan 2, 2015

@dakrone I pushed a new commits

@dakrone
Copy link
Member

dakrone commented Jan 2, 2015

LGTM

* @throws IOException if an IOException occurs
*/
public void deleteShardDirectorySafe(ShardLock lock, @IndexSettings Settings indexSettings) throws IOException {
assert indexSettings != ImmutableSettings.EMPTY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check and except if the lock is not acquired? Maybe also rename the method name to deleteShardDirectoryUnderLock, not sure. We should also document the fact that the shard is expected to be locked.

@bleskes
Copy link
Contributor

bleskes commented Jan 5, 2015

I like it. Much simpler. I think we can remove the timeout on delete introduced temporarily in #9009, but if we do so, we need to try to delete all the shard folders that are not locked, instead of trying to acquire all locks and then delete them all together (or not). This is needed for shards that were just relocated away (and their locks / in memory registration released) but not yet deleted from disk.

@s1monw
Copy link
Contributor Author

s1monw commented Jan 6, 2015

I applied changes to your comments. Can you take another look?

I like it. Much simpler. I think we can remove the timeout on delete introduced temporarily in #9009, but if we do so, we need to try to delete all the shard folders that are not locked, instead of trying to acquire all locks and then delete them all together (or not). This is needed for shards that were just relocated away (and their locks / in memory registration released) but not yet deleted from disk.

can we do this in a different change?

@bleskes
Copy link
Contributor

bleskes commented Jan 6, 2015

LGTM. thx.

can we do this in a different change?

I'll do it. OK.

Once we delete the the index on a node we are closing all resources
and subsequently need to delete all shards contents from disk. Yet
this happens today under a lock (the shard lock) that needs to be
acquried in order to execute any operation on the shards data
path. We try to delete all the index meta-data once we acquired
all the shard lock but this operation can run into a timeout which causes
the index to remain on disk. Further, all shard data will be left on
disk if the timeout is reached.

This commit removes all the shards data just before the shard lock
is release as the last operation on a shard that belongs to a deleted
index.
@s1monw s1monw merged commit 7ec8973 into elastic:master Jan 6, 2015
@clintongormley clintongormley added :Core/Infra/Core Core issues without another label and removed review labels Mar 19, 2015
@clintongormley clintongormley changed the title [CORE] Delete shard content under lock Delete shard content under lock Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants