New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locking in NodeEnvironment is completely broken #13264
Comments
so we need to clarify the naming here, we actually delete under lock but it's not the IW lock it's a shard lock that we maintain internally per Node. public void deleteShardDirectorySafe(ShardId shardId, @IndexSettings Settings indexSettings) throws IOException {
// This is to ensure someone doesn't use Settings.EMPTY
assert indexSettings != Settings.EMPTY;
final Path[] paths = availableShardPaths(shardId);
logger.trace("deleting shard {} directory, paths: [{}]", shardId, paths);
try (ShardLock lock = shardLock(shardId)) { // <==== here we lock and keep the lock - it's JVM internal
deleteShardDirectoryUnderLock(lock, indexSettings);
}
} the
I am not sure if the Javadoc happened but it need clarification.
+1 to make it a real check where it makes sense...
+1 to beef it up. |
Yes, I remember this, but now is our chance to fix it, so locking is as good as we can make it. It seems we are broken because of the placement of the lock files being underneath what is deleted, but that is something fixable. Its 2.0, there is no constraint about back compat here, so I think its time to fix it correctly. Additionally we spent lots of time, and added lots of paranoia in lucene to actually help with shitty behavior from shared filesystems, so it would be nice if it stands a chance. As far as the shard lock, i have no idea what that is. How is it better than a filesystem lock? Its definitely got a shitload of abstractions, but i can't tell if its anything more than a in-process RWL. |
I think we should make this straight forward and add a |
yeah, i think something along those lines: though is 'never deleted' a problem with ppl that have tons and tons of shards cycling through? accumulating a bunch of 0-byte files sounds dangerous and eventually the directory is gonna crap its pants. Deleting an NIOFS lock file is especially tricky and we just don't do it ever in lucene (we leave the lock file around). I dont know how to fix that without adding a "master" lock file that always stays around and is acquired around individual lock acquire/release+delete. |
yeah we won't have a way around that I guess. I think what we can do is to have an |
Why do that? just have global.lock. Its only needed around the actaul acquire and release+delete. Its not gonna cause a concurrency issue. Doing this in a more fine grained way makes zero sense. |
having an index level lock make sense to have here anyway since we also have index metadata we want to protect from concurrent modifications. All I was saying here is that we might be able to get away with not locking the global lock as long as we are in the context of an index. |
after the lucene 5.3 upgrade, i looked at how ES uses lucene's filesystem locking. most places are ok, obtaining a lock and doing stuff in a try/finally. However NodeEnvironment is a totally different story. Can we fix the use of locking here?
deleteShardDirectorySafe
is anything but safe. it callsdeleteShardDirectoryUnderLock
which doesn't actually delete under a lock either!!!! It calls this bogus method:acquireFSLockForPaths
which acquires then releases locks. Why? Why? Why?assertEnvIsLocked
is only called under assert. why? Look atfindAllIndices
, its about to do something really expensive, why can't the call toensureValid
be a real check?assertEnvIsLocked
has a bunch of leniency, why in the hell would it returntrue
when closed or when there are no locks at all, thats broken.After this stuff is fixed, any places here doing heavy operations (e.g. N filesystem operations) should seriously consider calling
ensureValid
on any locks that are supposed to be held. It means you do N+1 operations or whatever but man, if what we are doing is not important, then why are we using fs locks?The text was updated successfully, but these errors were encountered: