Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure close is called under lock in the case of an engine failure #5800

Merged
merged 1 commit into from Apr 16, 2014

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Apr 14, 2014

Until today we did close the engine without aqcuireing the write lock
since most calls were still holding a read lock. This commit removes
the code that holds on to the readlock when failing the engine which
means we can simply call #close()

@s1monw s1monw self-assigned this Apr 14, 2014
return true;
}

boolean assertLockIsHelp() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo... should be assertLockIsHel_d_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL yeah :)

@bleskes
Copy link
Contributor

bleskes commented Apr 15, 2014

I like it! it makes things cleaner. Left some comments..

throw new CreateFailedEngineException(shardId, create, e);
} finally {
rwl.readLock().unlock();
} catch (OutOfMemoryError | IllegalStateException | IOException t) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just catch Throwable here to be safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also applies to other places in the code below, if we decide to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this is what the code used to do. I think we are find here as it is to be honest....

@s1monw
Copy link
Contributor Author

s1monw commented Apr 15, 2014

@bleskes @kimchy thanks guys I commented and pushed a new commit

optimizeMutex.set(false);
}

}
// wait for the merges outside of the read lock
if (optimize.waitForMerge()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excess to the indexWriter without a lock. I think this can lead to an NPE if the shard is closed. I realize it's not part of the change, but I think we should deal with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use a local version of the indexwriter here...

@s1monw
Copy link
Contributor Author

s1monw commented Apr 16, 2014

@bleskes thanks for the review - I pushed another commit

// wait for the merges outside of the read lock
if (optimize.waitForMerge()) {
indexWriter.waitForMerges();
writer.waitForMerges();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writer can be null if optimizeMutex is true when the method begins. It seems we have this recurrent pattern of calling ensureOpen and than getting the indexWriter to do something. Perhaps we can change ensureOpen to either throw an exception or to return a non-null writer (guaranteed) . This this can become ensureOpen().waitForMerges()

@bleskes
Copy link
Contributor

bleskes commented Apr 16, 2014

thx. Simon. Looking good. Left one last comment. I'm +1 on this otherwise.

@s1monw
Copy link
Contributor Author

s1monw commented Apr 16, 2014

I fixed your last suggestion! thanks for all the reveiws @bleskes I think it's ready, if you don't object I'd like to rebase and push it.

@bleskes
Copy link
Contributor

bleskes commented Apr 16, 2014

thx. ++1 :)

Until today we did close the engine without aqcuireing the write lock
since most calls were still holding a read lock. This commit removes
the code that holds on to the readlock when failing the engine which
means we can simply call #close()
@s1monw s1monw merged commit be14968 into elastic:master Apr 16, 2014
@s1monw s1monw deleted the close_enging_on_close branch April 16, 2014 13:36
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Apr 17, 2014
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message.

This PR builds on elastic#5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.
bleskes added a commit that referenced this pull request Apr 18, 2014
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message.

This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.

Closes #5847
bleskes added a commit that referenced this pull request Apr 18, 2014
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message.

This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.

Closes #5847
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants