Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock problems when API flush and finish recovery happens concurrently #9648

Merged
merged 1 commit into from Feb 11, 2015

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Feb 11, 2015

Unfortunately the lock order is important in the current flush code. We have to acquire the readlock fist otherwise
if we are flushing at the end of the recovery while holding the write lock we can deadlock if:

  • Thread 1: flushes via API and gets the flush lock but blocks on the readlock since Thread 2 has the writeLock
  • Thread 2: flushes at the end of the recovery holding the writeLock and blocks on the flushLock owned by Thread 2

This commit acquires the read lock first which would be done further down anyway for the time of the flush.
As a sideeffect we can now safely flush on calling close() while holding the writeLock.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 11, 2015

NOTE: released code is not affected since we added the flush at the end of recovery only in 1.5

}
if (waitIfOngoing == false) {
if (flushLock.tryLock() == false) {
flushing.decrementAndGet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to throw an exception here, just as we do above with the currentFlushing check.

@bleskes
Copy link
Contributor

bleskes commented Feb 11, 2015

Left some small comments o.w. looks good.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 11, 2015

@bleskes I simplified the exception logic a bit and removed the flush counter.

if (flushLock.tryLock() == false) {
// if we can't get the lock right away we block if needed otherwise barf
if (waitIfOngoing) {
flushLock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want a trace message here...

@bleskes
Copy link
Contributor

bleskes commented Feb 11, 2015

LGTM. Left one minor comment.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 11, 2015

@bleskes added some more traces there :)

…pens concurrently

Unfortunately the lock order is important in the current flush codehe. We have to acquire the readlock fist otherwise
if we are flushing at the end of the recovery while holding the write lock we can deadlock if:
 * Thread 1: flushes via API and gets the flush lock but blocks on the readlock since Thread 2 has the writeLock
 * Thread 2: flushes at the end of the recovery holding the writeLock and blocks on the flushLock owned by Thread 2

This commit acquires the read lock first which would be done further down anyway for the time of the flush.
As a sideeffect we can now safely flush on calling close() while holding the writeLock.
@s1monw s1monw merged commit 0b0cd1c into elastic:master Feb 11, 2015
bleskes added a commit that referenced this pull request Mar 2, 2015
…nish recovery happens concurrently

Issue #9648 fixes a potential deadlock between two concurrent flushes - one at the end of recovery and one through the API or background flush. This back ports the logic to 1.4 . It is slightly more contrived as we still use the write lock in the flush code.

If we feel we have some concerns about this approach we can also move the recovery flush to happen on a generic thread.

Closes #9942
@clintongormley clintongormley added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. and removed review labels Mar 19, 2015
@clintongormley clintongormley changed the title [ENGINE] Fix deadlock problems when API flush and finish recovery happens concurrently Fix deadlock problems when API flush and finish recovery happens concurrently Jun 8, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
… and finish recovery happens concurrently

Issue elastic#9648 fixes a potential deadlock between two concurrent flushes - one at the end of recovery and one through the API or background flush. This back ports the logic to 1.4 . It is slightly more contrived as we still use the write lock in the flush code.

If we feel we have some concerns about this approach we can also move the recovery flush to happen on a generic thread.

Closes elastic#9942
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v1.5.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants