Ensure close is called under lock in the case of an engine failure #5800

s1monw · 2014-04-14T15:32:41Z

Until today we did close the engine without aqcuireing the write lock
since most calls were still holding a read lock. This commit removes
the code that holds on to the readlock when failing the engine which
means we can simply call #close()

bleskes · 2014-04-15T10:36:00Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

+            return true;
+        }
+
+        boolean assertLockIsHelp() {


Typo... should be assertLockIsHel_d_

LOL yeah :)

bleskes · 2014-04-15T11:54:13Z

I like it! it makes things cleaner. Left some comments..

kimchy · 2014-04-15T11:56:58Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

-            throw new CreateFailedEngineException(shardId, create, e);
-        } finally {
-            rwl.readLock().unlock();
+        } catch (OutOfMemoryError | IllegalStateException | IOException t) {


should we just catch Throwable here to be safe?

also applies to other places in the code below, if we decide to do it.

well this is what the code used to do. I think we are find here as it is to be honest....

s1monw · 2014-04-15T16:44:49Z

@bleskes @kimchy thanks guys I commented and pushed a new commit

bleskes · 2014-04-16T08:02:56Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

                optimizeMutex.set(false);
            }
+
        }
        // wait for the merges outside of the read lock
        if (optimize.waitForMerge()) {


This is an excess to the indexWriter without a lock. I think this can lead to an NPE if the shard is closed. I realize it's not part of the change, but I think we should deal with it.

I will use a local version of the indexwriter here...

s1monw · 2014-04-16T10:53:08Z

@bleskes thanks for the review - I pushed another commit

bleskes · 2014-04-16T11:00:40Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

        // wait for the merges outside of the read lock
        if (optimize.waitForMerge()) {
-            indexWriter.waitForMerges();
+            writer.waitForMerges();


I think writer can be null if optimizeMutex is true when the method begins. It seems we have this recurrent pattern of calling ensureOpen and than getting the indexWriter to do something. Perhaps we can change ensureOpen to either throw an exception or to return a non-null writer (guaranteed) . This this can become ensureOpen().waitForMerges()

bleskes · 2014-04-16T11:02:01Z

thx. Simon. Looking good. Left one last comment. I'm +1 on this otherwise.

s1monw · 2014-04-16T12:05:17Z

I fixed your last suggestion! thanks for all the reveiws @bleskes I think it's ready, if you don't object I'd like to rebase and push it.

bleskes · 2014-04-16T12:07:37Z

thx. ++1 :)

Until today we did close the engine without aqcuireing the write lock since most calls were still holding a read lock. This commit removes the code that holds on to the readlock when failing the engine which means we can simply call #close()

When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This PR builds on elastic#5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.

When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message. This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area. Closes #5847

s1monw self-assigned this Apr 14, 2014

s1monw added v1.2.0 labels Apr 14, 2014

bleskes reviewed Apr 15, 2014
View reviewed changes

kimchy reviewed Apr 15, 2014
View reviewed changes

bleskes reviewed Apr 16, 2014
View reviewed changes

s1monw merged commit be14968 into elastic:master Apr 16, 2014

s1monw deleted the close_enging_on_close branch April 16, 2014 13:36

bleskes mentioned this pull request Apr 17, 2014

Fail replica shards locally upon failures #5847

Closed

clintongormley added the :Engine label Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure close is called under lock in the case of an engine failure #5800

Ensure close is called under lock in the case of an engine failure #5800

s1monw commented Apr 14, 2014

bleskes Apr 15, 2014

s1monw Apr 15, 2014

bleskes commented Apr 15, 2014

kimchy Apr 15, 2014

kimchy Apr 15, 2014

s1monw Apr 15, 2014

s1monw commented Apr 15, 2014

bleskes Apr 16, 2014

s1monw Apr 16, 2014

s1monw commented Apr 16, 2014

bleskes Apr 16, 2014

bleskes commented Apr 16, 2014

s1monw commented Apr 16, 2014

bleskes commented Apr 16, 2014

Ensure close is called under lock in the case of an engine failure #5800

Ensure close is called under lock in the case of an engine failure #5800

Conversation

s1monw commented Apr 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Apr 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Apr 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Apr 16, 2014

Choose a reason for hiding this comment

bleskes commented Apr 16, 2014

s1monw commented Apr 16, 2014

bleskes commented Apr 16, 2014