Raise node disconnected even if the transport is stopped #5918

kimchy · 2014-04-23T12:07:48Z

during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case.
Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects

during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case. Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects closes elastic#5918

s1monw · 2014-04-23T16:04:15Z

so I worked on this a bit and I had a test as well as a fix for the test here s1monw@19af66b I didn't work further on it since we spoke but the fix you have didn't fix my test maybe you can investigate based on that? feel free to take my test as well

s1monw · 2014-04-26T14:08:19Z

one symptom of this is a test failure like this:

EPRODUCE WITH  : mvn test -Dtests.seed=5F9B9E72D2C3CE1F -Dtests.class=org.elasticsearch.cluster.MinimumMasterNodesTests -Dtests.prefix=tests -Dfile.encoding=UTF-8 -Duser.timezone=Etc/UTC -Dtests.method="multipleNodesShutdownNonMasterNodes" -Des.logger.level=DEBUG -Des.node.mode=network -Dtests.security.manager=true -Dtests.nightly=true -Dtests.heap.size=512m -Dtests.jvm.argline="-server -XX:+UseConcMarkSweepGC -XX:-UseCompressedOops"
  1> Throwable:
  1> java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_0.cfe=1, _0.si=1}
  1>     __randomizedtesting.SeedInfo.seed([5F9B9E72D2C3CE1F:D9159CAB6B73CDBB]:0)
  1>     org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:646)
  1>     org.elasticsearch.test.store.MockDirectoryHelper$ElasticsearchMockDirectoryWrapper.close(MockDirectoryHelper.java:140)
  1>     org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAllFilesClosed(ElasticsearchAssertions.java:506)
  1>     org.elasticsearch.test.ImmutableTestCluster.assertAfterTest(ImmutableTestCluster.java:79)
  1>     org.elasticsearch.test.ElasticsearchIntegrationTest.afterInternal(ElasticsearchIntegrationTest.java:443)
  1>     org.elasticsearch.test.ElasticsearchIntegrationTest.after(ElasticsearchIntegrationTest.java:1257)
  1>     [...sun.*, com.carrotsearch.randomizedtesting.*, java.lang.reflect.*]
  1>     org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
  1>     org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51)
  1>     org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
  1>     org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
  1>     org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
  1>     org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43)
  1>     org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
  1>     org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
  1>     org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     java.lang.Thread.run(Thread.java:745)
  1> Caused by: java.lang.RuntimeException: unclosed IndexInput: _0.cfe
  1>     org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:534)
  1>     org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:578)
  1>     org.elasticsearch.index.store.Store.openInputRaw(Store.java:318)
  1>     org.elasticsearch.indices.recovery.RecoverySource$1$1.run(RecoverySource.java:185)
  1>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  1>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  1>     java.lang.Thread.run(Thread.java:745)

just for the record...

kimchy · 2014-04-27T15:03:30Z

I was testing on the TestCluster class, and a cluster creates connection to itself, so my fix was working, but, not fully as you showed. We need to make sure we also clear it on stop even if for some reason node disconnects were not raised. I pushed your test and fix as well to the branch.

One additional thing, I am wondering if we should "wait for an acceptable" time to have all the callbacks raised before we exit the stop, or node disconnect callback (when we are in stop mode)? This will ensure the service has hopefully properly stopped, and log if it didn't stop within a timeout.

Btw, we do that anyhow in our ThreadPool class, so it might be enough there

s1monw · 2014-04-27T15:07:02Z

yeah I think it will be ok in the threadpool cases IMO. But I wonder if we should have more tests trigger this stuff. I need to think about it but I'd like to assert that the grace period is enough in the thread pools?!

kimchy · 2014-04-27T15:10:50Z

@s1monw aye!, that would be great to assert on in our test infra, that the ThreadPool always exits within the timeout, and if not, its something need fixing...

So are we good with this change then and letting thread pool to have the graceful shutdown period?

kimchy · 2014-04-28T08:55:40Z

I will pull this in, and we can open a separate issue for thread pool assertion

during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case. Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects closes #5918

kimchy added the review label Apr 23, 2014

clear undisconnected handles when stopping as well

0bdeac5

kimchy closed this in dedddf3 Apr 28, 2014

kimchy added enhancement labels Apr 28, 2014

kimchy deleted the better_node_disconnect branch April 28, 2014 08:58

jpountz removed the review label Apr 30, 2014

clintongormley added the :Cluster label Jun 7, 2015

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise node disconnected even if the transport is stopped #5918

Raise node disconnected even if the transport is stopped #5918

kimchy commented Apr 23, 2014

s1monw commented Apr 23, 2014

s1monw commented Apr 26, 2014

kimchy commented Apr 27, 2014

s1monw commented Apr 27, 2014

kimchy commented Apr 27, 2014

kimchy commented Apr 28, 2014

Raise node disconnected even if the transport is stopped #5918

Raise node disconnected even if the transport is stopped #5918

Conversation

kimchy commented Apr 23, 2014

s1monw commented Apr 23, 2014

s1monw commented Apr 26, 2014

kimchy commented Apr 27, 2014

s1monw commented Apr 27, 2014

kimchy commented Apr 27, 2014

kimchy commented Apr 28, 2014