Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise node disconnected even if the transport is stopped #5918

Closed
wants to merge 2 commits into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Apr 23, 2014

during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case.
Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects

during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case.
Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects
closes elastic#5918
@kimchy kimchy added the review label Apr 23, 2014
@s1monw
Copy link
Contributor

s1monw commented Apr 23, 2014

so I worked on this a bit and I had a test as well as a fix for the test here s1monw@19af66b I didn't work further on it since we spoke but the fix you have didn't fix my test maybe you can investigate based on that? feel free to take my test as well

@s1monw
Copy link
Contributor

s1monw commented Apr 26, 2014

one symptom of this is a test failure like this:

EPRODUCE WITH  : mvn test -Dtests.seed=5F9B9E72D2C3CE1F -Dtests.class=org.elasticsearch.cluster.MinimumMasterNodesTests -Dtests.prefix=tests -Dfile.encoding=UTF-8 -Duser.timezone=Etc/UTC -Dtests.method="multipleNodesShutdownNonMasterNodes" -Des.logger.level=DEBUG -Des.node.mode=network -Dtests.security.manager=true -Dtests.nightly=true -Dtests.heap.size=512m -Dtests.jvm.argline="-server -XX:+UseConcMarkSweepGC -XX:-UseCompressedOops"
  1> Throwable:
  1> java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_0.cfe=1, _0.si=1}
  1>     __randomizedtesting.SeedInfo.seed([5F9B9E72D2C3CE1F:D9159CAB6B73CDBB]:0)
  1>     org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:646)
  1>     org.elasticsearch.test.store.MockDirectoryHelper$ElasticsearchMockDirectoryWrapper.close(MockDirectoryHelper.java:140)
  1>     org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAllFilesClosed(ElasticsearchAssertions.java:506)
  1>     org.elasticsearch.test.ImmutableTestCluster.assertAfterTest(ImmutableTestCluster.java:79)
  1>     org.elasticsearch.test.ElasticsearchIntegrationTest.afterInternal(ElasticsearchIntegrationTest.java:443)
  1>     org.elasticsearch.test.ElasticsearchIntegrationTest.after(ElasticsearchIntegrationTest.java:1257)
  1>     [...sun.*, com.carrotsearch.randomizedtesting.*, java.lang.reflect.*]
  1>     org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
  1>     org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51)
  1>     org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
  1>     org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
  1>     org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
  1>     org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:43)
  1>     org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
  1>     org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
  1>     org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
  1>     [...com.carrotsearch.randomizedtesting.*]
  1>     java.lang.Thread.run(Thread.java:745)
  1> Caused by: java.lang.RuntimeException: unclosed IndexInput: _0.cfe
  1>     org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:534)
  1>     org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:578)
  1>     org.elasticsearch.index.store.Store.openInputRaw(Store.java:318)
  1>     org.elasticsearch.indices.recovery.RecoverySource$1$1.run(RecoverySource.java:185)
  1>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  1>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  1>     java.lang.Thread.run(Thread.java:745)

just for the record...

@kimchy
Copy link
Member Author

kimchy commented Apr 27, 2014

I was testing on the TestCluster class, and a cluster creates connection to itself, so my fix was working, but, not fully as you showed. We need to make sure we also clear it on stop even if for some reason node disconnects were not raised. I pushed your test and fix as well to the branch.

One additional thing, I am wondering if we should "wait for an acceptable" time to have all the callbacks raised before we exit the stop, or node disconnect callback (when we are in stop mode)? This will ensure the service has hopefully properly stopped, and log if it didn't stop within a timeout.

Btw, we do that anyhow in our ThreadPool class, so it might be enough there

@s1monw
Copy link
Contributor

s1monw commented Apr 27, 2014

yeah I think it will be ok in the threadpool cases IMO. But I wonder if we should have more tests trigger this stuff. I need to think about it but I'd like to assert that the grace period is enough in the thread pools?!

@kimchy
Copy link
Member Author

kimchy commented Apr 27, 2014

@s1monw aye!, that would be great to assert on in our test infra, that the ThreadPool always exits within the timeout, and if not, its something need fixing...

So are we good with this change then and letting thread pool to have the graceful shutdown period?

@kimchy
Copy link
Member Author

kimchy commented Apr 28, 2014

I will pull this in, and we can open a separate issue for thread pool assertion

@kimchy kimchy closed this in dedddf3 Apr 28, 2014
kimchy added a commit that referenced this pull request Apr 28, 2014
during the stop process, we raise network disconnect, so it is valid to raise then while we are in stop mode, and actually, we should not miss any events in such a case.
Typically, this is not a problem, since its during the normal shutdown process on the JVM, but when running a reused cluster within the JVM (like in our test infra with the shared cluster), we should properly raise those node disconnects
closes #5918
@kimchy kimchy deleted the better_node_disconnect branch April 28, 2014 08:58
@jpountz jpountz removed the review label Apr 30, 2014
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants