Improve the lifecycle management of the join control thread in zen discovery. #8327

martijnvg · 2014-11-03T11:37:41Z

This PR also includes:

Better exception handling in UnicastZenPing#ping
In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has successfully completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).

bleskes · 2014-11-03T12:22:22Z

src/main/java/org/elasticsearch/discovery/zen/ping/unicast/UnicastZenPing.java

+                }
+            });
+        } catch (Exception e) {
+            sendPingsHandler.close();


we want to rethrow here, right?

bleskes · 2014-11-03T12:34:59Z

Looking good. left two comments.

martijnvg · 2014-11-03T13:25:52Z

@bleskes I updated the PR.

bleskes · 2014-11-03T13:39:04Z

src/main/java/org/elasticsearch/ExceptionsHelper.java

+    /**
+     * Wraps the specified exception in a runtime exception if required and then rethrows it.
+     *
+     * Usable for assertions too because of the boolean return value.


I think it's only usabe in assertions :) can you give a usage example? . Maybe call it reThrowIfNotNull and allow passing null values into it? might make it more usefull .

martijnvg · 2014-11-03T14:12:49Z

@bleskes I applied the feedback, also added better error handling for multicast ping.

bleskes · 2014-11-03T21:54:02Z

src/main/java/org/elasticsearch/ExceptionsHelper.java

+     *
+     * Also usable for assertions, because of the boolean return value.
+     */
+    public static boolean reThrowIfNotNull(Throwable e) {


since this is renamed we need to allow for null as a value of e?

bleskes · 2014-11-03T22:01:55Z

@martijnvg looking good. left two last little comments

martijnvg · 2014-11-03T23:04:24Z

@bleskes Thanks, I updated the PR to address your comments.

bleskes · 2014-11-04T08:19:04Z

src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java

@@ -1281,7 +1281,7 @@ public ClusterState stopRunningThreadAndRejoin(ClusterState clusterState, String
            return rejoin(clusterState, reason);
        }

-        /** starts a new joining thread if there is no currently active one */
+        /** starts a new joining thread if there is no currently active one and join thread controlling is started */


join thread controlling is started <-- love it :)

bleskes · 2014-11-04T08:21:50Z

Left one minor logging comment. LGTM!

…d in zen discovery. Also added: * Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping * In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often). Applied feedback 3 Closes elastic#8327

…d in zen discovery. Also added: * Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping * In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often). Applied feedback 3 Closes #8327

martijnvg · 2014-11-04T08:52:41Z

Pushed, thanks @bleskes!

…d in zen discovery. Also added: * Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping * In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often). Applied feedback 3 Closes #8327

When a node stops, we cancel any ongoing join process. With elastic#8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits. Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level.

When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits. Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level. Closes #8359

When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. However, the joining thread is part of a thread pool and will not stop until the thread pool is shutdown. Another issue raised by the unneeded wait is that when we shutdown, we may ping ourselves - which results in an ugly warn level log. We now log all remote exception during pings at a debug level. Closes #8359

When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits. Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level. Closes #8359

…d in zen discovery. Also added: * Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping * In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often). Applied feedback 3 Closes elastic#8327

When a node stops, we cancel any ongoing join process. With elastic#8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits. Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level. Closes elastic#8359

martijnvg added >bug v1.4.0 v1.5.0 v2.0.0-beta1 review labels Nov 3, 2014

bleskes reviewed Nov 3, 2014
View reviewed changes

bleskes reviewed Nov 4, 2014
View reviewed changes

martijnvg force-pushed the improvements/join_thread_control_life_cycle branch from 8a27d38 to 4ddb057 Compare November 4, 2014 08:45

martijnvg merged commit 4ddb057 into elastic:master Nov 4, 2014

s1monw removed the review label Nov 4, 2014

bleskes mentioned this pull request Nov 6, 2014

Don't wait joinThread when stopping #8359

Closed

clintongormley added the :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Mar 19, 2015

martijnvg deleted the improvements/join_thread_control_life_cycle branch May 18, 2015 23:29

clintongormley changed the title ~~Discovery: Improve the lifecycle management of the join control thread in zen discovery.~~ Improve the lifecycle management of the join control thread in zen discovery. Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the lifecycle management of the join control thread in zen discovery. #8327

Improve the lifecycle management of the join control thread in zen discovery. #8327

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

bleskes commented Nov 3, 2014

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

bleskes commented Nov 3, 2014

martijnvg commented Nov 3, 2014

bleskes Nov 4, 2014

bleskes commented Nov 4, 2014

martijnvg commented Nov 4, 2014

Improve the lifecycle management of the join control thread in zen discovery. #8327

Improve the lifecycle management of the join control thread in zen discovery. #8327

Conversation

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

Choose a reason for hiding this comment

bleskes commented Nov 3, 2014

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

Choose a reason for hiding this comment

martijnvg commented Nov 3, 2014

bleskes Nov 3, 2014

Choose a reason for hiding this comment

bleskes commented Nov 3, 2014

martijnvg commented Nov 3, 2014

bleskes Nov 4, 2014

Choose a reason for hiding this comment

bleskes commented Nov 4, 2014

martijnvg commented Nov 4, 2014