Observe cluster state on health request #8350

s1monw · 2014-11-05T13:54:00Z

Today we use busy waiting and sampling when we execute HealthReqeusts
on the master. This is tricky sicne we might sample a not yet fully applied
cluster state and make a decsions base on the partial cluster state. This can
lead to ugly problems since requests might be routed to nodes where shards are
already marked as relocated but on the actual cluster state they are still started.
Yet, this window is very small usually it can lead to ugly test failures.

This commit moves the health request over to a listener pattern that gets the actual
applied cluster state.

s1monw · 2014-11-05T13:54:30Z

@bleskes here you go :)

bleskes · 2014-11-05T13:59:25Z

src/main/java/org/elasticsearch/action/admin/cluster/health/TransportClusterHealthAction.java

            }
-            if (System.currentTimeMillis() > endTime) {


The first line of the method can be removed now - the endTime variable is not used.

long endTime = System.currentTimeMillis() + request.timeout().millis();

I moved this over to be entirely non-blocking now...

s1monw · 2014-11-06T15:23:40Z

@bleskes I updated the PR as we discussed

bleskes · 2014-11-06T15:53:00Z

src/main/java/org/elasticsearch/cluster/ClusterStateObserver.java

@@ -241,15 +240,15 @@ public void onTimeout(TimeValue timeout) {
        }
    }

-    public interface Listener {
+    public static abstract class Listener {


I don't think there is an added value to have this as an abstract class anymore, right?

oh I will move back

s1monw · 2014-11-06T16:37:22Z

pushed another round

bleskes · 2014-11-06T17:21:23Z

src/main/java/org/elasticsearch/action/admin/cluster/health/TransportClusterHealthAction.java

@@ -90,14 +86,15 @@ public ClusterState execute(ClusterState currentState) {

                @Override
                public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
-                    latch.countDown();
+                    TimeValue newTimeout = TimeValue.timeValueMillis(Math.max(0, endTime - System.currentTimeMillis()));


I think a time out of 0 is problematic in the InternalClusterService logic. We should protect for it, either in executeHealth or in the ClusterStateObserver

hmm but can't you set this via the API already?

I double checked and the scheduler can take a value of 0 and the listener removal logic is also OK where postAdded and onTimeout are called in rapid succession. I still think the check for 0 here is clearer. Thanks for adding it.

s1monw · 2014-11-06T18:57:02Z

pushed a check for timeout == 0 I think it's ready

bleskes · 2014-11-06T23:50:35Z

Agreed. LGTM - thanks for cleaning this up.

Today we use busy waiting and sampling when we execute HealthReqeusts on the master. This is tricky sicne we might sample a not yet fully applied cluster state and make a decsions base on the partial cluster state. This can lead to ugly problems since requests might be routed to nodes where shards are already marked as relocated but on the actual cluster state they are still started. Yet, this window is very small usually it can lead to ugly test failures. This commit moves the health request over to a listener pattern that gets the actual applied cluster state. Closes elastic#8350

s1monw added review resiliency v2.0.0-beta1 v1.5.0 labels Nov 5, 2014

bleskes reviewed Nov 5, 2014
View reviewed changes

s1monw force-pushed the cluster_health_on_update branch from c25bef8 to b2c8288 Compare November 5, 2014 21:31

bleskes reviewed Nov 6, 2014
View reviewed changes

bleskes removed the review label Nov 6, 2014

s1monw force-pushed the cluster_health_on_update branch from df835d8 to cc8e8e6 Compare November 7, 2014 10:03

s1monw merged commit cc8e8e6 into elastic:master Nov 7, 2014

s1monw deleted the cluster_health_on_update branch November 7, 2014 10:24

clintongormley added :Core/Infra/Core Core issues without another label >enhancement labels Mar 19, 2015

clintongormley changed the title ~~[STATE] Observe cluster state on health request~~ Observe cluster state on health request Jun 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observe cluster state on health request #8350

Observe cluster state on health request #8350

s1monw commented Nov 5, 2014

s1monw commented Nov 5, 2014

bleskes Nov 5, 2014

s1monw Nov 5, 2014

s1monw commented Nov 6, 2014

bleskes Nov 6, 2014

s1monw Nov 6, 2014

s1monw commented Nov 6, 2014

bleskes Nov 6, 2014

s1monw Nov 6, 2014

bleskes Nov 6, 2014

s1monw commented Nov 6, 2014

bleskes commented Nov 6, 2014

Observe cluster state on health request #8350

Observe cluster state on health request #8350

Conversation

s1monw commented Nov 5, 2014

s1monw commented Nov 5, 2014

bleskes Nov 5, 2014

Choose a reason for hiding this comment

s1monw Nov 5, 2014

Choose a reason for hiding this comment

s1monw commented Nov 6, 2014

bleskes Nov 6, 2014

Choose a reason for hiding this comment

s1monw Nov 6, 2014

Choose a reason for hiding this comment

s1monw commented Nov 6, 2014

bleskes Nov 6, 2014

Choose a reason for hiding this comment

s1monw Nov 6, 2014

Choose a reason for hiding this comment

bleskes Nov 6, 2014

Choose a reason for hiding this comment

s1monw commented Nov 6, 2014

bleskes commented Nov 6, 2014