Cluster Health: Add wait time for pending task and recovery percentage #11393

spinscale · 2015-05-28T13:25:59Z

In order to get a quick overview using by simply checking the cluster state
and its corresponding cat API, the following two attributes have been added
to the cluster health response:

pending_task_time_in_queue, the time value of the first task of the
queue and how long it has been waiting
recovery percent: The percentage of the number of shards that are in
initializing state

This makes the cluster health API handy to check, when a fully restarted
cluster is back up and running.

In addition a small serialization fix has been added, which removes version
checks for the this branch in the ClusterHealthResponse.

Closes #10805

spinscale · 2015-05-28T13:28:19Z

biggest question here is, if the shardRouting.initialized() call is sufficient as an information, that this shard is being recovered.. if not this cannot be implemented as part of the cluster health being a master only operation, as we need to get shard information and the current RecoveryState

spinscale · 2015-06-01T08:55:15Z

@clintongormley care to take a look, if this matches your expectation, even though no new cat API has been added?

clintongormley · 2015-06-02T17:50:54Z

LGTM

bleskes · 2015-06-03T06:45:42Z

src/main/java/org/elasticsearch/action/admin/cluster/health/ClusterHealthResponse.java

+                int initializingShardCount = 0;
+                int totalShardCount = 0;
+                for (ShardRouting shardRouting : shardRoutings) {
+                    if (shardRouting.initializing()) initializingShardCount++;


This should count shardRouting.active() == false no?

not following here, why?

bleskes · 2015-06-03T06:58:07Z

Hey @spinscale, left some comment about naming and implementation. I think we comments also address the safe use of shardRouting.initialized() - I think we should make it a master only API, no reaching out to the individual nodes. If we want to have an aggregation of the _recovery info, we should add it as a header there imho.

spinscale · 2015-06-03T12:54:59Z

@bleskes so, you are proposing two important changes here, which I want to understand before applying them

pending_task_waiting should become the longest waiting time in the task list. This means, that when the cluster continouosly accepts new pending tasks with higher priority we do not see a change here. However, as one can also see the number of pending tasks and those change, this might be ok. What worries my about the approach of scanning all tasks, is that the speed of the cluster health response is dependent on the number of tasks in the queue. Slowing down this API doesnt sound like a good idea.
recovery_percent should become active_percent. Sounds good. The current metric is pretty useless, as the number of concurrent recoveries per node (2 by default iirc) and the number of total shards is going to be static. Something I hadnt thought about yet, is how unassigned shards come into play here. But I dont think it is a big problem that the active percentage is 50%, when you have a single node and one replica for your indices.

bleskes · 2015-06-08T13:24:33Z

What worries my about the approach of scanning all tasks, is that the speed of the cluster health response is dependent on the number of tasks in the queue.

That was my concern as well when thinking about this but I decided it's worth it. I think it's more power full this way. It's only an array scan so we're still talking super fast, even if it's 10K ops.

But I dont think it is a big problem that the active percentage is 50%, when you have a single node and one replica for your indices.

Yeah, I think it communicates clearly how bad is "YELLOW"

spinscale · 2015-06-15T14:43:26Z

upgraded the PR with all the naming refactoring and scanning all the tasks (also introduced a HasCreationDate interface to make it simpler to read the creation date).

bleskes · 2015-06-16T13:56:06Z

core/src/main/java/org/elasticsearch/action/admin/cluster/health/ClusterHealthResponse.java

+        } else {
+            List<ShardRouting> shardRoutings = clusterState.getRoutingTable().allShards();
+            // shortcut on no shards
+            if (shardRoutings.size() == 0) {


this should be covered by the green case, no?

bleskes · 2015-06-16T14:09:38Z

I like this! one thing I would love seeing is getting rid of org.elasticsearch.cluster.service.InternalClusterService.TimedPrioritizedRunnable - the time aspect should be folded into the executor..

spinscale · 2015-06-17T12:44:54Z

@bleskes thx for the comments, folded them in, also removed the TimedPrioritizedRunnable

bleskes · 2015-06-19T08:35:27Z

.../src/main/java/org/elasticsearch/common/util/concurrent/PrioritizedEsThreadPoolExecutor.java

+            }
+        }
+
+        return TimeValue.timeValueNanos(System.nanoTime() - oldestCreationDateInNanos);


Nit picky - Can we capture the initial long oldestCreationDateInNanos = System.nanoTime(); and use this as the "now"? Just worried that the queue can be empty (after the initial check) and we will still get a non 0 value ..

valid. fixed

bleskes · 2015-06-19T08:39:24Z

Went through it and it looks good! I mis a non-cat rest test. If you agreed with the comments, feel free to push this - no need for another round.

I also wonder if we should bite the bullet and add active_primary_shards_percent (since we have active_primary_shards )

spinscale · 2015-06-22T12:05:30Z

I'll add another test and check the renaming as well. will push upon running tests. Thanks for reviewing!

In order to get a quick overview using by simply checking the cluster state and its corresponding cat API, the following two attributes have been added to the cluster health response: * task max waiting time, the time value of the first task of the queue and how long it has been waiting * active shards percent: The percentage of the number of shards that are in initializing state This makes the cluster health API handy to check, when a fully restarted cluster is back up and running. Closes elastic#10805

spinscale · 2015-06-22T13:07:51Z

closed by 88f8d58

spinscale added >feature review labels May 28, 2015

clintongormley added the :Data Management/Stats Statistics tracking and retrieval APIs label May 28, 2015

bleskes reviewed Jun 3, 2015
View reviewed changes

spinscale force-pushed the 1505-recovery-progress-cluster-health-issue-10805 branch from d6c104e to d3a3651 Compare June 15, 2015 14:40

bleskes reviewed Jun 16, 2015
View reviewed changes

spinscale force-pushed the 1505-recovery-progress-cluster-health-issue-10805 branch from d3a3651 to 61981d8 Compare June 17, 2015 12:43

bleskes reviewed Jun 19, 2015
View reviewed changes

spinscale force-pushed the 1505-recovery-progress-cluster-health-issue-10805 branch 3 times, most recently from 1669d42 to c85557e Compare June 22, 2015 13:02

spinscale added the v2.0.0-beta1 label Jun 22, 2015

spinscale closed this Jun 22, 2015

kevinkluge removed the review label Jun 22, 2015

kunisen mentioned this pull request Jun 3, 2019

[Feature Request] Add task running progress (% complete) to the task management API #42786

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Health: Add wait time for pending task and recovery percentage #11393

Cluster Health: Add wait time for pending task and recovery percentage #11393

spinscale commented May 28, 2015

spinscale commented May 28, 2015

spinscale commented Jun 1, 2015

clintongormley commented Jun 2, 2015

bleskes Jun 3, 2015

spinscale Jun 15, 2015

bleskes commented Jun 3, 2015

spinscale commented Jun 3, 2015

bleskes commented Jun 8, 2015

spinscale commented Jun 15, 2015

bleskes Jun 16, 2015

bleskes commented Jun 16, 2015

spinscale commented Jun 17, 2015

bleskes Jun 19, 2015

spinscale Jun 22, 2015

bleskes commented Jun 19, 2015

spinscale commented Jun 22, 2015

spinscale commented Jun 22, 2015

Cluster Health: Add wait time for pending task and recovery percentage #11393

Cluster Health: Add wait time for pending task and recovery percentage #11393

Conversation

spinscale commented May 28, 2015

spinscale commented May 28, 2015

spinscale commented Jun 1, 2015

clintongormley commented Jun 2, 2015

bleskes Jun 3, 2015

Choose a reason for hiding this comment

spinscale Jun 15, 2015

Choose a reason for hiding this comment

bleskes commented Jun 3, 2015

spinscale commented Jun 3, 2015

bleskes commented Jun 8, 2015

spinscale commented Jun 15, 2015

bleskes Jun 16, 2015

Choose a reason for hiding this comment

bleskes commented Jun 16, 2015

spinscale commented Jun 17, 2015

bleskes Jun 19, 2015

Choose a reason for hiding this comment

spinscale Jun 22, 2015

Choose a reason for hiding this comment

bleskes commented Jun 19, 2015

spinscale commented Jun 22, 2015

spinscale commented Jun 22, 2015