New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Health: Add wait time for pending task and recovery percentage #11393
Cluster Health: Add wait time for pending task and recovery percentage #11393
Conversation
biggest question here is, if the |
@clintongormley care to take a look, if this matches your expectation, even though no new cat API has been added? |
LGTM |
int initializingShardCount = 0; | ||
int totalShardCount = 0; | ||
for (ShardRouting shardRouting : shardRoutings) { | ||
if (shardRouting.initializing()) initializingShardCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should count shardRouting.active() == false
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not following here, why?
Hey @spinscale, left some comment about naming and implementation. I think we comments also address the safe use of |
@bleskes so, you are proposing two important changes here, which I want to understand before applying them
|
That was my concern as well when thinking about this but I decided it's worth it. I think it's more power full this way. It's only an array scan so we're still talking super fast, even if it's 10K ops.
Yeah, I think it communicates clearly how bad is "YELLOW" |
d6c104e
to
d3a3651
Compare
upgraded the PR with all the naming refactoring and scanning all the tasks (also introduced a |
} else { | ||
List<ShardRouting> shardRoutings = clusterState.getRoutingTable().allShards(); | ||
// shortcut on no shards | ||
if (shardRoutings.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be covered by the green case, no?
I like this! one thing I would love seeing is getting rid of org.elasticsearch.cluster.service.InternalClusterService.TimedPrioritizedRunnable - the time aspect should be folded into the executor.. |
d3a3651
to
61981d8
Compare
@bleskes thx for the comments, folded them in, also removed the |
} | ||
} | ||
|
||
return TimeValue.timeValueNanos(System.nanoTime() - oldestCreationDateInNanos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit picky - Can we capture the initial long oldestCreationDateInNanos = System.nanoTime();
and use this as the "now"? Just worried that the queue can be empty (after the initial check) and we will still get a non 0 value ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid. fixed
Went through it and it looks good! I mis a non-cat rest test. If you agreed with the comments, feel free to push this - no need for another round. I also wonder if we should bite the bullet and add active_primary_shards_percent (since we have active_primary_shards ) |
I'll add another test and check the renaming as well. will push upon running tests. Thanks for reviewing! |
1669d42
to
c85557e
Compare
In order to get a quick overview using by simply checking the cluster state and its corresponding cat API, the following two attributes have been added to the cluster health response: * task max waiting time, the time value of the first task of the queue and how long it has been waiting * active shards percent: The percentage of the number of shards that are in initializing state This makes the cluster health API handy to check, when a fully restarted cluster is back up and running. Closes elastic#10805
closed by 88f8d58 |
In order to get a quick overview using by simply checking the cluster state
and its corresponding cat API, the following two attributes have been added
to the cluster health response:
queue and how long it has been waiting
initializing state
This makes the cluster health API handy to check, when a fully restarted
cluster is back up and running.
In addition a small serialization fix has been added, which removes version
checks for the this branch in the ClusterHealthResponse.
Closes #10805