New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging of slow cluster state tasks #10907
Add logging of slow cluster state tasks #10907
Conversation
@imotov can we log the times not as ms but as |
@@ -523,6 +538,12 @@ public void run() { | |||
} | |||
} | |||
|
|||
private void warnAboutSlowTaskIfNeeded(long executionTime, String source) { | |||
if (executionTime >= slowTaskLoggingThreshold.getMillis()) { | |||
logger.warn("cluster state update task [{}] took too long: {}ms", source, executionTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we log what the limit here? took X above the warn threshold of Y? Also, I wonder if this should be warn compared to info, don't have strong feelings, but it might be ok that it takes more than 10 seconds sometimes, depending on the cluster state task (like the initial gateway reach to discover which shards are around)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimchy it's 30 seconds by default and can be changed dynamically. So we can either 1) increase the default value to something that we feel comfortable with or 2) make it logged on info level. Not sure if the second option makes a lot of sense since we are already logging timing on the debug level. So, I thought the whole idea to log long running tasks was to "warn" users about a potential issue.
@kimchy I pushed a new version that logs time as |
logger.debug("processing [{}]: no change in cluster_state", source); | ||
TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, System.currentTimeMillis() - startTime)); | ||
logger.debug("processing [{}]: took {} no change in cluster_state", source, executionTime); | ||
warnAboutSlowTaskIfNeeded(executionTime, source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the other block, we count the time after calling onAllNodesAck and clusterStateProcessed, maybe we should do the same here?
lets another small comment, otherwise LGTM |
3fdcba3
to
c165afb
Compare
@imotov could you add a description of the changes to this PR please? |
@clintongormley updated the description in the first comment. |
Closes #10874
Logs how long each cluster state update task took on the debug level. If a task took longer than 30sec (dynamically configurable using
cluster.service.slow_task_logging_threshold
) this task is logged on the WARN level.