You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is possible for Watcher to get stuck while starting if attempts to read .watches or .triggered_watches fail. If this occurs, the node must be restarted - no amount of stopping and restarting Watcher or any other API will get it unstuck.
When this occurs, there will be a stack trace similar to this:
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:79) ~[elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:54) ~[elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:49) ~[elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.xpack.watcher.execution.TriggeredWatchStore.findTriggeredWatches(TriggeredWatchStore.java:141) ~[x-pack-watcher-6.8.1.jar:6.8.1]
at org.elasticsearch.xpack.watcher.WatcherService.reloadInner(WatcherService.java:229) ~[x-pack-watcher-6.8.1.jar:6.8.1]
at org.elasticsearch.xpack.watcher.WatcherService.lambda$start$2(WatcherService.java:203) ~[x-pack-watcher-6.8.1.jar:6.8.1]
at org.elasticsearch.xpack.watcher.WatcherService$1.doRun(WatcherService.java:399) [x-pack-watcher-6.8.1.jar:6.8.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.1.jar:6.8.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.8.1.jar:6.8.1]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:77) ~[elasticsearch-6.8.1.jar:6.8.1]
... 11 more
This happens due to a bug in WatcherLifeCycleService in the Watcher's startup routine. Watcher can only move from STARTING to STARTED via callbacks defined in this method:
At this point, the callback which sets the state to STARTED is lost, and no branch in WatcherLifeCycleService#clusterChanged handles the STARTING state. There is no way to exit this state at this point without restarting the node.
The text was updated successfully, but these errors were encountered:
gwbrown
changed the title
Watcher can get stuck while starting if there is an error while reading .watches or .triggered_watcher
Watcher can get stuck while starting if there is an error while reading .watches or .triggered_watches
Jul 29, 2019
Ran into this same issue but found an alternate way to remediate without restarting the cluster.
First step taken was rerouting the .watches and .triggered_watches indices to different nodes:
# Stop watcher first as a precaution
POST _watcher/_stop
POST _cluster/reroute
{
"commands": [
{
"move": {
"index": ".watches",
"shard": 0,
"from_node": [node with watcher stuck starting],
"to_node": [node with watcher started]
}
},
{
"move": {
"index": ".triggered_watches",
"shard": 0,
"from_node": [node with watcher stuck starting],
"to_node": [node with watcher started]
}
},
.... repeated for every node with watcher stuck in the starting phase
]
}
# Now restart watcher
POST _watcher/_start
After that force Watcher to re-assign watches by turning replica expansion off and back on:
PUT .watches/_settings
{
"index.auto_expand_replicas": "false"
}
PUT .watches/_settings
{
"index.auto_expand_replicas": "0-x" [where x is number of watcher nodes or "all"]
}
When I ran into this there were 16 nodes stuck in the "starting" state for Watcher and following these steps resolved it for all of them without needing to do a full restart on the nodes.
It is possible for Watcher to get stuck while starting if attempts to read
.watches
or.triggered_watches
fail. If this occurs, the node must be restarted - no amount of stopping and restarting Watcher or any other API will get it unstuck.When this occurs, there will be a stack trace similar to this:
This happens due to a bug in
WatcherLifeCycleService
in the Watcher's startup routine. Watcher can only move fromSTARTING
toSTARTED
via callbacks defined in this method:elasticsearch/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java
Line 86 in 57c473c
The problem occurs when this line in
start()
is hit due to an exception being thrown while reading from the.watches
or.triggered_watches
indices:elasticsearch/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherService.java
Line 207 in c7ef318
At this point, the callback which sets the state to
STARTED
is lost, and no branch inWatcherLifeCycleService#clusterChanged
handles theSTARTING
state. There is no way to exit this state at this point without restarting the node.The text was updated successfully, but these errors were encountered: