Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"IMMEDIATE" tasks getting queued up in pending tasks #8860

Closed
ppf2 opened this issue Dec 10, 2014 · 5 comments
Closed

"IMMEDIATE" tasks getting queued up in pending tasks #8860

ppf2 opened this issue Dec 10, 2014 · 5 comments

Comments

@ppf2
Copy link
Member

ppf2 commented Dec 10, 2014

Have a situation where the cluster had to be restarted. Upon restarting, there was a ton of recovery activity (at times, we observed >100 EMERGENCY tasks in pending_tasks). As a result, attempts to update the cluster (eg. to increase the concurrency setting for recovery) started failing with
ProcessClusterEventTimeoutException errors.

Request:

{"transient":{"cluster.routing.allocation.node_concurrent_recoveries": 10}}

Response:

HTTP/1.1 503 Service Unavailable

{"error":"RemoteTransportException[[Rush][inet[/IP:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503}

https://github.com/elasticsearch/elasticsearch/blob/1816951b6b0320e7a011436c7c7519ec2bfabc6e/src/main/java/org/elasticsearch/common/Priority.java#L45 seems to indicate that IMMEDIATE tasks are of higher priority of EMERGENCY tasks. But for some reason, these update cluster setting calls are being queued up still.

Partial pending_tasks output:

{ 
"insert_order" : 2125, 
"priority" : "IMMEDIATE", 
"source" : "cluster_update_settings", 
"executing" : false, 
"time_in_queue_millis" : 5908, 
"time_in_queue" : "5.9s" 
}

{ 
"insert_order" : 1949, 
"priority" : "URGENT", 
"source" : "shard-started ([index20141116][8], node[nodeID], [P], s[INITIALIZING]), reason [master [Rush][nodeID][node][inet[/172.16.0.6:9300]]{http=false, data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]", 
"executing" : true, 
"time_in_queue_millis" : 2463692, 
"time_in_queue" : "41m" 
}

Is the cluster too busy to even go and re-prioritize its running tasks? Or is it because once an EMERGENCY task starts to run, even if an IMMEDIATE task comes in, it will still have to wait till these running EMERGENCY tasks have completed? In other words, if the IMMEDIATE task comes in at the same time as an EMERGENCY task, it will get prioritized higher, but it will not go and suspend any running EMERGENCY task to allow for the IMMEDIATE task to run first, etc..?

@bleskes
Copy link
Contributor

bleskes commented Dec 10, 2014

Hey @ppf2 , I assume by EMERGENCY you mean URGENT. Answering based on that assumption..

Or is it because once an EMERGENCY task starts to run, even if an IMMEDIATE task comes in, it will still have to wait till these running EMERGENCY tasks have completed?

Yes, this is correct. The cluster state tasks are executed by a single thread to make sure things stay consistent. While the queue is being prioritized with every task insertion, you have to wait for the thread to finish it's current task and pickup on the next highest priority task. The assumption is that no task should take long (although you may have many tasks queued up). The question is what task was taking so long in your case? You can run hot threads to see what the cluster state update thread is doing. Another thing to try is to increase the timeout of the cluster state settings update call to 1m and see if it helps to get it in (the master wait for up to 30s for the nodes to respond when publishing the cluster state so this might be it as well).

PS - even if the api call times out , the task is still queued up and will be executed when the thread is free. So the setting will be applied, albeit after the call returns

@ppf2
Copy link
Member Author

ppf2 commented Dec 10, 2014

Thanks will get hot threads, and yes, I meant urgent tasks, not emergency :D

@ghost
Copy link

ghost commented Dec 10, 2014

We were having a similar issue with IMMEDIATE tasks queuing up. See issue #8804.

@bleskes
Copy link
Contributor

bleskes commented Dec 10, 2014

@miccon yeah, you have the same issue - a single cluster state update task takes way too long (which was fixed in your case in #8803 and disabling the include relocations as a work around)

@clintongormley
Copy link

This has probably been resolved by async shard store fetching. Please reopen if you still see this on recent versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants