This seems to be different than the issues I've had in the past and any that I've seen mentioned here. Using st2 2.3.2
More logs are available here:
https://gist.github.com/JohnWelborn/6ad29b1630726a22307018824a827e7b
A few times per day I am seeing a mistral subworkow stay running in st2. The examples seem to match up to the mistral worker timing out and getting killed. The mistral api is running as:
/opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 8 mistral.api.wsgi --graceful-timeout 10
This looks like the default 30 second gunicorn timeout.
mistral-api.log
[2017-09-05 08:56:59 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:10716)
[2017-09-05 08:56:59 +0000] [10716] [INFO] Worker exiting (pid: 10716)
[2017-09-05 08:56:59 +0000] [23101] [INFO] Booting worker with pid: 23101
[2017-09-05 08:57:22 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25455)
[2017-09-05 08:57:22 +0000] [25455] [INFO] Worker exiting (pid: 25455)
[2017-09-05 08:57:22 +0000] [23191] [INFO] Booting worker with pid: 23191
[2017-09-05 08:57:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25593)
[2017-09-05 08:57:31 +0000] [25593] [INFO] Worker exiting (pid: 25593)
[2017-09-05 08:57:31 +0000] [23207] [INFO] Booting worker with pid: 23207
[2017-09-05 08:57:34 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:12316)
[2017-09-05 08:57:34 +0000] [12316] [INFO] Worker exiting (pid: 12316)
[2017-09-05 08:57:34 +0000] [23217] [INFO] Booting worker with pid: 23217
[2017-09-05 08:58:24 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:27539)
[2017-09-05 08:58:24 +0000] [27539] [INFO] Worker exiting (pid: 27539)
[2017-09-05 08:58:25 +0000] [23343] [INFO] Booting worker with pid: 23343
[2017-09-05 09:36:04 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23217)
[2017-09-05 09:36:04 +0000] [23217] [INFO] Worker exiting (pid: 23217)
[2017-09-05 09:36:04 +0000] [3829] [INFO] Booting worker with pid: 3829
[2017-09-05 18:33:09 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:2068)
[2017-09-05 18:33:09 +0000] [2068] [INFO] Worker exiting (pid: 2068)
[2017-09-05 18:33:09 +0000] [6840] [INFO] Booting worker with pid: 6840
[2017-09-05 18:33:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23343)
[2017-09-05 18:33:31 +0000] [23343] [INFO] Worker exiting (pid: 23343)
[2017-09-05 18:33:32 +0000] [6903] [INFO] Booting worker with pid: 6903
Based on the logs in the gist including "WorkflowExecution not found" I think the mistral worker was busy doing its execution_expiration_policy at the 8:56:59 example. But the worker timeout happens much more than that. It may still be good for me to change the interval from 120 to 30:
[execution_expiration_policy]
evaluation_interval = 120
older_than = 2880
I have 8 mistral-server (which write to the same log file) and 8 mistral-api (each with 8 workers) instances running to keep up with demand.
I could use some guidance in where I should be looking to debug this. Do the logs indicate why the api is getting a worker timeout? Is it safe/advisable to increase the gunicorn timeout from 30 seconds? Should/Can I use -k gevent?
Thanks
This seems to be different than the issues I've had in the past and any that I've seen mentioned here. Using st2 2.3.2
More logs are available here:
https://gist.github.com/JohnWelborn/6ad29b1630726a22307018824a827e7b
A few times per day I am seeing a mistral subworkow stay running in st2. The examples seem to match up to the mistral worker timing out and getting killed. The mistral api is running as:
/opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 8 mistral.api.wsgi --graceful-timeout 10This looks like the default 30 second gunicorn timeout.
mistral-api.log
[2017-09-05 08:56:59 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:10716)
[2017-09-05 08:56:59 +0000] [10716] [INFO] Worker exiting (pid: 10716)
[2017-09-05 08:56:59 +0000] [23101] [INFO] Booting worker with pid: 23101
[2017-09-05 08:57:22 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25455)
[2017-09-05 08:57:22 +0000] [25455] [INFO] Worker exiting (pid: 25455)
[2017-09-05 08:57:22 +0000] [23191] [INFO] Booting worker with pid: 23191
[2017-09-05 08:57:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25593)
[2017-09-05 08:57:31 +0000] [25593] [INFO] Worker exiting (pid: 25593)
[2017-09-05 08:57:31 +0000] [23207] [INFO] Booting worker with pid: 23207
[2017-09-05 08:57:34 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:12316)
[2017-09-05 08:57:34 +0000] [12316] [INFO] Worker exiting (pid: 12316)
[2017-09-05 08:57:34 +0000] [23217] [INFO] Booting worker with pid: 23217
[2017-09-05 08:58:24 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:27539)
[2017-09-05 08:58:24 +0000] [27539] [INFO] Worker exiting (pid: 27539)
[2017-09-05 08:58:25 +0000] [23343] [INFO] Booting worker with pid: 23343
[2017-09-05 09:36:04 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23217)
[2017-09-05 09:36:04 +0000] [23217] [INFO] Worker exiting (pid: 23217)
[2017-09-05 09:36:04 +0000] [3829] [INFO] Booting worker with pid: 3829
[2017-09-05 18:33:09 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:2068)
[2017-09-05 18:33:09 +0000] [2068] [INFO] Worker exiting (pid: 2068)
[2017-09-05 18:33:09 +0000] [6840] [INFO] Booting worker with pid: 6840
[2017-09-05 18:33:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23343)
[2017-09-05 18:33:31 +0000] [23343] [INFO] Worker exiting (pid: 23343)
[2017-09-05 18:33:32 +0000] [6903] [INFO] Booting worker with pid: 6903
Based on the logs in the gist including "WorkflowExecution not found" I think the mistral worker was busy doing its execution_expiration_policy at the 8:56:59 example. But the worker timeout happens much more than that. It may still be good for me to change the interval from 120 to 30:
[execution_expiration_policy]
evaluation_interval = 120
older_than = 2880
I have 8 mistral-server (which write to the same log file) and 8 mistral-api (each with 8 workers) instances running to keep up with demand.
I could use some guidance in where I should be looking to debug this. Do the logs indicate why the api is getting a worker timeout? Is it safe/advisable to increase the gunicorn timeout from 30 seconds? Should/Can I use
-k gevent?Thanks