Mistral API server worker timeout

This seems to be different than the issues I've had in the past and any that I've seen mentioned here. Using st2 2.3.2

More logs are available here:
https://gist.github.com/JohnWelborn/6ad29b1630726a22307018824a827e7b

A few times per day I am seeing a mistral subworkow stay running in st2.  The examples seem to match up to the mistral worker timing out and getting killed. The mistral api is running as:

`/opt/stackstorm/mistral/bin/python /opt/stackstorm/mistral/bin/gunicorn --log-file /var/log/mistral/mistral-api.log -b 127.0.0.1:8989 -w 8 mistral.api.wsgi --graceful-timeout 10`

This looks like the default 30 second gunicorn timeout.

**mistral-api.log**
[2017-09-05 08:56:59 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:10716)
[2017-09-05 08:56:59 +0000] [10716] [INFO] Worker exiting (pid: 10716)
[2017-09-05 08:56:59 +0000] [23101] [INFO] Booting worker with pid: 23101
[2017-09-05 08:57:22 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25455)
[2017-09-05 08:57:22 +0000] [25455] [INFO] Worker exiting (pid: 25455)
[2017-09-05 08:57:22 +0000] [23191] [INFO] Booting worker with pid: 23191
[2017-09-05 08:57:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:25593)
[2017-09-05 08:57:31 +0000] [25593] [INFO] Worker exiting (pid: 25593)
[2017-09-05 08:57:31 +0000] [23207] [INFO] Booting worker with pid: 23207
[2017-09-05 08:57:34 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:12316)
[2017-09-05 08:57:34 +0000] [12316] [INFO] Worker exiting (pid: 12316)
[2017-09-05 08:57:34 +0000] [23217] [INFO] Booting worker with pid: 23217
[2017-09-05 08:58:24 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:27539)
[2017-09-05 08:58:24 +0000] [27539] [INFO] Worker exiting (pid: 27539)
[2017-09-05 08:58:25 +0000] [23343] [INFO] Booting worker with pid: 23343
[2017-09-05 09:36:04 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23217)
[2017-09-05 09:36:04 +0000] [23217] [INFO] Worker exiting (pid: 23217)
[2017-09-05 09:36:04 +0000] [3829] [INFO] Booting worker with pid: 3829
[2017-09-05 18:33:09 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:2068)
[2017-09-05 18:33:09 +0000] [2068] [INFO] Worker exiting (pid: 2068)
[2017-09-05 18:33:09 +0000] [6840] [INFO] Booting worker with pid: 6840
[2017-09-05 18:33:31 +0000] [686] [CRITICAL] WORKER TIMEOUT (pid:23343)
[2017-09-05 18:33:31 +0000] [23343] [INFO] Worker exiting (pid: 23343)
[2017-09-05 18:33:32 +0000] [6903] [INFO] Booting worker with pid: 6903

Based on the logs in the gist including "WorkflowExecution not found" I think the mistral worker was busy doing its execution_expiration_policy at the 8:56:59 example. But the worker timeout happens much more than that. It may still be good for me to change the interval from 120 to 30:

[execution_expiration_policy]
evaluation_interval = 120
older_than = 2880

I have 8 mistral-server (which write to the same log file) and 8 mistral-api (each with 8 workers) instances running to keep up with demand. 

I could use some guidance in where I should be looking to debug this. Do the logs indicate why the api is getting a worker timeout? Is it safe/advisable to increase the gunicorn timeout from 30 seconds? Should/Can I use `-k gevent`?

Thanks



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mistral API server worker timeout #3717

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Mistral API server worker timeout #3717

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions