[APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT#5865
Conversation
|
|
|
Thanks for your contribution! |
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a configurable timeout for worker process health checks to address scenarios where worker computations exceed the hardcoded 30-second default. The timeout is now controlled via the FD_WORKER_ALIVE_TIMEOUT environment variable.
Key Changes:
- Introduces
FD_WORKER_ALIVE_TIMEOUTenvironment variable with a default of 30 seconds - Updates health check calls in serving_chat.py and serving_completion.py to use the configurable timeout
- Adds documentation for the new environment variable in both English and Chinese
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/envs.py | Adds FD_WORKER_ALIVE_TIMEOUT environment variable definition with default value of 30 seconds |
| fastdeploy/entrypoints/openai/serving_chat.py | Imports envs module and updates two check_health calls to use envs.FD_WORKER_ALIVE_TIMEOUT |
| fastdeploy/entrypoints/openai/serving_completion.py | Imports envs module and updates two check_health calls to use envs.FD_WORKER_ALIVE_TIMEOUT |
| docs/usage/environment_variables.md | Adds English documentation for the new FD_WORKER_ALIVE_TIMEOUT environment variable |
| docs/zh/usage/environment_variables.md | Adds Chinese documentation for the new FD_WORKER_ALIVE_TIMEOUT environment variable |
| tests/test_worker_alive_timeout.py | Adds basic unit tests to verify environment variable retrieval with default and custom values |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
|
@copilot 提交的这个单测tests/test_worker_alive_timeout.py我认为没什么作用,直接移除吧,其余我没问题了 |
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…a FD_WORKER_ALIVE_TIMEOUT Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
…k timeout via FD_WORKER_ALIVE_TIMEOUT(#5865) (#5867) * Initial plan * Cherry-pick PR #5865: Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Motivation
Worker processes executing computations beyond 30 seconds triggered "worker not healthy" errors due to hardcoded timeout in
check_health()calls withinserving_chat.pyandserving_completion.py.Modifications
Environment Variable
FD_WORKER_ALIVE_TIMEOUTtofastdeploy/envs.py(default: 30 seconds)Service Layer Updates
check_health()invocations inserving_chat.py(lines 270, 580) andserving_completion.py(lines 288, 458) to use configurable timeout parameterfastdeploy.envsmoduleDocumentation
docs/usage/environment_variables.md(English)docs/zh/usage/environment_variables.md(Chinese)Usage or Command
Accuracy Tests
N/A - No changes to model outputs or computation logic.
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.Note: This PR requires cherry-pick to
release/2.4branch per issue requirements.Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.