Skip to content

[APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT#5865

Merged
Jiang-Jia-Jun merged 7 commits intodevelopfrom
copilot/configure-worker-timeout-parameter
Jan 5, 2026
Merged

[APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT#5865
Jiang-Jia-Jun merged 7 commits intodevelopfrom
copilot/configure-worker-timeout-parameter

Conversation

Copy link
Contributor

Copilot AI commented Jan 4, 2026

Motivation

Worker processes executing computations beyond 30 seconds triggered "worker not healthy" errors due to hardcoded timeout in check_health() calls within serving_chat.py and serving_completion.py.

Modifications

Environment Variable

  • Added FD_WORKER_ALIVE_TIMEOUT to fastdeploy/envs.py (default: 30 seconds)

Service Layer Updates

  • Modified 4 check_health() invocations in serving_chat.py (lines 270, 580) and serving_completion.py (lines 288, 458) to use configurable timeout parameter
  • Added import for fastdeploy.envs module

Documentation

  • Updated docs/usage/environment_variables.md (English)
  • Updated docs/zh/usage/environment_variables.md (Chinese)

Usage or Command

# Default 30s timeout
python -m fastdeploy.entrypoints.openai.serving_chat

# Custom timeout for longer computations
FD_WORKER_ALIVE_TIMEOUT=120 python -m fastdeploy.entrypoints.openai.serving_chat

Accuracy Tests

N/A - No changes to model outputs or computation logic.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
    • No unit tests added as the change is a simple configuration parameter with straightforward integration. The implementation is validated through code inspection and manual testing.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Note: This PR requires cherry-pick to release/2.4 branch per issue requirements.

Original prompt

This section details on the original issue you should resolve

<issue_title>修改openai/serving_chat与openai/serving_completion中对于worker进程探活的超时判断</issue_title>
<issue_description>当前在openai/serving_chat.py与openai/serving_completion.py中会调用check_healthy函数来判断worker进程是否是否是存活状态,该函数默认超时时间是30秒,在现有代码中调用时也是用的默认值,不可配置。 所以会出现worker进程计算超过30秒时,服务层会抛出worker not healthy的报错。

现在我希望这个超时可配置,在serving_chat.py与serving_completion.py中通过envs.FD_WORKER_ALIVE_TIMEOUT来配置这个超时时间(默认仍然是30s),提交代码完成该需求。

需要注意

  • 你需要修改fastdeploy/envs.py来增加环境变量,同时也包括这个环境变量对应的英文注释
  • 你需要修改docs下的中英文文档,有一个专门说环境变量的文档,需要增加这个新变量的说明
  • 这个PR提交后,还需cherry pick到release/2.4分支</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@CLAassistant
Copy link

CLAassistant commented Jan 4, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Jiang-Jia-Jun
❌ Copilot
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot
Copy link

paddle-bot bot commented Jan 4, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Jan 4, 2026
Copilot AI and others added 2 commits January 4, 2026 12:56
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
@Jiang-Jia-Jun Jiang-Jia-Jun marked this pull request as ready for review January 4, 2026 13:00
Copilot AI review requested due to automatic review settings January 4, 2026 13:00
Copilot AI changed the title [WIP] Add configurable worker process timeout for health checks [APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT Jan 4, 2026
Copilot AI requested a review from Jiang-Jia-Jun January 4, 2026 13:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a configurable timeout for worker process health checks to address scenarios where worker computations exceed the hardcoded 30-second default. The timeout is now controlled via the FD_WORKER_ALIVE_TIMEOUT environment variable.

Key Changes:

  • Introduces FD_WORKER_ALIVE_TIMEOUT environment variable with a default of 30 seconds
  • Updates health check calls in serving_chat.py and serving_completion.py to use the configurable timeout
  • Adds documentation for the new environment variable in both English and Chinese

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fastdeploy/envs.py Adds FD_WORKER_ALIVE_TIMEOUT environment variable definition with default value of 30 seconds
fastdeploy/entrypoints/openai/serving_chat.py Imports envs module and updates two check_health calls to use envs.FD_WORKER_ALIVE_TIMEOUT
fastdeploy/entrypoints/openai/serving_completion.py Imports envs module and updates two check_health calls to use envs.FD_WORKER_ALIVE_TIMEOUT
docs/usage/environment_variables.md Adds English documentation for the new FD_WORKER_ALIVE_TIMEOUT environment variable
docs/zh/usage/environment_variables.md Adds Chinese documentation for the new FD_WORKER_ALIVE_TIMEOUT environment variable
tests/test_worker_alive_timeout.py Adds basic unit tests to verify environment variable retrieval with default and custom values

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
@Jiang-Jia-Jun
Copy link
Collaborator

@copilot 提交的这个单测tests/test_worker_alive_timeout.py我认为没什么作用,直接移除吧,其余我没问题了

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 4, 2026

@copilot 提交的这个单测tests/test_worker_alive_timeout.py我认为没什么作用,直接移除吧,其余我没问题了

Removed the test file in commit 297cb5a.

Copilot AI requested a review from Jiang-Jia-Jun January 4, 2026 13:12
Copilot AI added a commit that referenced this pull request Jan 4, 2026
…a FD_WORKER_ALIVE_TIMEOUT

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Jiang-Jia-Jun added a commit that referenced this pull request Jan 5, 2026
…k timeout via FD_WORKER_ALIVE_TIMEOUT(#5865) (#5867)

* Initial plan

* Cherry-pick PR #5865: Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 7d5282e into develop Jan 5, 2026
14 of 20 checks passed
@Jiang-Jia-Jun Jiang-Jia-Jun deleted the copilot/configure-worker-timeout-parameter branch January 5, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

修改openai/serving_chat与openai/serving_completion中对于worker进程探活的超时判断

3 participants