Improve health endpoint#4615
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR enhances the OpenAI server /health endpoint by adding an engine health monitor that actively probes backend liveness and detects scheduler stalls via a new monotonic scheduler_tick metric surfaced from both TurboMind (C++/pybind) and PyTorch backends.
Changes:
- Add
scheduler_tickto schedule metrics across TurboMind (C++ + Python binding) and PyTorch scheduler metrics. - Introduce
EngineHealthMonitor+AsyncEngine.health_probe()and wire/healthto return structured JSON with 200/503 based on engine status. - Add lightweight backend-specific
get_health_status()implementations (TurboMind, PyTorch, mp engines) for the health probe.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/turbomind/utils/metrics.h |
Adds scheduler_tick field to TurboMind schedule metrics and prints it in the stream operator. |
src/turbomind/python/bind.cpp |
Exposes scheduler_tick to Python via pybind ScheduleMetrics. |
src/turbomind/engine/engine.cc |
Tracks scheduler_tick and adjusts schedule-metrics update/get logic (also initializes metrics after seq manager creation). |
lmdeploy/turbomind/turbomind.py |
Propagates scheduler_tick into Python ScheduleMetrics and adds TurboMind get_health_status(). |
lmdeploy/serve/openai/api_server.py |
Switches /health to JSON output backed by EngineHealthMonitor; wires monitor into FastAPI lifespan. |
lmdeploy/serve/managers/session_manager.py |
Adds num_dispatched to track checked-out request handles for stall detection logic. |
lmdeploy/serve/core/health.py |
New EngineHealthMonitor background task that periodically probes engine health. |
lmdeploy/serve/core/async_engine.py |
Adds bounded, non-overlapping health probing + scheduler progress validation. |
lmdeploy/serve/core/__init__.py |
Exports EngineHealthMonitor. |
lmdeploy/pytorch/paging/scheduler.py |
Adds scheduler_tick and includes it in schedule metrics. |
lmdeploy/pytorch/engine/mp_engine/zmq_engine.py |
Adds health status check for ZMQ process liveness before probing. |
lmdeploy/pytorch/engine/mp_engine/base.py |
Adds get_health_status() RPC wrapper. |
lmdeploy/pytorch/engine/mp_engine/base_worker.py |
Adds RPC-exposed get_health_status() implementation. |
lmdeploy/pytorch/engine/engine.py |
Adds PyTorch engine get_health_status() checking request/main loop task liveness. |
lmdeploy/pytorch/engine/engine_loop.py |
Increments scheduler tick on each main-loop iteration. |
lmdeploy/pytorch/engine/base.py |
Adds get_health_status() to the engine base interface. |
lmdeploy/messages.py |
Adds scheduler_tick field to the Python ScheduleMetrics dataclass. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+611
to
+617
| @staticmethod | ||
| def _health_check_tasks(tasks): | ||
| done_tasks = [] | ||
| for task in list(tasks): | ||
| if task.done(): | ||
| done_tasks.append(task.get_name()) | ||
| return len(done_tasks) == 0, done_tasks |
Comment on lines
+305
to
+320
| self._health_probe_task = asyncio.create_task(self.engine.get_health_status(), name='EngineHealthProbe') | ||
| try: | ||
| backend_status = await asyncio.wait_for(asyncio.shield(self._health_probe_task), timeout=timeout) | ||
| except asyncio.TimeoutError: | ||
| return self._make_health_result( | ||
| status='unhealthy', | ||
| message=f'Backend health probe timed out after {timeout:.1f}s.', | ||
| ) | ||
| except Exception as e: | ||
| self._health_probe_task = None | ||
| return self._make_health_result( | ||
| status='unhealthy', | ||
| message=f'Backend health probe failed: {e}', | ||
| ) | ||
|
|
||
| self._health_probe_task = None |
| total_blocks=tm_metrics.total_blocks, | ||
| active_blocks=tm_metrics.active_blocks, | ||
| free_blocks=tm_metrics.free_blocks) | ||
| free_blocks=tm_metrics.free_blocks, |
grimoire
reviewed
May 25, 2026
grimoire
approved these changes
May 26, 2026
lzhangzz
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve the API server
/healthendpoint so it reflects inference engine health instead of only reporting that the HTTP server is alive.This change adds backend health probing for both PyTorch and TurboMind engines. The API server now runs a background
EngineHealthMonitor, caches the latest health snapshot, andreturns
503when the inference backend is unhealthy while keeping200for healthy or sleeping engines.The health probe uses a bounded, non-overlapping backend check and validates scheduler progress with a backend-owned monotonic
scheduler_tick. This allows/healthto detect caseswhere requests have been dispatched but the backend scheduler stops making progress. Idle periods are handled separately so the backend is not marked unhealthy simply because there is
no active work.
Both engines expose
scheduler_tickthrough schedule metrics, which is update in every inference iter. so health probing sees current sequence/block state.Beside "scheduler_tick`, PyTorch engine health status now also checks engine loop/task liveness