[skyrl-train][inference] Inference Server Refactor (1/N)#899
[skyrl-train][inference] Inference Server Refactor (1/N)#899CharlieFRuan merged 28 commits intoNovaSky-AI:mainfrom
Conversation
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring for inference serving. The new architecture, with its clear separation of concerns into a router, server group, actor pool, and a server actor protocol, is robust and extensible. The implementation of the vLLM server actor and the comprehensive test suite, including complex scenarios like weight synchronization, are particularly impressive. I have a few suggestions to improve the router's lifecycle management and fix a test case.
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactor of the inference server architecture, moving towards a more decoupled and flexible design. Key components include a new InferenceRouter for load balancing and control plane fan-out, ServerGroup for managing server actors with placement groups, and a ServerActorProtocol to enable interchangeable backends like vLLM and SGLang. The changes also include a robust weight synchronization mechanism between trainers and inference servers. New utility functions for environment variables and port management are added, along with comprehensive unit and GPU integration tests. The overall structure is well-thought-out, enhancing maintainability and scalability for RL workloads.
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
CharlieFRuan
left a comment
There was a problem hiding this comment.
Thank you so much for the PR and pushing the refactor of the rollout stack. Learned a lot when reviewing, and cannot wait to see it land!
|
|
||
| Routing behavior: | ||
| - Data plane (generation requests): Routes to ONE server. | ||
| - If X-Session-ID header present: consistent hash to same backend |
There was a problem hiding this comment.
The current skyrl-train/skyrl_train/inference_engines/inference_engine_client_http_endpoint.py allows the users to, in their agent harness, instead using a say OpenAI endpoint, to use the HTTP endpoint that SkyRL spins up.
With this refactoring, the user will use http://0.0.0.0/8080/v1/chat/completions for the router router = InferenceRouter(server_urls, host="0.0.0.0", port=8080).
My question is, for the sticky routing, how will the user add the header? Would it be simpler if the session ID is in the body?
Currently for terminal bench, it will have a session_id field in the request:
-
- TBench side will forward this kwarg in the request body
- Then on SkyRL, upon receiving this request, pops the
session_idhere:SkyRL/skyrl-train/skyrl_train/inference_engines/inference_engine_client.py
Lines 367 to 377 in b9a6307
There was a problem hiding this comment.
so we need to make it compatible with industry standards. if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID. I don't know what the benefit of this is over keeping it in the body, but I think this is more of a standard for decoupling middle-ware / routing ops from the request body.
There was a problem hiding this comment.
session_id is not part of chat or completion api standard btw. So I'd say we need to change the agent harness / generator to the header instead.
There was a problem hiding this comment.
if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID
Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939
skyrl-train/tests/gpu/gpu_ci/inference_servers/test_inference_server_group.py
Show resolved
Hide resolved
|
|
||
| Routing behavior: | ||
| - Data plane (generation requests): Routes to ONE server. | ||
| - If X-Session-ID header present: consistent hash to same backend |
There was a problem hiding this comment.
if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID
Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939
CharlieFRuan
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments! Last 3 nits!
| "/wake_up", | ||
| "/reset_prefix_cache", | ||
| "/collective_rpc", | ||
| "/health", |
There was a problem hiding this comment.
What is the semantic of _proxy_to_all() for /health and /is_paused? If one of the engines returns false, does the router return false altogether? Could we add a unit test for/health when one of the engine is not healthy?
I feel like the semantic of /health and /is_paused (GET methods) are different from the other control plane APIs (POST) for _proxy_to_all().
There was a problem hiding this comment.
good q.
/is_paused --> routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)
/health --> I think we should actually make it not a control plane endpoint. Basically if at least one engine exists that is healthy it should return true, By making /health go through the routing, if it returns true you are guaranteed that at least one engine is healhty. The router should smart enough to not route to the engines that are unhealthy (that's part of the responsibility of the router that we can add to our simple one if we want later)
I removed /health from the control plane endpoints because of above.
There was a problem hiding this comment.
sounds good, thanks!
routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)
Got it. Would be good to document this behavior somewhere! We can do that in another PR
skyrl-train/tests/gpu/gpu_ci/inference_servers/test_inference_server_group.py
Outdated
Show resolved
Hide resolved
skyrl-train/tests/gpu/gpu_ci/inference_servers/test_inference_server_group.py
Outdated
Show resolved
Hide resolved
|
@kouroshHakha is attempting to deploy a commit to the Tyler's projects Team on Vercel. A member of the Team first needs to authorize it. |
Why
We're building HTTP-based inference serving for RL workloads. This enables:
What
Key Design Decisions
ServerActorProtocolServerGroup/pause,/resume,/init_weight_transferhit all serversReview Guide
protocols.py(30 lines) - the interface contractrouter.py- proxy routing + fan-out behaviorserver_group.py+vllm_server_actor.py- Ray actor lifecycletest_weight_sync.py- trainer→inference weight sync flowTesting
# GPU CI (requires 4 GPUs) uv run pytest tests/gpu/gpu_ci/test_inference_server_group.py -v uv run pytest tests/gpu/gpu_ci/test_weight_sync.py -vNext
RemoteInferenceClient- typed client for trainer to call inference servers