Skip to content

[skyrl-train][inference] Inference Server Refactor (1/N)#899

Merged
CharlieFRuan merged 28 commits intoNovaSky-AI:mainfrom
kouroshHakha:kh/inference-1
Jan 25, 2026
Merged

[skyrl-train][inference] Inference Server Refactor (1/N)#899
CharlieFRuan merged 28 commits intoNovaSky-AI:mainfrom
kouroshHakha:kh/inference-1

Conversation

@kouroshHakha
Copy link
Collaborator

Why

We're building HTTP-based inference serving for RL workloads. This enables:

  • Decoupled training ↔ inference via standard HTTP (no Ray object refs)
  • Flexible backends (vLLM now, SGLang later via protocol)
  • Dynamic weight sync between trainer and inference servers

What

skyrl_train/inference_servers/
├── router.py             # HTTP proxy: load balancing + control plane fan-out
├── server_group.py       # Manages N vLLM actors on a placement group
├── vllm_server_actor.py  # Ray actor wrapping vLLM OpenAI server
├── protocols.py          # ServerActorProtocol interface
├── vllm_worker.py        # Worker extension for NCCL weight sync
└── common.py             # Utilities (get_open_port, get_node_ip)

Key Design Decisions

Decision Rationale
ServerActorProtocol Swap vLLM for SGLang without changing ServerGroup
Router as HTTP proxy Compatible with any HTTP client; no vLLM-specific deps
Control plane fan-out /pause, /resume, /init_weight_transfer hit all servers
Session-aware hashing Sticky routing for multi-turn conversations

Review Guide

  1. Start here: protocols.py (30 lines) - the interface contract
  2. Core logic: router.py - proxy routing + fan-out behavior
  3. Glue code: server_group.py + vllm_server_actor.py - Ray actor lifecycle
  4. Tests: test_weight_sync.py - trainer→inference weight sync flow

Testing

# GPU CI (requires 4 GPUs)
uv run pytest tests/gpu/gpu_ci/test_inference_server_group.py -v
uv run pytest tests/gpu/gpu_ci/test_weight_sync.py -v

Next

RemoteInferenceClient - typed client for trainer to call inference servers

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha marked this pull request as ready for review January 20, 2026 03:26
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring for inference serving. The new architecture, with its clear separation of concerns into a router, server group, actor pool, and a server actor protocol, is robust and extensible. The implementation of the vLLM server actor and the comprehensive test suite, including complex scenarios like weight synchronization, are particularly impressive. I have a few suggestions to improve the router's lifecycle management and fix a test case.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactor of the inference server architecture, moving towards a more decoupled and flexible design. Key components include a new InferenceRouter for load balancing and control plane fan-out, ServerGroup for managing server actors with placement groups, and a ServerActorProtocol to enable interchangeable backends like vLLM and SGLang. The changes also include a robust weight synchronization mechanism between trainers and inference servers. New utility functions for environment variables and port management are added, along with comprehensive unit and GPU integration tests. The overall structure is well-thought-out, enhancing maintainability and scalability for RL workloads.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@CharlieFRuan CharlieFRuan self-assigned this Jan 20, 2026
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Copy link
Member

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the PR and pushing the refactor of the rollout stack. Learned a lot when reviewing, and cannot wait to see it land!


Routing behavior:
- Data plane (generation requests): Routes to ONE server.
- If X-Session-ID header present: consistent hash to same backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current skyrl-train/skyrl_train/inference_engines/inference_engine_client_http_endpoint.py allows the users to, in their agent harness, instead using a say OpenAI endpoint, to use the HTTP endpoint that SkyRL spins up.

With this refactoring, the user will use http://0.0.0.0/8080/v1/chat/completions for the router router = InferenceRouter(server_urls, host="0.0.0.0", port=8080).

My question is, for the sticky routing, how will the user add the header? Would it be simpler if the session ID is in the body?

Currently for terminal bench, it will have a session_id field in the request:

    • TBench side will forward this kwarg in the request body
  • Then on SkyRL, upon receiving this request, pops the session_id here:
    async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
    session_id = request_payload["json"].pop("session_id", None)
    if session_id is None:
    # if session_id is not provided, we'll use a random engine
    engine_idx = random.randint(0, len(self.engines) - 1)
    else:
    assert isinstance(session_id, (str, int)), "Session ID must be an integer or string for `/chat/completions`"
    engine_idx = hash_with_sha256(str(session_id)) % len(self.engines)
    # Always use the retry loop which also issues the first request inside
    return await self._chat_completion_with_retry(engine_idx, request_payload)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we need to make it compatible with industry standards. if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID. I don't know what the benefit of this is over keeping it in the body, but I think this is more of a standard for decoupling middle-ware / routing ops from the request body.

Copy link
Collaborator Author

@kouroshHakha kouroshHakha Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session_id is not part of chat or completion api standard btw. So I'd say we need to change the agent harness / generator to the header instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID

Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939


Routing behavior:
- Data plane (generation requests): Routes to ONE server.
- If X-Session-ID header present: consistent hash to same backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID

Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Copy link
Member

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! Last 3 nits!

"/wake_up",
"/reset_prefix_cache",
"/collective_rpc",
"/health",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the semantic of _proxy_to_all() for /health and /is_paused? If one of the engines returns false, does the router return false altogether? Could we add a unit test for/health when one of the engine is not healthy?

I feel like the semantic of /health and /is_paused (GET methods) are different from the other control plane APIs (POST) for _proxy_to_all().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good q.

/is_paused --> routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)

/health --> I think we should actually make it not a control plane endpoint. Basically if at least one engine exists that is healthy it should return true, By making /health go through the routing, if it returns true you are guaranteed that at least one engine is healhty. The router should smart enough to not route to the engines that are unhealthy (that's part of the responsibility of the router that we can add to our simple one if we want later)

I removed /health from the control plane endpoints because of above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks!

routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)

Got it. Would be good to document this behavior somewhere! We can do that in another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On /health, I made an issue here: #958

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@vercel
Copy link

vercel bot commented Jan 25, 2026

@kouroshHakha is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Copy link
Member

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@CharlieFRuan CharlieFRuan merged commit 95c0a54 into NovaSky-AI:main Jan 25, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants