[skyrl-train][inference] Inference Server Refactor (1/N) by kouroshHakha · Pull Request #899 · NovaSky-AI/SkyRL

kouroshHakha · 2026-01-20T03:23:51Z

Why

We're building HTTP-based inference serving for RL workloads. This enables:

Decoupled training ↔ inference via standard HTTP (no Ray object refs)
Flexible backends (vLLM now, SGLang later via protocol)
Dynamic weight sync between trainer and inference servers

What

skyrl_train/inference_servers/
├── router.py             # HTTP proxy: load balancing + control plane fan-out
├── server_group.py       # Manages N vLLM actors on a placement group
├── vllm_server_actor.py  # Ray actor wrapping vLLM OpenAI server
├── protocols.py          # ServerActorProtocol interface
├── vllm_worker.py        # Worker extension for NCCL weight sync
└── common.py             # Utilities (get_open_port, get_node_ip)

Key Design Decisions

Decision	Rationale
`ServerActorProtocol`	Swap vLLM for SGLang without changing `ServerGroup`
Router as HTTP proxy	Compatible with any HTTP client; no vLLM-specific deps
Control plane fan-out	`/pause`, `/resume`, `/init_weight_transfer` hit all servers
Session-aware hashing	Sticky routing for multi-turn conversations

Review Guide

Start here: protocols.py (30 lines) - the interface contract
Core logic: router.py - proxy routing + fan-out behavior
Glue code: server_group.py + vllm_server_actor.py - Ray actor lifecycle
Tests: test_weight_sync.py - trainer→inference weight sync flow

Testing

# GPU CI (requires 4 GPUs)
uv run pytest tests/gpu/gpu_ci/test_inference_server_group.py -v
uv run pytest tests/gpu/gpu_ci/test_weight_sync.py -v

Code Review

This pull request introduces a significant and well-designed refactoring for inference serving. The new architecture, with its clear separation of concerns into a router, server group, actor pool, and a server actor protocol, is robust and extensible. The implementation of the vLLM server actor and the comprehensive test suite, including complex scenarios like weight synchronization, are particularly impressive. I have a few suggestions to improve the router's lifecycle management and fix a test case.

skyrl-train/skyrl_train/inference_servers/router.py

skyrl-train/tests/cpu/inference_servers/test_router.py

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2026-01-20T06:24:36Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant refactor of the inference server architecture, moving towards a more decoupled and flexible design. Key components include a new InferenceRouter for load balancing and control plane fan-out, ServerGroup for managing server actors with placement groups, and a ServerActorProtocol to enable interchangeable backends like vLLM and SGLang. The changes also include a robust weight synchronization mechanism between trainers and inference servers. New utility functions for environment variables and port management are added, along with comprehensive unit and GPU integration tests. The overall structure is well-thought-out, enhancing maintainability and scalability for RL workloads.

skyrl-train/tests/cpu/inference_servers/test_common.py

skyrl-train/skyrl_train/env_vars.py

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

skyrl-train/skyrl_train/inference_servers/router.py

skyrl-train/skyrl_train/inference_servers/vllm_server_actor.py

skyrl-train/skyrl_train/inference_servers/vllm_worker.py

skyrl-train/tests/cpu/inference_servers/test_router.py

skyrl-train/skyrl_train/env_vars.py

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan

Thank you so much for the PR and pushing the refactor of the rollout stack. Learned a lot when reviewing, and cannot wait to see it land!

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

CharlieFRuan · 2026-01-24T16:17:48Z

skyrl-train/skyrl_train/inference_servers/router.py

+
+    Routing behavior:
+    - Data plane (generation requests): Routes to ONE server.
+      - If X-Session-ID header present: consistent hash to same backend


The current skyrl-train/skyrl_train/inference_engines/inference_engine_client_http_endpoint.py allows the users to, in their agent harness, instead using a say OpenAI endpoint, to use the HTTP endpoint that SkyRL spins up.

With this refactoring, the user will use http://0.0.0.0/8080/v1/chat/completions for the router router = InferenceRouter(server_urls, host="0.0.0.0", port=8080).

My question is, for the sticky routing, how will the user add the header? Would it be simpler if the session ID is in the body?

Currently for terminal bench, it will have a session_id field in the request:

SkyRL/skyrl-train/examples/terminal_bench/generator/terminal_bench_generator.py

Line 105 in b9a6307

"session_id": session_id,

TBench side will forward this kwarg in the request body

Then on SkyRL, upon receiving this request, pops the session_id here:

SkyRL/skyrl-train/skyrl_train/inference_engines/inference_engine_client.py

Lines 367 to 377 in b9a6307

async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:

session_id = request_payload["json"].pop("session_id", None)

if session_id is None:

# if session_id is not provided, we'll use a random engine

engine_idx = random.randint(0, len(self.engines) - 1)

else:

assert isinstance(session_id, (str, int)), "Session ID must be an integer or string for `/chat/completions`"

engine_idx = hash_with_sha256(str(session_id)) % len(self.engines)

# Always use the retry loop which also issues the first request inside

return await self._chat_completion_with_retry(engine_idx, request_payload)

so we need to make it compatible with industry standards. if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID. I don't know what the benefit of this is over keeping it in the body, but I think this is more of a standard for decoupling middle-ware / routing ops from the request body.

session_id is not part of chat or completion api standard btw. So I'd say we need to change the agent harness / generator to the header instead.

if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID

Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939

skyrl-train/skyrl_train/inference_servers/router.py

skyrl-train/tests/gpu/gpu_ci/inference_servers/test_inference_server_group.py

skyrl-train/tests/gpu/gpu_ci/test_weight_sync.py

CharlieFRuan · 2026-01-24T23:11:39Z

skyrl-train/skyrl_train/inference_servers/router.py

+
+    Routing behavior:
+    - Data plane (generation requests): Routes to ONE server.
+      - If X-Session-ID header present: consistent hash to same backend


if you look at sgl router, vllm-router, etc. the dominant way to pass in the session id is through passing a request header that has X-Session-ID

Thanks for the info, didn't know before! Let's keep it as X-Session-ID for now! We just need to remember properly telling users about how to do sticky routing somewhere. Made an issue here: #939

skyrl-train/skyrl_train/inference_servers/router.py

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan

Thanks for addressing the comments! Last 3 nits!

CharlieFRuan · 2026-01-25T01:03:13Z

skyrl-train/skyrl_train/inference_servers/router.py

+    "/wake_up",
+    "/reset_prefix_cache",
+    "/collective_rpc",
+    "/health",


What is the semantic of _proxy_to_all() for /health and /is_paused? If one of the engines returns false, does the router return false altogether? Could we add a unit test for/health when one of the engine is not healthy?

I feel like the semantic of /health and /is_paused (GET methods) are different from the other control plane APIs (POST) for _proxy_to_all().

good q.

/is_paused --> routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)

/health --> I think we should actually make it not a control plane endpoint. Basically if at least one engine exists that is healthy it should return true, By making /health go through the routing, if it returns true you are guaranteed that at least one engine is healhty. The router should smart enough to not route to the engines that are unhealthy (that's part of the responsibility of the router that we can add to our simple one if we want later)

I removed /health from the control plane endpoints because of above.

sounds good, thanks!

routers returns a mapping from server_urls to is_paused state of each engine (same as any other control plane ops)

Got it. Would be good to document this behavior somewhere! We can do that in another PR

On /health, I made an issue here: #958

skyrl-train/tests/gpu/gpu_ci/inference_servers/test_inference_server_group.py

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

vercel · 2026-01-25T08:08:35Z

@kouroshHakha is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan

LGTM! Thank you!

kouroshHakha added 15 commits January 18, 2026 21:23

v0

40e538b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

common

a52b0dc

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

vllm_server_actor

6d68e2f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

pool

d0d2990

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

d20b4bd

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

group

07f3d9f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

1a48e61

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

e290f4b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

tests

509538f

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Wip

555082b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

afcc8de

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Wip

058cb95

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

7c8fc0b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

68dc4ed

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

lint

dce17d2

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha marked this pull request as ready for review January 20, 2026 03:26

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

skyrl-train/skyrl_train/inference_servers/router.py Outdated Show resolved Hide resolved

skyrl-train/tests/cpu/inference_servers/test_router.py Outdated Show resolved Hide resolved

kouroshHakha added 2 commits January 19, 2026 22:19

gemini fback

22c12ad

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

05bfc92

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

erictang000 reviewed Jan 20, 2026

View reviewed changes

skyrl-train/skyrl_train/env_vars.py Outdated Show resolved Hide resolved

kouroshHakha added 5 commits January 20, 2026 18:59

wip

4321291

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

7a06b51

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

c0a6a6b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

86f1395

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Merge branch 'consolidate-env-vars' into kh/inference-1

d70897a

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan self-assigned this Jan 20, 2026

wip

a54b84b

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan reviewed Jan 24, 2026

View reviewed changes

kouroshHakha added 3 commits January 25, 2026 00:13

wip

a96cb63

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

9e2f919

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

wip

77f7bfd

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan reviewed Jan 25, 2026

View reviewed changes

wip

1c9dfad

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

mark vllm

f04b7b1

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

CharlieFRuan mentioned this pull request Jan 25, 2026

[trainer][rollout] /health implementation for router when inference refactoring lands, and do not route to dead engines #958

Open

CharlieFRuan approved these changes Jan 25, 2026

View reviewed changes

CharlieFRuan merged commit 95c0a54 into NovaSky-AI:main Jan 25, 2026
3 of 5 checks passed

	async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
	session_id = request_payload["json"].pop("session_id", None)
	if session_id is None:
	# if session_id is not provided, we'll use a random engine
	engine_idx = random.randint(0, len(self.engines) - 1)
	else:
	assert isinstance(session_id, (str, int)), "Session ID must be an integer or string for `/chat/completions`"
	engine_idx = hash_with_sha256(str(session_id)) % len(self.engines)

	# Always use the retry loop which also issues the first request inside
	return await self._chat_completion_with_retry(engine_idx, request_payload)

Conversation

kouroshHakha commented Jan 20, 2026

Why

What

Key Design Decisions

Review Guide

Testing

Next

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kouroshHakha commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vercel bot commented Jan 25, 2026

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kouroshHakha Jan 24, 2026 •

edited

Loading