[tinker] Abort/retry recovery for forwarded multi-tenant LoRA samples by erictang000 · Pull Request #1669 · NovaSky-AI/SkyRL

erictang000 · 2026-05-14T01:39:06Z

Summary

Builds on #1666 to make SkyRLTrainInferenceForwardingClient recover from per-LoRA aborts the same way the in-process RemoteInferenceClient.sample_with_retry (#1657) does. Closes the fully-async multi-tenant Tinker path: a Tinker SDK asample for a LoRA adapter whose weights are being synced now transparently waits for the sync to finish instead of failing or silently returning truncated tokens.

This PR is stacked on #1666 — please review #1666 first.

What was broken

For a LoRA adapter being weight-synced via RemoteInferenceClient.pause_generation(lora_name=X), in-flight Tinker asample requests for X (routed through SkyRLTrainInferenceForwardingClient in non-colocated Megatron/FSDP) were not handled correctly:

vLLM emits finish_reason="abort" in the /v1/completions response. The forwarding client's finish_reason → stop_reason map (skyrl_train_inference_forwarding.py:169 before this PR) lumps any non-stop/stop_token reason into "length". Result: SDK consumer sees a truncated, half-finished completion with no indication that it was aborted.
A fresh asample for X submitted during the pause window has no in-flight request to abort, but also no client-side gate keeping it from racing with load_lora_adapter and observing torn adapter weights.

How

EngineStateDB schema: new inference_server_urls (JSON list) column. The vllm-router forwards data-plane endpoints only (remote_inference_client.py:12-30 is explicit about this), so the forwarding client needs direct worker URLs to call /skyrl/v1/wait_lora_unpaused.
Publisher signature: set_inference_state_publisher and the engine-side writer now take (proxy_url, server_urls); SkyRLTrainBackend._publish_inference_state is updated to pass server_setup.server_urls.
Forwarding client retry loop: call_and_store_result for the per-LoRA, num_samples=1 path mirrors RemoteInferenceClient.sample_with_retry:
- Dispatch /v1/completions. On finish_reason="abort", append partial token_ids and logprobs to local accumulators.
- Long-poll /skyrl/v1/wait_lora_unpaused (new endpoint in [multi-tenant-lora] Expose per-LoRA pause state over HTTP #1666) on a worker URL; loop on {paused: true} returns.
- Resubmit with prompt + accumulated and max_tokens = original - len(accumulated). Loops until non-abort or budget exhausted.
- Truncate prompt_logprobs back to len(original_prompt) to match the in-process implementation.
Base-model sampling and num_samples > 1 keep the single-shot path. An abort there raises so the consumer sees FutureDB.status=FAILED instead of silent corruption.

Test plan

New file tests/tinker/skyrl_train/test_pause_async_sample.py mirrors the four tests in tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_pause_lora.py (#1657) but drives the forwarding client directly rather than RemoteInferenceClient:

test_pause_lora_does_not_affect_other_lora_via_forwarding — while LoRA A is paused, forwarding-client samples for LoRA B complete promptly.
test_forwarding_client_recovers_from_abort — 4 concurrent meow + 4 concurrent woof forwarding samples; pause meow mid-flight, await woof first (cross-LoRA isolation), then resume and verify meow tasks complete with "meow" content.
test_forwarding_client_observes_mid_sample_weight_swap — one forwarding sample for lora-target spans a Meow→Woof weight swap; merged output contains both adapter signatures with "meow" before "woof".
test_fresh_request_during_pause_blocks_until_resume — verifies the submission-gate middleware from [multi-tenant-lora] Expose per-LoRA pause state over HTTP #1666: a sample submitted after pause (with no in-flight to abort) blocks server-side until resume, then completes normally.
ruff + black clean.
Existing non-GPU Tinker suites (test_db, test_api_validation, test_api) pass with the publisher signature change.

The four GPU integration tests require 1 H100/A100/B200 and the Qwen3-0.6B Meow/Woof LoRA fixtures (auto-downloaded via snapshot_download). Same shape as test_pause_lora.py.

🤖 Generated with Claude Code

Adds three control-plane endpoints on the vLLM server actor and a submission-gate middleware so out-of-process callers (e.g. the Tinker SkyRLTrainInferenceForwardingClient) can observe the same pause window as in-process RemoteInferenceClient.sample_with_retry callers. - POST /skyrl/v1/abort_lora_requests: now also clears a per-LoRA asyncio.Event in app.state.paused_loras before the engine.abort() fan-out. - POST /skyrl/v1/resume_lora_requests: sets the event. - POST /skyrl/v1/wait_lora_unpaused: long-polls until the event is set or the caller-supplied timeout elapses. - Pure-ASGI middleware blocks fresh /v1/completions and /v1/chat/completions for a paused LoRA until resume, closing the race between abort_lora_requests and load_lora_adapter where a new request could observe torn adapter weights. RemoteInferenceClient.resume_generation now also POSTs the new /skyrl/v1/resume_lora_requests so the server-side flag matches the in-process gate. Existing test_pause_lora.py callers don't observe a return-value contract change. All endpoints carry the same TRANSIENT marker as the existing abort endpoint: delete the stack when vLLM ships native per-LoRA pause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds on the server-side pause-state endpoints in the parent PR to make ``SkyRLTrainInferenceForwardingClient`` participate in per-LoRA pause cycles. Previously, an in-flight Tinker asample for a LoRA being weight-synced by the trainer would: 1. Receive ``finish_reason="abort"`` from vLLM and silently map it to ``stop_reason="length"`` in the forwarding client (truncating result), or 2. Race ``load_lora_adapter`` if submitted during the pause window. Changes: * ``EngineStateDB.inference_server_urls``: new JSON column carrying the vLLM workers' direct URLs. vllm-router is data-plane-only (verified at remote_inference_client.py:12-30) and does not forward control-plane endpoints, so the forwarding client needs at least one worker URL to reach ``/skyrl/v1/wait_lora_unpaused``. * ``SkyRLTrainBackend._publish_inference_state`` and ``TinkerEngine._write_inference_state_to_db`` now take ``(proxy_url, server_urls)`` and persist both. ``Callable`` signature on the publisher updated accordingly. * ``SkyRLTrainInferenceForwardingClient.call_and_store_result`` now mirrors ``RemoteInferenceClient.sample_with_retry`` for the per-LoRA ``num_samples=1`` path: dispatch ``/v1/completions``, and on ``finish_reason="abort"`` accumulate partial tokens, long-poll ``/skyrl/v1/wait_lora_unpaused``, then resubmit with ``prompt + accumulated`` and remaining ``max_tokens``. Loops until a non-abort finish. The base-model and ``num_samples > 1`` paths stay single-shot — an abort there now surfaces as ``FutureDB.status=FAILED`` rather than silently corrupting the result. Tests: ``tests/tinker/skyrl_train/test_pause_async_sample.py`` mirrors the four tests in ``test_pause_lora.py`` but drives the forwarding client directly: 1. Pausing LoRA A doesn't affect forwarding samples for LoRA B. 2. In-flight forwarding samples for paused LoRA A recover from abort and contain the right LoRA content after retry. 3. A single forwarding sample spans a real Meow→Woof weight swap; merged tokens contain both adapters' signatures in order. 4. A FRESH sample submitted while a LoRA is paused (no in-flight to abort) blocks at the server-side submission gate until resume, closing the torn-weights race that the gate is there to prevent. All new code carries the same TRANSIENT marker as ``sample_with_retry``: delete the abort/retry plumbing when vLLM ships native per-LoRA pause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements a multi-tenant LoRA-aware pause and resume mechanism across the inference server and the Tinker forwarding client. Key changes include the addition of a submission gate middleware to block requests for paused LoRAs, new control-plane endpoints for resuming and waiting on LoRA status, and an abort-aware retry loop in the SkyRLTrainInferenceForwardingClient that utilizes long-polling. Feedback highlights the need for safer indexing when accessing server URLs and completion choices, as well as validation for max_tokens to avoid potential type errors during the retry process.

gemini-code-assist · 2026-05-14T01:44:23Z

+            if row.inference_server_urls:
+                server_url = row.inference_server_urls[0]


Accessing the first element of inference_server_urls without checking if the list is non-empty will raise an IndexError if the backend published an empty list of worker URLs. While unlikely in normal operation, it's safer to handle this case.

Suggested change

if row.inference_server_urls:

server_url = row.inference_server_urls[0]

if row.inference_server_urls and len(row.inference_server_urls) > 0:

server_url = row.inference_server_urls[0]

gemini-code-assist · 2026-05-14T01:44:23Z

+        wait_url = f"{server_url}/skyrl/v1/wait_lora_unpaused" if server_url else None
+        retry_eligible = base_model is None and num_samples == 1 and wait_url is not None
+
+        original_max_tokens = sp.max_tokens


The retry loop logic relies on original_max_tokens being a positive integer to bound the generation and decrement the budget correctly. If sp.max_tokens is None or non-integer, line 186 will raise a TypeError. Consider adding validation similar to RemoteInferenceClient.sample_with_retry to ensure max_tokens is a positive integer when retry_eligible is true.

gemini-code-assist · 2026-05-14T01:44:23Z

+                return self._build_sample_output(choices, abort_recovery=False)
+
+            # num_samples == 1, LoRA — abort-aware retry path.
+            choice = choices[0]


Accessing choices[0] without verifying that choices is non-empty could lead to an IndexError. Although vLLM should return at least one choice for n=1, a defensive check or using next(iter(choices), None) is preferred to avoid crashing the forwarding client on unexpected server responses.

erictang000 and others added 2 commits May 14, 2026 01:24

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tinker] Abort/retry recovery for forwarded multi-tenant LoRA samples#1669

[tinker] Abort/retry recovery for forwarded multi-tenant LoRA samples#1669
erictang000 wants to merge 2 commits into
NovaSky-AI:mainfrom
erictang000:multi-tenant-async-pause-tinker-forwarding

erictang000 commented May 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if row.inference_server_urls:
		server_url = row.inference_server_urls[0]

Conversation

erictang000 commented May 14, 2026

Summary

What was broken

How

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant