fix(dispatch): gate slot forwards during swap window with structured 503#385
Merged
Conversation
Chat completions hitting a slot mid-swap (STARTING/WARMING) returned a
raw 502 (port not yet bound) or 503 from llama-server's still-loading
gate — no Retry-After hint, no progress context for the dashboard.
Add a SlotLoading typed error raised by Dispatcher.forward before the
HTTP forward when the slot is not in {READY, SERVING, IDLE}. The error
carries retry_after_s + a progress block (phase, requested_model,
upstream), and the error middleware promotes retry_after_s to a real
Retry-After HTTP header on any 503 envelope so OpenAI-compatible SDKs
back off cleanly.
Verified end-to-end on LXC 105: chat requests during a primary slot
swap return 503 with code=slot.loading + Retry-After: 15 instead of
the previous raw 502/503. Once primary reaches READY, chats resume
returning 200 unchanged.
Test coverage: 6 parametrized non-ready states + 3 ready states +
remote-upstream skip case added to test_serving_integration.py.
Audio + chat-route tests in test_v1_audio.py pin the in-memory slot
state to READY since they stub the HTTP transport rather than starting
a real slot — without that they'd trip the new gate.
Related: #377 (chat dispatcher always routes to legacy primary slot
regardless of body.model) — orthogonal architectural fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 28, 2026
thinmintdev
added a commit
that referenced
this pull request
May 28, 2026
…ry → 200) (#392) Chat completions returned 502 dispatch.upstream_unavailable when the upstream port was dead but hal0's in-memory state still said the slot was READY/SERVING/IDLE — the typical signature of a Lemonade idle/OOM eviction that hal0's reconciler hadn't yet observed. The swap-window gate added in #385 only fires on known-non-ready states; the state-drift window bypasses it entirely and lands on a dead :8001. Production trace from the report: chat-completions 502: {"error":{"code":"dispatch.upstream_unavailable", "details":{"upstream":"primary", "target":"http://127.0.0.1:8001/v1/chat/completions", "error":"ConnectError"}}} Fix — two pieces: • src/hal0/slots/manager.py — new SlotManager.recover_evicted_slot() that drives a full unload + load cycle. Unload forces lemond to drop its loaded[] entry (a bare /v1/load returns success without re-spawning when loaded[] still claims the model on a dead PID); load then re-spawns the child llama-server. Unload failure is logged but doesn't bail — load() may still succeed. • src/hal0/dispatcher/router.py — _forward_direct and _forward_streaming now retry once after a recoverable transport error from a slot upstream. Recoverable set = ConnectError (TCP refused before request) + RemoteProtocolError (peer dropped mid-request, the eviction-race signature). Read/write timeouts intentionally excluded — those usually mean overload, not death. A single attempt-1 retry caps the loop; recovery itself raising falls back to the original UpstreamUnavailable envelope so the error trail stays useful. Test plan ========= Six new dispatcher tests in tests/dispatcher/test_serving_integration.py: - test_forward_recovers_from_silent_eviction_and_retries — the happy path (ConnectError, recover, retry, 200). - test_forward_gives_up_when_retry_after_recovery_still_fails — no infinite loop when lemond is genuinely down. - test_forward_remote_connect_error_does_not_attempt_recovery — remote providers (OpenRouter, etc.) skip the slot-recover branch. - test_forward_streaming_recovers_from_silent_eviction — streaming requests get the same recovery on the stream-open path. - test_forward_recovers_from_remote_protocol_error — peer-dropped mid-request triggers recovery too. - test_forward_gives_up_when_recover_evicted_slot_itself_raises — recover-time failure surfaces the original UpstreamUnavailable. 215 dispatcher + slots tests pass; ruff check + format clean. LXC verification ================ Recovery branch confirmed to fire end-to-end on hal0 LXC with the orphan-process trigger: dispatch.upstream_dead_attempting_recover slot=primary slot.recover_evicted_dispatched lemonade.provider.unload lemonade.provider.load dispatch.upstream_recovered Known limitation: if lemond re-spawns the child on a port other than the one hal0's slot config expects, the retry's target_url is stale and still hits the dead port. That's a separate port-drift issue in the Lemonade integration (LemonadeProvider.load doesn't pass a port hint and hal0 never reads backend_url back from /v1/health) — out of scope for this fix. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Chat completions hitting a slot mid-swap (
STARTING/WARMING) usedto return a raw 502 (port not yet bound) or 503 from
llama-server's still-loading gate — no
Retry-After, no progresscontext for the dashboard.
This PR adds a swap-window gate inside
Dispatcher.forward: if thetarget slot isn't in
{READY, SERVING, IDLE}, raise a typedSlotLoadingerror instead of forwarding to a port that may not bebound or a model that may not be loaded. The error carries
retry_after_splus aprogressblock (phase,requested_model,upstream), and the error middleware now promotesretry_after_stoa real
Retry-AfterHTTP header on any 503 envelope so OpenAI-styleSDKs back off cleanly.
Production trace that prompted this (2026-05-28, primary swap):
After this PR, the 502/503 window returns:
Why this works for both audiences
503 + Retry-After: 15and exponential-back-off-retry without help.details.progressinstead of a generic error toast./v1/*routes that go through the dispatcher (audio, embeddings, rerank, chat) get the gate for free — it lives atDispatcher.forward, not in a route handler.Files
src/hal0/dispatcher/router.py— newSlotLoading(DispatchError)+Dispatcher._check_slot_ready_for_dispatch+_build_loading_responsesrc/hal0/api/middleware/error_codes.py— promotedetails.retry_after_stoRetry-Afterheader on 503 envelopestests/dispatcher/test_serving_integration.py— 6 parametrized non-ready states + 3 ready states + remote-skip casetests/api/test_v1_audio.py— pin in-memory slot state toREADYin the seed helpers (audio tests stub HTTP transport rather than starting a real slot)Verification
slot.loading+Retry-After: 15✅state=starting) chat call → same envelope,phase=starting✅ruff checkclean,ruff format --checkclean.Related
modelfield — always routes toprimaryslot #377 — chat dispatcher always routes to legacyprimaryslot regardless of bodymodelfield. Orthogonal architectural issue, deferred. The swap-window gate doesn't touch routing — it only refuses to forward when the target slot is loading.Test plan
pytest tests/dispatcher/ tests/api/ruff check+ruff format --check