fix(dispatch): gate slot forwards during swap window with structured 503 by thinmintdev · Pull Request #385 · Hal0ai/hal0

thinmintdev · 2026-05-28T08:56:10Z

Summary

Chat completions hitting a slot mid-swap (STARTING / WARMING) used
to return a raw 502 (port not yet bound) or 503 from
llama-server's still-loading gate — no Retry-After, no progress
context for the dashboard.

This PR adds a swap-window gate inside Dispatcher.forward: if the
target slot isn't in {READY, SERVING, IDLE}, raise a typed
SlotLoading error instead of forwarding to a port that may not be
bound or a model that may not be loaded. The error carries
retry_after_s plus a progress block (phase, requested_model,
upstream), and the error middleware now promotes retry_after_s to
a real Retry-After HTTP header on any 503 envelope so OpenAI-style
SDKs back off cleanly.

Production trace that prompted this (2026-05-28, primary swap):

02:50:26  POST /v1/chat/completions  502  dispatch.upstream_unavailable  (port 8001 dead)
02:51:24  POST /v1/chat/completions  503  (raw llama-server "still loading")
02:51:31  POST /v1/chat/completions  503  (same)
02:51:35  primary llama-server: listening on :8001
02:54:27  POST /v1/chat/completions  200  (back to normal)

After this PR, the 502/503 window returns:

HTTP/1.1 503 Service Unavailable
retry-after: 15
content-type: application/json

{"error":{"code":"slot.loading",
          "message":"slot 'primary' is starting — not ready to serve",
          "details":{"slot":"primary","state":"starting","retry_after_s":15,
                     "progress":{"phase":"starting",
                                 "requested_model":"qwen3-coder-...",
                                 "upstream":"primary"}}}}

Why this works for both audiences

OpenAI SDKs see a textbook 503 + Retry-After: 15 and exponential-back-off-retry without help.
The hal0 dashboard can render a "model loading…" chip from details.progress instead of a generic error toast.
All /v1/* routes that go through the dispatcher (audio, embeddings, rerank, chat) get the gate for free — it lives at Dispatcher.forward, not in a route handler.

Files

src/hal0/dispatcher/router.py — new SlotLoading(DispatchError) + Dispatcher._check_slot_ready_for_dispatch + _build_loading_response
src/hal0/api/middleware/error_codes.py — promote details.retry_after_s to Retry-After header on 503 envelopes
tests/dispatcher/test_serving_integration.py — 6 parametrized non-ready states + 3 ready states + remote-skip case
tests/api/test_v1_audio.py — pin in-memory slot state to READY in the seed helpers (audio tests stub HTTP transport rather than starting a real slot)

Verification

Unit tests: 1928/1928 pass, 3 skipped (pre-existing).
LXC smoke (live):
- OFFLINE slot + chat call → 503 slot.loading + Retry-After: 15 ✅
- Mid-swap (state=starting) chat call → same envelope, phase=starting ✅
- Post-swap chat call → 200 OK unchanged ✅
CI gates: ruff check clean, ruff format --check clean.

Chat dispatcher ignores body model field — always routes to primary slot #377 — chat dispatcher always routes to legacy primary slot regardless of body model field. Orthogonal architectural issue, deferred. The swap-window gate doesn't touch routing — it only refuses to forward when the target slot is loading.

Test plan

pytest tests/dispatcher/ tests/api/
ruff check + ruff format --check
LXC smoke: trigger primary swap, fire chat during load window, assert 503 + Retry-After + structured body
LXC smoke: post-swap chat returns 200 unchanged

Chat completions hitting a slot mid-swap (STARTING/WARMING) returned a raw 502 (port not yet bound) or 503 from llama-server's still-loading gate — no Retry-After hint, no progress context for the dashboard. Add a SlotLoading typed error raised by Dispatcher.forward before the HTTP forward when the slot is not in {READY, SERVING, IDLE}. The error carries retry_after_s + a progress block (phase, requested_model, upstream), and the error middleware promotes retry_after_s to a real Retry-After HTTP header on any 503 envelope so OpenAI-compatible SDKs back off cleanly. Verified end-to-end on LXC 105: chat requests during a primary slot swap return 503 with code=slot.loading + Retry-After: 15 instead of the previous raw 502/503. Once primary reaches READY, chats resume returning 200 unchanged. Test coverage: 6 parametrized non-ready states + 3 ready states + remote-upstream skip case added to test_serving_integration.py. Audio + chat-route tests in test_v1_audio.py pin the in-memory slot state to READY since they stub the HTTP transport rather than starting a real slot — without that they'd trip the new gate. Related: #377 (chat dispatcher always routes to legacy primary slot regardless of body.model) — orthogonal architectural fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ry → 200) (#392) Chat completions returned 502 dispatch.upstream_unavailable when the upstream port was dead but hal0's in-memory state still said the slot was READY/SERVING/IDLE — the typical signature of a Lemonade idle/OOM eviction that hal0's reconciler hadn't yet observed. The swap-window gate added in #385 only fires on known-non-ready states; the state-drift window bypasses it entirely and lands on a dead :8001. Production trace from the report: chat-completions 502: {"error":{"code":"dispatch.upstream_unavailable", "details":{"upstream":"primary", "target":"http://127.0.0.1:8001/v1/chat/completions", "error":"ConnectError"}}} Fix — two pieces: • src/hal0/slots/manager.py — new SlotManager.recover_evicted_slot() that drives a full unload + load cycle. Unload forces lemond to drop its loaded[] entry (a bare /v1/load returns success without re-spawning when loaded[] still claims the model on a dead PID); load then re-spawns the child llama-server. Unload failure is logged but doesn't bail — load() may still succeed. • src/hal0/dispatcher/router.py — _forward_direct and _forward_streaming now retry once after a recoverable transport error from a slot upstream. Recoverable set = ConnectError (TCP refused before request) + RemoteProtocolError (peer dropped mid-request, the eviction-race signature). Read/write timeouts intentionally excluded — those usually mean overload, not death. A single attempt-1 retry caps the loop; recovery itself raising falls back to the original UpstreamUnavailable envelope so the error trail stays useful. Test plan ========= Six new dispatcher tests in tests/dispatcher/test_serving_integration.py: - test_forward_recovers_from_silent_eviction_and_retries — the happy path (ConnectError, recover, retry, 200). - test_forward_gives_up_when_retry_after_recovery_still_fails — no infinite loop when lemond is genuinely down. - test_forward_remote_connect_error_does_not_attempt_recovery — remote providers (OpenRouter, etc.) skip the slot-recover branch. - test_forward_streaming_recovers_from_silent_eviction — streaming requests get the same recovery on the stream-open path. - test_forward_recovers_from_remote_protocol_error — peer-dropped mid-request triggers recovery too. - test_forward_gives_up_when_recover_evicted_slot_itself_raises — recover-time failure surfaces the original UpstreamUnavailable. 215 dispatcher + slots tests pass; ruff check + format clean. LXC verification ================ Recovery branch confirmed to fire end-to-end on hal0 LXC with the orphan-process trigger: dispatch.upstream_dead_attempting_recover slot=primary slot.recover_evicted_dispatched lemonade.provider.unload lemonade.provider.load dispatch.upstream_recovered Known limitation: if lemond re-spawns the child on a port other than the one hal0's slot config expects, the retry's target_url is stale and still hits the dead port. That's a separate port-drift issue in the Lemonade integration (LemonadeProvider.load doesn't pass a port hint and hal0 never reads backend_url back from /v1/health) — out of scope for this fix. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thinmintdev merged commit 1d4c0e0 into main May 28, 2026
4 checks passed

thinmintdev deleted the fix/swap-window-503 branch May 28, 2026 09:03

This was referenced May 28, 2026

fix(dispatch): recover slot after silent Lemonade eviction #392

Merged

PRD: hal0 ↔ Lemonade slot-management end-state (typed adapter, reactive reconciliation) #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dispatch): gate slot forwards during swap window with structured 503#385

fix(dispatch): gate slot forwards during swap window with structured 503#385
thinmintdev merged 1 commit into
mainfrom
fix/swap-window-503

thinmintdev commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thinmintdev commented May 28, 2026

Summary

Why this works for both audiences

Files

Verification

Related

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant