fix(dispatch): recover slot after silent Lemonade eviction#392
Merged
Conversation
…ry → 200) Chat completions returned 502 dispatch.upstream_unavailable when the upstream port was dead but hal0's in-memory state still said the slot was READY/SERVING/IDLE — the typical signature of a Lemonade idle/OOM eviction that hal0's reconciler hadn't yet observed. The swap-window gate added in #385 only fires on known-non-ready states; the state-drift window bypasses it entirely and lands on a dead :8001. Production trace from the report: chat-completions 502: {"error":{"code":"dispatch.upstream_unavailable", "details":{"upstream":"primary", "target":"http://127.0.0.1:8001/v1/chat/completions", "error":"ConnectError"}}} Fix — two pieces: • src/hal0/slots/manager.py — new SlotManager.recover_evicted_slot() that drives a full unload + load cycle. Unload forces lemond to drop its loaded[] entry (a bare /v1/load returns success without re-spawning when loaded[] still claims the model on a dead PID); load then re-spawns the child llama-server. Unload failure is logged but doesn't bail — load() may still succeed. • src/hal0/dispatcher/router.py — _forward_direct and _forward_streaming now retry once after a recoverable transport error from a slot upstream. Recoverable set = ConnectError (TCP refused before request) + RemoteProtocolError (peer dropped mid-request, the eviction-race signature). Read/write timeouts intentionally excluded — those usually mean overload, not death. A single attempt-1 retry caps the loop; recovery itself raising falls back to the original UpstreamUnavailable envelope so the error trail stays useful. Test plan ========= Six new dispatcher tests in tests/dispatcher/test_serving_integration.py: - test_forward_recovers_from_silent_eviction_and_retries — the happy path (ConnectError, recover, retry, 200). - test_forward_gives_up_when_retry_after_recovery_still_fails — no infinite loop when lemond is genuinely down. - test_forward_remote_connect_error_does_not_attempt_recovery — remote providers (OpenRouter, etc.) skip the slot-recover branch. - test_forward_streaming_recovers_from_silent_eviction — streaming requests get the same recovery on the stream-open path. - test_forward_recovers_from_remote_protocol_error — peer-dropped mid-request triggers recovery too. - test_forward_gives_up_when_recover_evicted_slot_itself_raises — recover-time failure surfaces the original UpstreamUnavailable. 215 dispatcher + slots tests pass; ruff check + format clean. LXC verification ================ Recovery branch confirmed to fire end-to-end on hal0 LXC with the orphan-process trigger: dispatch.upstream_dead_attempting_recover slot=primary slot.recover_evicted_dispatched lemonade.provider.unload lemonade.provider.load dispatch.upstream_recovered Known limitation: if lemond re-spawns the child on a port other than the one hal0's slot config expects, the retry's target_url is stale and still hits the dead port. That's a separate port-drift issue in the Lemonade integration (LemonadeProvider.load doesn't pass a port hint and hal0 never reads backend_url back from /v1/health) — out of scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
thinmintdev
added a commit
that referenced
this pull request
May 29, 2026
End-of-stream cut for v0.3. Bundles MCP-completion, memory-map redesign, Settings → Updates fix (#386), silent-eviction dispatcher recovery (#392), ADR-0020 OpenRouter callback skeleton (#409), persona spending-cap primitive (#411), δ-harness Hermes coverage (#410), and the docs/internal pin + dashboard-v3 walkthrough (#389/#390). After this tag, active scope rolls to v0.4 (install-mode reconciliation + UI polish + fully-implemented Agents/UI/Install bootstrapped) and v0.5 (MCP admin + memory wiring across UI and agents). CHANGELOG merged from two coexisting Unreleased blocks into a single [v0.3.2-alpha.1] section; added missing entries for #392 (dispatcher), #387 (async-job polling contract), and the docs PRs #389/#390. pyproject 0.3.1-alpha.1 → 0.3.2-alpha.1. uv.lock resynced (was stuck at 0.3.0a1 from prior drift). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Chats returned 502
dispatch.upstream_unavailablewhen Lemonadesilently evicted a slot's child (idle/OOM) but hal0's in-memory state
still said the slot was READY/SERVING/IDLE. The swap-window gate from
#385 only catches known non-ready states; the state-drift window
bypasses it entirely and lands on a dead port.
User-reported trace:
Fix
Two pieces:
src/hal0/slots/manager.py— newSlotManager.recover_evicted_slot()drives a full unload + load cycle. Unload forces lemond to drop its
loaded[]entry (a bare/v1/loadreturns success withoutre-spawning when
loaded[]still claims a model on a dead PID); loadthen re-spawns the child. Unload failure is logged but doesn't bail —
load may still succeed.
src/hal0/dispatcher/router.py—_forward_directand_forward_streamingretry once after a recoverable transporterror from a slot upstream. Recoverable set:
httpx.ConnectError— TCP refused before requesthttpx.RemoteProtocolError— peer dropped mid-request (theeviction-race signature)
Read/write timeouts are intentionally excluded — those usually mean
overload, not death. Attempt-1 cap prevents infinite loops; recovery
itself raising falls back to the original
UpstreamUnavailableenvelope so the error trail stays useful.
Why not change the dispatch target instead?
Routing the data plane through lemond's
/api/v1/chat/completions(port 13305) was considered but rejected — both upstream Lemonade and
hal0/lemonade/client.py:5-6explicitly call out the control-plane /data-plane split as a perf design decision. This fix preserves that
boundary and adds lazy recovery at the existing seam.
Test coverage
Six new dispatcher tests in
tests/dispatcher/test_serving_integration.py:test_forward_recovers_from_silent_eviction_and_retries— happy pathtest_forward_gives_up_when_retry_after_recovery_still_fails— no infinite looptest_forward_remote_connect_error_does_not_attempt_recovery— remote upstreams skip recoverytest_forward_streaming_recovers_from_silent_eviction— streaming gets same treatmenttest_forward_recovers_from_remote_protocol_error— peer-dropped varianttest_forward_gives_up_when_recover_evicted_slot_itself_raises— recover failure surfaces original error215 dispatcher + slots tests pass; ruff check + format clean.
LXC verification
Recovery branch confirmed end-to-end on hal0 LXC under the
orphan-process trigger:
Known limitation (out of scope)
If lemond re-spawns the child on a port other than the one hal0's slot
config expects, the retry's
target_urlis stale and still hits thedead port. Root cause:
LemonadeProvider.loaddoesn't pass a porthint to
/v1/load, and hal0 never readsbackend_urlback from/v1/health. Separate issue; logged for follow-up.Related
full eviction lifecycle: gate catches known-not-ready, this PR
catches state-drift.
Test plan
pytest tests/dispatcher/ tests/slots/ruff check+ruff format --checkdispatch.upstream_dead_attempting_recoverrate over the next week to verify it catches real silent evictions🤖 Generated with Claude Code