Skip to content

fix(dispatch): recover slot after silent Lemonade eviction#392

Merged
thinmintdev merged 1 commit into
mainfrom
fix/dispatch-recover-evicted-slot
May 28, 2026
Merged

fix(dispatch): recover slot after silent Lemonade eviction#392
thinmintdev merged 1 commit into
mainfrom
fix/dispatch-recover-evicted-slot

Conversation

@thinmintdev
Copy link
Copy Markdown
Contributor

Summary

Chats returned 502 dispatch.upstream_unavailable when Lemonade
silently evicted a slot's child (idle/OOM) but hal0's in-memory state
still said the slot was READY/SERVING/IDLE. The swap-window gate from
#385 only catches known non-ready states; the state-drift window
bypasses it entirely and lands on a dead port.

User-reported trace:

chat-completions 502: {"error":{"code":"dispatch.upstream_unavailable",
  "details":{"upstream":"primary",
             "target":"http://127.0.0.1:8001/v1/chat/completions",
             "error":"ConnectError"}}}

Fix

Two pieces:

  • src/hal0/slots/manager.py — new SlotManager.recover_evicted_slot()
    drives a full unload + load cycle. Unload forces lemond to drop its
    loaded[] entry (a bare /v1/load returns success without
    re-spawning when loaded[] still claims a model on a dead PID); load
    then re-spawns the child. Unload failure is logged but doesn't bail —
    load may still succeed.

  • src/hal0/dispatcher/router.py_forward_direct and
    _forward_streaming retry once after a recoverable transport
    error from a slot upstream. Recoverable set:

    • httpx.ConnectError — TCP refused before request
    • httpx.RemoteProtocolError — peer dropped mid-request (the
      eviction-race signature)

    Read/write timeouts are intentionally excluded — those usually mean
    overload, not death. Attempt-1 cap prevents infinite loops; recovery
    itself raising falls back to the original UpstreamUnavailable
    envelope so the error trail stays useful.

Why not change the dispatch target instead?

Routing the data plane through lemond's /api/v1/chat/completions
(port 13305) was considered but rejected — both upstream Lemonade and
hal0/lemonade/client.py:5-6 explicitly call out the control-plane /
data-plane split as a perf design decision. This fix preserves that
boundary and adds lazy recovery at the existing seam.

Test coverage

Six new dispatcher tests in tests/dispatcher/test_serving_integration.py:

  • test_forward_recovers_from_silent_eviction_and_retries — happy path
  • test_forward_gives_up_when_retry_after_recovery_still_fails — no infinite loop
  • test_forward_remote_connect_error_does_not_attempt_recovery — remote upstreams skip recovery
  • test_forward_streaming_recovers_from_silent_eviction — streaming gets same treatment
  • test_forward_recovers_from_remote_protocol_error — peer-dropped variant
  • test_forward_gives_up_when_recover_evicted_slot_itself_raises — recover failure surfaces original error

215 dispatcher + slots tests pass; ruff check + format clean.

LXC verification

Recovery branch confirmed end-to-end on hal0 LXC under the
orphan-process trigger:

dispatch.upstream_dead_attempting_recover  slot=primary
slot.recover_evicted_dispatched
lemonade.provider.unload
lemonade.provider.load
dispatch.upstream_recovered

Known limitation (out of scope)

If lemond re-spawns the child on a port other than the one hal0's slot
config expects, the retry's target_url is stale and still hits the
dead port. Root cause: LemonadeProvider.load doesn't pass a port
hint to /v1/load, and hal0 never reads backend_url back from
/v1/health. Separate issue; logged for follow-up.

Related

Test plan

  • pytest tests/dispatcher/ tests/slots/
  • ruff check + ruff format --check
  • LXC smoke: trace recovery branch firing under orphan-process repro
  • Production validation: monitor dispatch.upstream_dead_attempting_recover rate over the next week to verify it catches real silent evictions

🤖 Generated with Claude Code

…ry → 200)

Chat completions returned 502 dispatch.upstream_unavailable when the
upstream port was dead but hal0's in-memory state still said the slot
was READY/SERVING/IDLE — the typical signature of a Lemonade idle/OOM
eviction that hal0's reconciler hadn't yet observed.  The swap-window
gate added in #385 only fires on known-non-ready states; the state-drift
window bypasses it entirely and lands on a dead :8001.

Production trace from the report:

  chat-completions 502: {"error":{"code":"dispatch.upstream_unavailable",
    "details":{"upstream":"primary",
               "target":"http://127.0.0.1:8001/v1/chat/completions",
               "error":"ConnectError"}}}

Fix — two pieces:

  • src/hal0/slots/manager.py — new SlotManager.recover_evicted_slot()
    that drives a full unload + load cycle.  Unload forces lemond to
    drop its loaded[] entry (a bare /v1/load returns success without
    re-spawning when loaded[] still claims the model on a dead PID);
    load then re-spawns the child llama-server.  Unload failure is
    logged but doesn't bail — load() may still succeed.

  • src/hal0/dispatcher/router.py — _forward_direct and
    _forward_streaming now retry once after a recoverable transport
    error from a slot upstream.  Recoverable set = ConnectError (TCP
    refused before request) + RemoteProtocolError (peer dropped
    mid-request, the eviction-race signature).  Read/write timeouts
    intentionally excluded — those usually mean overload, not death.
    A single attempt-1 retry caps the loop; recovery itself raising
    falls back to the original UpstreamUnavailable envelope so the
    error trail stays useful.

Test plan
=========
Six new dispatcher tests in tests/dispatcher/test_serving_integration.py:
  - test_forward_recovers_from_silent_eviction_and_retries — the
    happy path (ConnectError, recover, retry, 200).
  - test_forward_gives_up_when_retry_after_recovery_still_fails — no
    infinite loop when lemond is genuinely down.
  - test_forward_remote_connect_error_does_not_attempt_recovery —
    remote providers (OpenRouter, etc.) skip the slot-recover branch.
  - test_forward_streaming_recovers_from_silent_eviction — streaming
    requests get the same recovery on the stream-open path.
  - test_forward_recovers_from_remote_protocol_error — peer-dropped
    mid-request triggers recovery too.
  - test_forward_gives_up_when_recover_evicted_slot_itself_raises —
    recover-time failure surfaces the original UpstreamUnavailable.

215 dispatcher + slots tests pass; ruff check + format clean.

LXC verification
================
Recovery branch confirmed to fire end-to-end on hal0 LXC with the
orphan-process trigger:

  dispatch.upstream_dead_attempting_recover  slot=primary
  slot.recover_evicted_dispatched
  lemonade.provider.unload
  lemonade.provider.load
  dispatch.upstream_recovered

Known limitation: if lemond re-spawns the child on a port other than
the one hal0's slot config expects, the retry's target_url is stale and
still hits the dead port.  That's a separate port-drift issue in the
Lemonade integration (LemonadeProvider.load doesn't pass a port hint
and hal0 never reads backend_url back from /v1/health) — out of scope
for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thinmintdev thinmintdev merged commit b3db236 into main May 28, 2026
4 checks passed
@thinmintdev thinmintdev deleted the fix/dispatch-recover-evicted-slot branch May 28, 2026 22:34
@thinmintdev thinmintdev mentioned this pull request May 29, 2026
4 tasks
thinmintdev added a commit that referenced this pull request May 29, 2026
End-of-stream cut for v0.3. Bundles MCP-completion, memory-map redesign,
Settings → Updates fix (#386), silent-eviction dispatcher recovery (#392),
ADR-0020 OpenRouter callback skeleton (#409), persona spending-cap
primitive (#411), δ-harness Hermes coverage (#410), and the docs/internal
pin + dashboard-v3 walkthrough (#389/#390).

After this tag, active scope rolls to v0.4 (install-mode reconciliation
+ UI polish + fully-implemented Agents/UI/Install bootstrapped) and v0.5
(MCP admin + memory wiring across UI and agents).

CHANGELOG merged from two coexisting Unreleased blocks into a single
[v0.3.2-alpha.1] section; added missing entries for #392 (dispatcher),
#387 (async-job polling contract), and the docs PRs #389/#390.

pyproject 0.3.1-alpha.1 → 0.3.2-alpha.1. uv.lock resynced (was stuck at
0.3.0a1 from prior drift).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant