Skip to content

PRD: hal0 ↔ Lemonade slot-management end-state (typed adapter, reactive reconciliation) #402

@thinmintdev

Description

@thinmintdev

Problem Statement

Users of hal0 keep hitting chat-completions failures (502 dispatch.upstream_unavailable, 503 slot.loading at wrong moments, mysterious latency spikes) because hal0 and the embedded Lemonade daemon hold overlapping state machines with eventually-consistent reconciliation. The dispatcher reads from hal0's local copy of "is this slot alive?" while the truth lives in Lemonade. Whenever the two views diverge — silent eviction, port re-allocation after a respawn, mid-request peer-drop, configuration split-brain — the user (or their agent) gets a failure that shouldn't exist.

We have shipped 3 patches in 7 days against symptoms of this single root cause:

This pattern will keep producing bugs until the architectural shape changes.

Solution

Introduce a single typed boundary (LemondAdapter) that is the only module in hal0 which talks to Lemonade. Drive hal0's slot state from Lemonade events (WebSocket /logs/stream + log-line parsing) instead of polling. Compute dispatcher routing from the adapter at dispatch time so port drift becomes impossible by construction. Move idle-eviction policy from Lemonade to hal0 so every eviction is initiated by us, not surprised on us.

The user sees: zero dispatch.upstream_unavailable for transient eviction events; only structured 503 slot.loading + Retry-After envelopes (which their OpenAI-compatible SDKs already handle); predictable first-call latency after agent session start.

Six phases, each independently shippable. Phase 0 is a 40-LOC hotfix that ships this week. Phases 1–3 are the core architectural change (~3 weeks). Phases 4–6 are platform polish (~3–5 weeks).

User Stories

  1. As a hal0 dashboard user, I want chats to succeed even when Lemonade has silently evicted the model, so that I don't see "ConnectError" 502s in the chat panel.
  2. As an OpenAI-compatible SDK user (Python openai library, curl, agent harness), I want all transient backend issues to return 503 with Retry-After, so that my SDK's built-in retry logic handles the back-off cleanly.
  3. As an OpenAI-compatible SDK user, I never want to receive a 502 dispatch.upstream_unavailable for a slot that hal0 can reach with a single internal re-load, so that I don't have to implement custom error handling for hal0-specific failure modes.
  4. As an agent provider (Hermes, pi-coder, external MCP agent), I want predictable first-call latency after my session starts, so that warm-load happens before my first inference call rather than turning into a 30-second surprise.
  5. As an agent provider, I want multi-modality chains (embed → chat → rerank) to succeed even if one slot was idle-evicted between calls, so that I don't have to manually pre-warm every slot before chaining.
  6. As an agent author, I want streaming responses to never be cut mid-stream by an eviction, so that my agent doesn't have to handle "stream truncated, retry from scratch" semantics.
  7. As an agent author, I want idempotent retry semantics on the inference path, so that retrying a failed dispatch never causes a double-execution of side effects (tool calls, etc.).
  8. As an operator (the user running hal0 on their LXC), I want lemonade version upgrades to require touching at most one hal0 file, so that I'm not chasing schema deltas across LemonadeClient, LemonadeProvider, flm_trio, npu_swap_status, the idle driver, and the metrics shim individually.
  9. As an operator, I want a Grafana-style dashboard of adapter SLOs (time-to-route-stable, eviction-recovery-time, /v1/load p95/p99, WS reconnection rate), so that I can detect upstream lemonade regressions before users do.
  10. As an operator, I want dispatch.upstream_dead_attempting_recover to be a near-zero metric in production, so that I have evidence the architecture isn't compensating for state drift.
  11. As an operator, I want hal0 to own when slots get evicted (not Lemonade's internal policy), so that I can tune idle eviction to my workload (long-form coding sessions vs. short Q&A) without changing lemond config.
  12. As an operator, I want eviction policy decisions to be observable in the journal (slot.evicted_by_idle_policy with reason), so that I can diagnose "why did my slot unload?" without reading lemond logs.
  13. As a hal0 contributor, I want the LemondAdapter to be the ONLY module that imports httpx for talking to lemond, so that adding a new failure mode (timeout, retry policy, circuit breaker) requires one file change, not seven.
  14. As a hal0 contributor, I want all lemonade-related types (SlotEvent, RouteInfo, LoadedModel) to live in src/hal0/lemonade/ and be re-exported through a stable module API, so that I never have to grep for which file owns a given lemonade concept.
  15. As a hal0 contributor, I want the dispatcher's _recover_evicted_slot recovery branch to be deleted in phase 3, so that the "two systems disagree" failure mode is impossible by construction rather than papered-over by retry logic.
  16. As a hal0 contributor reading the v0.3 dispatcher today, I want to find one explicit document explaining the control-plane / data-plane split (ADR-0008 + this PRD's resulting ADR), so that I don't reverse-engineer the design from log lines and PR descriptions.
  17. As a future-me debugging a state drift, I want the SlotManager's state for lemonade-backed slots to be derived (state = compute_from_events(history)) rather than authored (state = self._transition(...)), so that there is exactly one place to look when state is wrong.
  18. As the AFK agent picking up phase 2, I want a working /logs/stream regex test suite + captured corpus before I commit, so that I don't ship a parser that breaks on the next lemonade release.
  19. As the AFK agent, I want each phase to ship behind no feature flag (the adapter introduction is a pure refactor; subsequent phases preserve external behavior until the recovery branch is removed in phase 3), so that backing out a phase is git revert, not "untangle a flag from 5 codepaths."
  20. As a Strix Halo home-AI user (the primary hal0 audience), I want hal0 to feel reliable enough that I run it as my primary inference endpoint for daily coding work, so that I'm not falling back to a cloud provider every time I hit a 502.
  21. As a Strix Halo home-AI user, I want the dashboard's slot indicator dots to reflect ground truth (lemonade's view) within ~100ms of a state change, so that I can trust the UI when deciding whether to send another request.
  22. As a Strix Halo home-AI user opening a new session, I want primary/embed/rerank to be warm-loaded in parallel within the same boot window, so that my first chat doesn't wait for serial cold-loads.
  23. As an hal0-memory or hal0-admin MCP client, I want the embedding slot to be reliably available for memory.add / memory.search calls, so that my agent's memory operations don't intermittently fail due to slot state confusion.
  24. As an hal0-memory MCP user with the rerank slot wired (PR feat(memory): pin embedding model + wire rerank slot into memory_search #365), I want the rerank slot's lifecycle to follow the same warm/idle policy as primary, so that memory_search isn't slower than primary chat for no reason.
  25. As a developer running the δ-tier harness (make harness), I want slot lifecycle tests to validate the new adapter-driven model, so that the harness catches regressions before they reach production.

Implementation Decisions

Modules to be built

  • LemondAdapter (new, deep module)src/hal0/lemonade/adapter.py. Single entry point for all hal0 → lemonade interactions. Owns: httpx client to :13305, WebSocket subscription to /logs/stream, TTL cache for /v1/health, event fan-out to subscribers. Replaces direct LemonadeClient usage scattered across flm_trio.py, npu_swap_status.py, the idle driver, the metrics shim. Designed to be the only file that changes when lemonade ships a breaking schema update.
  • SlotEvent types (new)src/hal0/lemonade/events.py. Typed dataclasses for ModelLoaded(name, port, pid), ModelEvicted(name, reason), ProcessCrashed(name, pid), LoadFailed(name, reason), PortChanged(name, old_port, new_port). These are the contract between the adapter and its subscribers.
  • LogStreamSubscriber (new, deep module)src/hal0/lemonade/log_stream.py. Owns the WebSocket connection to /logs/stream, reconnection-with-backoff, and the regex parser that converts log lines into SlotEvents. Falls back to polling on WS unavailable.
  • RouteCache (new) — TTL-cached view of /v1/health.all_models_loaded[] mapping model_name → (port, pid, alive). Invalidated by SlotEvents.
  • PortValidator (new, phase 3) — lightweight TCP probe with its own TTL cache. The dispatcher's pre-dispatch alive check.

Modules to be modified

  • Dispatcher (router.py) — computes Upstream.url at dispatch time from adapter.route_for(slot) rather than reading a static value baked at startup. The recovery branch from PR fix(dispatch): recover slot after silent Lemonade eviction #392 is removed in phase 3 in favor of pre-dispatch port validation. The fix(dispatch): gate slot forwards during swap window with structured 503 #385 swap-window gate stays.
  • SlotManager (manager.py) — for lemonade-backed slots, becomes a coordinator that subscribes to SlotEvents instead of polling. The recover_evicted_slot() helper is removed in phase 3. Idle policy (when to evict) moves here in phase 4.
  • LemonadeProvider (providers/lemonade.py) — routes through LemondAdapter. In phase 0, immediately adds port-read-back-from-/v1/health after load so Upstream.url is freshly known.
  • UpstreamRegistryUpstream.url becomes mutable (or, equivalently, computed by a callback). Required because the post-load port may differ from startup config.
  • flm_trio.py, npu_swap_status.py, lemonade/idle.py, lemonade/metrics_shim.py, lemonade/client.py — fold into adapter or route through it. lemonade/client.py likely remains as the low-level httpx wrapper that the adapter composes.

Interfaces (the deep-module contracts)

# LemondAdapter — the typed boundary
class LemondAdapter:
    async def route_for(self, slot_name: str) -> RouteInfo | None: ...
    async def is_route_alive(self, slot_name: str) -> bool: ...  # phase 3
    async def ensure_loaded(self, slot_name: str, model_id: str) -> None: ...
    async def evict(self, slot_name: str) -> None: ...
    def subscribe(self, handler: Callable[[SlotEvent], Awaitable[None]]) -> Unsubscribe: ...
    async def health(self) -> LemondHealth: ...
    async def aclose(self) -> None: ...
# SlotEvent — the contract between adapter and subscribers
@dataclass
class ModelLoaded:
    name: str
    port: int
    pid: int

@dataclass
class ModelEvicted:
    name: str
    reason: Literal["idle", "oom", "manual", "load_failure_cascade", "unknown"]

# (etc — full list in `src/hal0/lemonade/events.py`)

Architectural decisions

  • Control-plane / data-plane split preserved. The dispatcher still forwards inference traffic directly to the child llama-server (e.g. :8001), not via lemond. This was the design call in lemonade/client.py:5-6 and the alternative (route data plane through lemond :13305) is explicitly rejected: lemond is not designed to be a streaming reverse proxy under load.
  • Lemond is the source of truth for slot lifecycle. hal0 derives. The state.json files become cache/restore hints, not authoritative state.
  • Events are reactive, polling is the safety net. WebSocket /logs/stream drives state in real time (phase 2); a 30s /v1/health poll runs as the heartbeat that catches missed events.
  • Recovery is impossible by construction. Phase 3 deletes the dispatcher's recovery branch. If the adapter's pre-dispatch validation says "alive," the request goes. If "not alive," SlotLoading 503 fires. There is no in-band recovery on the request path.
  • Idle eviction policy lives in hal0, not lemond. Phase 4 has hal0 calling /v1/unload proactively. This removes the "lemonade silently evicted" class entirely because every eviction is initiated by us.
  • Phase 0 ships independently. The 40-LOC port-discovery patch lands as v0.3.1 hotfix this week. It kills the port-drift bug without any architectural change.

Phasing (each phase is independently shippable and reversible)

  • Phase 0 (1 PR, this week, targets v0.3.1): fix(lemonade): port discovery after load. _await_ready reads backend_url and updates slot.port + Upstream.url.
  • Phase 1 (2–3 PRs, 1 week): Introduce LemondAdapter as a pass-through; consolidate scattered httpx callers; add TTL cache.
  • Phase 2 (2 PRs, 1–2 weeks): Implement /logs/stream subscription + log-line regex parser. SlotManager subscribes to events; polling becomes safety net.
  • Phase 3 (1 PR, 3 days): Pre-dispatch port validation via adapter. Delete the recovery branch from PR fix(dispatch): recover slot after silent Lemonade eviction #392.
  • Phase 4 (2 PRs, 1 week): hal0-side idle eviction with proactive /v1/unload. Warm-load on agent session start.
  • Phase 5 (1 PR, 3 days): SlotManager derives UX state from event history for lemonade-backed slots.
  • Phase 6 (1 PR + monitoring): Adapter SLO metrics + dashboard.

API contracts (external)

  • /v1/chat/completions, /v1/embeddings, /v1/rerank, etc — unchanged. Same OpenAI-compat envelope. Same 503 slot.loading + Retry-After semantics from PR fix(dispatch): gate slot forwards during swap window with structured 503 #385.
  • /api/slots — unchanged response shape. The state field still surfaces PULLING/STARTING/WARMING/READY/SERVING/IDLE/OFFLINE/ERROR, just computed differently underneath.
  • New /api/v1/adapter/events SSE stream (phase 6) — exposes the SlotEvent stream to dashboard clients so the slot indicator dots react in real time.

Schema changes

  • state.json semantics change in phase 5: becomes a 1-hour-TTL snapshot for fast dashboard startup, not authoritative. Older state.json files are forward-compatible (the adapter will reconcile against lemond on startup).
  • No database schema changes.
  • No breaking changes to slot TOML config.

Testing Decisions

What makes a good test

  • Test external behavior, not implementation. Assert on the contract surface: "given a slot evicted by lemond, the next /v1/chat/completions returns 200 within N seconds" — not "verify _serving_enter was called twice."
  • Captured /v1/health fixtures. Lemond's response shape is the contract surface; check in real captured responses as test fixtures so phase 1's refactor is provably behavior-preserving.
  • Log-corpus tests for phase 2. A captured corpus of real /logs/stream lines per recent lemond release (10.5, 10.6, 10.7-rc), tagged with expected SlotEvent outputs. The regex parser is locked to this corpus; new lemond releases require updating the corpus.
  • Property tests for the state-derivation logic in phase 5. Given a random sequence of SlotEvents, the derived UX state matches a reference implementation. Catches edge cases in event ordering.
  • No mocking of lemond's HTTP layer in unit tests. Use httpx.MockTransport with realistic responses (the pattern already established in tests/dispatcher/test_forward.py and tests/dispatcher/test_serving_integration.py).

Modules to be tested

  • LemondAdapter: unit tests against httpx.MockTransport + captured fixtures. Test that route_for() returns fresh port info after a ModelLoaded event invalidates the cache.
  • LogStreamSubscriber: regex parser tested against captured log corpus (one test per SlotEvent variant). Reconnection logic tested with a mock WS server.
  • PortValidator: tested with a real ephemeral TCP listener (bind, validate alive, close, validate dead).
  • Dispatcher: existing tests/dispatcher/test_serving_integration.py extended with adapter-event-driven scenarios. In phase 3, the recovery-branch tests added by PR fix(dispatch): recover slot after silent Lemonade eviction #392 are deleted (the branch they cover no longer exists).
  • SlotManager: existing tests/slots/ extended with event-subscription tests. The recover_evicted_slot() tests are deleted in phase 3.

Prior art in the codebase

  • tests/dispatcher/test_forward.pyhttpx.MockTransport pattern for dispatcher unit tests.
  • tests/dispatcher/test_serving_integration.py_RecordingSlotManager pattern for stand-in SlotManager + asserting on lifecycle events.
  • tests/providers/test_lemonade.py — captured fixture pattern for LemondClient responses.
  • tests/harness/ (δ-tier) — full lifecycle harness for installer → CLI → slot → uninstall. Adapter integration validated here at the system level.

Test additions per phase

  • Phase 0: 1 unit test asserting slot.port is updated to match /v1/health.backend_url after load.
  • Phase 1: refactor preserves behavior — existing 215 dispatcher+slots tests stay green; adapter gets ~20 new unit tests against MockTransport.
  • Phase 2: ~30 log-corpus regex tests + ~10 reconnection-behavior tests.
  • Phase 3: pre-dispatch validator gets ~10 tests; the 6 recovery-branch tests from PR fix(dispatch): recover slot after silent Lemonade eviction #392 are deleted.
  • Phase 4: idle-policy tests (~15) + warm-load-on-session-start integration test.
  • Phase 5: ~30 property tests on state derivation.

Out of Scope

  • Replacing Lemonade entirely. This was considered (go back to hal0-managed systemd llama-servers) and rejected — we'd lose FLM/NPU support, Kokoro voice, sdcpp image-gen, and we just finished migrating TO lemonade in v0.2.
  • Routing the data plane through lemond. Considered and rejected: lemond is not designed to be a streaming reverse proxy under load. This is explicitly documented in lemonade/client.py:5-6.
  • A native model-event WebSocket API from upstream lemonade. Would let us skip phase 2's log-parsing. Not on lemonade's roadmap as of 2026-05; revisit annually. If it ships, phases 2's log-parsing is replaced with a thin WS adapter; the rest of the plan is unchanged.
  • Changing the inference-path protocol. The dispatcher still speaks OpenAI-compat. Streaming still uses SSE. The new architecture is invisible to clients.
  • Multi-host lemonade. All of this assumes single-host lemonade running on 127.0.0.1. Multi-host would be a separate ADR.
  • Cross-slot transactional guarantees. If an agent chain needs primary + embed + rerank to all be live simultaneously, the warm-load helper in phase 4 makes a best-effort but doesn't provide atomic guarantees. That's a v0.5 concern.

Further Notes

Timing: v0.3.1 vs v0.4 vs separate platform track

The user explicitly flagged this question as undecided. Three options:

  1. Phase 0 as v0.3.1 hotfix only. Ship phase 0 this week, defer phases 1+ to v0.5. Lowest risk. Only fixes today's bug class.
  2. Phases 0–3 bundled as v0.4 "Reactive Lemonade Integration" theme. ~3 weeks. v0.4 gets a coherent platform-reliability narrative. Right call if "platform reliability" resonates with the Strix Halo audience.
  3. Separate platform track. Phase 0 → v0.3.1; phases 1–3 → v0.4.x point releases over the cycle. Decouples user-visible from infrastructural. Risk: platform work always loses to features in practice, so phases 4–6 may never happen.

Recommended hybrid:

Phase Release Theme
Phase 0 v0.3.1 hotfix this week Bug fix
Phases 1–3 v0.4 milestone (~3 weeks) "Reactive Lemonade Integration"
Phases 4–6 v0.4.x point releases Platform polish

The trap to watch: if v0.4 is already promised feature-heavy (more agents, MCP, dashboard), phases 1–3 should slip to v0.5 — don't let infrastructure block features.

Spike questions before phase 2 starts

  1. Does lemond's /logs/stream emit reliable load/unload events? 1-day spike to capture corpus and assess regex tractability.
  2. WebSocket reconnection semantics — undocumented in lemonade today per memory hal0_lemonade_ws_protocol. Spike to characterize.
  3. Upstream lemonade roadmap — is a native event API coming in 3 months? If yes, skip phase 2's log-parsing and wait.

Cross-references

  • PR fix(dispatch): recover slot after silent Lemonade eviction #392 — silent-eviction recovery (today's fix). The _recover_evicted_slot helper added here is the proof-of-concept for phase 3's pre-dispatch validation.
  • PR fix(dispatch): gate slot forwards during swap window with structured 503 #385 — swap-window gate (complementary; stays in end-state).
  • ADR-0008 — Lemonade integration boundary. This PRD extends it with the adapter pattern; a follow-up ADR-0015 (or similar) should record the architectural shift.
  • Memory entries that informed this plan: hal0_dispatcher_silent_eviction_recovery, hal0_lemonade_port_drift, hal0_lemonade_gotchas, hal0_lemonade_internals, hal0_lemonade_ws_protocol, hal0_lemonade_threads_deadlock, hal0_lemonade_unload_gpu_cleanup_hang, hal0_lemonade_whisper_runpath_bug, hal0_lemonade_flm_npu_install, hal0_lemonade_ctx_size_lives_in_config_json, hal0_lemonade_rocm_device_perms, hal0_lemonade_hf_cache_gotchas, hal0_slot_backend_change_state_drift, hal0_hsa_gfx_override_stale_after_rocm_bundle_upgrade.
  • Architectural doc (local): /home/halo/Development/Projects/hal0/Developer Docs/hal0-lemonade-slot-management-end-state.md.

Success metrics

  • dispatch.upstream_dead_attempting_recover → 0 in production (phase 3 deletes the branch; its absence is also the win).
  • dispatch.upstream_unavailable (502 rate) < 0.1% of inference calls.
  • Time-to-route-stable after eviction: today "tens of seconds" → target "under one second."
  • Lemonade version upgrades affect exactly one hal0 file (the adapter).
  • Agent providers (Hermes, pi-coder, external MCP) report zero "primary unreachable" in normal operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestready-for-agentPRD is fully scoped and ready for an AFK agent to pick up

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions