PRD: hal0 ↔ Lemonade slot-management end-state (typed adapter, reactive reconciliation)

## Problem Statement

Users of hal0 keep hitting chat-completions failures (`502 dispatch.upstream_unavailable`, `503 slot.loading` at wrong moments, mysterious latency spikes) because hal0 and the embedded Lemonade daemon hold overlapping state machines with eventually-consistent reconciliation. The dispatcher reads from hal0's local copy of "is this slot alive?" while the truth lives in Lemonade. Whenever the two views diverge — silent eviction, port re-allocation after a respawn, mid-request peer-drop, configuration split-brain — the user (or their agent) gets a failure that shouldn't exist.

We have shipped 3 patches in 7 days against symptoms of this single root cause:
- PR #385 — swap-window gate (catches *known* not-ready states)
- PR #392 — silent-eviction recovery branch (catches *one* state-drift mode)
- The pending port-discovery fix (catches port-drift after orphan-process respawn)

This pattern will keep producing bugs until the architectural shape changes.

## Solution

Introduce a single typed boundary (`LemondAdapter`) that is the only module in hal0 which talks to Lemonade. Drive hal0's slot state from Lemonade events (WebSocket `/logs/stream` + log-line parsing) instead of polling. Compute dispatcher routing from the adapter at dispatch time so port drift becomes impossible by construction. Move idle-eviction policy from Lemonade to hal0 so every eviction is initiated by us, not surprised on us.

The user sees: zero `dispatch.upstream_unavailable` for transient eviction events; only structured `503 slot.loading` + `Retry-After` envelopes (which their OpenAI-compatible SDKs already handle); predictable first-call latency after agent session start.

Six phases, each independently shippable. Phase 0 is a 40-LOC hotfix that ships this week. Phases 1–3 are the core architectural change (~3 weeks). Phases 4–6 are platform polish (~3–5 weeks).

## User Stories

1. As a hal0 dashboard user, I want chats to succeed even when Lemonade has silently evicted the model, so that I don't see "ConnectError" 502s in the chat panel.
2. As an OpenAI-compatible SDK user (Python `openai` library, curl, agent harness), I want all transient backend issues to return `503` with `Retry-After`, so that my SDK's built-in retry logic handles the back-off cleanly.
3. As an OpenAI-compatible SDK user, I never want to receive a `502 dispatch.upstream_unavailable` for a slot that hal0 can reach with a single internal re-load, so that I don't have to implement custom error handling for hal0-specific failure modes.
4. As an agent provider (Hermes, pi-coder, external MCP agent), I want predictable first-call latency after my session starts, so that warm-load happens before my first inference call rather than turning into a 30-second surprise.
5. As an agent provider, I want multi-modality chains (embed → chat → rerank) to succeed even if one slot was idle-evicted between calls, so that I don't have to manually pre-warm every slot before chaining.
6. As an agent author, I want streaming responses to never be cut mid-stream by an eviction, so that my agent doesn't have to handle "stream truncated, retry from scratch" semantics.
7. As an agent author, I want idempotent retry semantics on the inference path, so that retrying a failed dispatch never causes a double-execution of side effects (tool calls, etc.).
8. As an operator (the user running hal0 on their LXC), I want lemonade version upgrades to require touching at most one hal0 file, so that I'm not chasing schema deltas across `LemonadeClient`, `LemonadeProvider`, `flm_trio`, `npu_swap_status`, the idle driver, and the metrics shim individually.
9. As an operator, I want a Grafana-style dashboard of adapter SLOs (time-to-route-stable, eviction-recovery-time, /v1/load p95/p99, WS reconnection rate), so that I can detect upstream lemonade regressions before users do.
10. As an operator, I want `dispatch.upstream_dead_attempting_recover` to be a near-zero metric in production, so that I have evidence the architecture isn't compensating for state drift.
11. As an operator, I want hal0 to own *when* slots get evicted (not Lemonade's internal policy), so that I can tune idle eviction to my workload (long-form coding sessions vs. short Q&A) without changing lemond config.
12. As an operator, I want eviction policy decisions to be observable in the journal (`slot.evicted_by_idle_policy` with reason), so that I can diagnose "why did my slot unload?" without reading lemond logs.
13. As a hal0 contributor, I want the `LemondAdapter` to be the ONLY module that imports `httpx` for talking to lemond, so that adding a new failure mode (timeout, retry policy, circuit breaker) requires one file change, not seven.
14. As a hal0 contributor, I want all lemonade-related types (`SlotEvent`, `RouteInfo`, `LoadedModel`) to live in `src/hal0/lemonade/` and be re-exported through a stable module API, so that I never have to grep for which file owns a given lemonade concept.
15. As a hal0 contributor, I want the dispatcher's `_recover_evicted_slot` recovery branch to be deleted in phase 3, so that the "two systems disagree" failure mode is impossible by construction rather than papered-over by retry logic.
16. As a hal0 contributor reading the v0.3 dispatcher today, I want to find one explicit document explaining the control-plane / data-plane split (ADR-0008 + this PRD's resulting ADR), so that I don't reverse-engineer the design from log lines and PR descriptions.
17. As a future-me debugging a state drift, I want the SlotManager's state for lemonade-backed slots to be derived (`state = compute_from_events(history)`) rather than authored (`state = self._transition(...)`), so that there is exactly one place to look when state is wrong.
18. As the AFK agent picking up phase 2, I want a working `/logs/stream` regex test suite + captured corpus before I commit, so that I don't ship a parser that breaks on the next lemonade release.
19. As the AFK agent, I want each phase to ship behind no feature flag (the adapter introduction is a pure refactor; subsequent phases preserve external behavior until the recovery branch is removed in phase 3), so that backing out a phase is `git revert`, not "untangle a flag from 5 codepaths."
20. As a Strix Halo home-AI user (the primary hal0 audience), I want hal0 to feel reliable enough that I run it as my primary inference endpoint for daily coding work, so that I'm not falling back to a cloud provider every time I hit a 502.
21. As a Strix Halo home-AI user, I want the dashboard's slot indicator dots to reflect ground truth (lemonade's view) within ~100ms of a state change, so that I can trust the UI when deciding whether to send another request.
22. As a Strix Halo home-AI user opening a new session, I want primary/embed/rerank to be warm-loaded in parallel within the same boot window, so that my first chat doesn't wait for serial cold-loads.
23. As an `hal0-memory` or `hal0-admin` MCP client, I want the embedding slot to be reliably available for memory.add / memory.search calls, so that my agent's memory operations don't intermittently fail due to slot state confusion.
24. As an `hal0-memory` MCP user with the rerank slot wired (PR #365), I want the rerank slot's lifecycle to follow the same warm/idle policy as primary, so that memory_search isn't slower than primary chat for no reason.
25. As a developer running the δ-tier harness (`make harness`), I want slot lifecycle tests to validate the new adapter-driven model, so that the harness catches regressions before they reach production.

## Implementation Decisions

### Modules to be built

- **`LemondAdapter` (new, deep module)** — `src/hal0/lemonade/adapter.py`. Single entry point for all hal0 → lemonade interactions. Owns: httpx client to `:13305`, WebSocket subscription to `/logs/stream`, TTL cache for `/v1/health`, event fan-out to subscribers. Replaces direct `LemonadeClient` usage scattered across `flm_trio.py`, `npu_swap_status.py`, the idle driver, the metrics shim. Designed to be the only file that changes when lemonade ships a breaking schema update.
- **`SlotEvent` types (new)** — `src/hal0/lemonade/events.py`. Typed dataclasses for `ModelLoaded(name, port, pid)`, `ModelEvicted(name, reason)`, `ProcessCrashed(name, pid)`, `LoadFailed(name, reason)`, `PortChanged(name, old_port, new_port)`. These are the contract between the adapter and its subscribers.
- **`LogStreamSubscriber` (new, deep module)** — `src/hal0/lemonade/log_stream.py`. Owns the WebSocket connection to `/logs/stream`, reconnection-with-backoff, and the regex parser that converts log lines into `SlotEvent`s. Falls back to polling on WS unavailable.
- **`RouteCache` (new)** — TTL-cached view of `/v1/health.all_models_loaded[]` mapping `model_name → (port, pid, alive)`. Invalidated by `SlotEvent`s.
- **`PortValidator` (new, phase 3)** — lightweight TCP probe with its own TTL cache. The dispatcher's pre-dispatch alive check.

### Modules to be modified

- **`Dispatcher` (`router.py`)** — computes `Upstream.url` at dispatch time from `adapter.route_for(slot)` rather than reading a static value baked at startup. The recovery branch from PR #392 is removed in phase 3 in favor of pre-dispatch port validation. The #385 swap-window gate stays.
- **`SlotManager` (`manager.py`)** — for lemonade-backed slots, becomes a coordinator that subscribes to `SlotEvent`s instead of polling. The `recover_evicted_slot()` helper is removed in phase 3. Idle policy (when to evict) moves here in phase 4.
- **`LemonadeProvider` (`providers/lemonade.py`)** — routes through `LemondAdapter`. In phase 0, immediately adds port-read-back-from-`/v1/health` after load so `Upstream.url` is freshly known.
- **`UpstreamRegistry`** — `Upstream.url` becomes mutable (or, equivalently, computed by a callback). Required because the post-load port may differ from startup config.
- **`flm_trio.py`, `npu_swap_status.py`, `lemonade/idle.py`, `lemonade/metrics_shim.py`, `lemonade/client.py`** — fold into adapter or route through it. `lemonade/client.py` likely remains as the low-level httpx wrapper that the adapter composes.

### Interfaces (the deep-module contracts)

```python
# LemondAdapter — the typed boundary
class LemondAdapter:
    async def route_for(self, slot_name: str) -> RouteInfo | None: ...
    async def is_route_alive(self, slot_name: str) -> bool: ...  # phase 3
    async def ensure_loaded(self, slot_name: str, model_id: str) -> None: ...
    async def evict(self, slot_name: str) -> None: ...
    def subscribe(self, handler: Callable[[SlotEvent], Awaitable[None]]) -> Unsubscribe: ...
    async def health(self) -> LemondHealth: ...
    async def aclose(self) -> None: ...
```

```python
# SlotEvent — the contract between adapter and subscribers
@dataclass
class ModelLoaded:
    name: str
    port: int
    pid: int

@dataclass
class ModelEvicted:
    name: str
    reason: Literal["idle", "oom", "manual", "load_failure_cascade", "unknown"]

# (etc — full list in `src/hal0/lemonade/events.py`)
```

### Architectural decisions

- **Control-plane / data-plane split preserved.** The dispatcher still forwards inference traffic directly to the child llama-server (e.g. `:8001`), not via lemond. This was the design call in `lemonade/client.py:5-6` and the alternative (route data plane through lemond `:13305`) is explicitly rejected: lemond is not designed to be a streaming reverse proxy under load.
- **Lemond is the source of truth for slot lifecycle.** hal0 derives. The `state.json` files become cache/restore hints, not authoritative state.
- **Events are reactive, polling is the safety net.** WebSocket `/logs/stream` drives state in real time (phase 2); a 30s `/v1/health` poll runs as the heartbeat that catches missed events.
- **Recovery is impossible by construction.** Phase 3 deletes the dispatcher's recovery branch. If the adapter's pre-dispatch validation says "alive," the request goes. If "not alive," `SlotLoading` 503 fires. There is no in-band recovery on the request path.
- **Idle eviction policy lives in hal0, not lemond.** Phase 4 has hal0 calling `/v1/unload` proactively. This removes the "lemonade silently evicted" class entirely because every eviction is initiated by us.
- **Phase 0 ships independently.** The 40-LOC port-discovery patch lands as v0.3.1 hotfix this week. It kills the port-drift bug without any architectural change.

### Phasing (each phase is independently shippable and reversible)

- **Phase 0 (1 PR, this week, targets v0.3.1):** `fix(lemonade): port discovery after load`. `_await_ready` reads `backend_url` and updates `slot.port` + `Upstream.url`.
- **Phase 1 (2–3 PRs, 1 week):** Introduce `LemondAdapter` as a pass-through; consolidate scattered httpx callers; add TTL cache.
- **Phase 2 (2 PRs, 1–2 weeks):** Implement `/logs/stream` subscription + log-line regex parser. SlotManager subscribes to events; polling becomes safety net.
- **Phase 3 (1 PR, 3 days):** Pre-dispatch port validation via adapter. Delete the recovery branch from PR #392.
- **Phase 4 (2 PRs, 1 week):** hal0-side idle eviction with proactive `/v1/unload`. Warm-load on agent session start.
- **Phase 5 (1 PR, 3 days):** SlotManager derives UX state from event history for lemonade-backed slots.
- **Phase 6 (1 PR + monitoring):** Adapter SLO metrics + dashboard.

### API contracts (external)

- `/v1/chat/completions`, `/v1/embeddings`, `/v1/rerank`, etc — unchanged. Same OpenAI-compat envelope. Same `503 slot.loading` + `Retry-After` semantics from PR #385.
- `/api/slots` — unchanged response shape. The `state` field still surfaces PULLING/STARTING/WARMING/READY/SERVING/IDLE/OFFLINE/ERROR, just computed differently underneath.
- New `/api/v1/adapter/events` SSE stream (phase 6) — exposes the `SlotEvent` stream to dashboard clients so the slot indicator dots react in real time.

### Schema changes

- `state.json` semantics change in phase 5: becomes a 1-hour-TTL snapshot for fast dashboard startup, not authoritative. Older state.json files are forward-compatible (the adapter will reconcile against lemond on startup).
- No database schema changes.
- No breaking changes to slot TOML config.

## Testing Decisions

### What makes a good test

- **Test external behavior, not implementation.** Assert on the contract surface: "given a slot evicted by lemond, the next /v1/chat/completions returns 200 within N seconds" — not "verify `_serving_enter` was called twice."
- **Captured `/v1/health` fixtures.** Lemond's response shape is the contract surface; check in real captured responses as test fixtures so phase 1's refactor is provably behavior-preserving.
- **Log-corpus tests for phase 2.** A captured corpus of real `/logs/stream` lines per recent lemond release (10.5, 10.6, 10.7-rc), tagged with expected `SlotEvent` outputs. The regex parser is locked to this corpus; new lemond releases require updating the corpus.
- **Property tests for the state-derivation logic in phase 5.** Given a random sequence of `SlotEvent`s, the derived UX state matches a reference implementation. Catches edge cases in event ordering.
- **No mocking of lemond's HTTP layer in unit tests.** Use `httpx.MockTransport` with realistic responses (the pattern already established in `tests/dispatcher/test_forward.py` and `tests/dispatcher/test_serving_integration.py`).

### Modules to be tested

- `LemondAdapter`: unit tests against `httpx.MockTransport` + captured fixtures. Test that `route_for()` returns fresh port info after a `ModelLoaded` event invalidates the cache.
- `LogStreamSubscriber`: regex parser tested against captured log corpus (one test per `SlotEvent` variant). Reconnection logic tested with a mock WS server.
- `PortValidator`: tested with a real ephemeral TCP listener (bind, validate alive, close, validate dead).
- `Dispatcher`: existing `tests/dispatcher/test_serving_integration.py` extended with adapter-event-driven scenarios. In phase 3, the recovery-branch tests added by PR #392 are deleted (the branch they cover no longer exists).
- `SlotManager`: existing `tests/slots/` extended with event-subscription tests. The `recover_evicted_slot()` tests are deleted in phase 3.

### Prior art in the codebase

- `tests/dispatcher/test_forward.py` — `httpx.MockTransport` pattern for dispatcher unit tests.
- `tests/dispatcher/test_serving_integration.py` — `_RecordingSlotManager` pattern for stand-in SlotManager + asserting on lifecycle events.
- `tests/providers/test_lemonade.py` — captured fixture pattern for `LemondClient` responses.
- `tests/harness/` (δ-tier) — full lifecycle harness for installer → CLI → slot → uninstall. Adapter integration validated here at the system level.

### Test additions per phase

- Phase 0: 1 unit test asserting `slot.port` is updated to match `/v1/health.backend_url` after load.
- Phase 1: refactor preserves behavior — existing 215 dispatcher+slots tests stay green; adapter gets ~20 new unit tests against MockTransport.
- Phase 2: ~30 log-corpus regex tests + ~10 reconnection-behavior tests.
- Phase 3: pre-dispatch validator gets ~10 tests; the 6 recovery-branch tests from PR #392 are deleted.
- Phase 4: idle-policy tests (~15) + warm-load-on-session-start integration test.
- Phase 5: ~30 property tests on state derivation.

## Out of Scope

- **Replacing Lemonade entirely.** This was considered (go back to hal0-managed systemd llama-servers) and rejected — we'd lose FLM/NPU support, Kokoro voice, sdcpp image-gen, and we just finished migrating TO lemonade in v0.2.
- **Routing the data plane through lemond.** Considered and rejected: lemond is not designed to be a streaming reverse proxy under load. This is explicitly documented in `lemonade/client.py:5-6`.
- **A native model-event WebSocket API from upstream lemonade.** Would let us skip phase 2's log-parsing. Not on lemonade's roadmap as of 2026-05; revisit annually. If it ships, phases 2's log-parsing is replaced with a thin WS adapter; the rest of the plan is unchanged.
- **Changing the inference-path protocol.** The dispatcher still speaks OpenAI-compat. Streaming still uses SSE. The new architecture is invisible to clients.
- **Multi-host lemonade.** All of this assumes single-host lemonade running on `127.0.0.1`. Multi-host would be a separate ADR.
- **Cross-slot transactional guarantees.** If an agent chain needs primary + embed + rerank to all be live simultaneously, the warm-load helper in phase 4 makes a best-effort but doesn't provide atomic guarantees. That's a v0.5 concern.

## Further Notes

### Timing: v0.3.1 vs v0.4 vs separate platform track

The user explicitly flagged this question as undecided. Three options:

1. **Phase 0 as v0.3.1 hotfix only.** Ship phase 0 this week, defer phases 1+ to v0.5. Lowest risk. Only fixes today's bug class.
2. **Phases 0–3 bundled as v0.4 "Reactive Lemonade Integration" theme.** ~3 weeks. v0.4 gets a coherent platform-reliability narrative. Right call if "platform reliability" resonates with the Strix Halo audience.
3. **Separate platform track.** Phase 0 → v0.3.1; phases 1–3 → v0.4.x point releases over the cycle. Decouples user-visible from infrastructural. Risk: platform work always loses to features in practice, so phases 4–6 may never happen.

**Recommended hybrid:**

| Phase | Release | Theme |
|---|---|---|
| Phase 0 | **v0.3.1** hotfix this week | Bug fix |
| Phases 1–3 | **v0.4** milestone (~3 weeks) | "Reactive Lemonade Integration" |
| Phases 4–6 | **v0.4.x** point releases | Platform polish |

The trap to watch: if v0.4 is already promised feature-heavy (more agents, MCP, dashboard), phases 1–3 should slip to v0.5 — don't let infrastructure block features.

### Spike questions before phase 2 starts

1. Does lemond's `/logs/stream` emit reliable load/unload events? 1-day spike to capture corpus and assess regex tractability.
2. WebSocket reconnection semantics — undocumented in lemonade today per memory `hal0_lemonade_ws_protocol`. Spike to characterize.
3. Upstream lemonade roadmap — is a native event API coming in 3 months? If yes, skip phase 2's log-parsing and wait.

### Cross-references

- PR #392 — silent-eviction recovery (today's fix). The `_recover_evicted_slot` helper added here is the proof-of-concept for phase 3's pre-dispatch validation.
- PR #385 — swap-window gate (complementary; stays in end-state).
- ADR-0008 — Lemonade integration boundary. This PRD extends it with the adapter pattern; a follow-up ADR-0015 (or similar) should record the architectural shift.
- Memory entries that informed this plan: `hal0_dispatcher_silent_eviction_recovery`, `hal0_lemonade_port_drift`, `hal0_lemonade_gotchas`, `hal0_lemonade_internals`, `hal0_lemonade_ws_protocol`, `hal0_lemonade_threads_deadlock`, `hal0_lemonade_unload_gpu_cleanup_hang`, `hal0_lemonade_whisper_runpath_bug`, `hal0_lemonade_flm_npu_install`, `hal0_lemonade_ctx_size_lives_in_config_json`, `hal0_lemonade_rocm_device_perms`, `hal0_lemonade_hf_cache_gotchas`, `hal0_slot_backend_change_state_drift`, `hal0_hsa_gfx_override_stale_after_rocm_bundle_upgrade`.
- Architectural doc (local): `/home/halo/Development/Projects/hal0/Developer Docs/hal0-lemonade-slot-management-end-state.md`.

### Success metrics

- `dispatch.upstream_dead_attempting_recover` → 0 in production (phase 3 deletes the branch; its absence is also the win).
- `dispatch.upstream_unavailable` (502 rate) < 0.1% of inference calls.
- Time-to-route-stable after eviction: today "tens of seconds" → target "under one second."
- Lemonade version upgrades affect exactly one hal0 file (the adapter).
- Agent providers (Hermes, pi-coder, external MCP) report zero "primary unreachable" in normal operation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRD: hal0 ↔ Lemonade slot-management end-state (typed adapter, reactive reconciliation) #402

Problem Statement

Solution

User Stories

Implementation Decisions

Modules to be built

Modules to be modified

Interfaces (the deep-module contracts)

Architectural decisions

Phasing (each phase is independently shippable and reversible)

API contracts (external)

Schema changes

Testing Decisions

What makes a good test

Modules to be tested

Prior art in the codebase

Test additions per phase

Out of Scope

Further Notes

Timing: v0.3.1 vs v0.4 vs separate platform track

Spike questions before phase 2 starts

Cross-references

Success metrics

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Release	Theme
Phase 0	v0.3.1 hotfix this week	Bug fix
Phases 1–3	v0.4 milestone (~3 weeks)	"Reactive Lemonade Integration"
Phases 4–6	v0.4.x point releases	Platform polish

PRD: hal0 ↔ Lemonade slot-management end-state (typed adapter, reactive reconciliation) #402

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Modules to be built

Modules to be modified

Interfaces (the deep-module contracts)

Architectural decisions

Phasing (each phase is independently shippable and reversible)

API contracts (external)

Schema changes

Testing Decisions

What makes a good test

Modules to be tested

Prior art in the codebase

Test additions per phase

Out of Scope

Further Notes

Timing: v0.3.1 vs v0.4 vs separate platform track

Spike questions before phase 2 starts

Cross-references

Success metrics

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions