Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 22 additions & 14 deletions docs/internal/adr/0006-migrate-inference-to-lemonade.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,21 @@ Lemonade is the sole inference backend in v0.2. The six per-modality Provider cl
### 2. Slot abstraction preserved at user-facing layer
Each hal0 slot (`primary`, `embed`, `embed-rerank`, `stt`, `tts`, `img`) remains a named, configured serving target with a chosen model + device. Runtime layer changes: slot = 1 Lemonade-loaded model rather than 1 systemd template instance + 1 container. `SlotManager.start(slot)` calls Lemonade load semantics; slot state derives from `/v1/health.loaded[]` by model_name. `hal0-slot@.service` template retires.

### 3. Drive method: HTTP-first
hal0 talks to Lemonade via `/v1/load`, `/v1/unload`, `/v1/health`, `/v1/pull`, `/v1/chat/completions`, `/v1/embeddings`, `/v1/reranking`, `/v1/audio/transcriptions`, `/v1/audio/speech`, `/v1/images/generations`. `LemonadeClient` (`src/hal0/lemonade/client.py`) wraps these. CLI subprocess (`lemonade load X`) is a bootstrap fallback in early PRs until `/v1/load` schema is reverse-engineered (research handoff pending).
### 3. Drive method: HTTP only
hal0 talks to Lemonade via `/v1/load`, `/v1/unload`, `/v1/health`, `/v1/pull`, `/v1/chat/completions`, `/v1/embeddings`, `/v1/reranking`, `/v1/audio/transcriptions`, `/v1/audio/speech`, `/v1/images/generations`. `LemonadeClient` (`src/hal0/lemonade/client.py`) wraps these. Healthcheck uses unversioned `/live` (zero-work, no auth required).

**`/v1/load` schema:** only `model_name` (string) is required. Optional: `recipe`, `ctx_size`, `llamacpp_backend`, `llamacpp_args`, etc. (See memory `hal0_lemonade_v1_load_schema`.) The spike's "type must be string, but is null" error was a malformed body, not a missing field — `nlohmann::json[]` throws on null access. CLI fallback NOT needed.

### 4. Model registration via hal0-customized `server_models.json`
At install time hal0 generates `server_models.json` from `/var/lib/hal0/registry/registry.toml` and writes it into Lemonade's resources directory. Curated catalog with explicit type metadata per entry (llm/embedding/reranking/transcription/image/tts). Runtime user adds go via `POST /v1/pull` with `user.*` namespace + type. Spike confirmed Lemonade's bundled `server_models.json` does not include hal0's curated picks (e.g. `hermes-4-14b`, `qwen3-coder-next-reap-40b-a3b`).

### 5. Process supervision: containerized lemond in systemd
`/etc/systemd/system/lemond.service` runs `podman/docker run --device=/dev/dri --device=/dev/accel/accel0` with `Restart=on-failure RestartSec=5s`. Boot-enabled (always-running). hal0-api `Wants=lemond.service` (soft dep). This combines systemd's standard supervision (Restart, journal, watchdog) with container isolation hal0 already uses.
### 5. Process supervision: AMD's embeddable tarball + bare systemd unit
`/etc/systemd/system/lemond.service` runs `lemond /opt/lemonade --port 9100` directly. Hardened with `NoNewPrivileges=yes`, `ProtectSystem=strict`, `ProtectHome=yes`, `PrivateTmp=yes`, `RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX`. `Restart=on-failure RestartSec=5s`. Boot-enabled. hal0-api `Wants=lemond.service` (soft dep). This combines systemd's standard supervision (Restart, journal, watchdog) with systemd's namespace hardening — no container layer needed.

**Revised from earlier draft** (which proposed a hal0-published container image): research surfaced that Lemonade already ships an `embeddable` cmake target producing a portable lemond+lemonade tarball. AMD's tarball is the official redistributable artifact. Building a container around it duplicates work AMD already did and reintroduces the docker-build apparmor pain hal0 has on LXC (memory `hal0_docker_build_lxc_apparmor`).

### 6. Container source: hal0-published wrapper image
`ghcr.io/hal0ai/hal0-lemond:vX.Y.Z`. Base + lemond tarball + `apt install unzip libxrt-npu2` (system deps Lemonade requires) + pre-pulled backends (`llamacpp:rocm`, `flm:npu`, `whispercpp:cpu`, `sdcpp:rocm`, `kokoro:cpu`). CI build + cosign sign. Replaces six per-modality toolbox images with one multi-backend image. (AMD does not currently publish an official docker image suitable for hal0's needs — research handoff confirms.)
### 6. Bundling: AMD's embeddable tarball, hal0 version-pins it
install.sh downloads `lemonade-embeddable-<VERSION>-ubuntu-x64.tar.gz` from `github.com/lemonade-sdk/lemonade/releases`, sha256-verifies, extracts to `/opt/lemonade`. Then `apt install -y unzip libxrt-npu2` (Lemonade's system deps). Then `lemonade backends install llamacpp:rocm flm:npu whispercpp:cpu sdcpp:rocm kokoro:cpu` at first boot to fetch backend binaries into `/opt/lemonade/bin/`. `manifest.json` carries `lemonade: { tarball_url, sha256, version }`. Bumps gated by hal0 releases.

### 7. `SlotConfig.backend` → `SlotConfig.device`
Schema refactor. Old `backend` field mixed providers and backends (`vulkan|rocm|flm|moonshine|kokoro|cpu`). New `device` field is hardware-preference only: `gpu-rocm | gpu-vulkan | cpu | npu`. Default `gpu-rocm`. `LemonadeProvider` maps `device` to Lemonade's `recipe:backend` pair internally. `capabilities.toml` schema_version bumps to 2; auto-migration preserves user choices.
Expand All @@ -49,8 +53,8 @@ v0.2 ships minimal UI patch (retarget API client to new endpoints; preserve all
### 9. Metrics shim via `/v1/stats` polling
Spike confirmed Lemonade's bundled llama-server returns 501 on `/metrics`. Backend_url scrape strategy (PR #124 path) does not survive the migration. hal0 builds a metrics aggregator (`src/hal0/lemonade/metrics.py`) that polls `/v1/stats` (last-request perf) and `/v1/health` (model state) per slot, exposes Prometheus surface for the dashboard.

### 10. Version pinning + bundling
`manifest.json` schema v2 adds `lemonade: { image, digest, version }`. Updates gated behind explicit hal0 release. install.sh pulls + cosign-verifies. Lemonade ships breaking changes weekly; pinning is non-negotiable.
### 10. Version pinning
Folded into §6. `manifest.json` schema v2 adds `lemonade: { tarball_url, sha256, version }`. Updates gated behind explicit hal0 release. install.sh sha256-verifies tarball before extraction. Lemonade ships breaking changes weekly; pinning is non-negotiable.

### 11. Rollback
Downgrade-only via existing update mechanism (`hal0 update --version v0.1.x`). v0.2 deletes old Provider code cleanly — no long-term feature-flag plumbing. Schema_version=2 → 1 downgrade preserves a `.v1.bak` of capabilities.toml on upgrade.
Expand Down Expand Up @@ -79,14 +83,18 @@ Each PR adds capability behind `HAL0_BACKEND=lemonade` env var while v0.1.x code
- Serialized load queue can deadlock under stuck load → hal0-side `/v1/load` timeout
- Weekly Lemonade breaking releases → pin discipline (decision §10)

## Open questions deferred to research handoff
## Resolved by research handoff 2026-05-22

Deep-dive at `docs/internal/lemonade-repo-deep-dive-2026-05-22.md`. Memories: `hal0_lemonade_v1_load_schema`, `hal0_lemonade_ws_protocol`, `hal0_lemonade_omni_pattern`, `hal0_lemonade_internals`.

- `/v1/load` actual request schema (CLI works; direct curl rejected)
- Lemonade WS protocol shape (`/logs/stream` + others) for v0.2.1 UI
- Omni recipe pattern — may inform `capabilities.toml` design
- Reserved-args extension hooks (server_models.json field set)
- `/v1/load` schema → only `model_name` required (drove §3 revision)
- WS protocol → `/logs/stream` is logs-only; no model-load-state event. v0.2.1 UI polls `/v1/health` or parses log lines
- Omni recipe → `collection.omni` is manifest of pre-registered models; `LMX-Omni-52B-Halo` is Strix-Halo-blessed
- AMD's embeddable cmake target → drove §5/§6 revision away from custom container
- Reserved-args list → hardcoded in router; not extensible via config

Research handoff: `/tmp/hal0-lemonade-research-handoff.md`. Findings land in `docs/internal/lemonade-repo-deep-dive-2026-05-22.md`.
**Still open (decide post-v0.2):**
- Omni vs hal0 capability-orchestrator interop strategy — coexist in v0.2; revisit pre-v0.3

## Related

Expand Down
21 changes: 11 additions & 10 deletions docs/internal/lemonade-migration-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Migration still net-positive on the surviving drivers, but the perf narrative ne
| 4 | vLLM-ROCm | Out of scope this cycle — re-evaluate post-v0.2 |
| 5 | Release vehicle | v0.2 (combined with Agents per parallel session work) |
| 6 | UI rework | Punt to v0.2.1 pending web-ui.md research |
| 7 | Bundling + pin | manifest.json schema v2 adds `lemonade: { image, digest, version }` |
| 7 | Bundling + pin | manifest.json schema v2 adds `lemonade: { tarball_url, sha256, version }` |
| 8 | Rollback | Downgrade-only via existing update mechanism; v0.2 deletes old Provider code cleanly |
| 9 | ComfyUI loss | Accepted — sdpp covers 90%; release-notes documented |

Expand All @@ -43,10 +43,10 @@ Migration still net-positive on the surviving drivers, but the perf narrative ne
| # | Decision | Choice |
|---|---|---|
| 10 | Slot abstraction | Preserve. Each hal0 slot = 1 Lemonade-loaded model. SlotManager retargets to Lemonade load/unload. Per-slot `hal0-slot@.service` retires. |
| 11 | Drive method | HTTP-first /v1/load with reverse-engineered schema (research lands it). `lemonade` CLI subprocess as bootstrap fallback. |
| 11 | Drive method | HTTP-only. `/v1/load` schema: `{model_name}` (only required field, research-resolved). `LemonadeClient` wraps endpoints. Healthcheck = `/live`. |
| 12 | Model registration | Generate hal0-customized `server_models.json` from `registry.toml` at install. Runtime user adds via /v1/pull `user.*` namespace. |
| 13 | Process supervision | Containerized lemond in systemd. `/etc/systemd/system/lemond.service` wraps `podman/docker run` with `--device` passthrough. Boot-enabled. |
| 14 | Container source | `ghcr.io/hal0ai/hal0-lemond:vX.Y.Z` — hal0-published wrapper. Base + lemond tarball + `unzip` + `libxrt-npu2` + pre-pulled backends. CI builds + cosign-signs. |
| 13 | Process supervision | AMD embeddable tarball + bare systemd unit. `lemond.service` runs `lemond /opt/lemonade --port 9100` directly. Hardened: NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, RestrictAddressFamilies. Boot-enabled. |
| 14 | Bundling | AMD's `lemonade-embeddable-<VERSION>-ubuntu-x64.tar.gz`. install.sh sha256-verifies, extracts to /opt/lemonade, apt-installs unzip+libxrt-npu2, runs `lemonade backends install` at first boot. **No custom hal0 image.** |
| 15 | `SlotConfig.backend` | Refactor → `SlotConfig.device` (enum `gpu-rocm \| gpu-vulkan \| cpu \| npu`). Default `gpu-rocm`. Schema_version bump + migration. |
| 16 | Metrics shim | `/v1/stats` polling (backend_url /metrics returned 501 in spike). hal0-side metrics aggregator polls `/v1/stats` + `/v1/health` per slot. |
| 17 | Idle-eviction driver | hal0-owned external. SlotManager polls `/v1/health.loaded[].last_use`, calls `POST /v1/unload` when stale per existing 300s policy. |
Expand Down Expand Up @@ -99,15 +99,16 @@ ADRs 0007+ for multi-user Cognee, prior plan, are renumbered to 0008+ as needed.

**install.sh changes:**
- `apt install -y unzip libxrt-npu2` (system deps Lemonade requires)
- Pull `ghcr.io/hal0ai/hal0-lemond:vX.Y.Z` image
- Download embeddable tarball, sha256-verify, extract to /opt/lemonade
- Write `/etc/systemd/system/lemond.service`
- `systemctl enable --now lemond`
- Generate + install `server_models.json` from registry
- Stop installing per-modality docker images

**New CI:**
- `hal0-lemond` image build workflow (replaces six toolbox workflows)
- Cosign signing per release
**CI changes:**
- Retire six toolbox build workflows
- NO new hal0 image to build (AMD ships the embeddable tarball)
- Release workflow adds tarball sha256 + version pin to manifest.json

---

Expand All @@ -116,8 +117,8 @@ ADRs 0007+ for multi-user Cognee, prior plan, are renumbered to 0008+ as needed.
1. **ADR-0006 + ADR-0007 drafts** — written from this session's decisions; tightened post-research
2. **`LemonadeClient` skeleton** — HTTP client with CLI fallback, type stubs only
3. **manifest.json schema v2** — `lemonade: {...}` field, validation, `HAL0_BACKEND=lemonade` flag plumbing
4. **`hal0-lemond` image** — Dockerfile + GH workflow + cosign + publish to ghcr.io
5. **`lemond.service`** — install.sh writes systemd unit + container-run command + device passthrough
4. **install.sh tarball fetch** — download embeddable tarball, sha256-verify, extract to /opt/lemonade, apt-install unzip+libxrt-npu2
5. **`lemond.service`** — install.sh writes hardened systemd unit running `lemond` directly (no container wrapper)
6. **`server_models_gen.py`** — registry.toml → server_models.json converter + install hook
7. **`SlotConfig.device`** — schema refactor + capabilities.toml schema_version=2 migration
8. **`LemonadeProvider`** — concrete provider implementing Provider ABC, drives `LemonadeClient`
Expand Down
Loading
Loading