From 2390215f48f1c942b0bf43ba32ed8eb308a8a04b Mon Sep 17 00:00:00 2001 From: Alexander Date: Fri, 22 May 2026 14:01:36 -0400 Subject: [PATCH] docs(lemonade): repo deep-dive + ADR/plan tightenings from research MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands the upstream-codebase research handoff that resolved several open questions on ADR-0006, plus the ADR + migration-plan tightenings those answers produced. ## What lands - `docs/internal/lemonade-repo-deep-dive-2026-05-22.md` (276L) — read of `lemonade-sdk/lemonade@7af26f75` (HEAD of main, 2026-05-21): dev internals, full API surface, embeddable build, omni recipe, WS protocol. The "what's actually in there" companion to the spike's "what happens when we run it" findings. ## ADR-0006 tightenings (from deep-dive) - §3 (Drive method): "HTTP-first with CLI fallback" → "HTTP only". The spike's `/v1/load` "type must be string but is null" failure was a malformed body (nlohmann::json[] throws on null access), not a missing field. CLI fallback isn't needed. - §3 (Schema): documents that only `model_name` is required for `/v1/load`; everything else is optional. - §5/§6 (Process supervision + bundling): "containerised lemond" → "AMD's embeddable tarball + bare systemd unit". Lemonade ships an `embeddable` cmake target producing a portable lemond+lemonade tarball — that's the official redistributable artifact. Building a container around it duplicates AMD's work and reintroduces the docker-build apparmor pain hal0 has on LXC. ## migration-plan tightenings - Decision #7 (Bundling+pin): `{image, digest, version}` → `{tarball_url, sha256, version}` to match the embeddable distribution. - Decision #11 (Drive method): mirrors ADR §3 — HTTP-only with the resolved schema, no CLI bootstrap fallback. ## Why a separate PR from #137 (the client skeleton) PR #137 already implements what these docs describe — keeping the docs/code split on a per-PR basis makes the migration story easier to read in `git log`: "client conforms to ADR-0006 §3 (HTTP-only)" reads cleanly because the ADR §3 it conforms to is the post-tightening version, landed alongside. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../adr/0006-migrate-inference-to-lemonade.md | 36 ++- docs/internal/lemonade-migration-plan.md | 21 +- .../lemonade-repo-deep-dive-2026-05-22.md | 276 ++++++++++++++++++ 3 files changed, 309 insertions(+), 24 deletions(-) create mode 100644 docs/internal/lemonade-repo-deep-dive-2026-05-22.md diff --git a/docs/internal/adr/0006-migrate-inference-to-lemonade.md b/docs/internal/adr/0006-migrate-inference-to-lemonade.md index 46a9e559..f95cc466 100644 --- a/docs/internal/adr/0006-migrate-inference-to-lemonade.md +++ b/docs/internal/adr/0006-migrate-inference-to-lemonade.md @@ -28,17 +28,21 @@ Lemonade is the sole inference backend in v0.2. The six per-modality Provider cl ### 2. Slot abstraction preserved at user-facing layer Each hal0 slot (`primary`, `embed`, `embed-rerank`, `stt`, `tts`, `img`) remains a named, configured serving target with a chosen model + device. Runtime layer changes: slot = 1 Lemonade-loaded model rather than 1 systemd template instance + 1 container. `SlotManager.start(slot)` calls Lemonade load semantics; slot state derives from `/v1/health.loaded[]` by model_name. `hal0-slot@.service` template retires. -### 3. Drive method: HTTP-first -hal0 talks to Lemonade via `/v1/load`, `/v1/unload`, `/v1/health`, `/v1/pull`, `/v1/chat/completions`, `/v1/embeddings`, `/v1/reranking`, `/v1/audio/transcriptions`, `/v1/audio/speech`, `/v1/images/generations`. `LemonadeClient` (`src/hal0/lemonade/client.py`) wraps these. CLI subprocess (`lemonade load X`) is a bootstrap fallback in early PRs until `/v1/load` schema is reverse-engineered (research handoff pending). +### 3. Drive method: HTTP only +hal0 talks to Lemonade via `/v1/load`, `/v1/unload`, `/v1/health`, `/v1/pull`, `/v1/chat/completions`, `/v1/embeddings`, `/v1/reranking`, `/v1/audio/transcriptions`, `/v1/audio/speech`, `/v1/images/generations`. `LemonadeClient` (`src/hal0/lemonade/client.py`) wraps these. Healthcheck uses unversioned `/live` (zero-work, no auth required). + +**`/v1/load` schema:** only `model_name` (string) is required. Optional: `recipe`, `ctx_size`, `llamacpp_backend`, `llamacpp_args`, etc. (See memory `hal0_lemonade_v1_load_schema`.) The spike's "type must be string, but is null" error was a malformed body, not a missing field — `nlohmann::json[]` throws on null access. CLI fallback NOT needed. ### 4. Model registration via hal0-customized `server_models.json` At install time hal0 generates `server_models.json` from `/var/lib/hal0/registry/registry.toml` and writes it into Lemonade's resources directory. Curated catalog with explicit type metadata per entry (llm/embedding/reranking/transcription/image/tts). Runtime user adds go via `POST /v1/pull` with `user.*` namespace + type. Spike confirmed Lemonade's bundled `server_models.json` does not include hal0's curated picks (e.g. `hermes-4-14b`, `qwen3-coder-next-reap-40b-a3b`). -### 5. Process supervision: containerized lemond in systemd -`/etc/systemd/system/lemond.service` runs `podman/docker run --device=/dev/dri --device=/dev/accel/accel0` with `Restart=on-failure RestartSec=5s`. Boot-enabled (always-running). hal0-api `Wants=lemond.service` (soft dep). This combines systemd's standard supervision (Restart, journal, watchdog) with container isolation hal0 already uses. +### 5. Process supervision: AMD's embeddable tarball + bare systemd unit +`/etc/systemd/system/lemond.service` runs `lemond /opt/lemonade --port 9100` directly. Hardened with `NoNewPrivileges=yes`, `ProtectSystem=strict`, `ProtectHome=yes`, `PrivateTmp=yes`, `RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX`. `Restart=on-failure RestartSec=5s`. Boot-enabled. hal0-api `Wants=lemond.service` (soft dep). This combines systemd's standard supervision (Restart, journal, watchdog) with systemd's namespace hardening — no container layer needed. + +**Revised from earlier draft** (which proposed a hal0-published container image): research surfaced that Lemonade already ships an `embeddable` cmake target producing a portable lemond+lemonade tarball. AMD's tarball is the official redistributable artifact. Building a container around it duplicates work AMD already did and reintroduces the docker-build apparmor pain hal0 has on LXC (memory `hal0_docker_build_lxc_apparmor`). -### 6. Container source: hal0-published wrapper image -`ghcr.io/hal0ai/hal0-lemond:vX.Y.Z`. Base + lemond tarball + `apt install unzip libxrt-npu2` (system deps Lemonade requires) + pre-pulled backends (`llamacpp:rocm`, `flm:npu`, `whispercpp:cpu`, `sdcpp:rocm`, `kokoro:cpu`). CI build + cosign sign. Replaces six per-modality toolbox images with one multi-backend image. (AMD does not currently publish an official docker image suitable for hal0's needs — research handoff confirms.) +### 6. Bundling: AMD's embeddable tarball, hal0 version-pins it +install.sh downloads `lemonade-embeddable--ubuntu-x64.tar.gz` from `github.com/lemonade-sdk/lemonade/releases`, sha256-verifies, extracts to `/opt/lemonade`. Then `apt install -y unzip libxrt-npu2` (Lemonade's system deps). Then `lemonade backends install llamacpp:rocm flm:npu whispercpp:cpu sdcpp:rocm kokoro:cpu` at first boot to fetch backend binaries into `/opt/lemonade/bin/`. `manifest.json` carries `lemonade: { tarball_url, sha256, version }`. Bumps gated by hal0 releases. ### 7. `SlotConfig.backend` → `SlotConfig.device` Schema refactor. Old `backend` field mixed providers and backends (`vulkan|rocm|flm|moonshine|kokoro|cpu`). New `device` field is hardware-preference only: `gpu-rocm | gpu-vulkan | cpu | npu`. Default `gpu-rocm`. `LemonadeProvider` maps `device` to Lemonade's `recipe:backend` pair internally. `capabilities.toml` schema_version bumps to 2; auto-migration preserves user choices. @@ -49,8 +53,8 @@ v0.2 ships minimal UI patch (retarget API client to new endpoints; preserve all ### 9. Metrics shim via `/v1/stats` polling Spike confirmed Lemonade's bundled llama-server returns 501 on `/metrics`. Backend_url scrape strategy (PR #124 path) does not survive the migration. hal0 builds a metrics aggregator (`src/hal0/lemonade/metrics.py`) that polls `/v1/stats` (last-request perf) and `/v1/health` (model state) per slot, exposes Prometheus surface for the dashboard. -### 10. Version pinning + bundling -`manifest.json` schema v2 adds `lemonade: { image, digest, version }`. Updates gated behind explicit hal0 release. install.sh pulls + cosign-verifies. Lemonade ships breaking changes weekly; pinning is non-negotiable. +### 10. Version pinning +Folded into §6. `manifest.json` schema v2 adds `lemonade: { tarball_url, sha256, version }`. Updates gated behind explicit hal0 release. install.sh sha256-verifies tarball before extraction. Lemonade ships breaking changes weekly; pinning is non-negotiable. ### 11. Rollback Downgrade-only via existing update mechanism (`hal0 update --version v0.1.x`). v0.2 deletes old Provider code cleanly — no long-term feature-flag plumbing. Schema_version=2 → 1 downgrade preserves a `.v1.bak` of capabilities.toml on upgrade. @@ -79,14 +83,18 @@ Each PR adds capability behind `HAL0_BACKEND=lemonade` env var while v0.1.x code - Serialized load queue can deadlock under stuck load → hal0-side `/v1/load` timeout - Weekly Lemonade breaking releases → pin discipline (decision §10) -## Open questions deferred to research handoff +## Resolved by research handoff 2026-05-22 + +Deep-dive at `docs/internal/lemonade-repo-deep-dive-2026-05-22.md`. Memories: `hal0_lemonade_v1_load_schema`, `hal0_lemonade_ws_protocol`, `hal0_lemonade_omni_pattern`, `hal0_lemonade_internals`. -- `/v1/load` actual request schema (CLI works; direct curl rejected) -- Lemonade WS protocol shape (`/logs/stream` + others) for v0.2.1 UI -- Omni recipe pattern — may inform `capabilities.toml` design -- Reserved-args extension hooks (server_models.json field set) +- `/v1/load` schema → only `model_name` required (drove §3 revision) +- WS protocol → `/logs/stream` is logs-only; no model-load-state event. v0.2.1 UI polls `/v1/health` or parses log lines +- Omni recipe → `collection.omni` is manifest of pre-registered models; `LMX-Omni-52B-Halo` is Strix-Halo-blessed +- AMD's embeddable cmake target → drove §5/§6 revision away from custom container +- Reserved-args list → hardcoded in router; not extensible via config -Research handoff: `/tmp/hal0-lemonade-research-handoff.md`. Findings land in `docs/internal/lemonade-repo-deep-dive-2026-05-22.md`. +**Still open (decide post-v0.2):** +- Omni vs hal0 capability-orchestrator interop strategy — coexist in v0.2; revisit pre-v0.3 ## Related diff --git a/docs/internal/lemonade-migration-plan.md b/docs/internal/lemonade-migration-plan.md index 89eb4186..983883ea 100644 --- a/docs/internal/lemonade-migration-plan.md +++ b/docs/internal/lemonade-migration-plan.md @@ -34,7 +34,7 @@ Migration still net-positive on the surviving drivers, but the perf narrative ne | 4 | vLLM-ROCm | Out of scope this cycle — re-evaluate post-v0.2 | | 5 | Release vehicle | v0.2 (combined with Agents per parallel session work) | | 6 | UI rework | Punt to v0.2.1 pending web-ui.md research | -| 7 | Bundling + pin | manifest.json schema v2 adds `lemonade: { image, digest, version }` | +| 7 | Bundling + pin | manifest.json schema v2 adds `lemonade: { tarball_url, sha256, version }` | | 8 | Rollback | Downgrade-only via existing update mechanism; v0.2 deletes old Provider code cleanly | | 9 | ComfyUI loss | Accepted — sdpp covers 90%; release-notes documented | @@ -43,10 +43,10 @@ Migration still net-positive on the surviving drivers, but the perf narrative ne | # | Decision | Choice | |---|---|---| | 10 | Slot abstraction | Preserve. Each hal0 slot = 1 Lemonade-loaded model. SlotManager retargets to Lemonade load/unload. Per-slot `hal0-slot@.service` retires. | -| 11 | Drive method | HTTP-first /v1/load with reverse-engineered schema (research lands it). `lemonade` CLI subprocess as bootstrap fallback. | +| 11 | Drive method | HTTP-only. `/v1/load` schema: `{model_name}` (only required field, research-resolved). `LemonadeClient` wraps endpoints. Healthcheck = `/live`. | | 12 | Model registration | Generate hal0-customized `server_models.json` from `registry.toml` at install. Runtime user adds via /v1/pull `user.*` namespace. | -| 13 | Process supervision | Containerized lemond in systemd. `/etc/systemd/system/lemond.service` wraps `podman/docker run` with `--device` passthrough. Boot-enabled. | -| 14 | Container source | `ghcr.io/hal0ai/hal0-lemond:vX.Y.Z` — hal0-published wrapper. Base + lemond tarball + `unzip` + `libxrt-npu2` + pre-pulled backends. CI builds + cosign-signs. | +| 13 | Process supervision | AMD embeddable tarball + bare systemd unit. `lemond.service` runs `lemond /opt/lemonade --port 9100` directly. Hardened: NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, RestrictAddressFamilies. Boot-enabled. | +| 14 | Bundling | AMD's `lemonade-embeddable--ubuntu-x64.tar.gz`. install.sh sha256-verifies, extracts to /opt/lemonade, apt-installs unzip+libxrt-npu2, runs `lemonade backends install` at first boot. **No custom hal0 image.** | | 15 | `SlotConfig.backend` | Refactor → `SlotConfig.device` (enum `gpu-rocm \| gpu-vulkan \| cpu \| npu`). Default `gpu-rocm`. Schema_version bump + migration. | | 16 | Metrics shim | `/v1/stats` polling (backend_url /metrics returned 501 in spike). hal0-side metrics aggregator polls `/v1/stats` + `/v1/health` per slot. | | 17 | Idle-eviction driver | hal0-owned external. SlotManager polls `/v1/health.loaded[].last_use`, calls `POST /v1/unload` when stale per existing 300s policy. | @@ -99,15 +99,16 @@ ADRs 0007+ for multi-user Cognee, prior plan, are renumbered to 0008+ as needed. **install.sh changes:** - `apt install -y unzip libxrt-npu2` (system deps Lemonade requires) -- Pull `ghcr.io/hal0ai/hal0-lemond:vX.Y.Z` image +- Download embeddable tarball, sha256-verify, extract to /opt/lemonade - Write `/etc/systemd/system/lemond.service` - `systemctl enable --now lemond` - Generate + install `server_models.json` from registry - Stop installing per-modality docker images -**New CI:** -- `hal0-lemond` image build workflow (replaces six toolbox workflows) -- Cosign signing per release +**CI changes:** +- Retire six toolbox build workflows +- NO new hal0 image to build (AMD ships the embeddable tarball) +- Release workflow adds tarball sha256 + version pin to manifest.json --- @@ -116,8 +117,8 @@ ADRs 0007+ for multi-user Cognee, prior plan, are renumbered to 0008+ as needed. 1. **ADR-0006 + ADR-0007 drafts** — written from this session's decisions; tightened post-research 2. **`LemonadeClient` skeleton** — HTTP client with CLI fallback, type stubs only 3. **manifest.json schema v2** — `lemonade: {...}` field, validation, `HAL0_BACKEND=lemonade` flag plumbing -4. **`hal0-lemond` image** — Dockerfile + GH workflow + cosign + publish to ghcr.io -5. **`lemond.service`** — install.sh writes systemd unit + container-run command + device passthrough +4. **install.sh tarball fetch** — download embeddable tarball, sha256-verify, extract to /opt/lemonade, apt-install unzip+libxrt-npu2 +5. **`lemond.service`** — install.sh writes hardened systemd unit running `lemond` directly (no container wrapper) 6. **`server_models_gen.py`** — registry.toml → server_models.json converter + install hook 7. **`SlotConfig.device`** — schema refactor + capabilities.toml schema_version=2 migration 8. **`LemonadeProvider`** — concrete provider implementing Provider ABC, drives `LemonadeClient` diff --git a/docs/internal/lemonade-repo-deep-dive-2026-05-22.md b/docs/internal/lemonade-repo-deep-dive-2026-05-22.md new file mode 100644 index 00000000..c03e007c --- /dev/null +++ b/docs/internal/lemonade-repo-deep-dive-2026-05-22.md @@ -0,0 +1,276 @@ +# Lemonade Repo Deep-Dive — 2026-05-22 + +**Repo state read:** `lemonade-sdk/lemonade@7af26f75` (HEAD of `main`, 2026-05-21). +**Scope:** dev internals, full API surface, embeddable build, omni recipe, WS protocol — all that hal0's v0.2 LemonadeProvider and v0.2.1 UI rework will sit on top of. +**Pairs with:** `lemonade-spike-findings-2026-05-22.md`, `lemonade-migration-plan.md`, ADR-0006, ADR-0007. + +--- + +## Executive summary (200 words) + +The spike findings stand, but the repo reveals two big surprises and three structural wins. + +**What hal0 inherits for free, more than expected:** +1. Lemonade *already ships* an embeddable build target (`cmake --build --target embeddable`) that emits a portable `lemond + lemonade + resources/` tarball with byte-identical CLI semantics to the deb. ADR-0006 decision #14 — "publish our own `ghcr.io/hal0ai/hal0-lemond` wrapper container" — can be re-evaluated; the embeddable tarball already does 90% of the bundling. A thin systemd-unit wrapper on host may beat full containerization. +2. A first-party browser UI (`src/web-app/`) is already maintained as a hard invariant to be Debian-packageable from system npm modules only. Tracks the lemond release version 1:1. hal0 v0.2.1 can either reuse `/app` directly or build against the documented WS taxonomy. +3. The OmniRouter pattern (`collection.omni` recipe) is conceptually identical to hal0's capability slots — only it's expressed inside Lemonade's own model registry rather than as an external overlay. + +**What's fragile to depend on:** +1. `/internal/*` endpoints are explicitly first-party-only, loopback-restricted, "may change without notice." Slot eviction / config writes from hal0 must avoid these despite their convenience. +2. The "nuclear evict-all" policy is *documented intent*, not a bug. + +**Biggest surprise vs public docs:** `/v1/load` request schema is bare — `model_name` is the *only* mandatory field. The spike curl that returned `"type must be string, but is null"` was almost certainly hitting `request_json["model_name"]` against an absent or null key — a nlohmann unconditional access. The "type must be string" message is nlohmann's, not Lemonade's. Mystery solved. + +--- + +## 1. `/v1/load` actual request schema (ground truth) + +**Source of truth:** `src/cpp/server/server.cpp::handle_load()` lines 3068-3183 + `src/cpp/server/recipe_options.cpp::RecipeOptions(recipe, options)` lines 132-141. + +### Required field + +```cpp +model_name = request_json["model_name"]; // unconditional access — null/missing → nlohmann throws +``` + +`model_name` is the only required field. The exception thrown if it is missing or null is the same `"type must be string, but is null"` that the spike saw. **Diagnosis: the spike body had `model_name` either absent, JSON-null, or non-string.** Standard `{"model_name": "Qwen3-0.6B-GGUF"}` works. + +### Optional fields — all consumed via `request_json.value(key, default)` or via `RecipeOptions` constructor + +| Field | Type | Recipes it applies to | Notes | +|---|---|---|---| +| `save_options` | bool | all | If true, persists per-model options to `recipe_options.json`; default `false` | +| `ctx_size` | int | `llamacpp`, `flm`, `ryzenai-llm` | Context window | +| `llamacpp_backend` | string | `llamacpp` | `vulkan`, `rocm`, `metal`, `cpu` | +| `llamacpp_args` | string | `llamacpp` | Custom args to llama-server (reserved-args list applies) | +| `llamacpp_device` | string | `llamacpp` | Comma-separated device list (e.g. `Vulkan0`) | +| `whispercpp_backend` | string | `whispercpp` | `npu` / `cpu` / `vulkan` | +| `whispercpp_args` | string | `whispercpp` | Custom args | +| `sd-cpp_backend` | string | `sd-cpp` | (note the hyphen in the key) | +| `sdcpp_args` | string | `sd-cpp` | | +| `steps`, `cfg_scale`, `width`, `height`, `sampling_method`, `flow_shift` | numeric | `sd-cpp` | Image gen params | +| `flm_args` | string | `flm` | | +| `vllm_backend`, `vllm_args` | string | `vllm` | | +| `merge_args` | bool | all | Default `true`; if `false`, per-model `*_args` replace global instead of merging | + +`RecipeOptions(recipe, options)` filters incoming JSON to only the keys returned by `get_keys_for_recipe(recipe)`, so passing extra fields is harmless. Empty-string and `-1` are treated as "use default" via `is_empty_option()`. + +### `/v1/load` is declarative + +Comment in source: *"Load model with optional per-model settings (declarative: no-op if already loaded with matching options, reload only if options differ)"*. hal0's idle-unload + reload-on-options-change driver does not need to track load state itself; Lemonade no-ops correctly. + +### Collection load behavior (omni) + +If `info.recipe == "collection.omni"`, handle_load iterates `info.components`, downloads any missing ones, and loads each via `router_->load_model(component, comp_info, comp_info.recipe_options, ...)`. **Per-load options like `ctx_size` or `llamacpp_backend` are NOT forwarded to components.** Each component is loaded with its own persisted `recipe_options.json` entry. Documented in the API ref; verified in `server.cpp:3132-3137`. + +### Server bug class to know about + +`is_empty_option` treats `""`, `"auto"`, and `-1` (int) as "use default". Pass `null` for any of these fields and you trip the unconditional accessor. Always send omitted-or-typed-value, never explicit nulls. + +--- + +## 2. WebSocket protocol (for v0.2.1 UI) + +**Source of truth:** `src/cpp/server/websocket_server.cpp` + `docs/api/lemonade.md` + `docs/api/openai.md`. + +### Connection routing + +The WS server shares a single port (`websocket_port`, OS-assigned by default; configurable via `--websocket-port` or `config.json`). Discovered via `GET /v1/health` → `websocket_port` field. Two URL paths multiplexed on that port: + +- `ws://host:/logs/stream` → log streaming +- `ws://host:/realtime?model=Whisper-Tiny` → realtime audio transcription (OpenAI-compatible) + +### `/logs/stream` taxonomy + +**Client → server:** +- `{"type":"logs.subscribe","after_seq":null|}` — replay from seq, or full backlog (≤5000 entries retained) + +**Server → client:** +- `{"type":"logs.snapshot","entries":[ {seq,timestamp,severity,tag,line}, ... ]}` — initial batch (sent once) +- `{"type":"logs.entry","entry":{seq,timestamp,severity,tag,line}}` — live entries +- `{"type":"error","error":{message,type}}` — protocol error + +Severity enum: `Trace | Debug | Info | Warning | Error | Fatal`. Tags are component-level strings (`Server`, `Router`, ...). + +### `/realtime` taxonomy (OpenAI-compatible) + +Initial `session.created` on connect. Then: + +**Client → server:** `session.update`, `input_audio_buffer.append` (base64 PCM16 16kHz mono), `input_audio_buffer.commit`, `input_audio_buffer.clear`. + +**Server → client:** `session.created`, `session.updated`, `input_audio_buffer.speech_started`, `input_audio_buffer.speech_stopped`, `input_audio_buffer.committed`, `input_audio_buffer.cleared`, `conversation.item.input_audio_transcription.delta` (interim), `conversation.item.input_audio_transcription.completed` (final), `error`. + +VAD configurable via `session.update.session.turn_detection = {threshold, silence_duration_ms, prefix_padding_ms}` or `null` to disable. + +### What hal0's v0.2.1 UI should consume + +- `/logs/stream` is the *primary* live signal; current hal0 dashboard polling can be replaced with `logs.subscribe + after_seq` for reconnect-safe streaming. Use `seq` for dedup. +- There is **no** WS event for "model load started/completed/failed" — those land on `/logs/stream` as `(Router)` and `(Server)` tagged Info/Error lines. v0.2.1 should either parse tagged log lines OR poll `/v1/health.all_models_loaded[].last_use` to drive a load-progress UI. +- There is no metrics-over-WS channel. Stats are pull-only via `/v1/stats`. + +--- + +## 3. Type classification + reserved args + +**Source:** `src/cpp/include/lemon/model_types.h::get_model_type_from_labels()`. + +Type assignment is **label-driven, not field-driven**. There is no `type` field in `server_models.json`. Resolution order: +1. **Chat-indicator labels win.** Any of `vision`, `reasoning`, `tool-calling`, `tools`, `chat-transcription` → `ModelType::LLM`. +2. Else first-match on: `embeddings`/`embedding` → EMBEDDING, `reranking` → RERANKING, `transcription` → TRANSCRIPTION, `image` → IMAGE, `tts` → TTS. +3. Else → `ModelType::LLM` (default). + +**Hal0 impact:** for the spike's broken rerank/embed discovery, the cause is `extra_models_dir` GGUFs receive labels `["custom"]` only (Extra-Models-Dir-Spec.md §"Model Properties"), so type defaults to LLM and `--reranking`/`--embedding` flags are never passed. Workaround: don't rely on extra-models-dir for non-LLM modalities — register them via `/v1/pull` with `embedding: true` or `reranking: true` (the API explicitly accepts these, which adds the `embeddings` / `reranking` label). + +**Device** is derived from recipe via `get_device_type_from_recipe()`, NOT from labels. Recipe-to-device static map: `llamacpp` → GPU (overridable to CPU for `cpu` backend), `ryzenai-llm`/`flm` → NPU, `whispercpp` → CPU (overridable), `sd-cpp` → CPU (overridable), `kokoro` → CPU (no GPU build exists), `collection.omni` → NONE. + +### Reserved args (the canonical list) + +From `/v1/load` doc verbatim — args forbidden in `llamacpp_args`: + +> `-m, --port, --ctx-size, -ngl, --jinja, --mmproj, --embeddings, --reranking` + +Whispercpp_args forbidden: `-m, --model, --port`. The spike captured a larger superset from server logs (`--device, --gpu-layers, --n-gpu-layers, -dev, --mmproj-*, --no-mmproj-*, -mm, -mmu`) — these are *also* managed but not in the doc. Treat the spike's list as authoritative for current build. + +--- + +## 4. `collection.omni` recipe + OmniRouter + +**Source:** `docs/dev/lemonade-omni.md` + `src/cpp/include/lemon/model_types.h:9` + `src/app/src/renderer/utils/toolDefinitions.json`. + +An **Omni model** is a registered model with `recipe: "collection.omni"` and a `components: [...]` array of other registered model names. It is *not* a multi-modal model file — it is a manifest that bundles several single-modality models together. Loading the collection loads each component using *its own* recipe_options entry. + +Tool surface (canonical, used by Lemonade's desktop app + the doc): + +| Tool | Endpoint | Required model label | +|---|---|---| +| `generate_image` | `POST /v1/images/generations` | `image` | +| `edit_image` | `POST /v1/images/edits` | `edit` | +| `text_to_speech` | `POST /v1/audio/speech` | `tts` | +| `transcribe_audio` | `POST /v1/audio/transcriptions` | `transcription` | +| `analyze_image` | `POST /v1/chat/completions` | LLM with `vision` | + +The "router" is just OpenAI tool-calling JSON shipped to the agent; Lemonade does not own the agent loop. Omni models are hidden from default `/v1/models` and only appear with `?show_all=true`. + +**hal0 mapping:** +- hal0's `capabilities.toml` (capability → backend slot rollup) and `collection.omni` (collection of model_names) overlap conceptually but target different layers. capabilities.toml expresses "which backend serves which capability"; omni expresses "this group of registered models is a single user-facing kit." +- For v0.2: keep `capabilities.toml` as the higher-level *UX rollup*; treat collection.omni as a sub-feature hal0 can expose to power-users (`hal0 capabilities export-as-omni`) or pass-through unchanged. +- The omni "Halo" SKU (`LMX-Omni-52B-Halo` = Qwen3.6-35B + Flux-2-Klein-9B + Whisper-Large-v3-Turbo + kokoro-v1) is *named for Strix Halo*. hal0 inherits a curated, vendor-blessed Strix Halo bundle by adopting Lemonade. + +--- + +## 5. Process supervision + nuclear evict-all + +**Source:** `src/cpp/Multi-Model-Spec.md` + `src/cpp/server/router.cpp`. + +- LRU per *type* slot. `--max-loaded-models N` (default 1) applies to each of llm/embedding/reranking/transcription/image/tts independently. `-1` = unlimited. +- Eviction granularity: only models of the *same type* are evicted to make room. Exception: an NPU load evicts any existing NPU model regardless of type (NPU exclusivity). +- **"Nuclear" evict-all policy is policy, not bug.** Multi-Model-Spec §"Error Handling": *"If a WrappedServer load fails (with exceptions noted below), all WrappedServers of every type are evicted, and the load is re-attempted. This 'nuclear' policy simplifies implementation while remaining effective in practice."* **Exception:** file-not-found errors are exempt. This validates ADR-0007's mitigation strategy verbatim. +- **Serialized loading.** Only one WrappedServer loads at a time. Concurrent `/v1/load` queues indefinitely. Hal0 needs a hard timeout at its layer. +- **Busy-protection.** A WrappedServer fulfilling an inference request cannot be evicted until it finishes (EVICTION_TIMEOUT=5s; `router.h:18`). +- **Auto-load protection.** An inference request to an unloaded model triggers auto-load; the inference completes before the WrappedServer becomes eligible for eviction. + +--- + +## 6. Build system + embeddable tarball + +**Source:** `docs/embeddable/`, `docs/dev/getting-started.md`, `CMakeLists.txt` (68 KB, not fully read). + +- Single CMake target `embeddable` produces a per-platform archive: `build/lemonade-embeddable--{ubuntu|windows|macos}-{x64|arm64}.{tar.gz|zip}`. +- Archive contents: `lemond`, `lemonade`, `LICENSE`, `resources/server_models.json`, `resources/backend_versions.json`, `resources/defaults.json`. Optionally `resources/web-app/` if `-DBUILD_WEB_APP=ON`. +- Runtime layout when `lemond ./` is invoked from the archive root: + - `./config.json` — auto-generated from `resources/defaults.json` on first launch; **safe to delete `defaults.json` after**. + - `./recipe_options.json` — per-model overrides. + - `./bin/{llamacpp,ryzenai-server,flm,sdpp,whispercpp}/{rocm,vulkan,cpu,npu}/...` — backend binaries (downloaded lazily, or pre-staged at packaging time via `lemonade backends install BACKEND:DEVICE`). + - `./models/models----/` — HF-standard layout. + - `./extra_models/` — bring-your-own GGUFs (extra.* namespace). +- Auth: `LEMONADE_API_KEY=KEY lemond ./ --port PORT` enables bearer-token auth. Missing/wrong key → 401. This is the canonical embedding pattern — hal0 should adopt it. + +**ADR-0006 decision #14 challenge:** the spike findings + the embeddable tarball + the existing `.deb` package together mean hal0 has *three* viable bundle paths: +1. Custom `ghcr.io/hal0ai/hal0-lemond` container (original ADR-0006 plan) +2. Apt-install Lemonade's official `.deb` + a hal0 systemd unit +3. Bundle the `embeddable` tarball + hal0 systemd unit, fully self-contained + +Option 3 has the smallest surface and avoids the `apparmor_parser` LXC docker-build issue (memory `hal0_docker_build_lxc_apparmor`). Worth raising in next grilling pass. + +--- + +## 7. Internal endpoints (DO NOT depend on) + +**Source:** `docs/dev/getting-started.md` + `docs/embeddable/runtime.md`. + +Endpoints under `/internal/*` are: +- Loopback-restricted (`127.0.0.1`/`::1`); non-localhost → 403. +- Documented as *"for first-party Lemonade software only (CLI, tray app, desktop app). They are not part of the public API, may change without notice, and must not be relied upon by third-party integrations."* +- `POST /internal/set` (server-level + deferred keys — see embeddable/runtime.md §"POST /internal/set" for full list including `port, host, log_level, global_timeout, no_broadcast, extra_models_dir, max_loaded_models, ctx_size, llamacpp_backend, llamacpp_args, sdcpp_backend, whispercpp_backend, whispercpp_args, vllm_backend, vllm_args, steps, cfg_scale, width, height, flm_args`) +- `GET /internal/config` — full runtime config snapshot. +- `POST /internal/shutdown` — unload all + shut down. +- `POST /internal/cleanup-cache` — orphan HF cache cleanup. + +**Hal0 stance:** these are tempting but unstable. Stay on the public surface. If a runtime tuneable is only available via `/internal/set`, drive it through the `lemonade config set` CLI subprocess instead — same endpoint underneath but Lemonade owns the compat contract. + +--- + +## 8. Test strategy + +**Source:** `docs/dev/getting-started.md#testing` + `test/` directory. + +Python test suite under `test/` drives the `lemonade` CLI binary (NOT the HTTP surface directly): + +| File | Coverage | +|---|---| +| `server_cli2.py` | CLI verbs: version, status, list, export, backends, pull, import, load, unload, run, launch, delete | +| `server_endpoints.py` | HTTP: health, models, pull, load, unload, system-info, stats | +| `server_llm.py` | Inference: chat, embeddings, reranking — parametrized by `--wrapped-server` and `--backend` | +| `server_whisper.py` | ASR | +| `server_sd.py` | Image gen (~2-3 min/img on CPU) | + +**Hal0 applicability:** `server_endpoints.py` is directly relevant as a contract test — hal0's LemonadeClient regression suite should mirror its assertions (response shapes for /v1/health, /v1/load, /v1/unload, /v1/models, /v1/stats, /v1/system-info). Worth pulling into hal0 verbatim or by reference. + +--- + +## 9. Things that surprised me vs the spike + +1. **`lemonade backends install` writes to `./bin/` of the lemond CWD**, not a global location. The "`/opt/hal0/flm-ubuntu` doesn't exist" finding from the spike is consistent — that path was a guess. The real path on the LXC is `~/.cache/lemonade/bin/flm/...` (default lemond cache). +2. **`vllm` is a documented recipe with full /internal/set keys.** Spike treated vLLM-ROCm as out of scope; the API support already exists. Cost of enabling it later is small. +3. **`kokoro` has its own recipe slot** (recipe `kokoro`, device CPU only, no GPU variant ever — spike confirmed but the source makes it explicit: `get_device_type_from_recipe` hardcodes CPU). GPU-kokoro loss is *permanent in Lemonade's design*, not a transient state. +4. **`/live` is unversioned and outside `/api/v0/`, `/api/v1/`, `/v0/`, `/v1/` prefixes.** Use this for hal0 healthcheck, not `/v1/health` — `/live` does zero work; `/v1/health` enumerates loaded models. +5. **UDP beacon on port 13305 broadcasts hostname to RFC1918 networks** for client discovery. Disable with `--no-broadcast` if hal0 doesn't want unsolicited LAN announcements (currently a leaky tray-app pattern in a server context). +6. **`max_models` is reported per-type by `/v1/health`** — `{"transcription":1,"embedding":1,"image":1,"llm":1,"reranking":1,"tts":1}`. hal0's dashboard already wants per-type slot counts; this surfaces them directly. + +--- + +## 10. File index — which Lemonade file covers which subsystem + +| Subsystem | File(s) | Why care | +|---|---|---| +| /v1/load handler | `src/cpp/server/server.cpp:3068-3183` | Source of truth for required/optional fields | +| Recipe option parsing | `src/cpp/server/recipe_options.cpp` | Per-recipe key whitelist, defaults, CLI flag mapping | +| Eviction policy | `src/cpp/Multi-Model-Spec.md` + `src/cpp/server/router.cpp` | LRU + nuclear-evict-all rationale | +| Type/device classification | `src/cpp/include/lemon/model_types.h` | Label → type, recipe → device | +| Extra-models-dir scan | `src/cpp/Extra-Models-Dir-Spec.md` + `src/cpp/server/model_manager.cpp` | extra.* namespace, label-restrictions | +| WebSocket protocol | `src/cpp/server/websocket_server.cpp` + `docs/api/lemonade.md` + `docs/api/openai.md` | logs.* + realtime audio taxonomy | +| Collection.omni | `src/cpp/include/lemon/model_types.h:9-13` + `src/cpp/server/model_manager.cpp` (`validate_collection_request`) + `docs/dev/lemonade-omni.md` | Omni recipe shape + tool catalog | +| Backend version pinning | `src/cpp/resources/backend_versions.json` | What versions of llama.cpp / whisper.cpp / FLM ship per release | +| Server config defaults | `src/cpp/resources/defaults.json` | First-launch config.json content | +| Embeddable build | `docs/embeddable/` + `CMakeLists.txt` (target `embeddable`) | Tarball assembly + auth pattern | +| First-party UI | `src/app/` (Tauri desktop) + `src/web-app/` (browser, Debian-packageable) | Shared React renderer; web-app is what `/app` serves | +| Web-UI Debian invariant | `docs/dev/web-ui.md` | Why hal0 can't just `npm install` arbitrary deps into the web-app build | +| API specs | `docs/api/{lemonade,openai,anthropic,ollama,llamacpp}.md` | Full request/response shapes for all surfaces | +| Internal endpoints | `docs/dev/getting-started.md#internal-endpoints` + `docs/embeddable/runtime.md` | Tempting but explicitly unstable | +| Integration patterns | `docs/integrations/{claude-code,open-webui,continue,...}.md` | What "free integrations" hal0 inherits | + +--- + +## 11. Recommendations to feed back into ADR-0006 / migration plan + +1. **Reconsider decision #14 (custom hal0-lemond container).** The `embeddable` tarball + a hal0 systemd unit may replace it with significantly less moving parts. Re-cost vs the `apparmor_parser`-on-LXC pain. +2. **Adopt `LEMONADE_API_KEY` from day one.** It's the canonical lockout pattern and hal0's threat model includes "other LAN apps reaching the gateway directly." +3. **Switch hal0 healthcheck from `/v1/health` to `/live`.** Cheaper, doesn't risk perturbing eviction timestamps. +4. **For v0.2.1 UI:** subscribe to `/logs/stream` with `after_seq` for the live event surface; supplement with `/v1/health` polling for `all_models_loaded[].last_use`. There is no dedicated model-state-change WS event; do not wait for one. +5. **Reserved-args list:** the spike's superset (`--device --gpu-layers --n-gpu-layers --jinja --mmproj* --no-mmproj* -dev -mm -mmu`) is current truth; the API doc's shorter list is the public-promised subset. Hal0's CLI/config validator should reject from the larger list. +6. **Type classification:** for embed/rerank models in `extra_models_dir`, the only reliable path is `/v1/pull` registration with `embedding:true`/`reranking:true`. Document this in the install.sh model-bootstrap path. +7. **Collection.omni as v0.2.1+ feature:** `hal0 capabilities export-as-omni` would let users hand off hal0-curated bundles into Lemonade's native omni picker — a one-way bridge that costs little. + +--- + +*End deep-dive. Memories written: `hal0_lemonade_internals`, `hal0_lemonade_v1_load_schema`, `hal0_lemonade_ws_protocol`, `hal0_lemonade_omni_pattern`. MEMORY.md index updated.*