feat(inference): inference-llm PR-2 — InferenceLlmModule ServiceModule impl (stub-backed) by joelteply · Pull Request #1391 · CambrianTech/continuum

joelteply · 2026-05-18T16:34:40Z

Summary

PR-2 of inference-llm. Wires the ServiceModule that accepts InferenceRequest commands + emits InferenceComplete + FirstTokenEmitted responses. The actual llama.cpp invoke lands in PR-3; PR-2 ships a STUB inference returning canned tokens so the seam is testable end-to-end + downstream consumers (sentinel-observer, VDD harness) can wire to it today.

What lands

InferenceLlmModule struct implementing ServiceModule
ModuleConfig: name="inference-llm", priority=High, command_prefixes=["inference/llm/"]
handle_command for "inference/llm/request":
- parses InferenceRequest JSON payload
- runs stub inference (3 canned tokens, FinishReason::Stop)
- returns InferenceResponse { complete, first_token } as JSON
Loud typed errors for unknown commands + invalid payloads (per Joel's never-swallow rule)
COMMAND_REQUEST = "inference/llm/request" constant pinned

Design choices

Stub backed because PR-3 ships the real engine; the OUTER wire shape stays identical across stub→real transition — downstream consumers don't need to know which is running.
pub(super) run_stub_inference + first_token_for helpers so PR-3 can keep a "stub-vs-real produce same wire shape" regression test before swapping.
Returns InferenceResponse bundle (complete + first_token) instead of publishing two events separately. Avoids racing the two events.

What is deliberately deferred (PR-3)

Real llama.cpp invocation (LlamaCppAdapter integration)
Tokenizer (composition_plan → prompt_tokens)
Token streaming via channels
Bus-event subscription path (artifact_subscriptions)
ResidencyFault emission on missing-page (needs working-set integration)
Runtime registration

Test plan

cargo test --lib --features metal,accelerate inference::llm_module_service — 8/8 pass:
- config_reports_name_and_command_prefix
- config_priority_is_high
- command_request_has_canonical_string_value
- handle_command_routes_request_to_stub_inference (end-to-end)
- handle_command_unknown_returns_loud_error
- handle_command_invalid_payload_returns_typed_error
- inference_response_round_trips_through_serde
- module_is_object_safe_for_dyn_service_module
No regressions across other 2940 lib tests

Stack

feat(inference): inference-llm PR-1 — typed event surface (MODULE-CATALOG §II) #1387 — inference-llm PR-1: typed event surface
This PR — inference-llm PR-2: ServiceModule impl (stub-backed)
NEXT — PR-3: real LlamaCppAdapter invoke + tokenizer + streaming

🤖 Generated with Claude Code

…e impl PR-2 of inference-llm. Wires the ServiceModule that accepts InferenceRequest commands + emits InferenceComplete + FirstTokenEmitted responses. The actual llama.cpp invoke lands in PR-3; PR-2 ships a STUB inference returning canned tokens so the seam is testable end-to-end + downstream consumers (sentinel-observer, VDD harness) wire to it today. What lands - InferenceLlmModule struct implementing ServiceModule - ModuleConfig: name="inference-llm", priority=High, command_prefixes=["inference/llm/"] - handle_command for "inference/llm/request": - parses InferenceRequest JSON payload - runs stub inference (3 canned tokens, FinishReason::Stop) - returns InferenceResponse { complete, first_token } as JSON - Loud typed errors for unknown commands + invalid payloads - COMMAND_REQUEST = "inference/llm/request" constant pinned Design choices - Stub backed because PR-3 ships the real engine; the OUTER wire shape stays identical across stub→real transition. - pub(super) run_stub_inference + first_token_for helpers so PR-3 can keep a "stub-vs-real produce same wire shape" regression test before swapping. - Returns InferenceResponse bundle (complete + first_token) instead of publishing two events separately. Caller decomposes if needed. Tests 8 new tests pin the contract: config, command constant, route to stub, loud error paths, serde round-trip, dyn dispatch. 8/8 pass. No regressions across other 2934 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - THIS PR — inference-llm PR-2: ServiceModule impl (stub-backed) - NEXT — PR-3: real LlamaCppAdapter invoke + tokenizer + streaming Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR-2's earlier clippy pass removed file-scope InferenceRequestId import because production code doesn't use it directly (only deserializes from JSON). Test module DOES use it for constructing sample requests, so cargo test --lib failed with E0433. Same pattern as the genome/blob.rs fix earlier this session. Future me: when clippy says 'unused import' but the test mod uses the type, scope to the test mod rather than deleting outright.

…shing helpers (#1392) PR-3a of inference-llm. Same pattern as my genome::bus PR-4 (#1358): name the canonical ArtifactKey constants + ship the async publishing helpers + subscriber convenience. The actual real-engine integration lands in PR-3b/PR-4; PR-3a ships the bus surface so downstream observers (sentinel-observer, VDD harness, audit-recorder) can wire to it today before the engine swap. What lands Four canonical ArtifactKeys under inference/: - INFERENCE_REQUEST_KEY = "inference/llm.request" - INFERENCE_COMPLETE_KEY = "inference/llm.complete" - FIRST_TOKEN_EMITTED_KEY = "inference/llm.first_token" - RESIDENCY_FAULT_KEY = "inference/llm.residency_fault" Four async publishing helpers — serialize the typed event + publish through the artifact dispatch path (#1339 + #1343): - publish_inference_request - publish_inference_complete - publish_first_token_emitted - publish_residency_fault Three subscriber-convenience surfaces: - subscribe_to_inference_responses(bus, name) — most observers want outcomes (complete + first_token + fault), not requests - inference_response_selectors() — three Exact selectors - all_inference_selectors() — four selectors including request for full-firehose consumers (audit-recorder when it covers inference) Design choices - Two subscriber surfaces (response-only vs full firehose) because most observers don't want every request — they want outcomes. Audit-recorder + VDD harness may want the firehose for the prod-replay chain Joel pushed at #1385. - Request key INFERENCE_REQUEST_KEY in the publish helpers but NOT in the default observer set. Producers (persona-cognition) emit requests; observers see responses. Wiring symmetry without the noise. - Same naming convention as genome::bus (module/surface.event) for cross-module consistency. What is deliberately deferred (PR-3b / PR-4) - Wiring helpers INTO InferenceLlmModule::handle_command so it auto-publishes after each call. PR-3b plumbs Arc<MessageBus> + Arc<ModuleRegistry> through the module's constructor. - Real LLM engine (LlamaCppAdapter integration) — PR-4 - InferenceRequest artifact subscription (module subscribes to requests via bus instead of going through command bus) — needs persona-cognition to publish via bus first Tests 7 new tests on inference::llm_module_bus: - keys_have_canonical_string_values (pin wire strings) - response_selectors_cover_three_keys_as_exact - all_selectors_cover_four_keys - publish_inference_complete_routes_to_subscribed_module (end-to-end through artifact dispatch) - each_publish_helper_routes_to_its_own_key - response_only_subscriber_does_not_see_requests - full_firehose_subscriber_sees_requests_too 7/7 pass. No regressions across other 2958 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - THIS PR — inference-llm PR-3a: bus keys + publishing helpers - NEXT — PR-3b: InferenceLlmModule auto-publishes via these helpers after each handle_command call - THEN — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hes via bus hook (#1393) PR-3b of inference-llm. Wires the bus helpers from PR-3a (#1392) INTO InferenceLlmModule's handle_command so every successful inference response auto-publishes InferenceComplete + FirstTokenEmitted to the trace bus. Closes the inference-llm bus loop: producer (command) → engine (stub for now) → response (CommandResult) → bus dispatch (complete + first_token) → subscriber (sentinel/VDD/audit). What lands - BusHook private struct: { bus: Arc<MessageBus>, registry: Arc<ModuleRegistry> }. Same shape as genome::local_manager BusHook (#1362). - InferenceLlmModule.bus_hook: Option<BusHook> — None = bus-less PR-2 behavior; Some = auto-publish on every successful handle_command. - with_bus(bus, registry) constructor — wires both Arcs at module construction; no in-flight switching (prevents the "bus added mid-service" race). - handle_request body: on success, spawns publish_inference_complete and publish_first_token_emitted into the current tokio runtime via Handle::try_current. Spawn pattern (not await) avoids the DashMap borrow-across-await lifetime issue inside Send-bounded async_trait — same workaround as my genome LocalWorkingSetManager (#1362). - spawn_publish_inference_complete + spawn_publish_first_token_emitted module-private helpers — Arcs cloned out before spawn so the &BusHook borrow doesn't outlive the spawn. Design choices - Publishing is best-effort observability. The authoritative response goes back through the CommandResult arm regardless of publish success — callers who need to know if a generation happened look at the Result, not the bus. - Error paths (unknown command + invalid payload) do NOT publish. Tests pin this — bus events represent successful generations; errors are loud in the Result and silent on the bus. - Two separate spawns (one per event) rather than one bundled publish. Lets subscribers see first_token even if the complete event hasn't dispatched yet (race-tolerant TTFT observability). Tests 4 new bus tests (12 total): - handle_command_with_bus_auto_publishes_complete_and_first_token — end-to-end: register subscriber, run handle_command, yield for spawn, verify both events landed with matching requestId - handle_command_without_bus_does_not_publish — backwards-compat with PR-2 new() constructor - handle_command_unknown_with_bus_does_not_publish — error paths silent on bus - handle_command_invalid_payload_with_bus_does_not_publish — same invariant 12/12 pass on inference::llm_module_service. No regressions across other 2957 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - #1392 — inference-llm PR-3a: bus keys + publishing helpers - THIS PR — inference-llm PR-3b: auto-publish wiring - NEXT — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming (the stub stays in place until then; PR-4 swaps under the same external contract) Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n layer + new constructors) (#1395) Bridges the substrate's typed InferenceRequest/InferenceComplete surface to the existing AIProviderAdapter trait (LlamaCppAdapter for local llama.cpp). PR-5 ships the LlamaCppAdapter Runtime wiring + the end-to-end stub-adapter test; PR-4 ships the translation logic + new constructors so PR-5 is just plumbing. What lands - InferenceRequest.prompt_text: Option<String> — PR-4 wire addition for adapter-based engines that tokenize internally. Backwards-compat (Option = optional on wire). - InferenceComplete.completion_text: Option<String> — wire addition for adapter-based engines that return text not tokens. - InferenceLlmModule.adapter: Option<Arc<dyn AIProviderAdapter>>. - with_adapter(adapter) constructor: real-inference + no bus. - with_bus_and_adapter(bus, registry, adapter) constructor: the full production wiring (adapter + bus publishing). - handle_request: routes via adapter when wired + prompt_text present; refuses loud when adapter wired + no prompt_text (raw- token path not yet implemented — never silent fallback); falls back to PR-2 stub when no adapter. - run_adapter_inference(adapter, request, prompt_text) — translates InferenceRequest → TextGenerationRequest, calls adapter, translates TextGenerationResponse → (InferenceComplete, FirstTokenEmitted). - translate_adapter_response(request, response) — pure-function body of the response-side translation. - translate_adapter_finish_reason(adapter_reason) — cross-enum mapping: Stop→Stop, Length→MaxTokens, ToolUse→Error{reason} (loud refusal — inference-llm doesn't model tool-use), Error→ Error{reason}. Wire-shape decisions - max_tokens=0 in substrate's GenerationBudget translates to None on adapter's wire. Substrate convention: 0=unlimited, caller takes duration responsibility. Adapter convention: None=unlimited, 0=stop immediately. The substrate's "stop immediately" doesn't have an encoding because no caller would ask for it. - stop_sequences: empty Vec on substrate translates to None on adapter (adapter convention: None = no caller stop sequences). - persona_id propagates to adapter as stringified UUID for per-persona resource attribution (matches existing adapter convention from PersonaResponseGenerator). - purpose hardcoded "inference-llm" for adapter routing diagnostics. Sub-fix: missing TS bindings from PR-1 PR-1 (#1387) shipped the Rust types but the shared/generated/inference_llm/ directory of TS exports wasn't included in the commit (regen produced them locally; they didn't get staged). PR-4 ships all 10 TS files + the barrel index. Closes a wire-contract gap. Tests 13 new behavioral tests (44 total in inference::llm_module + inference::llm_module_service + inference::llm_module_bus): - translate_adapter_response_carries_text_and_usage — completion_text + tokens_generated mapping - translate_finish_reason_covers_all_adapter_variants — cross-enum mapping pin - with_adapter_constructor_routes_via_adapter_path — constructors compile + no-adapter regression - 8 existing PR-2 + 4 existing PR-3b tests still pass (no regressions) End-to-end "stub adapter via Arc<dyn AIProviderAdapter>" tests deferred to PR-5: the AIProviderAdapter trait has 8+ methods (provider_id / api_style / default_model / get_available_models / health_check / model_metadata / capabilities / initialize / shutdown / generate_text / create_embedding) and implementing all of them on a test stub here would pull in ProviderHealth + AdapterCapabilities + ApiStyle + ModelInfo + their dependencies — bigger than atomic-slice. PR-5 will wire LlamaCppAdapter directly through Runtime registration. 44/44 inference::llm_module tests pass. No regressions across other 2928 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - #1392 — inference-llm PR-3a: bus keys + publishing helpers - #1393 — inference-llm PR-3b: auto-publish wiring - THIS PR — inference-llm PR-4: adapter integration (translation + constructors) - NEXT — PR-5: LlamaCppAdapter Runtime wiring + end-to-end integration test through real (or test-mock) adapter Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires InferenceLlmModule into the Runtime so it's callable from the cognition path via inference/llm/request commands. What lands - Add "inference-llm" to EXPECTED_MODULES in runtime/runtime.rs - runtime.register(Arc::new(InferenceLlmModule::new())) in ipc/mod.rs alongside the existing InferenceModule registration Design choices - Constructed via the .new() (bus-less, stub-backed) constructor rather than .with_bus_and_adapter(). Reason: the with_bus_and_adapter constructor requires an AIProviderAdapter Arc, which would couple PR-5's runtime registration to a specific LlamaCppAdapter init lifecycle. The substrate's LlamaCppAdapter is owned by AIProviderModule's adapter registry with its own initialization phase; threading the adapter Arc here would either duplicate the registration or create an init-ordering dependency this slice shouldn't introduce. - The stub-backed registration is still useful: it exposes the inference/llm/request command surface to the cognition path so downstream PRs (turn-execute that chains drain-turn-frame → response_prompt → inference/llm/request) can wire against the real command name. Bus + adapter integration is a follow-up PR that updates the construction call here. What is NOT changed - AIProviderModule + LlamaCppAdapter unchanged - All InferenceLlmModule trait impl logic unchanged (PR-2/3/4 work intact) - The stub vs real-adapter swap point stays exactly where PR-4 put it: with_bus_and_adapter constructor + run_adapter_inference function Tests - cargo build --features metal,accelerate --lib clean (no new test fixtures needed — the module's existing 44/44 tests cover the trait-impl correctness; this PR just plumbs construction into runtime startup) - EXPECTED_MODULES enforcement validates at boot: if the registration is missing the runtime fails with "missing inference-llm" error - Pre-push gate clean Stack - #1387 PR-1: typed event surface - #1391 PR-2: ServiceModule impl (stub-backed) - #1392 PR-3a: bus keys + publishing helpers - #1393 PR-3b: auto-publish wiring - #1395 PR-4: adapter integration (translation + new constructors) - THIS PR — PR-5: Runtime registration - FOLLOW-UP — adapter Arc wiring when LlamaCppAdapter init phase is integrated with Runtime startup Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Test and others added 2 commits May 18, 2026 11:20

joelteply merged commit 14b58fa into canary May 18, 2026
3 checks passed

joelteply deleted the feat/inference-llm-module-pr2 branch May 18, 2026 16:34

github-actions Bot added the size: L label May 18, 2026

joelteply mentioned this pull request May 18, 2026

feat(inference): inference-llm PR-3a — canonical ArtifactKeys + publishing helpers #1392

Merged

2 tasks

joelteply mentioned this pull request May 18, 2026

feat(inference): inference-llm PR-3b — InferenceLlmModule auto-publishes via bus hook (pure Rust) #1393

Merged

2 tasks

joelteply mentioned this pull request May 18, 2026

feat(inference): inference-llm PR-4 — adapter integration (translation + new constructors) #1395

Merged

3 tasks

joelteply mentioned this pull request May 18, 2026

feat(inference): inference-llm PR-5 — Runtime registration #1404

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): inference-llm PR-2 — InferenceLlmModule ServiceModule impl (stub-backed)#1391

feat(inference): inference-llm PR-2 — InferenceLlmModule ServiceModule impl (stub-backed)#1391
joelteply merged 2 commits into
canaryfrom
feat/inference-llm-module-pr2

joelteply commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 18, 2026

Summary

What lands

Design choices

What is deliberately deferred (PR-3)

Test plan

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant