feat(inference): inference-llm PR-2 — InferenceLlmModule ServiceModule impl (stub-backed)#1391
Merged
Merged
Conversation
…e impl
PR-2 of inference-llm. Wires the ServiceModule that accepts
InferenceRequest commands + emits InferenceComplete +
FirstTokenEmitted responses. The actual llama.cpp invoke lands in
PR-3; PR-2 ships a STUB inference returning canned tokens so the
seam is testable end-to-end + downstream consumers
(sentinel-observer, VDD harness) wire to it today.
What lands
- InferenceLlmModule struct implementing ServiceModule
- ModuleConfig: name="inference-llm", priority=High,
command_prefixes=["inference/llm/"]
- handle_command for "inference/llm/request":
- parses InferenceRequest JSON payload
- runs stub inference (3 canned tokens, FinishReason::Stop)
- returns InferenceResponse { complete, first_token } as JSON
- Loud typed errors for unknown commands + invalid payloads
- COMMAND_REQUEST = "inference/llm/request" constant pinned
Design choices
- Stub backed because PR-3 ships the real engine; the OUTER wire
shape stays identical across stub→real transition.
- pub(super) run_stub_inference + first_token_for helpers so PR-3
can keep a "stub-vs-real produce same wire shape" regression
test before swapping.
- Returns InferenceResponse bundle (complete + first_token) instead
of publishing two events separately. Caller decomposes if needed.
Tests
8 new tests pin the contract: config, command constant, route to
stub, loud error paths, serde round-trip, dyn dispatch. 8/8 pass.
No regressions across other 2934 lib tests.
Stack
- #1387 — inference-llm PR-1: typed event surface
- THIS PR — inference-llm PR-2: ServiceModule impl (stub-backed)
- NEXT — PR-3: real LlamaCppAdapter invoke + tokenizer + streaming
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-2's earlier clippy pass removed file-scope InferenceRequestId import because production code doesn't use it directly (only deserializes from JSON). Test module DOES use it for constructing sample requests, so cargo test --lib failed with E0433. Same pattern as the genome/blob.rs fix earlier this session. Future me: when clippy says 'unused import' but the test mod uses the type, scope to the test mod rather than deleting outright.
2 tasks
joelteply
added a commit
that referenced
this pull request
May 18, 2026
…shing helpers (#1392) PR-3a of inference-llm. Same pattern as my genome::bus PR-4 (#1358): name the canonical ArtifactKey constants + ship the async publishing helpers + subscriber convenience. The actual real-engine integration lands in PR-3b/PR-4; PR-3a ships the bus surface so downstream observers (sentinel-observer, VDD harness, audit-recorder) can wire to it today before the engine swap. What lands Four canonical ArtifactKeys under inference/: - INFERENCE_REQUEST_KEY = "inference/llm.request" - INFERENCE_COMPLETE_KEY = "inference/llm.complete" - FIRST_TOKEN_EMITTED_KEY = "inference/llm.first_token" - RESIDENCY_FAULT_KEY = "inference/llm.residency_fault" Four async publishing helpers — serialize the typed event + publish through the artifact dispatch path (#1339 + #1343): - publish_inference_request - publish_inference_complete - publish_first_token_emitted - publish_residency_fault Three subscriber-convenience surfaces: - subscribe_to_inference_responses(bus, name) — most observers want outcomes (complete + first_token + fault), not requests - inference_response_selectors() — three Exact selectors - all_inference_selectors() — four selectors including request for full-firehose consumers (audit-recorder when it covers inference) Design choices - Two subscriber surfaces (response-only vs full firehose) because most observers don't want every request — they want outcomes. Audit-recorder + VDD harness may want the firehose for the prod-replay chain Joel pushed at #1385. - Request key INFERENCE_REQUEST_KEY in the publish helpers but NOT in the default observer set. Producers (persona-cognition) emit requests; observers see responses. Wiring symmetry without the noise. - Same naming convention as genome::bus (module/surface.event) for cross-module consistency. What is deliberately deferred (PR-3b / PR-4) - Wiring helpers INTO InferenceLlmModule::handle_command so it auto-publishes after each call. PR-3b plumbs Arc<MessageBus> + Arc<ModuleRegistry> through the module's constructor. - Real LLM engine (LlamaCppAdapter integration) — PR-4 - InferenceRequest artifact subscription (module subscribes to requests via bus instead of going through command bus) — needs persona-cognition to publish via bus first Tests 7 new tests on inference::llm_module_bus: - keys_have_canonical_string_values (pin wire strings) - response_selectors_cover_three_keys_as_exact - all_selectors_cover_four_keys - publish_inference_complete_routes_to_subscribed_module (end-to-end through artifact dispatch) - each_publish_helper_routes_to_its_own_key - response_only_subscriber_does_not_see_requests - full_firehose_subscriber_sees_requests_too 7/7 pass. No regressions across other 2958 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - THIS PR — inference-llm PR-3a: bus keys + publishing helpers - NEXT — PR-3b: InferenceLlmModule auto-publishes via these helpers after each handle_command call - THEN — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
joelteply
added a commit
that referenced
this pull request
May 18, 2026
…hes via bus hook (#1393) PR-3b of inference-llm. Wires the bus helpers from PR-3a (#1392) INTO InferenceLlmModule's handle_command so every successful inference response auto-publishes InferenceComplete + FirstTokenEmitted to the trace bus. Closes the inference-llm bus loop: producer (command) → engine (stub for now) → response (CommandResult) → bus dispatch (complete + first_token) → subscriber (sentinel/VDD/audit). What lands - BusHook private struct: { bus: Arc<MessageBus>, registry: Arc<ModuleRegistry> }. Same shape as genome::local_manager BusHook (#1362). - InferenceLlmModule.bus_hook: Option<BusHook> — None = bus-less PR-2 behavior; Some = auto-publish on every successful handle_command. - with_bus(bus, registry) constructor — wires both Arcs at module construction; no in-flight switching (prevents the "bus added mid-service" race). - handle_request body: on success, spawns publish_inference_complete and publish_first_token_emitted into the current tokio runtime via Handle::try_current. Spawn pattern (not await) avoids the DashMap borrow-across-await lifetime issue inside Send-bounded async_trait — same workaround as my genome LocalWorkingSetManager (#1362). - spawn_publish_inference_complete + spawn_publish_first_token_emitted module-private helpers — Arcs cloned out before spawn so the &BusHook borrow doesn't outlive the spawn. Design choices - Publishing is best-effort observability. The authoritative response goes back through the CommandResult arm regardless of publish success — callers who need to know if a generation happened look at the Result, not the bus. - Error paths (unknown command + invalid payload) do NOT publish. Tests pin this — bus events represent successful generations; errors are loud in the Result and silent on the bus. - Two separate spawns (one per event) rather than one bundled publish. Lets subscribers see first_token even if the complete event hasn't dispatched yet (race-tolerant TTFT observability). Tests 4 new bus tests (12 total): - handle_command_with_bus_auto_publishes_complete_and_first_token — end-to-end: register subscriber, run handle_command, yield for spawn, verify both events landed with matching requestId - handle_command_without_bus_does_not_publish — backwards-compat with PR-2 new() constructor - handle_command_unknown_with_bus_does_not_publish — error paths silent on bus - handle_command_invalid_payload_with_bus_does_not_publish — same invariant 12/12 pass on inference::llm_module_service. No regressions across other 2957 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - #1392 — inference-llm PR-3a: bus keys + publishing helpers - THIS PR — inference-llm PR-3b: auto-publish wiring - NEXT — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming (the stub stays in place until then; PR-4 swaps under the same external contract) Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
3 tasks
joelteply
added a commit
that referenced
this pull request
May 18, 2026
…n layer + new constructors) (#1395) Bridges the substrate's typed InferenceRequest/InferenceComplete surface to the existing AIProviderAdapter trait (LlamaCppAdapter for local llama.cpp). PR-5 ships the LlamaCppAdapter Runtime wiring + the end-to-end stub-adapter test; PR-4 ships the translation logic + new constructors so PR-5 is just plumbing. What lands - InferenceRequest.prompt_text: Option<String> — PR-4 wire addition for adapter-based engines that tokenize internally. Backwards-compat (Option = optional on wire). - InferenceComplete.completion_text: Option<String> — wire addition for adapter-based engines that return text not tokens. - InferenceLlmModule.adapter: Option<Arc<dyn AIProviderAdapter>>. - with_adapter(adapter) constructor: real-inference + no bus. - with_bus_and_adapter(bus, registry, adapter) constructor: the full production wiring (adapter + bus publishing). - handle_request: routes via adapter when wired + prompt_text present; refuses loud when adapter wired + no prompt_text (raw- token path not yet implemented — never silent fallback); falls back to PR-2 stub when no adapter. - run_adapter_inference(adapter, request, prompt_text) — translates InferenceRequest → TextGenerationRequest, calls adapter, translates TextGenerationResponse → (InferenceComplete, FirstTokenEmitted). - translate_adapter_response(request, response) — pure-function body of the response-side translation. - translate_adapter_finish_reason(adapter_reason) — cross-enum mapping: Stop→Stop, Length→MaxTokens, ToolUse→Error{reason} (loud refusal — inference-llm doesn't model tool-use), Error→ Error{reason}. Wire-shape decisions - max_tokens=0 in substrate's GenerationBudget translates to None on adapter's wire. Substrate convention: 0=unlimited, caller takes duration responsibility. Adapter convention: None=unlimited, 0=stop immediately. The substrate's "stop immediately" doesn't have an encoding because no caller would ask for it. - stop_sequences: empty Vec on substrate translates to None on adapter (adapter convention: None = no caller stop sequences). - persona_id propagates to adapter as stringified UUID for per-persona resource attribution (matches existing adapter convention from PersonaResponseGenerator). - purpose hardcoded "inference-llm" for adapter routing diagnostics. Sub-fix: missing TS bindings from PR-1 PR-1 (#1387) shipped the Rust types but the shared/generated/inference_llm/ directory of TS exports wasn't included in the commit (regen produced them locally; they didn't get staged). PR-4 ships all 10 TS files + the barrel index. Closes a wire-contract gap. Tests 13 new behavioral tests (44 total in inference::llm_module + inference::llm_module_service + inference::llm_module_bus): - translate_adapter_response_carries_text_and_usage — completion_text + tokens_generated mapping - translate_finish_reason_covers_all_adapter_variants — cross-enum mapping pin - with_adapter_constructor_routes_via_adapter_path — constructors compile + no-adapter regression - 8 existing PR-2 + 4 existing PR-3b tests still pass (no regressions) End-to-end "stub adapter via Arc<dyn AIProviderAdapter>" tests deferred to PR-5: the AIProviderAdapter trait has 8+ methods (provider_id / api_style / default_model / get_available_models / health_check / model_metadata / capabilities / initialize / shutdown / generate_text / create_embedding) and implementing all of them on a test stub here would pull in ProviderHealth + AdapterCapabilities + ApiStyle + ModelInfo + their dependencies — bigger than atomic-slice. PR-5 will wire LlamaCppAdapter directly through Runtime registration. 44/44 inference::llm_module tests pass. No regressions across other 2928 lib tests. Stack - #1387 — inference-llm PR-1: typed event surface - #1391 — inference-llm PR-2: ServiceModule impl (stub-backed) - #1392 — inference-llm PR-3a: bus keys + publishing helpers - #1393 — inference-llm PR-3b: auto-publish wiring - THIS PR — inference-llm PR-4: adapter integration (translation + constructors) - NEXT — PR-5: LlamaCppAdapter Runtime wiring + end-to-end integration test through real (or test-mock) adapter Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
joelteply
added a commit
that referenced
this pull request
May 18, 2026
Wires InferenceLlmModule into the Runtime so it's callable from the cognition path via inference/llm/request commands. What lands - Add "inference-llm" to EXPECTED_MODULES in runtime/runtime.rs - runtime.register(Arc::new(InferenceLlmModule::new())) in ipc/mod.rs alongside the existing InferenceModule registration Design choices - Constructed via the .new() (bus-less, stub-backed) constructor rather than .with_bus_and_adapter(). Reason: the with_bus_and_adapter constructor requires an AIProviderAdapter Arc, which would couple PR-5's runtime registration to a specific LlamaCppAdapter init lifecycle. The substrate's LlamaCppAdapter is owned by AIProviderModule's adapter registry with its own initialization phase; threading the adapter Arc here would either duplicate the registration or create an init-ordering dependency this slice shouldn't introduce. - The stub-backed registration is still useful: it exposes the inference/llm/request command surface to the cognition path so downstream PRs (turn-execute that chains drain-turn-frame → response_prompt → inference/llm/request) can wire against the real command name. Bus + adapter integration is a follow-up PR that updates the construction call here. What is NOT changed - AIProviderModule + LlamaCppAdapter unchanged - All InferenceLlmModule trait impl logic unchanged (PR-2/3/4 work intact) - The stub vs real-adapter swap point stays exactly where PR-4 put it: with_bus_and_adapter constructor + run_adapter_inference function Tests - cargo build --features metal,accelerate --lib clean (no new test fixtures needed — the module's existing 44/44 tests cover the trait-impl correctness; this PR just plumbs construction into runtime startup) - EXPECTED_MODULES enforcement validates at boot: if the registration is missing the runtime fails with "missing inference-llm" error - Pre-push gate clean Stack - #1387 PR-1: typed event surface - #1391 PR-2: ServiceModule impl (stub-backed) - #1392 PR-3a: bus keys + publishing helpers - #1393 PR-3b: auto-publish wiring - #1395 PR-4: adapter integration (translation + new constructors) - THIS PR — PR-5: Runtime registration - FOLLOW-UP — adapter Arc wiring when LlamaCppAdapter init phase is integrated with Runtime startup Co-authored-by: Test <test@test.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-2 of inference-llm. Wires the ServiceModule that accepts InferenceRequest commands + emits InferenceComplete + FirstTokenEmitted responses. The actual llama.cpp invoke lands in PR-3; PR-2 ships a STUB inference returning canned tokens so the seam is testable end-to-end + downstream consumers (sentinel-observer, VDD harness) can wire to it today.
What lands
InferenceLlmModulestruct implementingServiceModuleModuleConfig:name="inference-llm",priority=High,command_prefixes=["inference/llm/"]handle_commandfor"inference/llm/request":InferenceResponse { complete, first_token }as JSONCOMMAND_REQUEST = "inference/llm/request"constant pinnedDesign choices
pub(super) run_stub_inference+first_token_forhelpers so PR-3 can keep a "stub-vs-real produce same wire shape" regression test before swapping.InferenceResponsebundle (complete + first_token) instead of publishing two events separately. Avoids racing the two events.What is deliberately deferred (PR-3)
LlamaCppAdapterintegration)artifact_subscriptions)ResidencyFaultemission on missing-page (needs working-set integration)Test plan
cargo test --lib --features metal,accelerate inference::llm_module_service— 8/8 pass:config_reports_name_and_command_prefixconfig_priority_is_highcommand_request_has_canonical_string_valuehandle_command_routes_request_to_stub_inference(end-to-end)handle_command_unknown_returns_loud_errorhandle_command_invalid_payload_returns_typed_errorinference_response_round_trips_through_serdemodule_is_object_safe_for_dyn_service_moduleStack
LlamaCppAdapterinvoke + tokenizer + streaming🤖 Generated with Claude Code