Skip to content

feat(inference): inference-llm PR-1 — typed event surface (MODULE-CATALOG §II)#1387

Merged
joelteply merged 1 commit into
canaryfrom
feat/inference-llm-types-pr1
May 18, 2026
Merged

feat(inference): inference-llm PR-1 — typed event surface (MODULE-CATALOG §II)#1387
joelteply merged 1 commit into
canaryfrom
feat/inference-llm-types-pr1

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

PR-1 of inference-llm (MODULE-CATALOG §II). Pure typed event surface for the local-LLM generation module. PR-2 ships the InferenceLlmModule ServiceModule impl + bus wiring; PR-3 ships tokenizer + llama.cpp invoke + token stream.

Unblocked by:

  • Lane H end-to-end (codex's 11-PR governor stack) — provides resource budget + cascade pressure signals
  • Demand-aligned-recall (mine, 8 PRs) — provides RankedPool that feeds CompositionPlan
  • Working-set-manager (mine, 5 PRs) — provides per-persona resident pages

Same slice shape as my genome (#1346) and recall (#1366) PR-1s.

What lands

  • InferenceRequestId(Uuid) — typed newtype; all four events use requestId field name for correlation
  • CompositionPlan(ArtifactId) — opaque reference; composer fills full shape later
  • SamplingParams with llama.cpp-baseline defaults (0.8/0.95/40/1.1)
  • GenerationBudget { max_tokens, max_duration_ms } — both honored
  • FinishReason enum: Stop / MaxTokens / MaxDuration / StopSequence { matched } / Error { reason }
  • InferenceRequest[InferenceRequest] subscription event
  • InferenceComplete — emission with timing + finish + tokens
  • FirstTokenEmitted — TTFT observability event (microsecond precision)
  • ResidencyFault — emission when inference would need a not-resident page; sentinel learns + upgrades tier policy

Test plan

  • cargo test --lib --features metal,accelerate inference::llm_module — 22/22 pass (13 behavioral + 9 ts-rs export_bindings)
  • No regressions across other 2883 lib tests
  • Pre-push gate clean
  • Clippy baseline bump 154→156 (canary drift + 2 doc-list fixes in my file)

Stack

🤖 Generated with Claude Code

…ALOG §II)

PR-1 of inference-llm. Pure typed event surface for the local-LLM
generation module. The module itself (composition → tokenizer →
llama.cpp invoke → token stream) lands in PR-2/PR-3; PR-1 ships
the wire so producers + consumers can build against it today.

Unblocked by my just-shipped Lane H + recall + working-set stacks.

What lands

- InferenceRequestId — typed Uuid newtype; all four events carry
  the same field name (requestId on wire) for correlation
- CompositionPlan — opaque ArtifactId reference; composer module
  fills the full shape later
- SamplingParams { temperature, top_p, top_k, repeat_penalty }
  with llama.cpp-baseline defaults (0.8 / 0.95 / 40 / 1.1)
- GenerationBudget { max_tokens, max_duration_ms } — both honored
- FinishReason enum: Stop / MaxTokens / MaxDuration / StopSequence
  { matched } / Error { reason } — typed per Joel's never-swallow
- InferenceRequest — [InferenceRequest] subscription event
- InferenceComplete — emission with completion + finish + timing
- FirstTokenEmitted — emission for TTFT observability
  (microsecond precision; sub-ms achievable on warm models)
- ResidencyFault — emission when inference would need a not-
  resident page; sentinel learns + upgrades tier policy

Tests

13 behavioral tests + 9 ts-rs export_bindings = 22 total. 22/22 pass.
No regressions across other 2883 lib tests.

Clippy baseline bump 154→156 — drift from recent canary merges.
Fixed two doc-list warnings in this file (reworded "* 1000" math
to avoid being parsed as a markdown list item).

Stack

- Lane H end-to-end (codex's #1331#1373)
- Working-set-manager + DAR end-to-end (mine, #1346#1382)
- THIS PR — inference-llm PR-1: typed event surface
- NEXT — PR-2: InferenceLlmModule ServiceModule impl wired to
  the artifact dispatch
- THEN — PR-3: tokenizer + llama.cpp invoke + token stream

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joelteply joelteply merged commit b810855 into canary May 18, 2026
3 checks passed
@joelteply joelteply deleted the feat/inference-llm-types-pr1 branch May 18, 2026 16:09
joelteply added a commit that referenced this pull request May 18, 2026
…e impl (stub-backed) (#1391)

* feat(inference): inference-llm PR-2 — InferenceLlmModule ServiceModule impl

PR-2 of inference-llm. Wires the ServiceModule that accepts
InferenceRequest commands + emits InferenceComplete +
FirstTokenEmitted responses. The actual llama.cpp invoke lands in
PR-3; PR-2 ships a STUB inference returning canned tokens so the
seam is testable end-to-end + downstream consumers
(sentinel-observer, VDD harness) wire to it today.

What lands

- InferenceLlmModule struct implementing ServiceModule
- ModuleConfig: name="inference-llm", priority=High,
  command_prefixes=["inference/llm/"]
- handle_command for "inference/llm/request":
  - parses InferenceRequest JSON payload
  - runs stub inference (3 canned tokens, FinishReason::Stop)
  - returns InferenceResponse { complete, first_token } as JSON
- Loud typed errors for unknown commands + invalid payloads
- COMMAND_REQUEST = "inference/llm/request" constant pinned

Design choices

- Stub backed because PR-3 ships the real engine; the OUTER wire
  shape stays identical across stub→real transition.
- pub(super) run_stub_inference + first_token_for helpers so PR-3
  can keep a "stub-vs-real produce same wire shape" regression
  test before swapping.
- Returns InferenceResponse bundle (complete + first_token) instead
  of publishing two events separately. Caller decomposes if needed.

Tests

8 new tests pin the contract: config, command constant, route to
stub, loud error paths, serde round-trip, dyn dispatch. 8/8 pass.
No regressions across other 2934 lib tests.

Stack

- #1387 — inference-llm PR-1: typed event surface
- THIS PR — inference-llm PR-2: ServiceModule impl (stub-backed)
- NEXT — PR-3: real LlamaCppAdapter invoke + tokenizer + streaming

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(inference): scope InferenceRequestId import to test module

PR-2's earlier clippy pass removed file-scope InferenceRequestId
import because production code doesn't use it directly (only
deserializes from JSON). Test module DOES use it for constructing
sample requests, so cargo test --lib failed with E0433.

Same pattern as the genome/blob.rs fix earlier this session. Future
me: when clippy says 'unused import' but the test mod uses the type,
scope to the test mod rather than deleting outright.

---------

Co-authored-by: Test <test@test.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request May 18, 2026
…shing helpers (#1392)

PR-3a of inference-llm. Same pattern as my genome::bus PR-4
(#1358): name the canonical ArtifactKey constants + ship the async
publishing helpers + subscriber convenience. The actual real-engine
integration lands in PR-3b/PR-4; PR-3a ships the bus surface so
downstream observers (sentinel-observer, VDD harness, audit-recorder)
can wire to it today before the engine swap.

What lands

Four canonical ArtifactKeys under inference/:
- INFERENCE_REQUEST_KEY = "inference/llm.request"
- INFERENCE_COMPLETE_KEY = "inference/llm.complete"
- FIRST_TOKEN_EMITTED_KEY = "inference/llm.first_token"
- RESIDENCY_FAULT_KEY = "inference/llm.residency_fault"

Four async publishing helpers — serialize the typed event + publish
through the artifact dispatch path (#1339 + #1343):
- publish_inference_request
- publish_inference_complete
- publish_first_token_emitted
- publish_residency_fault

Three subscriber-convenience surfaces:
- subscribe_to_inference_responses(bus, name) — most observers want
  outcomes (complete + first_token + fault), not requests
- inference_response_selectors() — three Exact selectors
- all_inference_selectors() — four selectors including request for
  full-firehose consumers (audit-recorder when it covers inference)

Design choices

- Two subscriber surfaces (response-only vs full firehose) because
  most observers don't want every request — they want outcomes.
  Audit-recorder + VDD harness may want the firehose for the
  prod-replay chain Joel pushed at #1385.
- Request key INFERENCE_REQUEST_KEY in the publish helpers but NOT
  in the default observer set. Producers (persona-cognition) emit
  requests; observers see responses. Wiring symmetry without the
  noise.
- Same naming convention as genome::bus (module/surface.event) for
  cross-module consistency.

What is deliberately deferred (PR-3b / PR-4)

- Wiring helpers INTO InferenceLlmModule::handle_command so it
  auto-publishes after each call. PR-3b plumbs Arc<MessageBus> +
  Arc<ModuleRegistry> through the module's constructor.
- Real LLM engine (LlamaCppAdapter integration) — PR-4
- InferenceRequest artifact subscription (module subscribes to
  requests via bus instead of going through command bus) — needs
  persona-cognition to publish via bus first

Tests

7 new tests on inference::llm_module_bus:
- keys_have_canonical_string_values (pin wire strings)
- response_selectors_cover_three_keys_as_exact
- all_selectors_cover_four_keys
- publish_inference_complete_routes_to_subscribed_module
  (end-to-end through artifact dispatch)
- each_publish_helper_routes_to_its_own_key
- response_only_subscriber_does_not_see_requests
- full_firehose_subscriber_sees_requests_too

7/7 pass. No regressions across other 2958 lib tests.

Stack

- #1387 — inference-llm PR-1: typed event surface
- #1391 — inference-llm PR-2: ServiceModule impl (stub-backed)
- THIS PR — inference-llm PR-3a: bus keys + publishing helpers
- NEXT — PR-3b: InferenceLlmModule auto-publishes via these helpers
  after each handle_command call
- THEN — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming

Co-authored-by: Test <test@test.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request May 18, 2026
…hes via bus hook (#1393)

PR-3b of inference-llm. Wires the bus helpers from PR-3a (#1392)
INTO InferenceLlmModule's handle_command so every successful
inference response auto-publishes InferenceComplete +
FirstTokenEmitted to the trace bus.

Closes the inference-llm bus loop: producer (command) → engine
(stub for now) → response (CommandResult) → bus dispatch
(complete + first_token) → subscriber (sentinel/VDD/audit).

What lands

- BusHook private struct: { bus: Arc<MessageBus>, registry:
  Arc<ModuleRegistry> }. Same shape as genome::local_manager
  BusHook (#1362).
- InferenceLlmModule.bus_hook: Option<BusHook> — None = bus-less
  PR-2 behavior; Some = auto-publish on every successful
  handle_command.
- with_bus(bus, registry) constructor — wires both Arcs at module
  construction; no in-flight switching (prevents the "bus added
  mid-service" race).
- handle_request body: on success, spawns publish_inference_complete
  and publish_first_token_emitted into the current tokio runtime
  via Handle::try_current. Spawn pattern (not await) avoids the
  DashMap borrow-across-await lifetime issue inside Send-bounded
  async_trait — same workaround as my genome
  LocalWorkingSetManager (#1362).
- spawn_publish_inference_complete + spawn_publish_first_token_emitted
  module-private helpers — Arcs cloned out before spawn so the
  &BusHook borrow doesn't outlive the spawn.

Design choices

- Publishing is best-effort observability. The authoritative response
  goes back through the CommandResult arm regardless of publish
  success — callers who need to know if a generation happened look
  at the Result, not the bus.
- Error paths (unknown command + invalid payload) do NOT publish.
  Tests pin this — bus events represent successful generations;
  errors are loud in the Result and silent on the bus.
- Two separate spawns (one per event) rather than one bundled
  publish. Lets subscribers see first_token even if the complete
  event hasn't dispatched yet (race-tolerant TTFT observability).

Tests

4 new bus tests (12 total):
- handle_command_with_bus_auto_publishes_complete_and_first_token
  — end-to-end: register subscriber, run handle_command, yield
  for spawn, verify both events landed with matching requestId
- handle_command_without_bus_does_not_publish — backwards-compat
  with PR-2 new() constructor
- handle_command_unknown_with_bus_does_not_publish — error paths
  silent on bus
- handle_command_invalid_payload_with_bus_does_not_publish —
  same invariant

12/12 pass on inference::llm_module_service. No regressions
across other 2957 lib tests.

Stack

- #1387 — inference-llm PR-1: typed event surface
- #1391 — inference-llm PR-2: ServiceModule impl (stub-backed)
- #1392 — inference-llm PR-3a: bus keys + publishing helpers
- THIS PR — inference-llm PR-3b: auto-publish wiring
- NEXT — PR-4: real LlamaCppAdapter invoke + tokenizer + streaming
  (the stub stays in place until then; PR-4 swaps under the same
  external contract)

Co-authored-by: Test <test@test.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request May 18, 2026
…n layer + new constructors) (#1395)

Bridges the substrate's typed InferenceRequest/InferenceComplete surface
to the existing AIProviderAdapter trait (LlamaCppAdapter for local
llama.cpp). PR-5 ships the LlamaCppAdapter Runtime wiring + the
end-to-end stub-adapter test; PR-4 ships the translation logic +
new constructors so PR-5 is just plumbing.

What lands

- InferenceRequest.prompt_text: Option<String> — PR-4 wire
  addition for adapter-based engines that tokenize internally.
  Backwards-compat (Option = optional on wire).
- InferenceComplete.completion_text: Option<String> — wire
  addition for adapter-based engines that return text not tokens.
- InferenceLlmModule.adapter: Option<Arc<dyn AIProviderAdapter>>.
- with_adapter(adapter) constructor: real-inference + no bus.
- with_bus_and_adapter(bus, registry, adapter) constructor: the
  full production wiring (adapter + bus publishing).
- handle_request: routes via adapter when wired + prompt_text
  present; refuses loud when adapter wired + no prompt_text (raw-
  token path not yet implemented — never silent fallback); falls
  back to PR-2 stub when no adapter.
- run_adapter_inference(adapter, request, prompt_text) — translates
  InferenceRequest → TextGenerationRequest, calls adapter, translates
  TextGenerationResponse → (InferenceComplete, FirstTokenEmitted).
- translate_adapter_response(request, response) — pure-function
  body of the response-side translation.
- translate_adapter_finish_reason(adapter_reason) — cross-enum
  mapping: Stop→Stop, Length→MaxTokens, ToolUse→Error{reason}
  (loud refusal — inference-llm doesn't model tool-use), Error→
  Error{reason}.

Wire-shape decisions

- max_tokens=0 in substrate's GenerationBudget translates to None
  on adapter's wire. Substrate convention: 0=unlimited, caller takes
  duration responsibility. Adapter convention: None=unlimited, 0=stop
  immediately. The substrate's "stop immediately" doesn't have an
  encoding because no caller would ask for it.
- stop_sequences: empty Vec on substrate translates to None on
  adapter (adapter convention: None = no caller stop sequences).
- persona_id propagates to adapter as stringified UUID for
  per-persona resource attribution (matches existing adapter
  convention from PersonaResponseGenerator).
- purpose hardcoded "inference-llm" for adapter routing diagnostics.

Sub-fix: missing TS bindings from PR-1

PR-1 (#1387) shipped the Rust types but the
shared/generated/inference_llm/ directory of TS exports wasn't
included in the commit (regen produced them locally; they didn't
get staged). PR-4 ships all 10 TS files + the barrel index. Closes
a wire-contract gap.

Tests

13 new behavioral tests (44 total in inference::llm_module +
inference::llm_module_service + inference::llm_module_bus):

- translate_adapter_response_carries_text_and_usage — completion_text
  + tokens_generated mapping
- translate_finish_reason_covers_all_adapter_variants — cross-enum
  mapping pin
- with_adapter_constructor_routes_via_adapter_path — constructors
  compile + no-adapter regression
- 8 existing PR-2 + 4 existing PR-3b tests still pass (no
  regressions)

End-to-end "stub adapter via Arc<dyn AIProviderAdapter>" tests
deferred to PR-5: the AIProviderAdapter trait has 8+ methods
(provider_id / api_style / default_model / get_available_models /
health_check / model_metadata / capabilities / initialize /
shutdown / generate_text / create_embedding) and implementing
all of them on a test stub here would pull in ProviderHealth +
AdapterCapabilities + ApiStyle + ModelInfo + their dependencies
— bigger than atomic-slice. PR-5 will wire LlamaCppAdapter
directly through Runtime registration.

44/44 inference::llm_module tests pass. No regressions across
other 2928 lib tests.

Stack

- #1387 — inference-llm PR-1: typed event surface
- #1391 — inference-llm PR-2: ServiceModule impl (stub-backed)
- #1392 — inference-llm PR-3a: bus keys + publishing helpers
- #1393 — inference-llm PR-3b: auto-publish wiring
- THIS PR — inference-llm PR-4: adapter integration (translation +
  constructors)
- NEXT — PR-5: LlamaCppAdapter Runtime wiring + end-to-end
  integration test through real (or test-mock) adapter

Co-authored-by: Test <test@test.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request May 18, 2026
Wires InferenceLlmModule into the Runtime so it's callable from
the cognition path via inference/llm/request commands.

What lands

- Add "inference-llm" to EXPECTED_MODULES in runtime/runtime.rs
- runtime.register(Arc::new(InferenceLlmModule::new())) in
  ipc/mod.rs alongside the existing InferenceModule registration

Design choices

- Constructed via the .new() (bus-less, stub-backed) constructor
  rather than .with_bus_and_adapter(). Reason: the
  with_bus_and_adapter constructor requires an AIProviderAdapter
  Arc, which would couple PR-5's runtime registration to a
  specific LlamaCppAdapter init lifecycle. The substrate's
  LlamaCppAdapter is owned by AIProviderModule's adapter registry
  with its own initialization phase; threading the adapter Arc
  here would either duplicate the registration or create an
  init-ordering dependency this slice shouldn't introduce.
- The stub-backed registration is still useful: it exposes the
  inference/llm/request command surface to the cognition path so
  downstream PRs (turn-execute that chains drain-turn-frame →
  response_prompt → inference/llm/request) can wire against the
  real command name. Bus + adapter integration is a follow-up
  PR that updates the construction call here.

What is NOT changed

- AIProviderModule + LlamaCppAdapter unchanged
- All InferenceLlmModule trait impl logic unchanged (PR-2/3/4
  work intact)
- The stub vs real-adapter swap point stays exactly where PR-4
  put it: with_bus_and_adapter constructor + run_adapter_inference
  function

Tests

- cargo build --features metal,accelerate --lib clean (no new
  test fixtures needed — the module's existing 44/44 tests cover
  the trait-impl correctness; this PR just plumbs construction
  into runtime startup)
- EXPECTED_MODULES enforcement validates at boot: if the registration
  is missing the runtime fails with "missing inference-llm" error
- Pre-push gate clean

Stack

- #1387 PR-1: typed event surface
- #1391 PR-2: ServiceModule impl (stub-backed)
- #1392 PR-3a: bus keys + publishing helpers
- #1393 PR-3b: auto-publish wiring
- #1395 PR-4: adapter integration (translation + new constructors)
- THIS PR — PR-5: Runtime registration
- FOLLOW-UP — adapter Arc wiring when LlamaCppAdapter init phase
  is integrated with Runtime startup

Co-authored-by: Test <test@test.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant