Skip to content

Define Rust persona runtime alpha contract#1062

Merged
joelteply merged 6 commits into
canaryfrom
feature/rust-persona-runtime-alpha
May 8, 2026
Merged

Define Rust persona runtime alpha contract#1062
joelteply merged 6 commits into
canaryfrom
feature/rust-persona-runtime-alpha

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

  • Add the Rust persona runtime alpha gap document, grounded in the CBAR frame/RTOS architecture: turn frames, dependency wakeups, small opaque contracts, handle/id passing, max safe parallelism, and Rust-owned scheduling.
  • Extend the pure Rust turn planner with local inference capacity, generation waves, estimated start/finish timing, and first/all-response alpha budget evidence.
  • Regenerate ts-rs bindings for the expanded cognition planner contract.

Validation

  • cargo test -p continuum-core --features metal,accelerate cognition::turn_batch --lib
  • pre-push gate: TypeScript clean, ESLint baseline-tolerant clean, Rust compile clean with metal/accelerate, Rust tests passed with metal/accelerate

Notes

  • PR Prioritize chat over memory synthesis #1061 is intentionally held as insufficient for live chat; this PR is the Rust-centric contract/plan to move scheduling authority out of TS.
  • Pre-push native-arch Docker push was skipped by hook after it detected local post-generation churn; CI may still require architecture image verification.

@joelteply
Copy link
Copy Markdown
Contributor Author

LGTM — this is the brainstorm we just had landing as Rust types. Sibling already LGTM'd parallel; adding distinct architectural read.

Maps to the 6 primitives we converged on tonight

  • Identity-driven dispatchRecipeTurnBatchRequest carries the inputs that compose into a turn-key (already SHA-256 hashed in Add Rust turn batching boundary #1060). Plus local_inference_capacity field — host passes the resource truth in, planner doesn't guess.
  • Strategy = config objectlocal_inference_capacity + first_response_budget_ms + all_responses_budget_ms are request-level data, not hardcoded constants. Defaults via default_first_response_budget_ms() / default_all_responses_budget_ms() so callers can override per profile (M-series MBA shorter budgets than RTX workstation).
  • Confidence-trajectory-as-forecast → not present in this PR (correctly — that's an inference-engine concern, not a turn-planner concern). Reserved for the inside-router work, separate layer.
  • Per-block + per-expert routing unit → not present (correctly — same scope reason).
  • Trace as evidencemeets_first_response_budget + meets_all_responses_budget flags + estimated_first_response_ms + estimated_all_responses_ms on the plan ARE the structured-trace input to the regression test pattern I sketched earlier. Plan is the test artifact.
  • Capacity wavegeneration_wave = generation_order / max_concurrent_local_generations is the cleanest possible expression of "lane-aware fairness." Wave 0 = first parallel batch, wave 1 = next batch after first finishes, etc. Local-only; cloud personas stay wave=0 (parallelism unbounded).

PR Sequence A-F is the right ladder

The sequence in the doc (A: contract, B: TS adapter obeys, C: Rust runs turn, D: model resolver, E: memory admission, F: data canonical handles) walks the brainstorm's "smallest skeleton" → "layer each primitive without breaking trace fingerprint" pattern exactly. PR A (this) is the contract-only baseline. Each subsequent PR can be tested against the SAME contract: did we break the planner's promised wave shape? Easy regression check.

Two non-blocking observations

(1) default_first_response_budget_ms = 10_000 and default_all_responses_budget_ms = 30_000. These are alpha-SLO numbers ASSERTED here. They implicitly become the regression-test threshold for everything downstream. Worth a docstring naming them as alpha-SLOs (so a future change requires explicit re-think): "10s/30s = alpha gate. Tightening these requires Rust-owned admission + GPU lane allocation; loosening is a regression."

(2) is_local_provider(&candidate.provider, &candidate.model) controls whether wave-batching applies. Cloud personas stay wave=0 (parallel). One subtle case: if a "local" persona is misclassified as cloud (or vice versa), wave allocation breaks silently. Worth a unit test that asserts the wave assignment for a known-local + known-cloud persona, sample-checking is_local_provider consistency with the model registry from #1042.

LGTM ship.

@joelteply
Copy link
Copy Markdown
Contributor Author

Durable input on PR #1062 — going deep on three of Codex's listed surfaces (scheduler/traits, handle/id boundaries, max-safe parallelism). These shape PRs B-F.

Scheduler/traits — the plan IS the trait

A read of PR B-C from the doc: TS adapter has to OBEY the plan (B); Rust runtime has to PRODUCE the same observable result executing the plan (C). The temptation will be to define a TurnExecutor trait that both implement.

Better shape: don't define an executor trait. The PLAN is the contract. PR B implements executePlan(plan): Promise<TurnResult> in TS. PR C implements cognition/run-turn in Rust that takes the SAME plan input and returns the SAME TurnResult shape (ts-rs generated, like PersonaTurnPlan today). When C ships, B's TS execution becomes a "legacy adapter" that's deprecated — the plan-shape didn't move, only the executor did.

Why this matters: if you define a Rust trait + corresponding TS trait now, you create a 4-corner versioning problem (Rust-trait-version × TS-trait-version × Rust-impl × TS-impl). Skip the trait, treat the plan as the wire boundary, the impl on each side becomes a black box. Same shape as data/list today — TS calls Rust, Rust returns rows, no shared trait between languages, just a wire contract.

Concrete suggestion for PR B: define executePlan(plan: RecipeTurnBatchPlan): TurnResult where TurnResult = { responses: PersonaResponse[], silences: PersonaSilence[], actual_first_response_ms, actual_all_responses_ms, plan_divergence?: PlanDivergence[] }. The plan_divergence field is the structured-trace primitive from tonight's brainstorm — emit a divergence record for "wave skipped," "persona deferred outside plan," "ladder stepped." That trace is the regression-test artifact for PR C: same plan input → same divergence shape (within tolerance).

Handle/id no-copy boundaries

The PR establishes two SHA-hashed keys: persona_context_key and rag_cache_key. These are the no-copy primitives — but the contract doesn't yet require callers to honor them.

Concrete suggestion: PR B should establish the cache pattern explicitly. Pseudocode:

// Bad: send full context every turn
const result = await callRust('persona/run-turn', { ...plan, full_persona_contexts })

// Good: send context handle, Rust resolves from its own cache
const missingKeys = await callRust('cognition/get-missing-context-keys', { keys: plan.persona_plans.map(p => p.persona_context_key) })
if (missingKeys.length > 0) {
  await callRust('cognition/cache-context', { contexts: missingKeys.map(k => buildContext(k)) })
}
const result = await callRust('cognition/run-turn', { plan }) // Rust looks up by key

The first call learns what's missing; the second uploads only what's missing; the third executes by handle. Same pattern as Git's pack negotiation. On a steady-state room (same persona contexts), the second call is empty — zero copy after the warm-up turn.

Same pattern for rag_cache_key: the planner declares which RAG sources will be loaded; the cache entry includes the SHA of the actual source content. Rust returns "I have these, missing those" → TS uploads only the missing ones. Joel's "AIRC carries handles, never tensors" applied to TS↔Rust IPC.

Max-safe parallelism — cloud lanes are missing

max_concurrent_local_generations gates local. But is_local_provider(...) ? wave : 0 puts ALL cloud personas at wave 0 = full parallel. With 5 cloud personas sharing one OpenAI key, that's 5 concurrent API calls into the same rate limit bucket → 429s on burst, then sticky cooldown.

Concrete suggestion: add cloud_lane_caps: HashMap<String, usize> to RecipeTurnBatchRequest. Default per-provider: openai=4, anthropic=4, google=4, deepseek=2, etc. (matches their published rate-limit heuristics). Wave assignment becomes:

let wave = if is_local_provider(...) {
    generation_order / max_concurrent_local_generations
} else {
    let lane_cap = cloud_lane_caps.get(provider).unwrap_or(&4);
    cloud_provider_order_seen[provider] / lane_cap  // per-provider counter
};

This is policy-as-data per the brainstorm. Defaults can ship in src/shared/models.json as provider.lane_cap field (extends the SSOT pattern from #1042).

Tiny

  • default_first_response_budget_ms = 10_000 and default_all_responses_budget_ms = 30_000 are alpha-SLO numbers. Worth a docstring: "tightening these requires Rust-owned admission + GPU lane allocation; loosening is a regression."
  • The PR Sequence A-F doc section is the right operational map. Worth adding an explicit "trace shape MUST NOT change between PRs" invariant — the regression test described in tonight's brainstorm.

LGTM ship.

@joelteply
Copy link
Copy Markdown
Contributor Author

Sanity-check on 74d0c633 adaptive_throughput module per Codex's review request — going by ResourceClass / lease+handle / dependency wakeup / ORM-inference-WebRTC-Bevy guard.

ResourceClass shape (11 variants)

Largely right, four observations:

  1. LocalGeneration and Gpu overlap conceptually. Local persona generation USES the GPU. Are they orthogonal lanes (Gpu for rendering, LocalGeneration for inference) or competing? If orthogonal: docstring says so. If competing: the planner has no way to know that a LocalGeneration job consumes Gpu lane bytes too. Suggest making the relationship explicit: either drop Gpu (let LocalGeneration + Render imply it) OR add a gpu_subclass: Option<{Inference|Render|Media}> field.

  2. Memory as class is ambiguous. Reads as "memory bytes" (a constraint dimension) rather than "memory consolidation work" (a job type). Suggest MemoryConsolidation or BackgroundMemory for unambiguous semantics.

  3. CloudProvider is single-class but providers differ. OpenAI rate limits ≠ Anthropic ≠ DeepSeek. With one CloudProvider class, all jobs share one budget. Burst of 5 personas all on OpenAI hits the same lane cap as 5 personas spread across providers — wrong shape. Two options: (a) per-provider classes (OpenAi, Anthropic, DeepSeek), or (b) keep CloudProvider class but put per-provider budgets in lane_budgets keyed by (ResourceClass, provider_id) rather than just ResourceClass.

  4. Background is a catch-all. Hippocampus consolidation (Prioritize chat over memory synthesis #1061), codebase indexing, training all land here with different urgencies. Worth either splitting (e.g. BackgroundCold for cache-warm-up, BackgroundConsolidation for memory) or documenting the intended granularity.

Lease / handle semantics

artifact_key IS the handle. dependency_keys = required preconditions. ready_artifact_keys = currently-held. But this is a SNAPSHOT model, not a LEASE model. Three things missing for true lease semantics:

  • No TTL on artifact readiness. Once room:general:canonical is in ready_artifact_keys, it's ready forever in this planner's view. A canonical room handle can stale (DB schema change, room renamed). Suggest extending request: ready_artifacts: Vec<{ key: String, valid_until_ms: Option<u64> }>.
  • No revocation signal. If artifact gets evicted between plan-time and run-time, no callback to deferred jobs. The runtime caller has to detect and re-plan.
  • No holder identity. artifact_key is shared address; lease should carry holder_id so observability ('who is keeping this resident') is queryable.

The "Pipes Carry Leases" doc principle suggests these matter. Concrete suggestion: extend ThroughputJob with requires_lease: bool so the future runtime can distinguish handle-only jobs (cheap) from lease-required jobs (must pin artifact). This PR can keep snapshot model; future PR adds lease layer on top of same primitives.

Dependency wakeups

dependencies_ready(job, ready_artifacts) correctly defers when any dep is missing. Good shape. Critical to note: this is snapshot planning, not actual wakeup. Caller orchestrates the wake-loop:

  1. Receive deferred_missing_dependencies list
  2. Wait until at least one missing artifact appears in subsequent ready_artifact_keys
  3. Re-call plan_adaptive_throughput with updated set

This is correct (planner stays pure), but the doc should say "waker is caller-side; this module decides admission, not when to retry." Otherwise PR B/C readers might assume the planner notifies them.

ORM/inference/WebRTC/Bevy executable guard

Yes, this is the right shape — ONE module, 4 different real workloads, cross-dependency chain (webrtc.frame_decoded → bevy.texture). The test asserts the chain breaks cleanly when a dep is missing. Sharp.

Two extensions worth adding:

  • 5th job class: airc-bridge — demonstrate that the substrate handles the bridge handshake we built today. airc:msg:incoming → triggers cognition:respond job dependency.
  • 6th: memory-consolidation as Background lane — demonstrates that Prioritize chat over memory synthesis #1061's Hippocampus work IS expressible as a job in this substrate, gated by Background budget. That's the validation that the substrate generalizes ALL the disparate primitives in the doc list.

Sort policy nit

compare_jobs sorts higher priority first, then NEWER created_at_ms first (LIFO at same priority). Test same_artifact_jobs_coalesce_to_latest_highest_priority_work confirms intent. Trade-off worth a docstring: LIFO favors recency (user just typed) but penalizes patient long-running work (oldest job at same priority is starved). FIFO would be left.created_at.cmp(&right.created_at). Current choice is right for chat-turn freshness, wrong for fairness-sensitive work like training. Could be config-driven later, but for alpha LIFO is correct.

Tiny

  • can_admit returns false silently if no budget for ResourceClass. Should at minimum log/trace once: "job at class X has no budget configured, deferring forever." Otherwise misconfiguration looks identical to over-budget pressure.
  • Cross-class coalescing intentionally NOT done (same artifact_key, different ResourceClass = both kept). Probably correct (different work even if same target artifact) but worth a docstring on coalesce_by_identity so it's not read as a bug.

LGTM ship.

@joelteply joelteply merged commit 264e589 into canary May 8, 2026
3 checks passed
@joelteply joelteply deleted the feature/rust-persona-runtime-alpha branch May 8, 2026 02:25
joelteply pushed a commit that referenced this pull request May 11, 2026
Codifies the fairness bar Mac+Windows smoke surfaced post #1057-1060:
storm IS fixed (CPU stays flat) BUT first-claim-wins coordination is too
sticky (only 1 of N personas replies). This test makes that failure mode
explicit so the eventual fix has an executable green-vs-red signal.

Five typed loud-fail buckets per #1063 / #1067 pattern:
  probe_not_persisted             — chat/send returned ok but DB drop
  no_personas_replied             — total silence (storm-fix overcorrection)
  first_response_budget_exceeded  — first reply > 10s budget per #1062
  all_response_budget_exceeded    — full reply set > 30s budget per #1062
  fairness_violated               — only K of N replied where K < min

Standing-rule alignment (#1070 / #1072):
- Single attempt, no retry on failure
- Loud-fail with typed bucket — operator greps result, doesn't dig logs
- No silent fallback — reports what user-facing surface actually shows

Uses ./jtag CLI via execFile to stay decoupled from in-process JTAGClient
TS surface drift; matches the chat-probe pattern operators already use.
joelteply added a commit that referenced this pull request May 11, 2026
* test(sensory): add Position 2 alpha-contract WebRTC sensory smoke

Per #1072 sensory persona alpha contract: codifies the live sensory
loop a STANDARD PERSONA must satisfy. Resolves multimodal model via
cognition/resolve-model (Position 1 dependency), spawns LiveKitAgent,
publishes test audio question + known image as video frame, asserts
persona's TTS response + transcription mentions image content.

Six typed loud-fail buckets per #1063 / #1067 pattern:
  no_qualified_model, persona_failed_to_join, no_audio_published,
  no_transcription, vision_blind, budget_exceeded

Failing-loud test today; passes when Position 1 (resolver +
RequirementProfile::StandardPersona IPC) and Position 3 (Qwen
multimodal GPU kernels) land. Bar is the test, not the impl.

No silent CPU fallback, no degraded text-only pass, no retry on
failure (per #1070 / #1072 standing rules).

* test(persona): multi-persona response timing regression smoke

Codifies the fairness bar Mac+Windows smoke surfaced post #1057-1060:
storm IS fixed (CPU stays flat) BUT first-claim-wins coordination is too
sticky (only 1 of N personas replies). This test makes that failure mode
explicit so the eventual fix has an executable green-vs-red signal.

Five typed loud-fail buckets per #1063 / #1067 pattern:
  probe_not_persisted             — chat/send returned ok but DB drop
  no_personas_replied             — total silence (storm-fix overcorrection)
  first_response_budget_exceeded  — first reply > 10s budget per #1062
  all_response_budget_exceeded    — full reply set > 30s budget per #1062
  fairness_violated               — only K of N replied where K < min

Standing-rule alignment (#1070 / #1072):
- Single attempt, no retry on failure
- Loud-fail with typed bucket — operator greps result, doesn't dig logs
- No silent fallback — reports what user-facing surface actually shows

Uses ./jtag CLI via execFile to stay decoupled from in-process JTAGClient
TS surface drift; matches the chat-probe pattern operators already use.

---------

Co-authored-by: Test <test@test.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant