Skip to content

feat(persona): per-turn latency metrics — LatencyAggregate + ServeOutcome.turn_latency#1515

Closed
joelteply wants to merge 2 commits into
feat/persona-prime-at-spawnfrom
feat/persona-turn-latency-metrics
Closed

feat(persona): per-turn latency metrics — LatencyAggregate + ServeOutcome.turn_latency#1515
joelteply wants to merge 2 commits into
feat/persona-prime-at-spawnfrom
feat/persona-turn-latency-metrics

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

Per Joel 2026-06-02 ("make sure timing and other metrics are in place"): the substrate doesn't get to claim "fast airc-bound persona" without measuring. This PR makes the per-reply cost structural.

What changed

  • LatencyAggregate { count, total_ms, min_ms, max_ms } — cheap online aggregator. O(1) record, allocation-free, saturating-add on overflow (locked by test).
  • ServeOutcome.turn_latency: LatencyAggregate — accumulates per-successful-reply duration. Excludes wait-for-next-message and pre-watermark / self-loop / RAG-only-skip cycles.
  • serve_persona_loop_inner instruments the per-reply path:
    • Instant::now captured AFTER filters, BEFORE RAG inspect
    • elapsed recorded into turn_latency only on successful say
    • tracing::info! per turn with lamport, duration, mean/min/max so the substrate's observability layer captures the metric structurally per [[observability-is-half-the-architecture]]

Doctrine fit

  • Monotonic Instant (not wall-clock) — immune to clock skew
  • One Instant per turn, no Vec growth, no heap allocs on hot path
  • Per Joel's computer-engineer mental model in [[init-once-handle-then-lease-zero-copy-refs]]: cache-friendly, branch-predictable, autovectorization-friendly

Tests (7/7 pass)

  • latency_aggregate_records_min_max_sum_count — empty + populated math; mean = total/count
  • latency_aggregate_saturates_on_overflow — locks safety property per [[every-error-is-an-opportunity-to-battle-harden]]
  • replies_to_inbound_from_other_peer (extended) — asserts turn_latency.count == 1 after one successful reply

Stacked on

PR #1514 (feat/persona-prime-at-spawn).

Closes #150. Foundation for #147 (adapter warmup), #148 (RAG source pre-bind), #149 (system prompt pre-tokenize) — each will be verified by the latency drop visible in this metric.

🤖 Generated with Claude Code

…come.turn_latency (#150)

Per Joel 2026-06-02: "make sure timing and other metrics are in place."
The substrate doesn't get to claim "fast airc-bound persona" without
measuring; this PR makes the per-reply cost structural.

Added (all in persona/service_loop.rs):

- LatencyAggregate { count, total_ms, min_ms, max_ms } — cheap online
  aggregator. O(1) record, allocation-free, saturating-add on
  overflow (locked by test). mean_ms returns Option<f64>.
- ServeOutcome.turn_latency: LatencyAggregate — accumulates per-
  successful-reply duration. Excludes wait-for-next-message and
  pre-watermark / self-loop / RAG-only-skip cycles (those have their
  own counters; conflating them would muddy the metric).
- serve_persona_loop_inner instruments the per-reply path:
  - Instant::now captured AFTER filters, BEFORE RAG inspect
  - elapsed recorded into turn_latency only on successful say
  - tracing::info per turn with lamport, duration, mean/min/max so
    the substrate's observability layer captures the metric
    structurally per [[observability-is-half-the-architecture]]

Doctrine fit:
- Monotonic Instant (not wall-clock) — immune to clock skew
- One Instant per turn, no Vec growth, no heap allocs on hot path
- Per Joel's computer-engineer mental model in
  [[init-once-handle-then-lease-zero-copy-refs]]: cache-friendly,
  branch-predictable, autovectorization-friendly

Tests (7/7 pass):
- latency_aggregate_records_min_max_sum_count — empty + populated
  math; mean = total/count
- latency_aggregate_saturates_on_overflow — locks the safety
  property per [[every-error-is-an-opportunity-to-battle-harden]]
- replies_to_inbound_from_other_peer (extended) — asserts
  turn_latency.count == 1 after one successful reply; min/max/mean
  set. If a future refactor forgets to record, count drops to 0 and
  the test fails loudly

Test plan:
- [x] cargo test --lib ... persona::service_loop:: — 7/7 pass

Closes #150. Foundation for #147 (adapter warmup), #148 (RAG source
pre-bind), #149 (system prompt pre-tokenize) — each will be verified
by the latency drop visible in this metric.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…test (caller-primes contract)

Per Joel 2026-06-02: "God I hope it's not more fallback cancer. You
tend to turn stuff into fake demos."

Two honest fixes addressing both criticisms.

## Fix 1: ONE place primes, not two (no more belt-and-suspenders)

Before: `spawn_persona_service` called `conversation.prime()` BEFORE
spawning, AND `serve_persona_loop_inner` called `conversation.prime()`
unconditionally as a "safety net." Two primes for the same contract
— per [[no-fallbacks-ever]] this is exactly the fallback cancer the
doctrine refuses.

After: `serve_persona_loop_inner` does NOT prime. Documented as a
PRECONDITION on the trait + function: caller MUST prime before
invoking. The supervisor's `spawn_persona_service` primes for
production. Direct callers (`airc_chat_demo`, tests) prime explicitly.

If a caller forgets, the first `next_message` returns the typed
`Err("called before prime()")` shipped in cb2894f — fail-loud,
never silently-warm.

Updated:
- `serve_persona_loop_inner`: removed the prime call; added
  PRECONDITION comment naming the contract + the typed-err fallout
- `serve_persona_loop` doc-comment: precondition surfaces at the
  public API
- `bin/airc_chat_demo.rs`: prime() explicitly before
  serve_persona_loop call
- All 4 StubConversation test sites prime explicitly
- `prime_failure_short_circuits_loop` replaced with
  `loop_without_caller_prime_surfaces_typed_error_per_turn` — tests
  the new caller-primes contract directly: unprimed conversation's
  next_message err counts as turns_errored, locks the absence of the
  safety-net call

## Fix 2: latency test verifies REAL elapsed time, not just plumbing

Before: `replies_to_inbound_from_other_peer` asserted
`turn_latency.count == 1` and that min/max/mean were Some. Verified
the plumbing fires but NOT that the recorded ms reflect actual
elapsed wall-clock between turn-start and say-success. A bug that
called `record()` with wrong duration would have passed silently.
Fake-demo-shaped.

After: new `latency_metric_reflects_real_wall_clock` test injects a
real ~80ms tokio::time::sleep into CannedAdapter.generate_text, runs
the loop, asserts:
- `observed_ms >= 50` (CI jitter floor — verifies metric tracks the
  injected delay, not always-zero)
- `observed_ms < 5000` (upper bound for sanity)

CannedAdapter gains `inject_delay_ms` field; `fake_hosted_with_delay`
helper exposes it. Default (`fake_hosted`) passes 0 so existing tests
are unaffected.

Test plan:
- [x] cargo test --lib ... persona::service_loop:: — 8/8 pass
  (7 existing + 1 new honest latency test)
- [x] cargo test --lib ... persona:: — 713/713 pass overall

Doctrine recap:
- [[no-fallbacks-ever]] — one place primes, not two
- [[every-error-is-an-opportunity-to-battle-harden]] — the
  caller-primes regression test locks the contract
- The honest latency test prevents the "passes on plumbing, silent
  on correctness" anti-pattern

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply
Copy link
Copy Markdown
Contributor Author

Superseded — the semantic content of this PR landed via the cognition pipeline rewires on canary.

Verified on canary (75b21d08a):

  • AircCitizen trait → src/workers/continuum-core/src/persona/airc_citizen.rs:73
  • PersonaConversation::primesrc/workers/continuum-core/src/persona/airc_persona_conversation.rs:105 + service_loop.rs:100
  • LatencyAggregate + ServeOutcome.turn_latencyservice_loop.rs:167,247
  • AIProviderAdapter::warmupai/adapter.rs:319, called at supervisor.rs:447
  • Test scaffolding primitives (scripted_conversation, scripted_adapter_factory) → persona/scripted_*.rs
  • PersonaSpawnSupervisor + BootSummarypersona/host.rs:119
  • .instrument(ctx.span()) (&ctx-pure tracing) → service_loop.rs:309
  • multi_persona_stress_baseline.rs → already on canary (via feat(persona): persona decides + responds via LLM in ONE structured call #1519)

The stack pre-dated #1519's "persona decides + responds via LLM in ONE structured call" rewire and #1539's cognition pipeline integration. Those PRs reshaped serve_persona_loop so heavily that rebasing this stack would mean replaying the architectural intent against code that already implements it. Closing as superseded; task system marks #144/#146/#147/#150/#154/#156 completed.

No work lost — every architectural insight from this stack is in the substrate today.

@joelteply joelteply deleted the branch feat/persona-prime-at-spawn June 7, 2026 15:51
@joelteply joelteply closed this Jun 7, 2026
@joelteply joelteply deleted the feat/persona-turn-latency-metrics branch June 7, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant