feat(persona): per-turn latency metrics — LatencyAggregate + ServeOutcome.turn_latency#1515
Closed
joelteply wants to merge 2 commits into
Closed
feat(persona): per-turn latency metrics — LatencyAggregate + ServeOutcome.turn_latency#1515joelteply wants to merge 2 commits into
joelteply wants to merge 2 commits into
Conversation
…come.turn_latency (#150) Per Joel 2026-06-02: "make sure timing and other metrics are in place." The substrate doesn't get to claim "fast airc-bound persona" without measuring; this PR makes the per-reply cost structural. Added (all in persona/service_loop.rs): - LatencyAggregate { count, total_ms, min_ms, max_ms } — cheap online aggregator. O(1) record, allocation-free, saturating-add on overflow (locked by test). mean_ms returns Option<f64>. - ServeOutcome.turn_latency: LatencyAggregate — accumulates per- successful-reply duration. Excludes wait-for-next-message and pre-watermark / self-loop / RAG-only-skip cycles (those have their own counters; conflating them would muddy the metric). - serve_persona_loop_inner instruments the per-reply path: - Instant::now captured AFTER filters, BEFORE RAG inspect - elapsed recorded into turn_latency only on successful say - tracing::info per turn with lamport, duration, mean/min/max so the substrate's observability layer captures the metric structurally per [[observability-is-half-the-architecture]] Doctrine fit: - Monotonic Instant (not wall-clock) — immune to clock skew - One Instant per turn, no Vec growth, no heap allocs on hot path - Per Joel's computer-engineer mental model in [[init-once-handle-then-lease-zero-copy-refs]]: cache-friendly, branch-predictable, autovectorization-friendly Tests (7/7 pass): - latency_aggregate_records_min_max_sum_count — empty + populated math; mean = total/count - latency_aggregate_saturates_on_overflow — locks the safety property per [[every-error-is-an-opportunity-to-battle-harden]] - replies_to_inbound_from_other_peer (extended) — asserts turn_latency.count == 1 after one successful reply; min/max/mean set. If a future refactor forgets to record, count drops to 0 and the test fails loudly Test plan: - [x] cargo test --lib ... persona::service_loop:: — 7/7 pass Closes #150. Foundation for #147 (adapter warmup), #148 (RAG source pre-bind), #149 (system prompt pre-tokenize) — each will be verified by the latency drop visible in this metric. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…test (caller-primes contract)
Per Joel 2026-06-02: "God I hope it's not more fallback cancer. You
tend to turn stuff into fake demos."
Two honest fixes addressing both criticisms.
## Fix 1: ONE place primes, not two (no more belt-and-suspenders)
Before: `spawn_persona_service` called `conversation.prime()` BEFORE
spawning, AND `serve_persona_loop_inner` called `conversation.prime()`
unconditionally as a "safety net." Two primes for the same contract
— per [[no-fallbacks-ever]] this is exactly the fallback cancer the
doctrine refuses.
After: `serve_persona_loop_inner` does NOT prime. Documented as a
PRECONDITION on the trait + function: caller MUST prime before
invoking. The supervisor's `spawn_persona_service` primes for
production. Direct callers (`airc_chat_demo`, tests) prime explicitly.
If a caller forgets, the first `next_message` returns the typed
`Err("called before prime()")` shipped in cb2894f — fail-loud,
never silently-warm.
Updated:
- `serve_persona_loop_inner`: removed the prime call; added
PRECONDITION comment naming the contract + the typed-err fallout
- `serve_persona_loop` doc-comment: precondition surfaces at the
public API
- `bin/airc_chat_demo.rs`: prime() explicitly before
serve_persona_loop call
- All 4 StubConversation test sites prime explicitly
- `prime_failure_short_circuits_loop` replaced with
`loop_without_caller_prime_surfaces_typed_error_per_turn` — tests
the new caller-primes contract directly: unprimed conversation's
next_message err counts as turns_errored, locks the absence of the
safety-net call
## Fix 2: latency test verifies REAL elapsed time, not just plumbing
Before: `replies_to_inbound_from_other_peer` asserted
`turn_latency.count == 1` and that min/max/mean were Some. Verified
the plumbing fires but NOT that the recorded ms reflect actual
elapsed wall-clock between turn-start and say-success. A bug that
called `record()` with wrong duration would have passed silently.
Fake-demo-shaped.
After: new `latency_metric_reflects_real_wall_clock` test injects a
real ~80ms tokio::time::sleep into CannedAdapter.generate_text, runs
the loop, asserts:
- `observed_ms >= 50` (CI jitter floor — verifies metric tracks the
injected delay, not always-zero)
- `observed_ms < 5000` (upper bound for sanity)
CannedAdapter gains `inject_delay_ms` field; `fake_hosted_with_delay`
helper exposes it. Default (`fake_hosted`) passes 0 so existing tests
are unaffected.
Test plan:
- [x] cargo test --lib ... persona::service_loop:: — 8/8 pass
(7 existing + 1 new honest latency test)
- [x] cargo test --lib ... persona:: — 713/713 pass overall
Doctrine recap:
- [[no-fallbacks-ever]] — one place primes, not two
- [[every-error-is-an-opportunity-to-battle-harden]] — the
caller-primes regression test locks the contract
- The honest latency test prevents the "passes on plumbing, silent
on correctness" anti-pattern
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Superseded — the semantic content of this PR landed via the cognition pipeline rewires on canary. Verified on
The stack pre-dated #1519's "persona decides + responds via LLM in ONE structured call" rewire and #1539's cognition pipeline integration. Those PRs reshaped No work lost — every architectural insight from this stack is in the substrate today. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per Joel 2026-06-02 ("make sure timing and other metrics are in place"): the substrate doesn't get to claim "fast airc-bound persona" without measuring. This PR makes the per-reply cost structural.
What changed
LatencyAggregate { count, total_ms, min_ms, max_ms }— cheap online aggregator. O(1) record, allocation-free, saturating-add on overflow (locked by test).ServeOutcome.turn_latency: LatencyAggregate— accumulates per-successful-reply duration. Excludes wait-for-next-message and pre-watermark / self-loop / RAG-only-skip cycles.serve_persona_loop_innerinstruments the per-reply path:Instant::nowcaptured AFTER filters, BEFORE RAG inspectturn_latencyonly on successfulsaytracing::info!per turn with lamport, duration, mean/min/max so the substrate's observability layer captures the metric structurally per [[observability-is-half-the-architecture]]Doctrine fit
Instant(not wall-clock) — immune to clock skewInstantper turn, noVecgrowth, no heap allocs on hot pathTests (7/7 pass)
latency_aggregate_records_min_max_sum_count— empty + populated math; mean = total/countlatency_aggregate_saturates_on_overflow— locks safety property per [[every-error-is-an-opportunity-to-battle-harden]]replies_to_inbound_from_other_peer(extended) — assertsturn_latency.count == 1after one successful replyStacked on
PR #1514 (
feat/persona-prime-at-spawn).Closes #150. Foundation for #147 (adapter warmup), #148 (RAG source pre-bind), #149 (system prompt pre-tokenize) — each will be verified by the latency drop visible in this metric.
🤖 Generated with Claude Code