Skip to content

PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI#51

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1b-grpc-long-session-bench-8e7f
Jun 2, 2026
Merged

PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI#51
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1b-grpc-long-session-bench-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Jun 1, 2026

Why

PR-E1's deferred sibling. Ships everything needed to validate the two ADR 0008 §7 GA gates the deprecated HTTP shim's bench cannot answer:

Gate What it checks
memory bounded agg.kv_bounded is True (KV stays inside a tight band)
prefill bounded agg.prefill_bounded is True (per-turn latency is flat across the run)

The HTTP shim's bench_long_session.py fails on prefill-bounded by architecture — every /v1/chat/completions request re-prefills the full conversation history. The session-bound gRPC contract makes prefill cost depend only on the size of each new user message, regardless of conversation length. PR-E1b measures this empirically.

Mac M4 4-hour evidence — GA gate G2 PASSED ✅

Committed to main as bec3d7b: results/platform-tests/bench_session_4h_1780332893.json

=== Run config ===
  duration_s         = 14400.01
  turns              = 480           (24 × 10-min buckets, 20 turns/bucket — perfect cadence)
  errors             = 0
  abort_reason       = None
  partial            = false

=== Headline KPIs ===
  p50 latency        = 1.8286 s
  p95 latency        = 1.8529 s
  p95 - p50          = 0.024 s        (latency distribution is essentially a delta function)
  latency_drift_p50_s= 0.0093 s       (9 ms drift over 14400 s)
  bucket-p50 spread  = 0.014 s        (1.824s head → 1.838s tail)
  prefill_bounded    = True
  kv_bounded         = True           (see caveat below)

Comparison to v0.3.0-rc1 HTTP-shim 4h run

Metric v0.3.0-rc1 (HTTP shim) v0.3 PR-E1b (gRPC session) Δ
n_turns 58 480 8.3×
n_errors many (timeout loop) 0
latency_drift_p50_s +39.74 s +0.0093 s 4400× improvement

This is the empirical proof that the session-bound gRPC contract delivers the prefill-bounded property the deprecated HTTP shim could not, and is the binding evidence for ADR 0008 §7 GA gate G2.

Caveat: kv_live_bytes is always 0 in this report

The bench reports min/mean/max kv_live_bytes = 0 across all 480 turns. The verifier's KV cache is not actually empty — this is a reporting-path issue:

  • GetSessionInfo.kv_live_bytes reads Session.slab.live_kv_bytes (PR-A3b's wiring).
  • KVSlab.live_kv_bytes reflects whether anyone has called slab.append() to write K/V tensors into the slab.
  • The verifier maintains its own SinkWindowKVCache (an mlx_lm._BaseCache subclass) and never writes through to the slab.
  • The slab is therefore an "this session holds one capacity unit" placeholder, not a true byte gauge.

kv_bounded = True therefore reduces to 0 - 0 < 10% × max(0,1) = 0.10 — trivially True from a constant-zero series. Architecturally kv_bounded is a mathematical guarantee of the SinkWindowKVCache construction (capacity = (sink + window) tokens fixed at build time, here 68 tokens). 4h × 480 turns × no OOM / no crash is the indirect empirical confirmation. But the bench's headline number is degenerate.

Fix queued as PR-E1c: wire GetSessionInfo.kv_live_bytes to the verifier's cache.k_seq_length × num_layers × num_kv_heads × head_dim × bytes_per_dtype × 2 (K + V). ~50-100 lines + unit tests. Independent of merging this PR.

What ships

inference_engine/bench/ (new package, 100% covered)

File Stmts Purpose
session_long_run.py 56 Pure aggregator: _percentile, _kv_bounded, _prefill_bounded, _latency_drift_p50_s, _bucketize_10min, aggregate_run. No numpy dep.

scripts/start_grpc_runtime_server.py (new, CLI plumbing)

Boots a real Qwen3 verifier (CPU or MLX), wires SessionStore + AppendTokensCoordinator + GenerationCoordinator, serves the gRPC RuntimeService. Pulls slab dims (num_layers, num_kv_heads, head_dim) from the verifier's HF config.

scripts/bench_agentic/bench_session_long_run.py (new, CLI plumbing)

Walks ONE gRPC session through many turns. Tokenizes only the new user message per turn. Records latency + session.info().kv_live_bytes + history length. Atomic JSON via os.replace + 10-min partial checkpoints.

tests/inference_engine/bench/test_session_long_run.py (new)

35 tests covering the aggregator to 100%.

scripts/review_pr_e1b_on_mac.sh (new, executable)

Default = 30-min smoke. --print-server-cmd / --print-4h-cmd subcommands print bare commands for separate-terminal long runs.

.github/workflows/ci.yaml

  • tests/inference_engine/bench/ added to the Linux pytest gate
  • --cov=inference_engine.bench added to the coverage gate

Linux verification

Linux CI gate (extended path set + new bench tests):
  730 passed, 100% coverage on 1750 stmts.

CLI scripts:
  py_compile passes on both. End-to-end runs validated on Mac M4 (see above).

Stack

PR-E1b is independent of PR-D1 (#49) and PR-E1 (#50) at the file level. Branched off main directly. Can merge in any order with the others.

Reviewer checklist

  • 4-hour Mac M4 evidence committed to mainresults/platform-tests/bench_session_4h_1780332893.json (commit bec3d7b). prefill_bounded=True with 9 ms drift over 14400 s. ✅
  • 0 errors across 480 turns. ✅
  • Linux CI 100% coverage on the aggregator. ✅
  • kv_live_bytes=0 caveat acknowledged; PR-E1c queued to fix the reporting wiring.

Next PR

  • PR-E1c (queued, ~50-100 lines): wire GetSessionInfo.kv_live_bytes to the verifier's true cache byte count.
  • PR-D2: HTTP shim refactor onto SessionStore. Independent of this PR.
  • PR-E2: Self-hosted Mac M4 GitHub Actions workflow.
Open in Web Open in Cursor 

PR-E1's deferred sibling, per the scope split documented in PR-E1's
description. Ships everything needed to validate the two ADR 0008
\u00a77 GA gates the deprecated HTTP shim's bench cannot answer:

  * memory bounded:  agg.kv_bounded
  * prefill bounded: agg.prefill_bounded

The HTTP shim's bench_long_session.py fails on prefill-bounded by
architecture (every /v1/chat/completions request re-prefills the
full conversation history). The session-bound gRPC contract makes
prefill cost depend only on the size of each new user message,
regardless of how long the conversation is. PR-E1b measures this
empirically.

What ships
----------

inference_engine/bench/ (new package)
  __init__.py
  session_long_run.py     [56 stmts, 100% covered]
    Pure-Python aggregation helpers split out of the CLI script so
    they can be unit-tested under the Linux 100% coverage gate.
    Exposes:
      _percentile           linear-interpolated, no numpy dep
      _kv_bounded           tolerance band on KV-bytes series
      _prefill_bounded      tail-vs-head p50 latency drift gate
      _latency_drift_p50_s  drift in seconds
      _bucketize_10min      10-minute bucket breakdown for long runs
      aggregate_run         the full report builder

scripts/start_grpc_runtime_server.py (new, CLI)
  Boots a real Qwen3 verifier (CPU or MLX), wires it through
  SessionStore + AppendTokensCoordinator + GenerationCoordinator,
  serves the v0.3 gRPC RuntimeService. Pulls slab dims (num_layers,
  num_kv_heads, head_dim) from the verifier's HF config so
  GetSessionInfo.kv_live_bytes reports physically meaningful bytes.
  Symmetric to scripts/serve.py for the deprecated HTTP shim. CLI
  plumbing exempt from coverage by the same convention.

scripts/bench_agentic/bench_session_long_run.py (new, CLI)
  Walks ONE gRPC session through many turns, recording per-turn
  latency and session.info().kv_live_bytes. Tokenizes ONLY the
  new user message per turn (the whole point of session-bound
  runtime: prefill cost is O(new_user_message), not O(history)).
  Writes JSON via atomic os.replace + 10-min partial checkpoints
  so a host reboot mid-run doesn't lose evidence.

tests/inference_engine/bench/test_session_long_run.py (new, 35 tests)
  Covers the aggregator to 100%: _percentile (5 tests), _kv_bounded
  (5), _prefill_bounded (5), _latency_drift_p50_s (3), _bucketize_10min
  (6), aggregate_run (7) including empty / all-error / mixed /
  bounded / unbounded / custom-threshold paths.

scripts/review_pr_e1b_on_mac.sh (new, executable)
  Default invocation: 30-min smoke. Boots the gRPC server in the
  background, runs the bench against it, kills the server, prints
  the headline KPIs. Two helper subcommands print the bare server
  and 4-hour bench commands for separate-terminal manual runs:
      bash scripts/review_pr_e1b_on_mac.sh --print-server-cmd
      bash scripts/review_pr_e1b_on_mac.sh --print-4h-cmd

CI wiring
---------

.github/workflows/ci.yaml
  + tests/inference_engine/bench/ in the Linux pytest gate
  + --cov=inference_engine.bench in the coverage gate

Local verification (Linux VM, py3.12)
-------------------------------------
  Linux CI gate (extended path set + new bench tests):
    730 passed, 100% coverage on 1750 stmts (was 682 / 1660 stmts).
    +35 new tests under tests/inference_engine/bench/.

  CLI scripts:
    py_compile passes on both new scripts. Runtime validation
    happens on Mac M4 via review_pr_e1b_on_mac.sh.

Per ADR 0008 \u00a79
----------------
This PR ships CLI plumbing + a pure-Python aggregator. The
aggregator is fully covered on Linux. The CLI scripts that drive
real model weights are platform-agnostic Python but only validated
end-to-end on Mac M4. Reviewer pushes the 30-min smoke JSON to the
PR branch; the 4-hour evidence run is committed separately when
wall-clock budget allows.

Sequence for the 4-hour evidence run (per user spec, exactly):

  1. Start gRPC server (Mac mini local, Qwen3-0.6B):
       PYTHONPATH=.:sdks/python python3 \
           scripts/start_grpc_runtime_server.py \
           --backend cpu --verifier-id Qwen/Qwen3-0.6B \
           --bind 127.0.0.1:50051 \
           --capacity 1 --sink 4 --window 64

  2. Run bench:
       PYTHONPATH=.:sdks/python python3 \
           scripts/bench_agentic/bench_session_long_run.py \
           --grpc-address 127.0.0.1:50051 \
           --tokenizer-id Qwen/Qwen3-0.6B \
           --duration-s 14400 --turn-spacing-s 30 \
           --output results/platform-tests/bench_session_4h_$(date +%s).json

  3. Commit JSON to branch.

Stack
-----
PR-E1b is independent of PR-D1 (#49) and PR-E1 (#50) at the file
level. Branched off main directly. Can merge in any order with
the others.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02
@FluffyAIcode FluffyAIcode merged commit 9773190 into main Jun 2, 2026
6 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-e1b-grpc-long-session-bench-8e7f branch June 2, 2026 04:02
FluffyAIcode added a commit that referenced this pull request Jun 2, 2026
…tion workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants