PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI by FluffyAIcode · Pull Request #51 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-01T16:33:52Z

Why

PR-E1's deferred sibling. Ships everything needed to validate the two ADR 0008 §7 GA gates the deprecated HTTP shim's bench cannot answer:

Gate	What it checks
memory bounded	`agg.kv_bounded` is True (KV stays inside a tight band)
prefill bounded	`agg.prefill_bounded` is True (per-turn latency is flat across the run)

The HTTP shim's bench_long_session.py fails on prefill-bounded by architecture — every /v1/chat/completions request re-prefills the full conversation history. The session-bound gRPC contract makes prefill cost depend only on the size of each new user message, regardless of conversation length. PR-E1b measures this empirically.

Mac M4 4-hour evidence — GA gate G2 PASSED ✅

Committed to main as bec3d7b: results/platform-tests/bench_session_4h_1780332893.json

=== Run config ===
  duration_s         = 14400.01
  turns              = 480           (24 × 10-min buckets, 20 turns/bucket — perfect cadence)
  errors             = 0
  abort_reason       = None
  partial            = false

=== Headline KPIs ===
  p50 latency        = 1.8286 s
  p95 latency        = 1.8529 s
  p95 - p50          = 0.024 s        (latency distribution is essentially a delta function)
  latency_drift_p50_s= 0.0093 s       (9 ms drift over 14400 s)
  bucket-p50 spread  = 0.014 s        (1.824s head → 1.838s tail)
  prefill_bounded    = True
  kv_bounded         = True           (see caveat below)

Comparison to v0.3.0-rc1 HTTP-shim 4h run

Metric	v0.3.0-rc1 (HTTP shim)	v0.3 PR-E1b (gRPC session)	Δ
n_turns	58	480	8.3×
n_errors	many (timeout loop)	0	—
`latency_drift_p50_s`	+39.74 s	+0.0093 s	4400× improvement

This is the empirical proof that the session-bound gRPC contract delivers the prefill-bounded property the deprecated HTTP shim could not, and is the binding evidence for ADR 0008 §7 GA gate G2.

Caveat: `kv_live_bytes` is always 0 in this report

The bench reports min/mean/max kv_live_bytes = 0 across all 480 turns. The verifier's KV cache is not actually empty — this is a reporting-path issue:

GetSessionInfo.kv_live_bytes reads Session.slab.live_kv_bytes (PR-A3b's wiring).
KVSlab.live_kv_bytes reflects whether anyone has called slab.append() to write K/V tensors into the slab.
The verifier maintains its own SinkWindowKVCache (an mlx_lm._BaseCache subclass) and never writes through to the slab.
The slab is therefore an "this session holds one capacity unit" placeholder, not a true byte gauge.

kv_bounded = True therefore reduces to 0 - 0 < 10% × max(0,1) = 0.10 — trivially True from a constant-zero series. Architecturally kv_bounded is a mathematical guarantee of the SinkWindowKVCache construction (capacity = (sink + window) tokens fixed at build time, here 68 tokens). 4h × 480 turns × no OOM / no crash is the indirect empirical confirmation. But the bench's headline number is degenerate.

Fix queued as PR-E1c: wire GetSessionInfo.kv_live_bytes to the verifier's cache.k_seq_length × num_layers × num_kv_heads × head_dim × bytes_per_dtype × 2 (K + V). ~50-100 lines + unit tests. Independent of merging this PR.

What ships

`inference_engine/bench/` (new package, 100% covered)

File	Stmts	Purpose
`session_long_run.py`	56	Pure aggregator: `_percentile`, `_kv_bounded`, `_prefill_bounded`, `_latency_drift_p50_s`, `_bucketize_10min`, `aggregate_run`. No numpy dep.

`scripts/start_grpc_runtime_server.py` (new, CLI plumbing)

Boots a real Qwen3 verifier (CPU or MLX), wires SessionStore + AppendTokensCoordinator + GenerationCoordinator, serves the gRPC RuntimeService. Pulls slab dims (num_layers, num_kv_heads, head_dim) from the verifier's HF config.

`scripts/bench_agentic/bench_session_long_run.py` (new, CLI plumbing)

Walks ONE gRPC session through many turns. Tokenizes only the new user message per turn. Records latency + session.info().kv_live_bytes + history length. Atomic JSON via os.replace + 10-min partial checkpoints.

`tests/inference_engine/bench/test_session_long_run.py` (new)

35 tests covering the aggregator to 100%.

`scripts/review_pr_e1b_on_mac.sh` (new, executable)

Default = 30-min smoke. --print-server-cmd / --print-4h-cmd subcommands print bare commands for separate-terminal long runs.

`.github/workflows/ci.yaml`

tests/inference_engine/bench/ added to the Linux pytest gate
--cov=inference_engine.bench added to the coverage gate

Linux verification

Linux CI gate (extended path set + new bench tests):
  730 passed, 100% coverage on 1750 stmts.

CLI scripts:
  py_compile passes on both. End-to-end runs validated on Mac M4 (see above).

Stack

PR-E1b is independent of PR-D1 (#49) and PR-E1 (#50) at the file level. Branched off main directly. Can merge in any order with the others.

Reviewer checklist

4-hour Mac M4 evidence committed to main — results/platform-tests/bench_session_4h_1780332893.json (commit bec3d7b). prefill_bounded=True with 9 ms drift over 14400 s. ✅
0 errors across 480 turns. ✅
Linux CI 100% coverage on the aggregator. ✅
kv_live_bytes=0 caveat acknowledged; PR-E1c queued to fix the reporting wiring.

Next PR

PR-E1c (queued, ~50-100 lines): wire GetSessionInfo.kv_live_bytes to the verifier's true cache byte count.
PR-D2: HTTP shim refactor onto SessionStore. Independent of this PR.
PR-E2: Self-hosted Mac M4 GitHub Actions workflow.

PR-E1's deferred sibling, per the scope split documented in PR-E1's description. Ships everything needed to validate the two ADR 0008 \u00a77 GA gates the deprecated HTTP shim's bench cannot answer: * memory bounded: agg.kv_bounded * prefill bounded: agg.prefill_bounded The HTTP shim's bench_long_session.py fails on prefill-bounded by architecture (every /v1/chat/completions request re-prefills the full conversation history). The session-bound gRPC contract makes prefill cost depend only on the size of each new user message, regardless of how long the conversation is. PR-E1b measures this empirically. What ships ---------- inference_engine/bench/ (new package) __init__.py session_long_run.py [56 stmts, 100% covered] Pure-Python aggregation helpers split out of the CLI script so they can be unit-tested under the Linux 100% coverage gate. Exposes: _percentile linear-interpolated, no numpy dep _kv_bounded tolerance band on KV-bytes series _prefill_bounded tail-vs-head p50 latency drift gate _latency_drift_p50_s drift in seconds _bucketize_10min 10-minute bucket breakdown for long runs aggregate_run the full report builder scripts/start_grpc_runtime_server.py (new, CLI) Boots a real Qwen3 verifier (CPU or MLX), wires it through SessionStore + AppendTokensCoordinator + GenerationCoordinator, serves the v0.3 gRPC RuntimeService. Pulls slab dims (num_layers, num_kv_heads, head_dim) from the verifier's HF config so GetSessionInfo.kv_live_bytes reports physically meaningful bytes. Symmetric to scripts/serve.py for the deprecated HTTP shim. CLI plumbing exempt from coverage by the same convention. scripts/bench_agentic/bench_session_long_run.py (new, CLI) Walks ONE gRPC session through many turns, recording per-turn latency and session.info().kv_live_bytes. Tokenizes ONLY the new user message per turn (the whole point of session-bound runtime: prefill cost is O(new_user_message), not O(history)). Writes JSON via atomic os.replace + 10-min partial checkpoints so a host reboot mid-run doesn't lose evidence. tests/inference_engine/bench/test_session_long_run.py (new, 35 tests) Covers the aggregator to 100%: _percentile (5 tests), _kv_bounded (5), _prefill_bounded (5), _latency_drift_p50_s (3), _bucketize_10min (6), aggregate_run (7) including empty / all-error / mixed / bounded / unbounded / custom-threshold paths. scripts/review_pr_e1b_on_mac.sh (new, executable) Default invocation: 30-min smoke. Boots the gRPC server in the background, runs the bench against it, kills the server, prints the headline KPIs. Two helper subcommands print the bare server and 4-hour bench commands for separate-terminal manual runs: bash scripts/review_pr_e1b_on_mac.sh --print-server-cmd bash scripts/review_pr_e1b_on_mac.sh --print-4h-cmd CI wiring --------- .github/workflows/ci.yaml + tests/inference_engine/bench/ in the Linux pytest gate + --cov=inference_engine.bench in the coverage gate Local verification (Linux VM, py3.12) ------------------------------------- Linux CI gate (extended path set + new bench tests): 730 passed, 100% coverage on 1750 stmts (was 682 / 1660 stmts). +35 new tests under tests/inference_engine/bench/. CLI scripts: py_compile passes on both new scripts. Runtime validation happens on Mac M4 via review_pr_e1b_on_mac.sh. Per ADR 0008 \u00a79 ---------------- This PR ships CLI plumbing + a pure-Python aggregator. The aggregator is fully covered on Linux. The CLI scripts that drive real model weights are platform-agnostic Python but only validated end-to-end on Mac M4. Reviewer pushes the 30-min smoke JSON to the PR branch; the 4-hour evidence run is committed separately when wall-clock budget allows. Sequence for the 4-hour evidence run (per user spec, exactly): 1. Start gRPC server (Mac mini local, Qwen3-0.6B): PYTHONPATH=.:sdks/python python3 \ scripts/start_grpc_runtime_server.py \ --backend cpu --verifier-id Qwen/Qwen3-0.6B \ --bind 127.0.0.1:50051 \ --capacity 1 --sink 4 --window 64 2. Run bench: PYTHONPATH=.:sdks/python python3 \ scripts/bench_agentic/bench_session_long_run.py \ --grpc-address 127.0.0.1:50051 \ --tokenizer-id Qwen/Qwen3-0.6B \ --duration-s 14400 --turn-spacing-s 30 \ --output results/platform-tests/bench_session_4h_$(date +%s).json 3. Commit JSON to branch. Stack ----- PR-E1b is independent of PR-D1 (#49) and PR-E1 (#50) at the file level. Branched off main directly. Can merge in any order with the others. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tion workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-E1c: fix kv_live_bytes reporting path #52

Merged

4 tasks

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02

FluffyAIcode merged commit 9773190 into main Jun 2, 2026
6 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-e1b-grpc-long-session-bench-8e7f branch June 2, 2026 04:02

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow #57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI#51

PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI#51
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1b-grpc-long-session-bench-8e7f

FluffyAIcode commented Jun 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Mac M4 4-hour evidence — GA gate G2 PASSED ✅

Comparison to v0.3.0-rc1 HTTP-shim 4h run

Caveat: kv_live_bytes is always 0 in this report

What ships

inference_engine/bench/ (new package, 100% covered)

scripts/start_grpc_runtime_server.py (new, CLI plumbing)

scripts/bench_agentic/bench_session_long_run.py (new, CLI plumbing)

tests/inference_engine/bench/test_session_long_run.py (new)

scripts/review_pr_e1b_on_mac.sh (new, executable)

.github/workflows/ci.yaml

Linux verification

Stack

Reviewer checklist

Next PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 1, 2026 •

edited by cursor Bot

Loading

Caveat: `kv_live_bytes` is always 0 in this report

`inference_engine/bench/` (new package, 100% covered)

`scripts/start_grpc_runtime_server.py` (new, CLI plumbing)

`scripts/bench_agentic/bench_session_long_run.py` (new, CLI plumbing)

`tests/inference_engine/bench/test_session_long_run.py` (new)

`scripts/review_pr_e1b_on_mac.sh` (new, executable)

`.github/workflows/ci.yaml`