PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore by FluffyAIcode · Pull Request #58 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-02T04:26:47Z

Why

Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery from the HTTP shim's request path. Each /v1/chat/completions request is now a single-shot session under SessionStore: CreateSession → AppendTokens(prompt) → Generate → CloseSession. Same semantics as the gRPC RuntimeService surface; ADR 0008 §2.7 deprecation.

Three architectural changes

#	Change	Notes
1	Speculative decoding is no longer applied on the HTTP path	The session-bound runtime is pure AR against the verifier; the proposer is wired into v0.4 alignment work (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine; post-PR-D2 it's roughly the same speed as transformers-vanilla AR. Migrate to gRPC for v0.3's full perf story.
2	Admission control is now `asyncio.Semaphore`	REJECT vs QUEUE policy with `queue_max_wait_s` is preserved (`queue_max_wait_s=0` means wait forever); the in-flight slab-pool bookkeeping moved into `SessionStore`. The Scheduler module + integration tests stay (used by other callers), just no longer wired to HTTP.
3	ADR 0008 §2.7 deprecation headers	Stamped onto every response by a new `_DeprecationHeadersMiddleware`: `Deprecation: true`, `Sunset: Wed, 31 Dec 2025 00:00:00 GMT`, `Link: </docs/adr/0008-...>; rel="successor-version"`.

Files

Direction	File	Δ
Rewritten	`inference_engine/server/app.py`	+330 / −300 net; `create_app(verifier, config, *, slab_pool=None, model_id_label=None)`; new `_DeprecationHeadersMiddleware`; route handler runs SessionStore + AppendTokensCoordinator + GenerationCoordinator with sync gen wrapped in `asyncio.to_thread` for disconnect-poll responsiveness
Deleted	`inference_engine/scheduler/pooled_verifier.py`	−175
Updated	`inference_engine/scheduler/__init__.py`	dropped `PooledVerifier` from `__all__`
Rewritten	`scripts/serve.py`	`_build_engine` → `_build_verifier`; banner says "DEPRECATED HTTP shim", points at gRPC entrypoint; `--block-size` / `--num-diffusion-steps` flags retained for CLI compat but documented as ignored
Deleted	`tests/inference_engine/scheduler/test_pooled_verifier.py`	−272; PR-N1's exemption was precisely because PR-D2 retires this module
Added	`tests/inference_engine/server/test_grpc_app.py`	+132, 3 tests covering grpc_app's success paths (`test_append_tokens_session_not_found_returns_not_found`, `test_append_tokens_success_returns_response`, `test_generate_yields_history_truncated_then_done`) — previously covered by deleted-in-PR-N3 FakeVerifier-backed tests
Updated	`tests/integration/test_http_shim_real.py`	Fixture: `real_speculative_engine` → `real_speculative_engine._decoder.verifier`. Asserts: `state.engine.model_id_label` → `state.model_id_label`.
Updated	`.github/workflows/ci.yaml`	dropped `pooled_verifier.py` from `--include=` filter
Added	`scripts/review_pr_d2_on_mac.sh`	Mac M4 reviewer aid

Linux verification

PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  476 passed (was 473 on main; +3 net = added 3 grpc_app success-path tests).
  100% coverage on 915 stmts (was 987 on main; -72 net = deleted PooledVerifier).

Mac M4 evidence (REQUIRED for merge)

This is the single most invasive PR in the v0.3 sequence — it rewrites the deprecated HTTP shim's entire request path. The integration suite's test_http_shim_real.py is the binding gate. Reviewer runs:

bash scripts/review_pr_d2_on_mac.sh
git add results/platform-tests/pr-d2-mac-*
git commit -m "Mac M4 review evidence for PR-D2"
git push

Acceptance: all integration tests pass against real Qwen3-0.6B, including the now-rewired test_http_shim_real.py which exercises chat-completions (streaming + non-streaming), auth, /healthz, /metrics, /v1/models against the new SessionStore-driven path.

Stack

PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57) which adds CI workflow YAML; the two can merge in either order.

Next PR

v0.4 brings the proposer back into the session-bound path: PR-V0.4-A wires SparseLogitsProposer into a new SpeculativeAppendTokensCoordinator (or extends the existing one) so speculative decoding is restored on both gRPC and HTTP paths. The ADR 0001 / 0004 alignment training feeds into that work.

Reviewer checklist

create_app signature change is documented in PR description (was engine, config, pool=None; now verifier, config, *, slab_pool=None, model_id_label=None).
PooledVerifier is gone from the scheduler package; no callers remain anywhere in the repo.
Deprecation / Sunset headers are visible in every HTTP response (use curl -I against a running shim to verify).
Mac M4 evidence committed to this branch — pr-d2-mac-integration-tests-*.json shows all integration tests passing against real Qwen3-0.6B.
serve.py's --block-size / --num-diffusion-steps flags are retained but documented as ignored (don't silently drop them — existing scripts may pass them).

Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery from the HTTP shim's request path. Each /v1/chat/completions request is now a single-shot session under SessionStore: CreateSession \u2192 AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation. Three architectural changes --------------------------- 1. Speculative decoding is no longer applied on the HTTP path. The session-bound runtime is pure AR against the verifier; the proposer is wired into the v0.4 alignment work (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine (proposer + verifier together); post-PR-D2 it's roughly the same speed as transformers-vanilla AR. Migrate to gRPC for v0.3's full perf story. 2. Admission control is now an asyncio.Semaphore instead of a full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s is preserved (queue_max_wait_s=0 means wait forever); the in-flight slab-pool bookkeeping moved into SessionStore. The Scheduler module + integration tests stay (used by other callers), but the HTTP shim no longer wires it. 3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every response by a new _DeprecationHeadersMiddleware: Deprecation: true Sunset: Wed, 31 Dec 2025 00:00:00 GMT Link: </docs/adr/0008-...>; rel="successor-version" Production-side changes ----------------------- inference_engine/server/app.py ~rewrite, +330 / -300 net - create_app's signature changed: now takes (verifier, config, *, slab_pool=None, model_id_label=None) instead of (engine, config, pool=None). Caller passes the underlying SinkWindowVerifier directly. - Internal: builds SessionStore + AppendTokensCoordinator + GenerationCoordinator. asyncio.Semaphore for admission. - Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate (sync gen run in asyncio.to_thread for disconnect-poll responsiveness) \u2192 CloseSession on success/error. - SSE streaming: same pattern; queue-bridged from the sync generator coordinator. HistoryTruncatedEvent is consumed silently (no OpenAI analog). - app.state.engine \u2192 app.state.{verifier, store, append_coord, gen_coord, model_id_label, admission_sem}. inference_engine/scheduler/__init__.py -1 line export Dropped 'PooledVerifier' from __all__. inference_engine/scheduler/pooled_verifier.py DELETED, -150 lines scripts/serve.py ~rewrite, +12 / -50 net - _build_engine \u2192 _build_verifier (returns SinkWindowVerifier or MLXSinkWindowVerifier). - main() builds the verifier and passes to create_app(verifier, config). Mirrors PR-E1b's start_grpc_runtime_server.py. - --block-size and --num-diffusion-steps flags retained for CLI compat but documented as ignored. - Banner now says 'DEPRECATED HTTP shim' and points at the gRPC entrypoint. Tests ----- tests/inference_engine/scheduler/test_pooled_verifier.py DELETED, -250 lines PR-N1 had marked this file exempt from no-doubles cleanup precisely because PR-D2 was going to retire the module. PR-D2 delivers; the file goes with it. tests/inference_engine/server/test_grpc_app.py +120 lines, 3 new tests Coverage of grpc_app.py's success paths after the test_app_* files (which previously hit them via the FakeVerifier-backed SchedulerEngine path) were retired by PR-N3: test_append_tokens_session_not_found_returns_not_found Coordinator override raises SessionNotFoundError. Covers grpc_app.py:208 (NOT_FOUND abort branch). test_append_tokens_success_returns_response Coordinator override returns a synthetic history_length; asserts the response carries it. Covers grpc_app.py:213 (return AppendTokensResponse on success). test_generate_yields_history_truncated_then_done Generator override yields HistoryTruncated + Token + Done events; asserts the wire frames in order. Covers grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent yield). tests/integration/test_http_shim_real.py ~30 line update Fixture wiring: real_speculative_engine \u2192 real_speculative_engine._decoder.verifier (since create_app's signature changed). Tests reading real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label. CI workflow ----------- .github/workflows/ci.yaml: dropped pooled_verifier.py from the --include= filter (it no longer exists). Linux verification ------------------ PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>: 476 passed (was 473 on main; +3 net = added 3 grpc_app success-path tests). 100% coverage on 915 stmts (was 987 on main; -72 net = the deleted PooledVerifier module). Mac M4 evidence (REQUIRED for merge) ------------------------------------ This is the single most invasive PR in the v0.3 sequence \u2014 it rewrites the deprecated HTTP shim's entire request path. The integration suite's test_http_shim_real.py is the binding gate. Reviewer runs: bash scripts/review_pr_d2_on_mac.sh git add results/platform-tests/pr-d2-mac-* git commit -m 'Mac M4 review evidence for PR-D2' git push Acceptance: all integration tests pass against real Qwen3-0.6B, including the now-rewired test_http_shim_real.py which exercises chat-completions (streaming + non-streaming), auth, /healthz, /metrics, /v1/models against the new SessionStore-driven path. Stack ----- PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57) which adds CI workflow YAML; the two can merge in either order. Next PR ------- v0.4 brings the proposer back into the session-bound path: PR-V0.4-A wires SparseLogitsProposer into a new SpeculativeAppendTokensCoordinator (or extends the existing one) so speculative decoding is restored on both gRPC and HTTP paths. The ADR 0001/0004 alignment training feeds into that work. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:36

FluffyAIcode merged commit 5481ffa into main Jun 2, 2026
6 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-d2-http-shim-onto-sessionstore-8e7f branch June 2, 2026 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore#58

PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore#58
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-d2-http-shim-onto-sessionstore-8e7f

FluffyAIcode commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 2, 2026

Why

Three architectural changes

Files

Linux verification

Mac M4 evidence (REQUIRED for merge)

Stack

Next PR

Reviewer checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants