Skip to content

PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore#58

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-d2-http-shim-onto-sessionstore-8e7f
Jun 2, 2026
Merged

PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore#58
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-d2-http-shim-onto-sessionstore-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Why

Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery from the HTTP shim's request path. Each /v1/chat/completions request is now a single-shot session under SessionStore: CreateSessionAppendTokens(prompt)GenerateCloseSession. Same semantics as the gRPC RuntimeService surface; ADR 0008 §2.7 deprecation.

Three architectural changes

# Change Notes
1 Speculative decoding is no longer applied on the HTTP path The session-bound runtime is pure AR against the verifier; the proposer is wired into v0.4 alignment work (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine; post-PR-D2 it's roughly the same speed as transformers-vanilla AR. Migrate to gRPC for v0.3's full perf story.
2 Admission control is now asyncio.Semaphore REJECT vs QUEUE policy with queue_max_wait_s is preserved (queue_max_wait_s=0 means wait forever); the in-flight slab-pool bookkeeping moved into SessionStore. The Scheduler module + integration tests stay (used by other callers), just no longer wired to HTTP.
3 ADR 0008 §2.7 deprecation headers Stamped onto every response by a new _DeprecationHeadersMiddleware: Deprecation: true, Sunset: Wed, 31 Dec 2025 00:00:00 GMT, Link: </docs/adr/0008-...>; rel="successor-version".

Files

Direction File Δ
Rewritten inference_engine/server/app.py +330 / −300 net; create_app(verifier, config, *, slab_pool=None, model_id_label=None); new _DeprecationHeadersMiddleware; route handler runs SessionStore + AppendTokensCoordinator + GenerationCoordinator with sync gen wrapped in asyncio.to_thread for disconnect-poll responsiveness
Deleted inference_engine/scheduler/pooled_verifier.py −175
Updated inference_engine/scheduler/__init__.py dropped PooledVerifier from __all__
Rewritten scripts/serve.py _build_engine_build_verifier; banner says "DEPRECATED HTTP shim", points at gRPC entrypoint; --block-size / --num-diffusion-steps flags retained for CLI compat but documented as ignored
Deleted tests/inference_engine/scheduler/test_pooled_verifier.py −272; PR-N1's exemption was precisely because PR-D2 retires this module
Added tests/inference_engine/server/test_grpc_app.py +132, 3 tests covering grpc_app's success paths (test_append_tokens_session_not_found_returns_not_found, test_append_tokens_success_returns_response, test_generate_yields_history_truncated_then_done) — previously covered by deleted-in-PR-N3 FakeVerifier-backed tests
Updated tests/integration/test_http_shim_real.py Fixture: real_speculative_enginereal_speculative_engine._decoder.verifier. Asserts: state.engine.model_id_labelstate.model_id_label.
Updated .github/workflows/ci.yaml dropped pooled_verifier.py from --include= filter
Added scripts/review_pr_d2_on_mac.sh Mac M4 reviewer aid

Linux verification

PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  476 passed (was 473 on main; +3 net = added 3 grpc_app success-path tests).
  100% coverage on 915 stmts (was 987 on main; -72 net = deleted PooledVerifier).

Mac M4 evidence (REQUIRED for merge)

This is the single most invasive PR in the v0.3 sequence — it rewrites the deprecated HTTP shim's entire request path. The integration suite's test_http_shim_real.py is the binding gate. Reviewer runs:

bash scripts/review_pr_d2_on_mac.sh
git add results/platform-tests/pr-d2-mac-*
git commit -m "Mac M4 review evidence for PR-D2"
git push

Acceptance: all integration tests pass against real Qwen3-0.6B, including the now-rewired test_http_shim_real.py which exercises chat-completions (streaming + non-streaming), auth, /healthz, /metrics, /v1/models against the new SessionStore-driven path.

Stack

PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57) which adds CI workflow YAML; the two can merge in either order.

Next PR

v0.4 brings the proposer back into the session-bound path: PR-V0.4-A wires SparseLogitsProposer into a new SpeculativeAppendTokensCoordinator (or extends the existing one) so speculative decoding is restored on both gRPC and HTTP paths. The ADR 0001 / 0004 alignment training feeds into that work.

Reviewer checklist

  • create_app signature change is documented in PR description (was engine, config, pool=None; now verifier, config, *, slab_pool=None, model_id_label=None).
  • PooledVerifier is gone from the scheduler package; no callers remain anywhere in the repo.
  • Deprecation / Sunset headers are visible in every HTTP response (use curl -I against a running shim to verify).
  • Mac M4 evidence committed to this branch — pr-d2-mac-integration-tests-*.json shows all integration tests passing against real Qwen3-0.6B.
  • serve.py's --block-size / --num-diffusion-steps flags are retained but documented as ignored (don't silently drop them — existing scripts may pass them).
Open in Web Open in Cursor 

Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery
from the HTTP shim's request path. Each /v1/chat/completions request
is now a single-shot session under SessionStore: CreateSession \u2192
AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as
the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation.

Three architectural changes
---------------------------

  1. Speculative decoding is no longer applied on the HTTP path.
     The session-bound runtime is pure AR against the verifier;
     the proposer is wired into the v0.4 alignment work
     (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine
     (proposer + verifier together); post-PR-D2 it's roughly the
     same speed as transformers-vanilla AR. Migrate to gRPC for
     v0.3's full perf story.

  2. Admission control is now an asyncio.Semaphore instead of a
     full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s
     is preserved (queue_max_wait_s=0 means wait forever); the
     in-flight slab-pool bookkeeping moved into SessionStore. The
     Scheduler module + integration tests stay (used by other
     callers), but the HTTP shim no longer wires it.

  3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every
     response by a new _DeprecationHeadersMiddleware:
       Deprecation: true
       Sunset: Wed, 31 Dec 2025 00:00:00 GMT
       Link: </docs/adr/0008-...>; rel="successor-version"

Production-side changes
-----------------------

inference_engine/server/app.py                     ~rewrite, +330 / -300 net
  - create_app's signature changed: now takes (verifier, config,
    *, slab_pool=None, model_id_label=None) instead of
    (engine, config, pool=None). Caller passes the underlying
    SinkWindowVerifier directly.
  - Internal: builds SessionStore + AppendTokensCoordinator +
    GenerationCoordinator. asyncio.Semaphore for admission.
  - Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate
    (sync gen run in asyncio.to_thread for disconnect-poll
    responsiveness) \u2192 CloseSession on success/error.
  - SSE streaming: same pattern; queue-bridged from the sync
    generator coordinator. HistoryTruncatedEvent is consumed
    silently (no OpenAI analog).
  - app.state.engine \u2192 app.state.{verifier, store, append_coord,
    gen_coord, model_id_label, admission_sem}.

inference_engine/scheduler/__init__.py             -1 line export
  Dropped 'PooledVerifier' from __all__.

inference_engine/scheduler/pooled_verifier.py     DELETED, -150 lines

scripts/serve.py                                   ~rewrite, +12 / -50 net
  - _build_engine \u2192 _build_verifier (returns SinkWindowVerifier
    or MLXSinkWindowVerifier).
  - main() builds the verifier and passes to create_app(verifier,
    config). Mirrors PR-E1b's start_grpc_runtime_server.py.
  - --block-size and --num-diffusion-steps flags retained for CLI
    compat but documented as ignored.
  - Banner now says 'DEPRECATED HTTP shim' and points at the
    gRPC entrypoint.

Tests
-----

tests/inference_engine/scheduler/test_pooled_verifier.py  DELETED, -250 lines
  PR-N1 had marked this file exempt from no-doubles cleanup
  precisely because PR-D2 was going to retire the module. PR-D2
  delivers; the file goes with it.

tests/inference_engine/server/test_grpc_app.py     +120 lines, 3 new tests
  Coverage of grpc_app.py's success paths after the test_app_*
  files (which previously hit them via the FakeVerifier-backed
  SchedulerEngine path) were retired by PR-N3:
    test_append_tokens_session_not_found_returns_not_found
      Coordinator override raises SessionNotFoundError. Covers
      grpc_app.py:208 (NOT_FOUND abort branch).
    test_append_tokens_success_returns_response
      Coordinator override returns a synthetic history_length;
      asserts the response carries it. Covers grpc_app.py:213
      (return AppendTokensResponse on success).
    test_generate_yields_history_truncated_then_done
      Generator override yields HistoryTruncated + Token + Done
      events; asserts the wire frames in order. Covers
      grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent
      yield).

tests/integration/test_http_shim_real.py           ~30 line update
  Fixture wiring: real_speculative_engine \u2192
  real_speculative_engine._decoder.verifier (since create_app's
  signature changed). Tests reading
  real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label.

CI workflow
-----------

.github/workflows/ci.yaml: dropped pooled_verifier.py from the
--include= filter (it no longer exists).

Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  476 passed (was 473 on main; +3 net = added 3 grpc_app
              success-path tests).
  100% coverage on 915 stmts (was 987 on main; -72 net = the
              deleted PooledVerifier module).

Mac M4 evidence (REQUIRED for merge)
------------------------------------
This is the single most invasive PR in the v0.3 sequence \u2014 it
rewrites the deprecated HTTP shim's entire request path. The
integration suite's test_http_shim_real.py is the binding gate.
Reviewer runs:

    bash scripts/review_pr_d2_on_mac.sh
    git add results/platform-tests/pr-d2-mac-*
    git commit -m 'Mac M4 review evidence for PR-D2'
    git push

Acceptance: all integration tests pass against real Qwen3-0.6B,
including the now-rewired test_http_shim_real.py which exercises
chat-completions (streaming + non-streaming), auth, /healthz,
/metrics, /v1/models against the new SessionStore-driven path.

Stack
-----
PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57)
which adds CI workflow YAML; the two can merge in either order.

Next PR
-------
v0.4 brings the proposer back into the session-bound path:
PR-V0.4-A wires SparseLogitsProposer into a new
SpeculativeAppendTokensCoordinator (or extends the existing one)
so speculative decoding is restored on both gRPC and HTTP paths.
The ADR 0001/0004 alignment training feeds into that work.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:36
@FluffyAIcode FluffyAIcode merged commit 5481ffa into main Jun 2, 2026
6 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-d2-http-shim-onto-sessionstore-8e7f branch June 2, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants