PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore#58
Merged
FluffyAIcode merged 1 commit intoJun 2, 2026
Merged
Conversation
Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery
from the HTTP shim's request path. Each /v1/chat/completions request
is now a single-shot session under SessionStore: CreateSession \u2192
AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as
the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation.
Three architectural changes
---------------------------
1. Speculative decoding is no longer applied on the HTTP path.
The session-bound runtime is pure AR against the verifier;
the proposer is wired into the v0.4 alignment work
(ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine
(proposer + verifier together); post-PR-D2 it's roughly the
same speed as transformers-vanilla AR. Migrate to gRPC for
v0.3's full perf story.
2. Admission control is now an asyncio.Semaphore instead of a
full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s
is preserved (queue_max_wait_s=0 means wait forever); the
in-flight slab-pool bookkeeping moved into SessionStore. The
Scheduler module + integration tests stay (used by other
callers), but the HTTP shim no longer wires it.
3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every
response by a new _DeprecationHeadersMiddleware:
Deprecation: true
Sunset: Wed, 31 Dec 2025 00:00:00 GMT
Link: </docs/adr/0008-...>; rel="successor-version"
Production-side changes
-----------------------
inference_engine/server/app.py ~rewrite, +330 / -300 net
- create_app's signature changed: now takes (verifier, config,
*, slab_pool=None, model_id_label=None) instead of
(engine, config, pool=None). Caller passes the underlying
SinkWindowVerifier directly.
- Internal: builds SessionStore + AppendTokensCoordinator +
GenerationCoordinator. asyncio.Semaphore for admission.
- Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate
(sync gen run in asyncio.to_thread for disconnect-poll
responsiveness) \u2192 CloseSession on success/error.
- SSE streaming: same pattern; queue-bridged from the sync
generator coordinator. HistoryTruncatedEvent is consumed
silently (no OpenAI analog).
- app.state.engine \u2192 app.state.{verifier, store, append_coord,
gen_coord, model_id_label, admission_sem}.
inference_engine/scheduler/__init__.py -1 line export
Dropped 'PooledVerifier' from __all__.
inference_engine/scheduler/pooled_verifier.py DELETED, -150 lines
scripts/serve.py ~rewrite, +12 / -50 net
- _build_engine \u2192 _build_verifier (returns SinkWindowVerifier
or MLXSinkWindowVerifier).
- main() builds the verifier and passes to create_app(verifier,
config). Mirrors PR-E1b's start_grpc_runtime_server.py.
- --block-size and --num-diffusion-steps flags retained for CLI
compat but documented as ignored.
- Banner now says 'DEPRECATED HTTP shim' and points at the
gRPC entrypoint.
Tests
-----
tests/inference_engine/scheduler/test_pooled_verifier.py DELETED, -250 lines
PR-N1 had marked this file exempt from no-doubles cleanup
precisely because PR-D2 was going to retire the module. PR-D2
delivers; the file goes with it.
tests/inference_engine/server/test_grpc_app.py +120 lines, 3 new tests
Coverage of grpc_app.py's success paths after the test_app_*
files (which previously hit them via the FakeVerifier-backed
SchedulerEngine path) were retired by PR-N3:
test_append_tokens_session_not_found_returns_not_found
Coordinator override raises SessionNotFoundError. Covers
grpc_app.py:208 (NOT_FOUND abort branch).
test_append_tokens_success_returns_response
Coordinator override returns a synthetic history_length;
asserts the response carries it. Covers grpc_app.py:213
(return AppendTokensResponse on success).
test_generate_yields_history_truncated_then_done
Generator override yields HistoryTruncated + Token + Done
events; asserts the wire frames in order. Covers
grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent
yield).
tests/integration/test_http_shim_real.py ~30 line update
Fixture wiring: real_speculative_engine \u2192
real_speculative_engine._decoder.verifier (since create_app's
signature changed). Tests reading
real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label.
CI workflow
-----------
.github/workflows/ci.yaml: dropped pooled_verifier.py from the
--include= filter (it no longer exists).
Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
476 passed (was 473 on main; +3 net = added 3 grpc_app
success-path tests).
100% coverage on 915 stmts (was 987 on main; -72 net = the
deleted PooledVerifier module).
Mac M4 evidence (REQUIRED for merge)
------------------------------------
This is the single most invasive PR in the v0.3 sequence \u2014 it
rewrites the deprecated HTTP shim's entire request path. The
integration suite's test_http_shim_real.py is the binding gate.
Reviewer runs:
bash scripts/review_pr_d2_on_mac.sh
git add results/platform-tests/pr-d2-mac-*
git commit -m 'Mac M4 review evidence for PR-D2'
git push
Acceptance: all integration tests pass against real Qwen3-0.6B,
including the now-rewired test_http_shim_real.py which exercises
chat-completions (streaming + non-streaming), auth, /healthz,
/metrics, /v1/models against the new SessionStore-driven path.
Stack
-----
PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57)
which adds CI workflow YAML; the two can merge in either order.
Next PR
-------
v0.4 brings the proposer back into the session-bound path:
PR-V0.4-A wires SparseLogitsProposer into a new
SpeculativeAppendTokensCoordinator (or extends the existing one)
so speculative decoding is restored on both gRPC and HTTP paths.
The ADR 0001/0004 alignment training feeds into that work.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery from the HTTP shim's request path. Each
/v1/chat/completionsrequest is now a single-shot session underSessionStore:CreateSession→AppendTokens(prompt)→Generate→CloseSession. Same semantics as the gRPCRuntimeServicesurface; ADR 0008 §2.7 deprecation.Three architectural changes
asyncio.Semaphorequeue_max_wait_sis preserved (queue_max_wait_s=0means wait forever); the in-flight slab-pool bookkeeping moved intoSessionStore. The Scheduler module + integration tests stay (used by other callers), just no longer wired to HTTP._DeprecationHeadersMiddleware:Deprecation: true,Sunset: Wed, 31 Dec 2025 00:00:00 GMT,Link: </docs/adr/0008-...>; rel="successor-version".Files
inference_engine/server/app.pycreate_app(verifier, config, *, slab_pool=None, model_id_label=None); new_DeprecationHeadersMiddleware; route handler runs SessionStore + AppendTokensCoordinator + GenerationCoordinator with sync gen wrapped inasyncio.to_threadfor disconnect-poll responsivenessinference_engine/scheduler/pooled_verifier.pyinference_engine/scheduler/__init__.pyPooledVerifierfrom__all__scripts/serve.py_build_engine→_build_verifier; banner says "DEPRECATED HTTP shim", points at gRPC entrypoint;--block-size/--num-diffusion-stepsflags retained for CLI compat but documented as ignoredtests/inference_engine/scheduler/test_pooled_verifier.pytests/inference_engine/server/test_grpc_app.pytest_append_tokens_session_not_found_returns_not_found,test_append_tokens_success_returns_response,test_generate_yields_history_truncated_then_done) — previously covered by deleted-in-PR-N3 FakeVerifier-backed teststests/integration/test_http_shim_real.pyreal_speculative_engine→real_speculative_engine._decoder.verifier. Asserts:state.engine.model_id_label→state.model_id_label..github/workflows/ci.yamlpooled_verifier.pyfrom--include=filterscripts/review_pr_d2_on_mac.shLinux verification
Mac M4 evidence (REQUIRED for merge)
This is the single most invasive PR in the v0.3 sequence — it rewrites the deprecated HTTP shim's entire request path. The integration suite's
test_http_shim_real.pyis the binding gate. Reviewer runs:Acceptance: all integration tests pass against real Qwen3-0.6B, including the now-rewired
test_http_shim_real.pywhich exercises chat-completions (streaming + non-streaming), auth,/healthz,/metrics,/v1/modelsagainst the new SessionStore-driven path.Stack
PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57) which adds CI workflow YAML; the two can merge in either order.
Next PR
v0.4 brings the proposer back into the session-bound path:
PR-V0.4-AwiresSparseLogitsProposerinto a newSpeculativeAppendTokensCoordinator(or extends the existing one) so speculative decoding is restored on both gRPC and HTTP paths. The ADR 0001 / 0004 alignment training feeds into that work.Reviewer checklist
create_appsignature change is documented in PR description (wasengine, config, pool=None; nowverifier, config, *, slab_pool=None, model_id_label=None).curl -Iagainst a running shim to verify).pr-d2-mac-integration-tests-*.jsonshows all integration tests passing against real Qwen3-0.6B.serve.py's--block-size/--num-diffusion-stepsflags are retained but documented as ignored (don't silently drop them — existing scripts may pass them).