From 724be1b210fcfdf3009f86760a7b5fcd854c2e6a Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Tue, 2 Jun 2026 04:26:05 +0000
Subject: [PATCH] PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto
 SessionStore

Retires the Scheduler + PooledVerifier + SpeculativeEngine machinery
from the HTTP shim's request path. Each /v1/chat/completions request
is now a single-shot session under SessionStore: CreateSession \u2192
AppendTokens(prompt) \u2192 Generate \u2192 CloseSession. Same semantics as
the gRPC RuntimeService surface; ADR 0008 \u00a72.7 deprecation.

Three architectural changes
---------------------------

  1. Speculative decoding is no longer applied on the HTTP path.
     The session-bound runtime is pure AR against the verifier;
     the proposer is wired into the v0.4 alignment work
     (ADR 0004). Pre-PR-D2 the HTTP shim used SpeculativeEngine
     (proposer + verifier together); post-PR-D2 it's roughly the
     same speed as transformers-vanilla AR. Migrate to gRPC for
     v0.3's full perf story.

  2. Admission control is now an asyncio.Semaphore instead of a
     full Scheduler. REJECT vs QUEUE policy with queue_max_wait_s
     is preserved (queue_max_wait_s=0 means wait forever); the
     in-flight slab-pool bookkeeping moved into SessionStore. The
     Scheduler module + integration tests stay (used by other
     callers), but the HTTP shim no longer wires it.

  3. ADR 0008 \u00a72.7 deprecation headers are stamped onto every
     response by a new _DeprecationHeadersMiddleware:
       Deprecation: true
       Sunset: Wed, 31 Dec 2025 00:00:00 GMT
       Link: </docs/adr/0008-...>; rel="successor-version"

Production-side changes
-----------------------

inference_engine/server/app.py                     ~rewrite, +330 / -300 net
  - create_app's signature changed: now takes (verifier, config,
    *, slab_pool=None, model_id_label=None) instead of
    (engine, config, pool=None). Caller passes the underlying
    SinkWindowVerifier directly.
  - Internal: builds SessionStore + AppendTokensCoordinator +
    GenerationCoordinator. asyncio.Semaphore for admission.
  - Route handler: tokenize \u2192 CreateSession \u2192 append \u2192 generate
    (sync gen run in asyncio.to_thread for disconnect-poll
    responsiveness) \u2192 CloseSession on success/error.
  - SSE streaming: same pattern; queue-bridged from the sync
    generator coordinator. HistoryTruncatedEvent is consumed
    silently (no OpenAI analog).
  - app.state.engine \u2192 app.state.{verifier, store, append_coord,
    gen_coord, model_id_label, admission_sem}.

inference_engine/scheduler/__init__.py             -1 line export
  Dropped 'PooledVerifier' from __all__.

inference_engine/scheduler/pooled_verifier.py     DELETED, -150 lines

scripts/serve.py                                   ~rewrite, +12 / -50 net
  - _build_engine \u2192 _build_verifier (returns SinkWindowVerifier
    or MLXSinkWindowVerifier).
  - main() builds the verifier and passes to create_app(verifier,
    config). Mirrors PR-E1b's start_grpc_runtime_server.py.
  - --block-size and --num-diffusion-steps flags retained for CLI
    compat but documented as ignored.
  - Banner now says 'DEPRECATED HTTP shim' and points at the
    gRPC entrypoint.

Tests
-----

tests/inference_engine/scheduler/test_pooled_verifier.py  DELETED, -250 lines
  PR-N1 had marked this file exempt from no-doubles cleanup
  precisely because PR-D2 was going to retire the module. PR-D2
  delivers; the file goes with it.

tests/inference_engine/server/test_grpc_app.py     +120 lines, 3 new tests
  Coverage of grpc_app.py's success paths after the test_app_*
  files (which previously hit them via the FakeVerifier-backed
  SchedulerEngine path) were retired by PR-N3:
    test_append_tokens_session_not_found_returns_not_found
      Coordinator override raises SessionNotFoundError. Covers
      grpc_app.py:208 (NOT_FOUND abort branch).
    test_append_tokens_success_returns_response
      Coordinator override returns a synthetic history_length;
      asserts the response carries it. Covers grpc_app.py:213
      (return AppendTokensResponse on success).
    test_generate_yields_history_truncated_then_done
      Generator override yields HistoryTruncated + Token + Done
      events; asserts the wire frames in order. Covers
      grpc_app.py:295-310 (HistoryTruncatedEvent yield + DoneEvent
      yield).

tests/integration/test_http_shim_real.py           ~30 line update
  Fixture wiring: real_speculative_engine \u2192
  real_speculative_engine._decoder.verifier (since create_app's
  signature changed). Tests reading
  real_app.state.engine.model_id_label \u2192 real_app.state.model_id_label.

CI workflow
-----------

.github/workflows/ci.yaml: dropped pooled_verifier.py from the
--include= filter (it no longer exists).

Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  476 passed (was 473 on main; +3 net = added 3 grpc_app
              success-path tests).
  100% coverage on 915 stmts (was 987 on main; -72 net = the
              deleted PooledVerifier module).

Mac M4 evidence (REQUIRED for merge)
------------------------------------
This is the single most invasive PR in the v0.3 sequence \u2014 it
rewrites the deprecated HTTP shim's entire request path. The
integration suite's test_http_shim_real.py is the binding gate.
Reviewer runs:

    bash scripts/review_pr_d2_on_mac.sh
    git add results/platform-tests/pr-d2-mac-*
    git commit -m 'Mac M4 review evidence for PR-D2'
    git push

Acceptance: all integration tests pass against real Qwen3-0.6B,
including the now-rewired test_http_shim_real.py which exercises
chat-completions (streaming + non-streaming), auth, /healthz,
/metrics, /v1/models against the new SessionStore-driven path.

Stack
-----
PR-D2 is branched off post-N1..N4 main. Independent of PR-E2 (#57)
which adds CI workflow YAML; the two can merge in either order.

Next PR
-------
v0.4 brings the proposer back into the session-bound path:
PR-V0.4-A wires SparseLogitsProposer into a new
SpeculativeAppendTokensCoordinator (or extends the existing one)
so speculative decoding is restored on both gRPC and HTTP paths.
The ADR 0001/0004 alignment training feeds into that work.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 .github/workflows/ci.yaml                     |   4 +-
 inference_engine/scheduler/__init__.py        |   6 +-
 inference_engine/scheduler/pooled_verifier.py | 175 ----
 inference_engine/server/app.py                | 858 ++++++++++--------
 scripts/review_pr_d2_on_mac.sh                |  79 ++
 scripts/serve.py                              | 103 +--
 .../scheduler/test_pooled_verifier.py         | 272 ------
 .../inference_engine/server/test_grpc_app.py  | 132 +++
 tests/integration/test_http_shim_real.py      |  28 +-
 9 files changed, 779 insertions(+), 878 deletions(-)
 delete mode 100644 inference_engine/scheduler/pooled_verifier.py
 create mode 100755 scripts/review_pr_d2_on_mac.sh
 delete mode 100644 tests/inference_engine/scheduler/test_pooled_verifier.py

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index 9bea66f..e4a131f 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -100,10 +100,10 @@ jobs:
             --junitxml=junit.xml \
             -v
           coverage report \
-            --include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/scheduler/pooled_verifier.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
+            --include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*' \
             --fail-under=100
           coverage xml -o coverage.xml \
-            --include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/scheduler/pooled_verifier.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
+            --include='inference_engine/server/auth.py,inference_engine/server/config.py,inference_engine/server/errors.py,inference_engine/server/grpc_app.py,inference_engine/server/metrics.py,inference_engine/server/schemas.py,inference_engine/server/proto_gen/**/*.py,inference_engine/memory/*,inference_engine/scheduler/config.py,inference_engine/scheduler/session.py,inference_engine/pipeline/*,inference_engine/session/store.py,sdks/python/kakeya/__init__.py,sdks/python/kakeya/errors.py,training/repr_align/*'
 
       - name: Upload coverage artifact
         if: always()
diff --git a/inference_engine/scheduler/__init__.py b/inference_engine/scheduler/__init__.py
index f37c501..00e01e5 100644
--- a/inference_engine/scheduler/__init__.py
+++ b/inference_engine/scheduler/__init__.py
@@ -26,13 +26,15 @@
 """
 
 from .config import AdmissionPolicy, SchedulerConfig
-from .pooled_verifier import PooledVerifier
+# PooledVerifier was retired by PR-D2; the HTTP shim now drives
+# SessionStore + AppendTokensCoordinator directly. Imports kept
+# stable by removing the export entirely (no soft-deprecation
+# layer — the symbol is gone from the package).
 from .scheduler import RequestRejected, Scheduler
 from .session import Session, SessionState
 
 __all__ = [
     "AdmissionPolicy",
-    "PooledVerifier",
     "RequestRejected",
     "Scheduler",
     "SchedulerConfig",
diff --git a/inference_engine/scheduler/pooled_verifier.py b/inference_engine/scheduler/pooled_verifier.py
deleted file mode 100644
index bbad158..0000000
--- a/inference_engine/scheduler/pooled_verifier.py
+++ /dev/null
@@ -1,175 +0,0 @@
-"""PooledVerifier wrapper: ties a verifier's lifecycle to a slab pool.
-
-This is the **intermediate step** described in ADR 0003. It does not
-make slab tensors hold the real KV — that is the deferred full
-refactor. What it does:
-
-  * On ``prefill()``: acquires a slab from the pool (releasing any
-    previously held one).
-  * On ``reset()``: releases the held slab.
-  * After every forward (``prefill``, ``forward_block``,
-    ``append_token``, ``commit_or_truncate``): writes the verifier's
-    real ``stats.peak_kv_bytes`` snapshot into the slab's
-    ``live_kv_bytes_override`` so ``slab.live_kv_bytes`` reports
-    real numbers, not placeholder tensor bytes.
-
-The wrapper is a structural pass-through: every public method on
-the underlying verifier is delegated. We only intercept the calls
-that change cache state.
-
-Why a wrapper rather than modifying SinkWindowVerifier directly:
-
-  * Avoids a circular dependency between ``kv_cache_proposer``
-    (where the verifier lives) and ``inference_engine.memory``
-    (where the pool lives). Today layering goes
-    ``inference_engine -> kv_cache_proposer``; reversing that for
-    the verifier would invert the import graph.
-  * Keeps the verifier's ``DynamicCache`` path bit-identical to
-    v0.1.0. The wrapper does not touch the model forward.
-  * Makes the integration optional: callers that don't care about
-    pool-backed memory accounting use the bare verifier; callers
-    that do (multi-session HTTP) wrap with ``PooledVerifier``.
-
-Verifier protocol assumed:
-
-    verifier.prefill(prompt_ids: list[int]) -> None
-    verifier.forward_block(tokens: list[int]) -> torch.Tensor
-    verifier.commit_or_truncate(forwarded: int, accepted: int) -> None
-    verifier.append_token(token_id: int) -> torch.Tensor
-    verifier.reset() -> None
-    verifier.stats.peak_kv_bytes  (int, updated by verifier)
-    verifier.tokenizer            (passthrough)
-    verifier.next_token_logits    (passthrough)
-    verifier.cache_logical_size   (passthrough)
-    verifier.next_global_position (passthrough)
-
-Both PyTorch ``SinkWindowVerifier`` and MLX
-``MLXSinkWindowVerifier`` satisfy this; future verifiers must too.
-"""
-
-from __future__ import annotations
-
-from typing import Any, List, Optional
-
-import torch
-
-from inference_engine.memory.pool import SlabPool
-from inference_engine.memory.slab import KVSlab
-
-
-class PooledVerifier:
-    """Wraps a verifier; manages slab-pool acquire/release per session."""
-
-    def __init__(self, verifier: Any, pool: SlabPool) -> None:
-        if pool is None:
-            raise ValueError("pool must not be None")
-        self._verifier = verifier
-        self._pool = pool
-        self._slab: Optional[KVSlab] = None
-
-    # ------------------------------------------------------------------
-    # Verifier-protocol methods we intercept
-    # ------------------------------------------------------------------
-
-    def prefill(self, prompt_ids: List[int]) -> None:
-        # Acquire a slab for this session; release any prior one
-        # (defensive — same verifier instance reused across sessions).
-        self._release_slab_if_held()
-        self._slab = self._pool.acquire()
-        try:
-            self._verifier.prefill(prompt_ids)
-            self._sync_slab_bytes()
-        except BaseException:
-            # Release the slab on failure so the pool is not stuck.
-            self._release_slab_if_held()
-            raise
-
-    def forward_block(self, tokens: List[int]) -> torch.Tensor:
-        out = self._verifier.forward_block(tokens)
-        self._sync_slab_bytes()
-        return out
-
-    def commit_or_truncate(self, forwarded: int, accepted: int) -> None:
-        self._verifier.commit_or_truncate(forwarded=forwarded, accepted=accepted)
-        self._sync_slab_bytes()
-
-    def append_token(self, token_id: int) -> torch.Tensor:
-        out = self._verifier.append_token(token_id)
-        self._sync_slab_bytes()
-        return out
-
-    def reset(self) -> None:
-        self._verifier.reset()
-        self._release_slab_if_held()
-
-    # ------------------------------------------------------------------
-    # Pass-through attributes the speculative decoder reads directly
-    # ------------------------------------------------------------------
-
-    @property
-    def tokenizer(self):
-        return self._verifier.tokenizer
-
-    @property
-    def stats(self):
-        return self._verifier.stats
-
-    @property
-    def next_token_logits(self):
-        return self._verifier.next_token_logits
-
-    @next_token_logits.setter
-    def next_token_logits(self, value):
-        self._verifier.next_token_logits = value
-
-    @property
-    def cache_logical_size(self) -> int:
-        return self._verifier.cache_logical_size
-
-    @property
-    def next_global_position(self) -> int:
-        return self._verifier.next_global_position
-
-    @property
-    def config(self):
-        return self._verifier.config
-
-    @property
-    def slab(self) -> Optional[KVSlab]:
-        """The currently-held slab, if any. Public for tests."""
-        return self._slab
-
-    @property
-    def pool(self) -> SlabPool:
-        return self._pool
-
-    @property
-    def inner(self):
-        """The wrapped verifier — escape hatch for callers that need
-        access to verifier-specific extras (e.g. ``quantization``
-        on MLX). Use sparingly; depending on this defeats the
-        wrapper's abstraction."""
-        return self._verifier
-
-    # ------------------------------------------------------------------
-    # Internal
-    # ------------------------------------------------------------------
-
-    def _sync_slab_bytes(self) -> None:
-        """Copy the verifier's real KV byte count onto the slab.
-
-        The verifier updates ``stats.peak_kv_bytes`` on every forward.
-        We use the *current* size (the verifier also publishes
-        ``stats.peak_kv_bytes`` as a running max — which is fine
-        for our purposes since we want pool gauges to reflect the
-        worst case during the session).
-        """
-        if self._slab is None:  # pragma: no cover - defensive; all callers acquire first
-            return
-        bytes_ = int(getattr(self._verifier.stats, "peak_kv_bytes", 0))
-        self._slab.live_kv_bytes_override = bytes_
-
-    def _release_slab_if_held(self) -> None:
-        if self._slab is not None:
-            self._pool.release(self._slab)
-            self._slab = None
diff --git a/inference_engine/server/app.py b/inference_engine/server/app.py
index 485b157..be54c2b 100644
--- a/inference_engine/server/app.py
+++ b/inference_engine/server/app.py
@@ -1,57 +1,51 @@
-"""FastAPI app factory and route handlers.
-
-The app is constructed by :func:`create_app` from a fully-initialized
-:class:`Engine` (and a :class:`ServerConfig`). All inference flows
-through a :class:`Scheduler` constructed inside the factory: routes
-never call ``engine.generate`` directly. This is the integration that
-makes admission control, fair queuing, slab-pool occupancy, and
-graceful shutdown observable / consistent regardless of single-user
-or multi-user deployment.
-
-Routes implemented in this commit:
+"""FastAPI app factory and route handlers (PR-D2 of ADR 0008 Phase D).
+
+The HTTP shim is **deprecated** per ADR 0008 §2.7 and slated for
+retirement once OpenAI-API consumers migrate to the v0.3 gRPC
+surface. PR-D2 refactored this module's internals to drive the
+session-bound runtime (``SessionStore`` +
+:class:`AppendTokensCoordinator` + :class:`GenerationCoordinator`)
+directly, retiring the previous ``Scheduler`` + ``PooledVerifier``
++ :class:`SpeculativeEngine` machinery. Each ``/v1/chat/completions``
+request is now a single-shot session: ``CreateSession`` →
+``AppendTokens(prompt)`` → ``Generate`` → ``CloseSession`` —
+identical semantics to the gRPC ``RuntimeService`` surface.
+
+What this means for users
+-------------------------
+
+* **Speculative decoding is no longer applied on the HTTP path.**
+  The session-bound runtime is pure autoregressive against the
+  verifier; the proposer is wired into the v0.4 alignment work
+  (ADR 0004). For now the HTTP shim is roughly the same speed as
+  ``transformers``-vanilla AR generation. **Migrate to gRPC** for
+  the v0.3 architecture's full perf story.
+* Every response carries ``Deprecation: true`` and a
+  ``Sunset`` header pointing to the v0.3 GA tag. The OpenAI clients
+  ignore these by default but the metadata is in the response for
+  proxies / observability tools.
+* Admission control is now an :class:`asyncio.Semaphore` instead of
+  a full ``Scheduler`` — the queueing and timeout semantics are
+  preserved (REJECT vs QUEUE policy with ``queue_max_wait_s``) but
+  the in-flight slab-pool bookkeeping moved into ``SessionStore``.
+
+Routes
+------
 
     GET  /healthz
+    GET  /metrics
     GET  /v1/models
     POST /v1/chat/completions
 
-OpenAI compatibility notes
---------------------------
-
-* ``stream`` is the load-bearing flag: when true the response is
-  ``text/event-stream``; when false it is ``application/json``. We
-  branch on it inside the route, not at registration time.
-* Sampling parameters (``temperature``, ``top_p``, ``stop``) are
-  accepted in the request schema but not applied — the underlying
-  decoder is greedy temperature-0 by design (see ADR 0001 §2.2 for
-  the rationale).
-* ``finish_reason`` is ``"stop"`` if EOS terminated generation OR
-  if the client cancelled, ``"length"`` if ``max_tokens`` did. We
-  do not yet emit ``"content_filter"`` or ``"function_call"``.
-
 Error mapping
 -------------
 
-* Pydantic validation errors: 422 (FastAPI default).
+* Pydantic validation errors: 422.
 * Tokenizer chat-template rejection: 400.
-* Tokenizer with no EOS: 500 (defense in depth; engine constructor
-  is supposed to catch this earlier).
-* Scheduler rejects (pool full under REJECT policy, queue timeout
-  under QUEUE policy): 429 with a JSON body following OpenAI's
-  error shape.
-* Engine raises mid-generate: 500 (non-streaming) or terminal SSE
-  chunk with ``finish_reason="stop"`` (streaming — the SSE
-  contract has no graceful way to surface a 500 once the response
-  has started; the session error is swallowed at the wire after
-  any partial output).
-
-Lifespan
---------
-
-The app registers a FastAPI lifespan context that calls
-``scheduler.shutdown()`` when the server stops. Active sessions are
-cancelled, queued admissions are rejected, slabs are released. The
-HTTP layer becomes externally indistinguishable from "no server here"
-within one poll interval after the lifespan exits.
+* Tokenizer with no EOS: 500.
+* Pool / admission saturation: 429 with OpenAI error envelope.
+* Verifier raises mid-generate: 500 (non-streaming) or terminal
+  SSE chunk with ``finish_reason="stop"`` (streaming).
 """
 
 from __future__ import annotations
@@ -72,23 +66,17 @@
 
 from inference_engine.memory.pool import SlabPool
 from inference_engine.memory.slab import SlabConfig
-from inference_engine.scheduler.config import SchedulerConfig
-from inference_engine.scheduler.scheduler import (
-    RequestRejected,
-    Scheduler,
-)
-from inference_engine.scheduler.session import Session, SessionState
-
-from .auth import verify_api_key
-from .config import ServerConfig
-from .engine import Engine
-from .errors import (
+from inference_engine.scheduler.config import AdmissionPolicy
+from inference_engine.server.auth import verify_api_key
+from inference_engine.server.config import ServerConfig
+from inference_engine.server.errors import (
+    build_error_envelope,
     http_exception_handler,
     request_validation_exception_handler,
     unhandled_exception_handler,
 )
-from .metrics import Metrics
-from .schemas import (
+from inference_engine.server.metrics import Metrics
+from inference_engine.server.schemas import (
     ChatCompletionChoice,
     ChatCompletionChunk,
     ChatCompletionChunkChoice,
@@ -101,87 +89,121 @@
     ListModelsResponse,
     ModelInfo,
 )
-from .streaming import _StreamingDetokenizer
-from .tokenizer import resolve_eos_ids
+from inference_engine.server.streaming import _StreamingDetokenizer
+from inference_engine.server.tokenizer import resolve_eos_ids
+from inference_engine.session import (
+    AppendTokensCoordinator,
+    DoneEvent,
+    GenerationCoordinator,
+    HistoryTruncatedEvent,
+    SessionStore,
+    STOP_REASON_EOS,
+    TokenEvent,
+)
+from inference_engine.session.store import SessionNotFoundError
 
 
-# ---------------------------------------------------------------------------
-# App factory
-# ---------------------------------------------------------------------------
+# Per ADR 0008 §2.7: every HTTP-shim response carries these headers.
+# v0.3.0 final ships with the deprecation marker live; the Sunset
+# date is cosmetic until a real cutover plan exists.
+_DEPRECATION_HEADERS = {
+    "Deprecation": "true",
+    "Sunset": "Wed, 31 Dec 2025 00:00:00 GMT",
+    "Link": (
+        '</docs/adr/0008-session-bound-runtime-and-grpc-protocol.md>; '
+        'rel="successor-version"; type="text/markdown"'
+    ),
+}
 
 
 def create_app(
-    engine: Engine,
+    verifier,
     config: ServerConfig,
-    pool: Optional[SlabPool] = None,
+    *,
+    slab_pool: Optional[SlabPool] = None,
+    model_id_label: Optional[str] = None,
 ) -> FastAPI:
-    """Build a FastAPI app bound to a specific engine + config.
+    """Build a FastAPI app bound to a verifier + config.
 
     Parameters
     ----------
-    engine:
-        Anything implementing :class:`Engine`. In production this is
-        a :class:`SpeculativeEngine`; in tests it is a deterministic
-        test double.
+    verifier:
+        Anything implementing the verifier protocol consumed by
+        :class:`AppendTokensCoordinator` (i.e., :meth:`prefill`,
+        :meth:`forward_block`, :meth:`commit_or_truncate`,
+        :meth:`k_seq_length`, :meth:`kv_live_bytes`) plus a
+        :attr:`tokenizer` attribute satisfying
+        :class:`~inference_engine.server.tokenizer.Tokenizer`.
+        In production this is a :class:`SinkWindowVerifier`.
     config:
-        Process-wide :class:`ServerConfig`. The scheduler-related
-        fields (``max_concurrent``, ``admission_policy``,
-        ``queue_max_wait_s``) drive the internal :class:`Scheduler`.
-    pool:
-        Optional pre-built :class:`SlabPool`. If ``None``, we build a
-        minimal placeholder pool sized for ``config.max_concurrent``
-        slots — these slots are pure admission-control bookkeeping
-        until a future commit wires the verifier itself to consume
-        slabs from the pool. The placeholder slab tensors are
-        deliberately tiny (a few bytes per slab) since they are not
-        currently read by attention kernels.
+        Process-wide :class:`ServerConfig`. ``max_concurrent``,
+        ``admission_policy``, and ``queue_max_wait_s`` drive the
+        per-app admission semaphore.
+    slab_pool:
+        Optional pre-built :class:`SlabPool`. If ``None``, a tiny
+        placeholder pool is built for ``max_concurrent`` slots.
+        The slab is a session-bookkeeping placeholder; the verifier
+        owns the real KV tensors and writes byte counts onto the
+        slab via PR-E1c's ``_sync_slab_bytes`` helper.
+    model_id_label:
+        Returned by ``/v1/models`` and embedded in every
+        ``chat.completion`` payload's ``model`` field. Defaults to
+        ``config.model_id_label``.
     """
-    pool = pool if pool is not None else _build_placeholder_pool(config.max_concurrent)
+    pool = slab_pool if slab_pool is not None else _build_placeholder_pool(
+        config.max_concurrent,
+    )
     if pool.total_count != config.max_concurrent:
         raise ValueError(
-            f"pool.total_count={pool.total_count} does not match "
+            f"slab_pool.total_count={pool.total_count} does not match "
             f"config.max_concurrent={config.max_concurrent}"
         )
-    scheduler = Scheduler(
-        engine=engine, pool=pool,
-        config=SchedulerConfig(
-            max_concurrent=config.max_concurrent,
-            admission_policy=config.admission_policy,
-            queue_max_wait_s=config.queue_max_wait_s,
-        ),
+
+    store = SessionStore(
+        capacity=config.max_concurrent,
+        cache_inspector=verifier,
+        slab_pool=pool,
     )
+    append_coord = AppendTokensCoordinator(store, verifier)
+    gen_coord = GenerationCoordinator(store, verifier)
+
     metrics = Metrics.build()
     metrics.snapshot_scheduler(
         active=0, pool_in_use=0, pool_total=pool.total_count, pending=0,
         kv_live_bytes=0,
     )
 
+    label = model_id_label or config.model_id_label
+
     @asynccontextmanager
     async def lifespan(app: FastAPI) -> AsyncIterator[None]:
-        # Startup: nothing to do; scheduler is already constructed and
-        # ready to admit. We yield without doing anything here so unit
-        # tests that exercise the route via ASGITransport without
-        # explicit lifespan handling still work.
-        try:
-            yield
-        finally:
-            await scheduler.shutdown()
+        # No long-lived worker tasks anymore — every request is a
+        # single-shot session. Lifespan is a no-op other than the
+        # context-manager protocol the framework needs.
+        yield
 
     app = FastAPI(
-        title="Kakeya Inference Engine",
+        title="Kakeya Inference Engine (HTTP shim, deprecated)",
         description=(
-            "OpenAI-compatible HTTP API for the DLM-proposer + AR-verifier "
-            "speculative decoder. See https://github.com/FluffyAIcode/"
-            "Kakeya-LLM-Inference-engine for source and ADRs."
+            "DEPRECATED OpenAI-compatible HTTP API. The v0.3 architecture "
+            "is gRPC-first; see /docs/adr/0008-session-bound-runtime-"
+            "and-grpc-protocol.md. This shim is feature-frozen, pure-AR "
+            "(no speculative decoding), and slated for removal once "
+            "consumers migrate."
         ),
-        version="0.2.0-dev",
+        version="0.3.0",
         lifespan=lifespan,
     )
-    app.state.engine = engine
+
+    app.state.verifier = verifier
     app.state.config = config
-    app.state.scheduler = scheduler
+    app.state.store = store
+    app.state.append_coord = append_coord
+    app.state.gen_coord = gen_coord
     app.state.pool = pool
     app.state.metrics = metrics
+    app.state.model_id_label = label
+    app.state.admission_sem = asyncio.Semaphore(config.max_concurrent)
 
     # OpenAI-shape error envelopes for HTTPException + 422 + 500.
     app.add_exception_handler(StarletteHTTPException, http_exception_handler)
@@ -194,11 +216,18 @@ async def lifespan(app: FastAPI) -> AsyncIterator[None]:
     app.add_middleware(_AuthMiddleware, valid_keys=config.api_keys)
     # Per-request timing + counter.
     app.add_middleware(_MetricsMiddleware, metrics=metrics)
+    # ADR 0008 §2.7 Deprecation / Sunset headers on every response.
+    app.add_middleware(_DeprecationHeadersMiddleware)
 
     _register_routes(app)
     return app
 
 
+# ---------------------------------------------------------------------------
+# Middleware
+# ---------------------------------------------------------------------------
+
+
 class _AuthMiddleware(BaseHTTPMiddleware):
     """Bearer-token gate for ``/v1/*`` routes when api_keys is non-empty."""
 
@@ -210,20 +239,12 @@ async def dispatch(self, request, call_next):
         try:
             verify_api_key(request, valid_keys=self._valid_keys)
         except StarletteHTTPException as exc:
-            # Re-route through the registered handler so the response
-            # carries the OpenAI envelope.
             return await http_exception_handler(request, exc)
         return await call_next(request)
 
 
 class _MetricsMiddleware(BaseHTTPMiddleware):
-    """Records ``http_requests_total`` + duration histogram per request.
-
-    The path label is the matched route's path template (e.g.
-    ``/v1/chat/completions``) when available, otherwise the raw URL
-    path. We deliberately avoid recording dynamic path segments
-    (e.g. session ids) to prevent label-cardinality blow-up.
-    """
+    """Records ``http_requests_total`` + duration histogram per request."""
 
     def __init__(self, app, *, metrics: Metrics) -> None:
         super().__init__(app)
@@ -244,27 +265,35 @@ async def dispatch(self, request, call_next):
 
     @staticmethod
     def _safe_path(request) -> str:
-        """Return the route template if matched, else the raw path.
-
-        Starlette stores the matched route on ``request.scope["route"]``
-        when available; we fall back to the raw URL for unmatched
-        requests (404s).
-        """
         route = request.scope.get("route")
         if route is not None and hasattr(route, "path"):
             return route.path
         return request.url.path
 
 
+class _DeprecationHeadersMiddleware(BaseHTTPMiddleware):
+    """Stamps ADR 0008 §2.7 deprecation headers onto every response."""
+
+    async def dispatch(self, request, call_next):
+        response = await call_next(request)
+        for k, v in _DEPRECATION_HEADERS.items():
+            response.headers[k] = v
+        return response
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
 def _build_placeholder_pool(num_slabs: int) -> SlabPool:
-    """Construct a minimal :class:`SlabPool` for admission-control bookkeeping.
-
-    Slab tensors are 1-element bf16 (2 bytes per K + 2 per V × num_slabs).
-    Total memory cost for the default ``num_slabs=1`` pool is ~4 bytes,
-    plus Python object overhead. When the verifier-side refactor lands
-    that actually consumes slabs as KV storage, callers will pass a
-    properly-sized pool to ``create_app`` and this placeholder will
-    become unnecessary in production paths.
+    """Construct a tiny ``SlabPool`` for session bookkeeping.
+
+    The slab is a placeholder; PR-E1c's :func:`_sync_slab_bytes`
+    writes the verifier's real KV byte count onto each slab's
+    ``live_kv_bytes_override`` after every coordinator mutation,
+    so :meth:`Session.kv_live_bytes` reports physically meaningful
+    values without the slab actually holding the K/V tensors.
     """
     cfg = SlabConfig(
         num_layers=1, num_heads=1, sink_size=0, window_size=1,
@@ -273,59 +302,102 @@ def _build_placeholder_pool(num_slabs: int) -> SlabPool:
     return SlabPool(num_slabs=num_slabs, slab_config=cfg)
 
 
+def _encode_prompt(verifier, req: ChatCompletionRequest) -> List[int]:
+    """Apply the verifier's tokenizer's chat template to the request."""
+    messages = [m.model_dump() for m in req.messages]
+    prompt_ids = verifier.tokenizer.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_dict=False,
+        enable_thinking=False,
+    )
+    if not isinstance(prompt_ids, list) or not all(
+        isinstance(t, int) for t in prompt_ids
+    ):
+        raise ValueError(
+            f"chat template returned {type(prompt_ids).__name__}, expected list[int]"
+        )
+    if not prompt_ids:
+        raise ValueError("chat template produced an empty token sequence")
+    return prompt_ids
+
+
+async def _admit(
+    *,
+    sem: asyncio.Semaphore,
+    config: ServerConfig,
+) -> None:
+    """Acquire the admission semaphore per the configured policy.
+
+    REJECT: fail immediately with HTTPException(429) if the
+    semaphore is fully saturated. QUEUE: wait up to
+    ``queue_max_wait_s`` then fail. The ``queue_max_wait_s=0``
+    sentinel means wait forever.
+    """
+    if config.admission_policy == AdmissionPolicy.REJECT:
+        # Non-blocking: try once.
+        if not _try_acquire(sem):
+            raise HTTPException(
+                status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+                detail="slab pool exhausted (REJECT policy)",
+            )
+        return
+    # QUEUE policy.
+    timeout = (
+        None
+        if config.queue_max_wait_s == 0
+        else config.queue_max_wait_s
+    )
+    try:
+        await asyncio.wait_for(sem.acquire(), timeout=timeout)
+    except asyncio.TimeoutError as exc:
+        raise HTTPException(
+            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
+            detail=(
+                f"queue wait exceeded ({config.queue_max_wait_s}s)"
+            ),
+        ) from exc
+
+
+def _try_acquire(sem: asyncio.Semaphore) -> bool:
+    """Non-blocking semaphore acquire.
+
+    ``asyncio.Semaphore`` lacks a public ``locked()``-with-grab API;
+    we inspect the internal ``_value`` (CPython implementation
+    detail kept stable across versions; documented in cpython
+    ``asyncio/locks.py``).
+    """
+    if sem._value <= 0:  # noqa: SLF001 - intentional, see docstring
+        return False
+    sem._value -= 1  # noqa: SLF001
+    return True
+
+
 # ---------------------------------------------------------------------------
-# Route registration
+# Routes
 # ---------------------------------------------------------------------------
 
 
 def _register_routes(app: FastAPI) -> None:
     @app.get("/healthz", response_model=HealthResponse)
     async def healthz() -> HealthResponse:
-        engine: Engine = app.state.engine
-        return HealthResponse(status="ok", model=engine.model_id_label)
+        label: str = app.state.model_id_label
+        return HealthResponse(status="ok", model=label)
 
     @app.get("/metrics")
     async def metrics_endpoint() -> Response:
         metrics: Metrics = app.state.metrics
-        scheduler: Scheduler = app.state.scheduler
-        pool: SlabPool = app.state.pool
-        # Refresh scheduler-state gauges on every scrape so the
-        # exposition reflects "now" rather than the last
-        # admission/completion event.
-        engine_for_kv: Engine = app.state.engine
-        # Read KV bytes directly from the engine's verifier rather
-        # than from pool.live_kv_bytes. Rationale: in v0.3 the slab
-        # is a session ticket (acquired/released per request) — the
-        # verifier holds the real KV cache tensors and is the
-        # canonical source of truth. Pool-side accounting only
-        # populates once PooledVerifier is wired (a post-v0.3.0
-        # change) and otherwise reads 0 even while the verifier
-        # cache is several MiB.
-        #
-        # Gauge semantics: "KV bytes attributable to in-flight
-        # sessions". Between turns, the verifier's ``self.cache``
-        # still holds the previous turn's tensors — the next
-        # prefill calls ``reset()`` which replaces them, but until
-        # then ``engine.kv_state()`` reports non-zero residual
-        # bytes. Reporting that as "live" misleads observers
-        # and breaks the §2.3 KV-bounded check (residual carries
-        # forward at the previous turn's peak, never trimmed). We
-        # therefore gate the gauge on ``active_count > 0``: an
-        # idle server reports 0, a server with an active session
-        # reports the verifier's true KV size. This is also how
-        # the gauge will naturally behave once PooledVerifier is
-        # wired post-v0.3 (the pool aggregation is 0 when no slab
-        # is in use).
-        kv_live = (
-            int(engine_for_kv.kv_state())
-            if scheduler.active_count > 0
-            else 0
-        )
+        store: SessionStore = app.state.store
+        # Refresh the in-flight gauge from SessionStore so /metrics
+        # always reports current state.
+        kv_live = store.total_kv_live_bytes
+        active = store.active_count
         metrics.snapshot_scheduler(
-            active=scheduler.active_count,
-            pool_in_use=pool.in_use_count,
-            pool_total=pool.total_count,
-            pending=scheduler.pending_count,
+            active=active,
+            pool_in_use=active,
+            pool_total=app.state.pool.total_count,
+            pending=0,
             kv_live_bytes=kv_live,
         )
         return PlainTextResponse(
@@ -335,32 +407,31 @@ async def metrics_endpoint() -> Response:
 
     @app.get("/v1/models", response_model=ListModelsResponse)
     async def list_models() -> ListModelsResponse:
-        engine: Engine = app.state.engine
+        label: str = app.state.model_id_label
         return ListModelsResponse(
-            data=[
-                ModelInfo(
-                    id=engine.model_id_label,
-                    created=int(time.time()),
-                )
-            ]
+            data=[ModelInfo(id=label, created=int(time.time()))],
         )
 
     @app.post("/v1/chat/completions")
     async def chat_completions(req: ChatCompletionRequest, request: Request):
-        engine: Engine = app.state.engine
-        scheduler: Scheduler = app.state.scheduler
+        verifier = app.state.verifier
+        store: SessionStore = app.state.store
+        append_coord: AppendTokensCoordinator = app.state.append_coord
+        gen_coord: GenerationCoordinator = app.state.gen_coord
         config: ServerConfig = app.state.config
         metrics: Metrics = app.state.metrics
+        admission_sem: asyncio.Semaphore = app.state.admission_sem
+        model_label: str = app.state.model_id_label
 
         try:
-            prompt_ids = _encode_prompt(engine, req)
+            prompt_ids = _encode_prompt(verifier, req)
         except ValueError as exc:
             raise HTTPException(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 detail=f"prompt encoding failed: {exc}",
             ) from exc
 
-        eos_token_ids = resolve_eos_ids(engine.tokenizer)
+        eos_token_ids = resolve_eos_ids(verifier.tokenizer)
         if not eos_token_ids:
             raise HTTPException(
                 status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
@@ -372,179 +443,208 @@ async def chat_completions(req: ChatCompletionRequest, request: Request):
         created = int(time.time())
         prompt_token_count = len(prompt_ids)
 
-        # Submit to the scheduler. Admission failures surface as 429
-        # — the canonical OpenAI status for capacity exhaustion.
-        try:
-            session = await scheduler.submit(
-                prompt_ids=prompt_ids,
-                max_new_tokens=max_new_tokens,
-                eos_token_ids=eos_token_ids,
-            )
-        except RequestRejected as exc:
-            metrics.record_admission(admitted=False)
-            raise HTTPException(
-                status_code=status.HTTP_429_TOO_MANY_REQUESTS,
-                detail=str(exc),
-            ) from exc
+        # Admission control. Failures surface as 429 (REJECT) or
+        # 429-after-timeout (QUEUE).
+        await _admit(sem=admission_sem, config=config)
         metrics.record_admission(admitted=True)
 
-        if req.stream:
-            return EventSourceResponse(
-                _stream_via_scheduler(
-                    scheduler=scheduler,
-                    session=session,
-                    request=request,
-                    engine=engine,
-                    completion_id=completion_id,
-                    created=created,
-                    metrics=metrics,
-                ),
-                media_type="text/event-stream",
-            )
+        session = store.create_session(eos_token_ids=tuple(eos_token_ids))
 
         try:
-            output_token_ids = await _collect_non_streaming_tokens(
-                scheduler=scheduler,
-                session=session,
-                request=request,
+            try:
+                append_coord.append_tokens(session.session_id, prompt_ids)
+            except Exception as exc:  # noqa: BLE001 - surface every prefill error
+                store.close_session(session.session_id)
+                raise HTTPException(
+                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                    detail=f"prefill error: {type(exc).__name__}: {exc}",
+                ) from exc
+
+            if req.stream:
+                return EventSourceResponse(
+                    _stream_session(
+                        gen_coord=gen_coord,
+                        session_id=session.session_id,
+                        request=request,
+                        verifier=verifier,
+                        completion_id=completion_id,
+                        created=created,
+                        model_label=model_label,
+                        max_tokens=max_new_tokens,
+                        eos_token_ids=eos_token_ids,
+                        metrics=metrics,
+                        store=store,
+                        admission_sem=admission_sem,
+                    ),
+                    media_type="text/event-stream",
+                )
+
+            try:
+                output_token_ids, stopped_on_eos = (
+                    await _collect_session_tokens(
+                        gen_coord=gen_coord,
+                        session_id=session.session_id,
+                        max_tokens=max_new_tokens,
+                        request=request,
+                    )
+                )
+            except asyncio.CancelledError:
+                raise
+            except BaseException as exc:
+                raise HTTPException(
+                    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+                    detail=f"engine error: {exc}",
+                ) from exc
+
+            completion_text = verifier.tokenizer.decode(
+                output_token_ids, skip_special_tokens=True,
             )
-        except asyncio.CancelledError:
-            # Client timed out/disconnected while the JSON response was
-            # draining. Without explicit cancellation the worker can keep
-            # occupying the only slab, causing later queued requests to 429.
-            await scheduler.cancel_session(session)
-            raise
-        except BaseException as exc:
-            await scheduler.cancel_session(session)
-            # Engine raised mid-generate; surface as 500.
-            raise HTTPException(
-                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                detail=f"engine error: {exc}",
-            ) from exc
+            finish_reason = "stop" if stopped_on_eos else "length"
 
-        completion_text = engine.tokenizer.decode(
-            output_token_ids, skip_special_tokens=True
-        )
-        # finish_reason: COMPLETED + last token in eos_set => "stop";
-        # otherwise (cap, cancellation, or anything else) => "length"
-        # for non-streaming. Cancellation in non-streaming should not
-        # happen via this path (no cancel hook on JSON responses), but
-        # we keep the conservative mapping.
-        if (
-            session.state is SessionState.COMPLETED
-            and output_token_ids
-            and output_token_ids[-1] in set(eos_token_ids)
-        ):
-            finish_reason = "stop"
-        else:
-            finish_reason = "length"
-
-        metrics.record_completion(
-            finish_reason=finish_reason,
-            n_tokens=len(output_token_ids),
-            acceptance_rate=None,
-        )
+            metrics.record_completion(
+                finish_reason=finish_reason,
+                n_tokens=len(output_token_ids),
+                acceptance_rate=None,
+            )
 
-        return JSONResponse(
-            content=ChatCompletionResponse(
-                id=completion_id,
-                created=created,
-                model=engine.model_id_label,
-                choices=[
-                    ChatCompletionChoice(
-                        index=0,
-                        message=ChatCompletionResponseMessage(
-                            role="assistant", content=completion_text,
+            return JSONResponse(
+                content=ChatCompletionResponse(
+                    id=completion_id,
+                    created=created,
+                    model=model_label,
+                    choices=[
+                        ChatCompletionChoice(
+                            index=0,
+                            message=ChatCompletionResponseMessage(
+                                role="assistant", content=completion_text,
+                            ),
+                            finish_reason=finish_reason,
+                        )
+                    ],
+                    usage=ChatCompletionUsage(
+                        prompt_tokens=prompt_token_count,
+                        completion_tokens=len(output_token_ids),
+                        total_tokens=(
+                            prompt_token_count + len(output_token_ids)
                         ),
-                        finish_reason=finish_reason,
-                    )
-                ],
-                usage=ChatCompletionUsage(
-                    prompt_tokens=prompt_token_count,
-                    completion_tokens=len(output_token_ids),
-                    total_tokens=prompt_token_count + len(output_token_ids),
-                ),
-            ).model_dump()
-        )
+                    ),
+                ).model_dump(),
+            )
+        finally:
+            # Non-streaming path closes the session here. The streaming
+            # path's generator owns its own teardown — the EventSource
+            # flow is asynchronous so the session lifecycle is handled
+            # in _stream_session's finally.
+            if not req.stream:
+                try:
+                    store.close_session(session.session_id)
+                except SessionNotFoundError:
+                    pass
+                admission_sem.release()
 
 
 # ---------------------------------------------------------------------------
-# Helpers
+# Generation drivers
 # ---------------------------------------------------------------------------
 
 
-def _encode_prompt(engine: Engine, req: ChatCompletionRequest) -> List[int]:
-    """Apply the tokenizer's chat template to the request messages."""
-    messages = [m.model_dump() for m in req.messages]
-    prompt_ids = engine.tokenizer.apply_chat_template(
-        messages,
-        add_generation_prompt=True,
-        tokenize=True,
-        return_dict=False,
-        enable_thinking=False,
-    )
-    if not isinstance(prompt_ids, list) or not all(
-        isinstance(t, int) for t in prompt_ids
-    ):
-        raise ValueError(
-            f"chat template returned {type(prompt_ids).__name__}, expected list[int]"
-        )
-    if not prompt_ids:
-        raise ValueError("chat template produced an empty token sequence")
-    return prompt_ids
-
-
-async def _collect_non_streaming_tokens(
+async def _collect_session_tokens(
     *,
-    scheduler: Scheduler,
-    session: Session,
+    gen_coord: GenerationCoordinator,
+    session_id: str,
+    max_tokens: int,
     request: Request,
     disconnect_poll_interval_s: float = 0.05,
-) -> List[int]:
-    """Drain a non-streaming session while honoring client disconnects.
+) -> tuple[List[int], bool]:
+    """Drain the generator coordinator while honoring client disconnects.
 
-    Streaming responses already poll ``request.is_disconnected()`` and
-    cancel their scheduler session. JSON responses need the same cleanup:
-    a timed-out client otherwise leaves the scheduler worker running until
-    generation finishes, which can monopolize a single-slot server.
+    Runs the synchronous ``GenerationCoordinator.generate`` iterator
+    in a background thread (it's CPU-bound on the verifier, not
+    async), polling ``request.is_disconnected()`` between events.
+    Returns ``(emitted_token_ids, stopped_on_eos)``.
     """
-    output_token_ids: List[int] = []
+    output: List[int] = []
+    stopped_on_eos = False
+
+    # Run the generator in a thread to keep the event loop responsive
+    # to disconnect polling. Use a queue to pipe events back.
+    queue: asyncio.Queue = asyncio.Queue()
+    sentinel = object()
+
+    def _drain():
+        try:
+            for event in gen_coord.generate(
+                session_id, max_tokens=max_tokens,
+            ):
+                queue.put_nowait(event)
+        except Exception as exc:  # noqa: BLE001 - propagate as a queue item
+            queue.put_nowait(exc)
+        finally:
+            queue.put_nowait(sentinel)
+
+    drain_task = asyncio.create_task(asyncio.to_thread(_drain))
+
     last_disconnect_check = time.monotonic()
-    async for tok in scheduler.iter_tokens(session):
-        output_token_ids.append(int(tok))
-        now = time.monotonic()
-        if (now - last_disconnect_check) >= disconnect_poll_interval_s:
-            last_disconnect_check = now
-            if await request.is_disconnected():
-                await scheduler.cancel_session(session)
-    return output_token_ids
+    try:
+        while True:
+            try:
+                event = await asyncio.wait_for(
+                    queue.get(),
+                    timeout=disconnect_poll_interval_s,
+                )
+            except asyncio.TimeoutError:
+                if await request.is_disconnected():
+                    drain_task.cancel()
+                    raise asyncio.CancelledError() from None
+                continue
+
+            if event is sentinel:
+                break
+            if isinstance(event, BaseException):
+                raise event
+            if isinstance(event, TokenEvent):
+                output.append(event.token_id)
+            elif isinstance(event, DoneEvent):
+                stopped_on_eos = event.stop_reason == STOP_REASON_EOS
+            elif isinstance(event, HistoryTruncatedEvent):
+                # Non-streaming path doesn't surface this; ignore.
+                continue
 
+            now = time.monotonic()
+            if (now - last_disconnect_check) >= disconnect_poll_interval_s:
+                last_disconnect_check = now
+                if await request.is_disconnected():
+                    drain_task.cancel()
+                    raise asyncio.CancelledError()
+    finally:
+        if not drain_task.done():
+            drain_task.cancel()
+            try:
+                await drain_task
+            except (asyncio.CancelledError, BaseException):
+                pass
 
-async def _stream_via_scheduler(
+    return output, stopped_on_eos
+
+
+async def _stream_session(
     *,
-    scheduler: Scheduler,
-    session: Session,
+    gen_coord: GenerationCoordinator,
+    session_id: str,
     request: Request,
-    engine: Engine,
+    verifier,
     completion_id: str,
     created: int,
+    model_label: str,
+    max_tokens: int,
+    eos_token_ids: List[int],
     metrics: Metrics,
+    store: SessionStore,
+    admission_sem: asyncio.Semaphore,
     disconnect_poll_interval_s: float = 0.05,
 ) -> AsyncIterator[dict]:
-    """SSE async generator that drains :meth:`Scheduler.iter_tokens`.
-
-    Implements the OpenAI streaming chunk protocol on top of a
-    scheduler-managed session. Polls ``request.is_disconnected()`` on
-    a wall-clock interval; on disconnect, calls
-    ``scheduler.cancel_session`` to short-circuit generation.
-
-    The generator yields ``{"data": "<json>"}`` envelopes (the format
-    sse-starlette consumes), terminated by ``{"data": "[DONE]"}``.
-    """
-    import asyncio
-
-    model_label = engine.model_id_label
-    detok = _StreamingDetokenizer(engine.tokenizer)
+    """SSE async generator that drains the GenerationCoordinator."""
+    detok = _StreamingDetokenizer(verifier.tokenizer)
 
     def envelope(content_delta, role_delta, finish_reason) -> dict:
         chunk = ChatCompletionChunk(
@@ -563,57 +663,93 @@ def envelope(content_delta, role_delta, finish_reason) -> dict:
         )
         return {"data": chunk.model_dump_json()}
 
-    yield envelope(content_delta=None, role_delta="assistant", finish_reason=None)
+    yield envelope(
+        content_delta=None, role_delta="assistant", finish_reason=None,
+    )
 
-    last_disconnect_check = time.monotonic()
+    queue: asyncio.Queue = asyncio.Queue()
+    sentinel = object()
+    n_tokens = 0
     cancelled_by_disconnect = False
+
+    def _drain():
+        try:
+            for event in gen_coord.generate(session_id, max_tokens=max_tokens):
+                queue.put_nowait(event)
+        except Exception as exc:  # noqa: BLE001 - swallow, surface terminal chunk
+            queue.put_nowait(exc)
+        finally:
+            queue.put_nowait(sentinel)
+
+    drain_task = asyncio.create_task(asyncio.to_thread(_drain))
+    last_disconnect_check = time.monotonic()
+    stopped_on_eos = False
+
     try:
-        async for tok in scheduler.iter_tokens(session):
-            delta = detok.feed(int(tok))
-            if delta:
-                yield envelope(
-                    content_delta=delta, role_delta=None, finish_reason=None,
+        while True:
+            try:
+                event = await asyncio.wait_for(
+                    queue.get(),
+                    timeout=disconnect_poll_interval_s,
                 )
+            except asyncio.TimeoutError:
+                if await request.is_disconnected():
+                    cancelled_by_disconnect = True
+                    drain_task.cancel()
+                    break
+                continue
+            if event is sentinel:
+                break
+            if isinstance(event, BaseException):
+                # Verifier raised mid-stream. Once SSE has started
+                # there's no way to surface a 500; close gracefully
+                # with finish_reason="stop".
+                break
+            if isinstance(event, TokenEvent):
+                n_tokens += 1
+                delta = detok.feed(event.token_id)
+                if delta:
+                    yield envelope(
+                        content_delta=delta, role_delta=None, finish_reason=None,
+                    )
+            elif isinstance(event, DoneEvent):
+                stopped_on_eos = event.stop_reason == STOP_REASON_EOS
+            elif isinstance(event, HistoryTruncatedEvent):
+                # Stream contract: this event arrives BEFORE the
+                # first TokenEvent. We don't surface it on the
+                # OpenAI wire (no analog). Ignore.
+                continue
+
             now = time.monotonic()
             if (now - last_disconnect_check) >= disconnect_poll_interval_s:
                 last_disconnect_check = now
                 if await request.is_disconnected():
                     cancelled_by_disconnect = True
-                    await scheduler.cancel_session(session)
-                    # Drain remaining tokens (will exit shortly because
-                    # the on_token callback inside the scheduler now
-                    # returns True).
-    except BaseException:  # noqa: BLE001 — surface as terminal chunk
-        # Engine errors mid-stream end the SSE stream gracefully; the
-        # client sees a finish_reason="stop" with no further content.
-        # We deliberately do NOT raise here — once SSE has started,
-        # there is no way to send a 500 status; the OpenAI clients
-        # also expect graceful termination on errors.
-        pass
-
-    # Terminal chunk: derive finish_reason from session state.
-    if cancelled_by_disconnect or session.state is SessionState.CANCELLED:
-        finish_reason = "stop"
-    elif session.state is SessionState.COMPLETED:
-        # Did we end on EOS or hit max_tokens?
-        if (
-            session.output_token_ids
-            and session.output_token_ids[-1]
-            in set(session.eos_token_ids)
-        ):
-            finish_reason = "stop"
-        else:
-            finish_reason = "length"
-    else:
-        # FAILED or some other terminal — be conservative.
-        finish_reason = "stop"
-
+                    drain_task.cancel()
+                    break
+    finally:
+        if not drain_task.done():
+            drain_task.cancel()
+            try:
+                await drain_task
+            except (asyncio.CancelledError, BaseException):
+                pass
+        try:
+            store.close_session(session_id)
+        except SessionNotFoundError:
+            pass
+        admission_sem.release()
+
+    finish_reason = (
+        "stop" if (stopped_on_eos and not cancelled_by_disconnect)
+        else "length"
+    )
     yield envelope(
         content_delta=None, role_delta=None, finish_reason=finish_reason,
     )
     metrics.record_completion(
         finish_reason=finish_reason,
-        n_tokens=len(session.output_token_ids),
+        n_tokens=n_tokens,
         acceptance_rate=None,
     )
     yield {"data": "[DONE]"}
diff --git a/scripts/review_pr_d2_on_mac.sh b/scripts/review_pr_d2_on_mac.sh
new file mode 100755
index 0000000..4a7ba9b
--- /dev/null
+++ b/scripts/review_pr_d2_on_mac.sh
@@ -0,0 +1,79 @@
+#!/usr/bin/env bash
+# Mac M4 review aid for PR-D2 (HTTP shim refactor onto SessionStore).
+#
+# PR-D2 retired the Scheduler + PooledVerifier + SpeculativeEngine
+# machinery from the HTTP shim's request path. Each
+# /v1/chat/completions request is now a single-shot session under
+# SessionStore — same semantics as the gRPC RuntimeService. The
+# integration suite's test_http_shim_real.py is the binding gate
+# for this refactor: it drives the full FastAPI app (with real
+# Qwen3-0.6B verifier) through OpenAI-compat surface tests
+# including SSE streaming, auth, error envelopes, /metrics, and
+# /v1/models.
+#
+# Produces 1 artifact:
+#
+#   results/platform-tests/pr-d2-mac-integration-tests-<unix>.json
+#     pytest -m integration tests/integration/ — runs the full
+#     accumulated integration suite (PR-E1 INV-3 + PR-N1 coordinator/
+#     generator + PR-N2 scheduler + PR-N3 http_shim/engine/
+#     tokenizer/streaming + PR-N4 SDK).
+#
+# Usage (from repo root, on Mac M4):
+#
+#     bash scripts/review_pr_d2_on_mac.sh
+
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+cd "$ROOT"
+
+stamp="$(date +%s)"
+out_dir="results/platform-tests"
+mkdir -p "$out_dir"
+
+junit="$out_dir/pr-d2-mac-integration-tests-${stamp}.junit.xml"
+report="$out_dir/pr-d2-mac-integration-tests-${stamp}.json"
+
+echo "==> integration suite (HTTP shim onto SessionStore + full N1..N4 cumulative)"
+PYTHONPATH=.:sdks/python python3 -m pytest \
+    -m integration \
+    tests/integration/ \
+    --junitxml="$junit" \
+    -v
+
+PYTHONPATH=.:sdks/python python3 - "$junit" "$report" <<'PY'
+import json
+import platform
+import sys
+import xml.etree.ElementTree as ET
+junit_path, out_path = sys.argv[1:3]
+jr = ET.parse(junit_path).getroot()
+testsuites = list(jr.iter("testsuite"))
+total_tests = sum(int(ts.get("tests", "0")) for ts in testsuites)
+total_failures = sum(int(ts.get("failures", "0")) for ts in testsuites)
+total_errors = sum(int(ts.get("errors", "0")) for ts in testsuites)
+total_skipped = sum(int(ts.get("skipped", "0")) for ts in testsuites)
+report = {
+    "schema_version": 1,
+    "kind": "pr_d2_mac_integration_tests",
+    "host": {
+        "platform": platform.platform(),
+        "machine": platform.machine(),
+        "python": platform.python_version(),
+    },
+    "junit": {
+        "tests": total_tests, "failures": total_failures,
+        "errors": total_errors, "skipped": total_skipped,
+    },
+}
+with open(out_path, "w", encoding="utf-8") as fh:
+    json.dump(report, fh, indent=2)
+print(f"  -> {out_path}")
+PY
+
+echo
+echo "==> Done. Commit:"
+echo "    git add $out_dir/pr-d2-mac-*"
+echo "    git commit -m 'Mac M4 review evidence for PR-D2'"
+echo "    git push"
diff --git a/scripts/serve.py b/scripts/serve.py
index cabfbca..4461fca 100644
--- a/scripts/serve.py
+++ b/scripts/serve.py
@@ -1,19 +1,29 @@
-"""HTTP server launcher (E2).
-
-Boots the speculative-decoding engine and serves it via uvicorn over
-the OpenAI-compatible API defined in ``inference_engine.server``.
+"""HTTP server launcher (deprecated shim, post PR-D2 of ADR 0008).
+
+Boots the **deprecated** OpenAI-compatible HTTP shim over a real
+verifier. Per ADR 0008 §2.7 the HTTP shim is feature-frozen and
+slated for retirement; new integrations should use the gRPC server
+(``scripts/start_grpc_runtime_server.py``) instead.
+
+PR-D2 (this revision):
+  - The shim no longer wraps a SpeculativeEngine. Each
+    /v1/chat/completions request runs as a single-shot session
+    against the verifier directly: prefill -> generate -> close.
+  - Speculative decoding (proposer + verifier) is NOT exercised on
+    the HTTP path. Pure AR. Performance roughly matches
+    transformers-vanilla.
+  - Every response carries ``Deprecation: true`` and a ``Sunset``
+    header.
 
 Usage:
-    PYTHONPATH=. python3 scripts/serve.py --backend mlx \\
-        --verifier-id Qwen/Qwen3-1.7B
+    PYTHONPATH=. python3 scripts/serve.py --backend cpu \\
+        --verifier-id Qwen/Qwen3-0.6B
     PYTHONPATH=. python3 scripts/serve.py --backend mlx \\
         --verifier-id mlx-community/Qwen3-1.7B-4bit \\
         --host 0.0.0.0 --port 8000
 
 This script is exempt from unit-test coverage (CLI plumbing around
-already-tested library code, same convention as ``run_demo.py`` and
-``chat.py``). Its correctness is verified by the system-test PR and
-by ad-hoc local invocation.
+already-tested library code).
 """
 
 from __future__ import annotations
@@ -26,70 +36,35 @@
 
 from inference_engine.server.app import create_app
 from inference_engine.server.config import ServerConfig
-from inference_engine.server.engine import SpeculativeEngine
-from kv_cache_proposer.proposer import ProposerConfig
-from kv_cache_proposer.speculative import SpeculativeDecoder
 from kv_cache_proposer.verifier import VerifierConfig
 
 
-def _build_engine(
+def _build_verifier(
     *,
     backend: str,
     verifier_id: str,
     sink_size: int,
     window_size: int,
-    block_size: int,
-    num_diffusion_steps: int,
-    model_id_label: str,
-) -> SpeculativeEngine:
-    proposer_cfg = ProposerConfig(dtype=torch.bfloat16, device="cpu")
-    verifier_cfg = VerifierConfig(
+):
+    cfg = VerifierConfig(
         model_id=verifier_id,
         dtype=torch.bfloat16, device="cpu",
         sink_size=sink_size, window_size=window_size,
     )
 
     if backend == "cpu":
-        from inference_engine.proposer import SparseLogitsProposer
         from kv_cache_proposer.verifier import SinkWindowVerifier
-        proposer = SparseLogitsProposer(proposer_cfg)
-        verifier = SinkWindowVerifier(verifier_cfg)
-    elif backend == "mlx":
-        from inference_engine.backends.mlx.env import probe_environment
-        env = probe_environment()
-        if not env.is_available:
-            print(f"[serve] MLX unavailable: {env.failure_reason}",
-                  file=sys.stderr)
-            sys.exit(2)
-        from inference_engine.backends.mlx.proposer import MLXSparseLogitsProposer
-        from inference_engine.backends.mlx.verifier import MLXSinkWindowVerifier
-        proposer = MLXSparseLogitsProposer(proposer_cfg)
-        verifier = MLXSinkWindowVerifier(verifier_cfg)
-    elif backend == "mixed":
+        return SinkWindowVerifier(cfg)
+    if backend == "mlx":
         from inference_engine.backends.mlx.env import probe_environment
         env = probe_environment()
         if not env.is_available:
             print(f"[serve] MLX unavailable: {env.failure_reason}",
                   file=sys.stderr)
             sys.exit(2)
-        from inference_engine.proposer import SparseLogitsProposer
         from inference_engine.backends.mlx.verifier import MLXSinkWindowVerifier
-        proposer = SparseLogitsProposer(proposer_cfg)
-        verifier = MLXSinkWindowVerifier(verifier_cfg)
-    else:
-        raise SystemExit(f"unknown backend: {backend}")
-
-    decoder = SpeculativeDecoder(
-        proposer=proposer,
-        verifier=verifier,
-        block_size=block_size,
-        num_diffusion_steps=num_diffusion_steps,
-    )
-    return SpeculativeEngine(
-        decoder=decoder,
-        tokenizer=verifier.tokenizer,
-        model_id_label=model_id_label,
-    )
+        return MLXSinkWindowVerifier(cfg)
+    raise SystemExit(f"unknown backend: {backend}")
 
 
 def main() -> int:
@@ -101,8 +76,14 @@ def main() -> int:
     ap.add_argument("--log-level", default=None)
     ap.add_argument("--sink-size", type=int, default=4)
     ap.add_argument("--window-size", type=int, default=64)
-    ap.add_argument("--block-size", type=int, default=16)
-    ap.add_argument("--num-diffusion-steps", type=int, default=2)
+    # PR-D2: HTTP shim is pure-AR now. The proposer-related flags
+    # (block_size, num_diffusion_steps) are accepted but ignored
+    # for backward CLI compatibility; remove in v0.4 once
+    # downstream scripts are updated.
+    ap.add_argument("--block-size", type=int, default=16,
+                    help="Ignored post PR-D2; kept for CLI compat.")
+    ap.add_argument("--num-diffusion-steps", type=int, default=2,
+                    help="Ignored post PR-D2; kept for CLI compat.")
     ap.add_argument("--model-id-label", default=None,
                     help="OpenAI-API ``model`` field returned by /v1/models. "
                          "Defaults to the verifier id.")
@@ -160,20 +141,24 @@ def main() -> int:
     )
 
     print(
-        f"[serve] backend={args.backend} verifier={args.verifier_id} "
-        f"host={config.host} port={config.port}",
+        f"[serve] DEPRECATED HTTP shim "
+        f"backend={args.backend} verifier={args.verifier_id} "
+        f"host={config.host} port={config.port}\n"
+        f"[serve] migrate to gRPC: "
+        f"scripts/start_grpc_runtime_server.py",
         file=sys.stderr, flush=True,
     )
-    engine = _build_engine(
+    verifier = _build_verifier(
         backend=args.backend,
         verifier_id=args.verifier_id,
         sink_size=args.sink_size,
         window_size=args.window_size,
-        block_size=args.block_size,
-        num_diffusion_steps=args.num_diffusion_steps,
+    )
+    app = create_app(
+        verifier,
+        config,
         model_id_label=config.model_id_label,
     )
-    app = create_app(engine, config)
     uvicorn.run(app, host=config.host, port=config.port,
                 log_level=config.log_level)
     return 0
diff --git a/tests/inference_engine/scheduler/test_pooled_verifier.py b/tests/inference_engine/scheduler/test_pooled_verifier.py
deleted file mode 100644
index 8dc9046..0000000
--- a/tests/inference_engine/scheduler/test_pooled_verifier.py
+++ /dev/null
@@ -1,272 +0,0 @@
-"""Unit tests for :class:`PooledVerifier`.
-
-Uses a real concrete ``_FakeVerifier`` class — not ``unittest.mock`` —
-that mimics the verifier protocol with deterministic, in-memory state.
-This lets us verify the wrapper's lifecycle without loading real
-Qwen3 weights (which would make CI slow and HF-cache-bound).
-"""
-
-from __future__ import annotations
-
-import pytest
-import torch
-
-from inference_engine.memory.pool import SlabPool
-from inference_engine.memory.slab import SlabConfig
-from inference_engine.scheduler.pooled_verifier import PooledVerifier
-
-
-class _FakeStats:
-    def __init__(self) -> None:
-        self.peak_kv_bytes = 0
-
-
-class _FakeConfig:
-    def __init__(self) -> None:
-        self.sink_size = 1
-        self.window_size = 4
-        self.dtype = torch.float32
-        self.device = "cpu"
-
-
-class _FakeVerifier:
-    """Deterministic verifier-protocol implementation for tests."""
-
-    def __init__(self) -> None:
-        self.config = _FakeConfig()
-        self.stats = _FakeStats()
-        self.tokenizer = "tokenizer-marker"
-        self.next_token_logits: torch.Tensor | None = None
-        self.cache_logical_size = 0
-        self.next_global_position = 0
-        # Recording for assertions.
-        self.calls: list[tuple] = []
-
-    def prefill(self, prompt_ids):
-        self.calls.append(("prefill", tuple(prompt_ids)))
-        self.cache_logical_size = len(prompt_ids)
-        self.next_global_position = len(prompt_ids)
-        self.next_token_logits = torch.zeros(8)
-        self.stats.peak_kv_bytes = len(prompt_ids) * 100  # fake "100 bytes/token"
-
-    def forward_block(self, tokens):
-        self.calls.append(("forward_block", tuple(tokens)))
-        self.cache_logical_size += len(tokens)
-        self.stats.peak_kv_bytes = self.cache_logical_size * 100
-        return torch.zeros(len(tokens), 8)
-
-    def commit_or_truncate(self, *, forwarded, accepted):
-        self.calls.append(("commit_or_truncate", forwarded, accepted))
-        self.cache_logical_size -= (forwarded - accepted)
-        self.next_global_position += accepted
-        self.stats.peak_kv_bytes = self.cache_logical_size * 100
-
-    def append_token(self, token_id):
-        self.calls.append(("append_token", token_id))
-        self.cache_logical_size += 1
-        self.next_global_position += 1
-        self.stats.peak_kv_bytes = self.cache_logical_size * 100
-        out = torch.zeros(8)
-        self.next_token_logits = out
-        return out
-
-    def reset(self):
-        self.calls.append(("reset",))
-        self.cache_logical_size = 0
-        self.next_global_position = 0
-        self.next_token_logits = None
-
-
-@pytest.fixture
-def slab_config() -> SlabConfig:
-    return SlabConfig(
-        num_layers=1, num_heads=1, sink_size=0, window_size=1,
-        head_dim=1, dtype=torch.float32,
-    )
-
-
-@pytest.fixture
-def pool(slab_config: SlabConfig) -> SlabPool:
-    return SlabPool(num_slabs=2, slab_config=slab_config)
-
-
-@pytest.fixture
-def fake_verifier() -> _FakeVerifier:
-    return _FakeVerifier()
-
-
-@pytest.fixture
-def pooled(fake_verifier: _FakeVerifier, pool: SlabPool) -> PooledVerifier:
-    return PooledVerifier(verifier=fake_verifier, pool=pool)
-
-
-# ---------------------------------------------------------------------------
-# Construction
-# ---------------------------------------------------------------------------
-
-
-def test_construction_rejects_none_pool(fake_verifier):
-    with pytest.raises(ValueError, match="pool must not be None"):
-        PooledVerifier(verifier=fake_verifier, pool=None)  # type: ignore[arg-type]
-
-
-def test_construction_no_slab_held_initially(pooled, pool):
-    assert pooled.slab is None
-    assert pool.in_use_count == 0
-
-
-# ---------------------------------------------------------------------------
-# prefill: acquires slab + syncs bytes
-# ---------------------------------------------------------------------------
-
-
-def test_prefill_acquires_slab(pooled, pool, fake_verifier):
-    pooled.prefill([1, 2, 3])
-    assert pooled.slab is not None
-    assert pool.in_use_count == 1
-    assert ("prefill", (1, 2, 3)) in fake_verifier.calls
-
-
-def test_prefill_writes_real_bytes_to_slab_override(pooled):
-    pooled.prefill([1, 2, 3])
-    # _FakeVerifier reports 3 * 100 = 300 bytes after prefill.
-    assert pooled.slab.live_kv_bytes_override == 300
-    # And live_kv_bytes returns the override.
-    assert pooled.slab.live_kv_bytes == 300
-
-
-def test_repeated_prefill_releases_old_slab(pooled, pool):
-    pooled.prefill([1])
-    first_slab = pooled.slab
-    pooled.prefill([2])
-    second_slab = pooled.slab
-    # Different slab instance (or same after release+re-acquire); the
-    # invariant is that pool only holds one active slab for this
-    # verifier.
-    assert pool.in_use_count == 1
-    # Old slab was released.
-    assert first_slab is not None
-    if first_slab is not second_slab:
-        # The pool's free list contains first_slab now (it was released).
-        # We don't probe pool internals; just verify in_use_count.
-        pass
-
-
-def test_prefill_failure_releases_slab(pool):
-    """If the wrapped verifier raises during prefill, the wrapper
-    must release the slab so the pool isn't leaked."""
-
-    class _RaisingVerifier(_FakeVerifier):
-        def prefill(self, prompt_ids):
-            raise RuntimeError("synthetic prefill failure")
-
-    pooled = PooledVerifier(verifier=_RaisingVerifier(), pool=pool)
-    with pytest.raises(RuntimeError, match="synthetic prefill failure"):
-        pooled.prefill([1, 2, 3])
-    assert pool.in_use_count == 0
-    assert pooled.slab is None
-
-
-# ---------------------------------------------------------------------------
-# forward_block / commit_or_truncate / append_token sync slab bytes
-# ---------------------------------------------------------------------------
-
-
-def test_forward_block_updates_slab_bytes(pooled):
-    pooled.prefill([1, 2, 3])
-    out = pooled.forward_block([4, 5])
-    assert out.shape == (2, 8)
-    # cache_logical_size is now 5; 5*100 = 500.
-    assert pooled.slab.live_kv_bytes_override == 500
-
-
-def test_commit_or_truncate_updates_slab_bytes(pooled):
-    pooled.prefill([1, 2, 3])
-    pooled.forward_block([4, 5])
-    pooled.commit_or_truncate(forwarded=2, accepted=1)
-    # cache shrinks by 1 (drop=2-1=1) -> logical_size=4 -> 400 bytes.
-    assert pooled.slab.live_kv_bytes_override == 400
-
-
-def test_append_token_updates_slab_bytes(pooled):
-    pooled.prefill([1, 2, 3])
-    pooled.append_token(99)
-    # logical_size went 3 -> 4; 400 bytes.
-    assert pooled.slab.live_kv_bytes_override == 400
-
-
-def test_methods_passthrough_to_underlying_verifier(pooled, fake_verifier):
-    pooled.prefill([7, 8])
-    pooled.forward_block([9])
-    pooled.commit_or_truncate(forwarded=1, accepted=1)
-    pooled.append_token(42)
-    # Inspect the recorded call list on the underlying verifier.
-    assert fake_verifier.calls == [
-        ("prefill", (7, 8)),
-        ("forward_block", (9,)),
-        ("commit_or_truncate", 1, 1),
-        ("append_token", 42),
-    ]
-
-
-# ---------------------------------------------------------------------------
-# reset releases slab
-# ---------------------------------------------------------------------------
-
-
-def test_reset_releases_slab(pooled, pool):
-    pooled.prefill([1, 2])
-    assert pool.in_use_count == 1
-    pooled.reset()
-    assert pool.in_use_count == 0
-    assert pooled.slab is None
-
-
-def test_reset_when_no_slab_held_is_noop(pooled, fake_verifier):
-    pooled.reset()
-    assert pooled.slab is None
-    assert ("reset",) in fake_verifier.calls
-
-
-# ---------------------------------------------------------------------------
-# Pass-through properties
-# ---------------------------------------------------------------------------
-
-
-def test_tokenizer_passthrough(pooled, fake_verifier):
-    assert pooled.tokenizer == fake_verifier.tokenizer
-
-
-def test_stats_passthrough(pooled, fake_verifier):
-    pooled.prefill([1])
-    assert pooled.stats is fake_verifier.stats
-
-
-def test_config_passthrough(pooled, fake_verifier):
-    assert pooled.config is fake_verifier.config
-
-
-def test_cache_logical_size_passthrough(pooled):
-    pooled.prefill([1, 2, 3])
-    assert pooled.cache_logical_size == 3
-
-
-def test_next_global_position_passthrough(pooled):
-    pooled.prefill([1, 2])
-    assert pooled.next_global_position == 2
-
-
-def test_next_token_logits_get_and_set(pooled):
-    pooled.prefill([1])
-    assert pooled.next_token_logits is not None
-    new_logits = torch.ones(8)
-    pooled.next_token_logits = new_logits
-    assert torch.equal(pooled.next_token_logits, new_logits)
-
-
-def test_inner_returns_wrapped_verifier(pooled, fake_verifier):
-    assert pooled.inner is fake_verifier
-
-
-def test_pool_property_returns_pool(pooled, pool):
-    assert pooled.pool is pool
diff --git a/tests/inference_engine/server/test_grpc_app.py b/tests/inference_engine/server/test_grpc_app.py
index 2ea196a..f698eac 100644
--- a/tests/inference_engine/server/test_grpc_app.py
+++ b/tests/inference_engine/server/test_grpc_app.py
@@ -598,6 +598,138 @@ async def abort(self, code, details):  # pragma: no cover
         runtime_pb2.GenerateDone.STOP_REASON_CANCELLED
     assert events[1].done.generated_token_count == 1
 
+
+async def test_append_tokens_session_not_found_returns_not_found():
+    """SessionNotFoundError raised by the AppendTokensCoordinator →
+    NOT_FOUND on the wire. Coordinator override; verifier-free."""
+    from inference_engine.session import AppendTokensCoordinator
+    from inference_engine.session.store import SessionNotFoundError
+
+    class _NotFoundCoordinator(AppendTokensCoordinator):
+        def append_tokens(self, session_id, token_ids):
+            del token_ids
+            raise SessionNotFoundError(session_id)
+
+    store = SessionStore(capacity=2)
+    coord = _NotFoundCoordinator(store, verifier=None)
+    server = grpc.aio.server()
+    runtime_pb2_grpc.add_RuntimeServiceServicer_to_server(
+        RuntimeServiceServicer(store, append_coordinator=coord), server,
+    )
+    port = server.add_insecure_port("127.0.0.1:0")
+    await server.start()
+    channel = grpc.aio.insecure_channel(f"127.0.0.1:{port}")
+    stub = runtime_pb2_grpc.RuntimeServiceStub(channel)
+    try:
+        with pytest.raises(grpc.aio.AioRpcError) as exc_info:
+            await stub.AppendTokens(
+                runtime_pb2.AppendTokensRequest(
+                    session_id="any-id", token_ids=[1, 2, 3],
+                ),
+            )
+        assert exc_info.value.code() == grpc.StatusCode.NOT_FOUND
+    finally:
+        await channel.close()
+        await server.stop(grace=0.1)
+
+
+async def test_append_tokens_success_returns_response():
+    """Coordinator override returns successfully → AppendTokensResponse
+    with the correct history_length on the wire."""
+    from inference_engine.session import AppendTokensCoordinator
+
+    class _SuccessCoordinator(AppendTokensCoordinator):
+        def append_tokens(self, session_id, token_ids):
+            del session_id
+            return len(list(token_ids)) + 100  # deterministic, easy to assert
+
+    store = SessionStore(capacity=2)
+    coord = _SuccessCoordinator(store, verifier=None)
+    server = grpc.aio.server()
+    runtime_pb2_grpc.add_RuntimeServiceServicer_to_server(
+        RuntimeServiceServicer(store, append_coordinator=coord), server,
+    )
+    port = server.add_insecure_port("127.0.0.1:0")
+    await server.start()
+    channel = grpc.aio.insecure_channel(f"127.0.0.1:{port}")
+    stub = runtime_pb2_grpc.RuntimeServiceStub(channel)
+    try:
+        create_resp = await stub.CreateSession(
+            runtime_pb2.CreateSessionRequest(),
+        )
+        resp = await stub.AppendTokens(
+            runtime_pb2.AppendTokensRequest(
+                session_id=create_resp.session_id, token_ids=[1, 2, 3],
+            ),
+        )
+        assert resp.history_length == 103
+    finally:
+        await channel.close()
+        await server.stop(grace=0.1)
+
+
+async def test_generate_yields_history_truncated_then_done():
+    """Generator override yields HistoryTruncatedEvent + DoneEvent →
+    truncated frame + done frame on the wire. Covers the
+    HistoryTruncatedEvent and DoneEvent yield paths."""
+    from inference_engine.session import (
+        DoneEvent,
+        GenerationCoordinator,
+        HistoryTruncatedEvent,
+        STOP_REASON_MAX_TOKENS,
+        TokenEvent,
+    )
+
+    class _StreamingGen(GenerationCoordinator):
+        def generate(self, session_id, *, max_tokens, **kw):
+            del session_id, max_tokens, kw
+            yield HistoryTruncatedEvent(dropped_token_count=42)
+            yield TokenEvent(token_id=7)
+            yield DoneEvent(
+                stop_reason=STOP_REASON_MAX_TOKENS,
+                generated_token_count=1,
+                prefill_seconds=0.0,
+                total_seconds=0.0,
+            )
+
+    store = SessionStore(capacity=2)
+    gen_coord = _StreamingGen(store, verifier=None)
+    server = grpc.aio.server()
+    runtime_pb2_grpc.add_RuntimeServiceServicer_to_server(
+        RuntimeServiceServicer(
+            store, generation_coordinator=gen_coord,
+        ),
+        server,
+    )
+    port = server.add_insecure_port("127.0.0.1:0")
+    await server.start()
+    channel = grpc.aio.insecure_channel(f"127.0.0.1:{port}")
+    stub = runtime_pb2_grpc.RuntimeServiceStub(channel)
+    try:
+        create_resp = await stub.CreateSession(
+            runtime_pb2.CreateSessionRequest(),
+        )
+        events = []
+        async for evt in stub.Generate(
+            runtime_pb2.GenerateRequest(
+                session_id=create_resp.session_id, max_tokens=1,
+            ),
+        ):
+            events.append(evt)
+        # Order: truncated → token → done
+        assert len(events) == 3
+        assert events[0].WhichOneof("payload") == "truncated"
+        assert events[0].truncated.dropped_token_count == 42
+        assert events[1].WhichOneof("payload") == "token_id"
+        assert events[1].token_id == 7
+        assert events[2].WhichOneof("payload") == "done"
+        assert events[2].done.stop_reason == \
+            runtime_pb2.GenerateDone.STOP_REASON_MAX_TOKENS
+    finally:
+        await channel.close()
+        await server.stop(grace=0.1)
+
+
 # ---------------------------------------------------------------------------
 # Generate (PR-B3) — verifier-independent paths only
 #
diff --git a/tests/integration/test_http_shim_real.py b/tests/integration/test_http_shim_real.py
index 6bc8630..28adebc 100644
--- a/tests/integration/test_http_shim_real.py
+++ b/tests/integration/test_http_shim_real.py
@@ -34,19 +34,33 @@
 
 
 @pytest.fixture
-def real_app(real_speculative_engine):
+def _real_verifier_for_http(real_speculative_engine):
+    """Extract the underlying real verifier from the
+    SpeculativeEngine fixture. Post PR-D2 the HTTP shim's
+    ``create_app`` accepts a verifier directly (not an
+    ``Engine`` wrapper); the SpeculativeEngine fixture is
+    still the canonical source of a configured + loaded
+    real verifier in the integration suite, so we reach
+    into its decoder rather than re-loading model weights.
+    """
+    return real_speculative_engine._decoder.verifier
+
+
+@pytest.fixture
+def real_app(_real_verifier_for_http):
     return create_app(
-        real_speculative_engine,
-        ServerConfig(default_max_new_tokens=4),
+        _real_verifier_for_http,
+        ServerConfig(default_max_new_tokens=4, model_id_label="kakeya-test"),
     )
 
 
 @pytest.fixture
-def real_app_with_auth(real_speculative_engine):
+def real_app_with_auth(_real_verifier_for_http):
     return create_app(
-        real_speculative_engine,
+        _real_verifier_for_http,
         ServerConfig(
             default_max_new_tokens=4,
+            model_id_label="kakeya-test",
             api_keys=frozenset({"sk-test-secret"}),
         ),
     )
@@ -73,7 +87,7 @@ async def test_chat_completions_returns_openai_envelope(real_app):
     assert body["object"] == "chat.completion"
     assert "id" in body
     assert "created" in body
-    assert body["model"] == real_app.state.engine.model_id_label
+    assert body["model"] == real_app.state.model_id_label
     assert len(body["choices"]) == 1
     choice = body["choices"][0]
     assert choice["index"] == 0
@@ -251,6 +265,6 @@ async def test_models_endpoint_lists_engine_id(real_app):
     body = r.json()
     assert body["object"] == "list"
     assert any(
-        m["id"] == real_app.state.engine.model_id_label
+        m["id"] == real_app.state.model_id_label
         for m in body["data"]
     )