PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests#54
Merged
FluffyAIcode merged 2 commits intoJun 2, 2026
Conversation
…uler tests
ADR 0008 / no-test-doubles cleanup, second installment. PR-N2
retires the scheduler-side engine/tokenizer test doubles and
migrates their dispatch / admission-control / lifecycle tests to
tests/integration/ where they run against a real SpeculativeEngine
over Qwen3-0.6B.
Per the user's principle: 'fake = mock, all banned'. PR-N1 cleared
the verifier protocol mirrors; PR-N2 clears the
scheduler-conftest engine + tokenizer mirrors. PR-N3 will tackle
the HTTP shim's separate copies and engine subtypes; PR-N4 will
clean up the SDK conftest stub + final CI consolidation.
What was deleted
----------------
tests/inference_engine/scheduler/conftest.py
-197 / +62 lines net. DeterministicEngine and
DeterministicTokenizer classes (~120 lines) deleted. Their
fixtures (tokenizer, short_engine, long_engine, slow_engine,
reject_scheduler, queue_scheduler) deleted. The slab-pool
fixtures (slab_config, small_pool, single_pool) stay \u2014
they're verifier-independent and consumed by the new Linux-
side validation tests.
tests/inference_engine/scheduler/test_scheduler.py
-421 lines. 20 tests against DeterministicEngine. Migrated
selectively (see Added).
What was added
--------------
tests/integration/test_scheduler_real.py +422 lines, 12 tests
Scheduler integration tests against the real SpeculativeEngine:
- construction validation (pool size match)
- happy path: submit + iter_tokens \u2192 COMPLETED + slab
released
- admission control: REJECT (pool exhausted), QUEUE
(admit-after-completion)
- cancellation (mid-stream + idempotent-after-completion)
- engine error propagation (parametric error injector wraps
the real engine; same composition pattern PR-N1 used for
gRPC error-mapping tests)
- 3-way concurrency (all complete)
- shutdown (cancels active + rejects pending)
- active_count zeros after drain
The migrated tests use looser assertions than the originals
(real engine output varies); structural invariants are what
matters for scheduler correctness.
tests/integration/conftest.py +82 lines
- Existing pytest_collection_modifyitems hook (auto-marks
everything under tests/integration/ with @pytest.mark.integration).
- New session-scoped real_speculative_engine fixture
(Qwen3-0.6B + SparseLogitsProposer + SpeculativeDecoder +
SpeculativeEngine wrapper). Mirrors the long-standing fixture
under tests/system/conftest.py but uses 0.6B not 1.7B to
match the rest of the integration suite.
tests/inference_engine/scheduler/test_scheduler_validation.py
+102 lines, 5 tests
Pre-engine validation paths on Linux:
- construction validation (pool dim mismatch)
- submit() argument validation: empty prompt, zero
max_new_tokens, empty EOS
All run with engine=None; the validation rejects before the
scheduler enqueues a worker that would consult the engine.
scripts/review_pr_n2_on_mac.sh +88 lines
Mac M4 reviewer aid. Runs pytest -m integration
tests/integration/ and produces pr-n2-mac-integration-tests-
<unix>.json under results/platform-tests/.
What stays in place
-------------------
tests/inference_engine/scheduler/test_pooled_verifier.py
Still uses _FakeVerifier / _RaisingVerifier. PR-D2 retires
PooledVerifier entirely; cleanup before then is throwaway.
tests/inference_engine/server/conftest.py
Has its own copy of DeterministicEngine + DeterministicTokenizer
used by the HTTP shim tests. PR-N3 scope.
tests/inference_engine/server/test_app_*.py + test_engine.py +
test_tokenizer.py + their server-specific subtypes
(_RaisingEngine, _ProxyEngine, _AlwaysHoldingEngine,
_KVAwareSlowEngine, _BrokenTokenizer, _EmptyTemplateTokenizer,
_NoEosTokenizer)
PR-N3 scope.
CI workflow change
------------------
.github/workflows/ci.yaml: dropped --cov=inference_engine.scheduler
in favor of explicit per-module coverage:
- inference_engine.scheduler.config (Linux \u2713)
- inference_engine.scheduler.session (Linux \u2713)
- inference_engine.scheduler.pooled_verifier (Linux \u2713; via test_pooled_verifier.py exempt)
- inference_engine.scheduler.scheduler (integration only)
Also pre-emptively switched to the coverage-run pattern (was
`pytest --cov=` before; the GitHub-hosted runner's
torch+pytest-cov interaction surfaced as SIGSEGV during PR-N1).
Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
680 passed (was 695 on main, -15 net = removed 20 scheduler
FakeEngine tests, added 5 verifier-independent
validation tests).
100% coverage on 1456 stmts (was 1660 on main; -204 net stmts
is inference_engine.scheduler.scheduler now
integration-only).
Mac M4 evidence (REQUIRED for merge)
------------------------------------
Per ADR 0008 \u00a79: this PR's runtime correctness lives in the
integration suite. Reviewer runs:
bash scripts/review_pr_n2_on_mac.sh
git add results/platform-tests/pr-n2-mac-*
git commit -m 'Mac M4 review evidence for PR-N2'
git push
Acceptance: all integration tests pass against real Qwen3-0.6B,
including PR-N1's coordinator/generator suites and the INV-3
byte-exact GA gate (PR-E1). The Mac evidence is load-bearing
because Linux CI cannot exercise the scheduler+real-engine path.
Stack
-----
PR-N2 is branched off main, independent of PR-N1 (#53). The two
touch disjoint test files; can merge in either order. Once both
land, PR-N3 cleans up the HTTP shim's doubles + subtypes.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Jun 2, 2026
…2-remove-engine-tokenizer-doubles-8e7f Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/ci.yaml # tests/integration/conftest.py
4 tasks
FluffyAIcode
added a commit
that referenced
this pull request
Jun 2, 2026
…tion workflow
Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.
Three artifacts ship:
.github/workflows/integration.yaml +136 lines
Self-hosted runner workflow targeting [self-hosted, macOS,
ARM64, kakeya-mac-m4]. Triggers on PR events when the
needs-mac-m4 label is present, plus on workflow_dispatch
for manual re-runs. Steps:
1. Checkout (full history).
2. Verify host shape (chip, memory, python version).
3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
at test time \u2014 no downloads in CI; cache miss fails
fast with a clear pre-warm command).
4. pip install -e . + pytest dependencies (warm pip cache
keeps this <30 s).
5. pytest -m integration tests/integration/ \u2014 expected
runtime 60-120 s on M4 with warm cache. 90-min timeout
is a safety margin, not the operating point.
6. Upload JUnit XML artifact.
7. On failure, inline the test names + first-line error
messages into the Action log so triage doesn't require
downloading the artifact.
Concurrency: cancel-in-progress per PR, so a new push
supersedes the previous run.
.github/workflows/auto-label-mac.yaml +89 lines
pull_request_target workflow that auto-applies (or removes)
the needs-mac-m4 label based on which paths the PR touches.
Trigger paths:
inference_engine/ \u2014 runtime, scheduler, session, server
sdks/ \u2014 Python + TypeScript SDK
proto/ \u2014 wire contract
tests/integration/ \u2014 the integration suite itself
kv_cache_proposer/ \u2014 verifier + decoder
Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
integration gate entirely, saving runner time. The label is
automatically dropped if a subsequent push removes all
verifier-dependent edits.
docs/ops/mac-m4-runner-setup.md +137 lines
Operator runbook for the self-hosted runner: hardware
requirements (24 GB minimum, ~50 GB free disk), runner
registration with the kakeya-mac-m4 label, HF cache
pre-warm command (Qwen3-0.6B), Python toolchain setup,
runtime expectations, cache hygiene cron, runner upgrade
procedure, and failure triage steps.
CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
1. Self-hosted runners are slow / few; doc-only PRs shouldn't
touch them.
2. The integration gate is intentionally OPT-IN by label; ci.yaml
is non-optional.
3. Failure semantics differ: Linux gate failure blocks merge
unconditionally; Mac M4 gate failure surfaces a structured
report but the merge decision is a human one until v0.3.0
final ships.
Together the two workflows form the post-cleanup gating model:
- Linux gate (ci.yaml):
verifier-independent code; 100% coverage; every PR.
- Mac M4 gate (integration.yaml):
verifier-dependent code; binding GA gate; PRs touching
runtime / SDK / proto / integration tests.
Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.
Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Second installment of the no-test-doubles cleanup (per the user's principle: fake = mock, all banned). PR-N1 cleared verifier protocol mirrors. PR-N2 clears the scheduler-conftest engine + tokenizer mirrors. PR-N3 will tackle the HTTP shim's separate copies and engine subtypes; PR-N4 will clean up the SDK conftest stub + final CI consolidation.
Scope
tests/inference_engine/scheduler/conftest.py—DeterministicEngine,DeterministicTokenizerclasses + their fixturestests/inference_engine/server/conftest.py's separate copy (PR-N3)tests/inference_engine/scheduler/test_scheduler.py— 20 tests using those doublestest_pooled_verifier.py— exempt; PR-D2 retiresPooledVerifiermoduleWhat was deleted
tests/inference_engine/scheduler/conftest.pyDeterministicEngine+DeterministicTokenizerclasses (~120 lines) + their fixtures (tokenizer,short/long/slow_engine,reject/queue_scheduler). Slab fixtures (slab_config,small_pool,single_pool) kept — verifier-independent.tests/inference_engine/scheduler/test_scheduler.pyDeterministicEngine. Migrated selectively to integration.What was added
tests/integration/test_scheduler_real.pySpeculativeEngineover Qwen3-0.6B: construction validation, happy path, REJECT/QUEUE admission, cancellation, engine error propagation (parametric injector wrapping the real engine, same pattern PR-N1 used for gRPC error-mapping), 3-way concurrency, shutdown, active-count drain. Looser assertions than the originals (real engine output varies); structural invariants are what matters.tests/integration/conftest.pyreal_speculative_enginefixture (Qwen3-0.6B +SparseLogitsProposer+SpeculativeDecoder+SpeculativeEnginewrapper). Mirrorstests/system/conftest.pybut uses 0.6B not 1.7B to match the rest of the integration suite.tests/inference_engine/scheduler/test_scheduler_validation.pymax_new_tokens, empty EOS. All run withengine=None; validation rejects before the scheduler enqueues the worker.scripts/review_pr_n2_on_mac.shCI workflow change
.github/workflows/ci.yaml: split--cov=inference_engine.schedulerinto per-module coverage:Also pre-emptively switched to
coverage run -m pytest+coverage report --include(the same pattern Mac reviewer scripts use, to dodge the torch+pytest-cov race that hit PR-N1's CI runs).Linux verification
Mac M4 evidence (REQUIRED for merge)
Per ADR 0008 §9: scheduler runtime correctness now lives in the integration suite. Reviewer runs:
Acceptance: all integration tests pass against real Qwen3-0.6B, including PR-N1's coordinator/generator suites (if PR-N1 is also on the merge target) and the INV-3 byte-exact GA gate (PR-E1).
Stack
PR-N2 is branched off
main, independent of PR-N1 (#53) at the file level. The two touch disjoint test files (session/vsscheduler/) and disjoint conftest copies. Can merge in either order.Once both land, PR-N3 cleans up the HTTP shim's doubles + subtypes (largest remaining cluster:
tests/inference_engine/server/conftest.py+test_app_*.py+test_engine.py+test_tokenizer.py).