PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests by FluffyAIcode · Pull Request #54 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-02T02:25:17Z

Why

Second installment of the no-test-doubles cleanup (per the user's principle: fake = mock, all banned). PR-N1 cleared verifier protocol mirrors. PR-N2 clears the scheduler-conftest engine + tokenizer mirrors. PR-N3 will tackle the HTTP shim's separate copies and engine subtypes; PR-N4 will clean up the SDK conftest stub + final CI consolidation.

Scope

In scope (this PR)	Out of scope
`tests/inference_engine/scheduler/conftest.py` — `DeterministicEngine`, `DeterministicTokenizer` classes + their fixtures	`tests/inference_engine/server/conftest.py`'s separate copy (PR-N3)
`tests/inference_engine/scheduler/test_scheduler.py` — 20 tests using those doubles	HTTP shim tests + engine subtypes (PR-N3)
	`test_pooled_verifier.py` — exempt; PR-D2 retires `PooledVerifier` module

What was deleted

File	Δ	What
`tests/inference_engine/scheduler/conftest.py`	−197/+62 net	`DeterministicEngine` + `DeterministicTokenizer` classes (~120 lines) + their fixtures (`tokenizer`, `short/long/slow_engine`, `reject/queue_scheduler`). Slab fixtures (`slab_config`, `small_pool`, `single_pool`) kept — verifier-independent.
`tests/inference_engine/scheduler/test_scheduler.py`	−421	All 20 tests against `DeterministicEngine`. Migrated selectively to integration.

What was added

File	Lines	What
`tests/integration/test_scheduler_real.py`	+422, 12 tests	Scheduler runtime against real `SpeculativeEngine` over Qwen3-0.6B: construction validation, happy path, REJECT/QUEUE admission, cancellation, engine error propagation (parametric injector wrapping the real engine, same pattern PR-N1 used for gRPC error-mapping), 3-way concurrency, shutdown, active-count drain. Looser assertions than the originals (real engine output varies); structural invariants are what matters.
`tests/integration/conftest.py`	+82	New session-scoped `real_speculative_engine` fixture (Qwen3-0.6B + `SparseLogitsProposer` + `SpeculativeDecoder` + `SpeculativeEngine` wrapper). Mirrors `tests/system/conftest.py` but uses 0.6B not 1.7B to match the rest of the integration suite.
`tests/inference_engine/scheduler/test_scheduler_validation.py`	+102, 5 tests	Pre-engine validation paths on Linux: pool-dim mismatch, empty prompt, zero `max_new_tokens`, empty EOS. All run with `engine=None`; validation rejects before the scheduler enqueues the worker.
`scripts/review_pr_n2_on_mac.sh`	+88	Mac M4 reviewer aid.

CI workflow change

.github/workflows/ci.yaml: split --cov=inference_engine.scheduler into per-module coverage:

- --cov=inference_engine.scheduler
+ --cov includes:
+   inference_engine/scheduler/config.py           (Linux ✓)
+   inference_engine/scheduler/session.py          (Linux ✓)
+   inference_engine/scheduler/pooled_verifier.py  (Linux ✓ via exempt test)
+ --cov excludes (integration-only):
+   inference_engine/scheduler/scheduler.py

Also pre-emptively switched to coverage run -m pytest + coverage report --include (the same pattern Mac reviewer scripts use, to dodge the torch+pytest-cov race that hit PR-N1's CI runs).

Linux verification

PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  680 passed (was 695 on main; -15 net = removed 20 scheduler-engine tests,
              added 5 engine-independent validation tests).
  100% coverage on 1456 stmts (was 1660 on main; -204 = scheduler.scheduler now
              integration-only).

Mac M4 evidence (REQUIRED for merge)

Per ADR 0008 §9: scheduler runtime correctness now lives in the integration suite. Reviewer runs:

bash scripts/review_pr_n2_on_mac.sh
git add results/platform-tests/pr-n2-mac-*
git commit -m "Mac M4 review evidence for PR-N2"
git push

Acceptance: all integration tests pass against real Qwen3-0.6B, including PR-N1's coordinator/generator suites (if PR-N1 is also on the merge target) and the INV-3 byte-exact GA gate (PR-E1).

Stack

PR-N2 is branched off main, independent of PR-N1 (#53) at the file level. The two touch disjoint test files (session/ vs scheduler/) and disjoint conftest copies. Can merge in either order.

Once both land, PR-N3 cleans up the HTTP shim's doubles + subtypes (largest remaining cluster: tests/inference_engine/server/conftest.py + test_app_*.py + test_engine.py + test_tokenizer.py).

…uler tests ADR 0008 / no-test-doubles cleanup, second installment. PR-N2 retires the scheduler-side engine/tokenizer test doubles and migrates their dispatch / admission-control / lifecycle tests to tests/integration/ where they run against a real SpeculativeEngine over Qwen3-0.6B. Per the user's principle: 'fake = mock, all banned'. PR-N1 cleared the verifier protocol mirrors; PR-N2 clears the scheduler-conftest engine + tokenizer mirrors. PR-N3 will tackle the HTTP shim's separate copies and engine subtypes; PR-N4 will clean up the SDK conftest stub + final CI consolidation. What was deleted ---------------- tests/inference_engine/scheduler/conftest.py -197 / +62 lines net. DeterministicEngine and DeterministicTokenizer classes (~120 lines) deleted. Their fixtures (tokenizer, short_engine, long_engine, slow_engine, reject_scheduler, queue_scheduler) deleted. The slab-pool fixtures (slab_config, small_pool, single_pool) stay \u2014 they're verifier-independent and consumed by the new Linux- side validation tests. tests/inference_engine/scheduler/test_scheduler.py -421 lines. 20 tests against DeterministicEngine. Migrated selectively (see Added). What was added -------------- tests/integration/test_scheduler_real.py +422 lines, 12 tests Scheduler integration tests against the real SpeculativeEngine: - construction validation (pool size match) - happy path: submit + iter_tokens \u2192 COMPLETED + slab released - admission control: REJECT (pool exhausted), QUEUE (admit-after-completion) - cancellation (mid-stream + idempotent-after-completion) - engine error propagation (parametric error injector wraps the real engine; same composition pattern PR-N1 used for gRPC error-mapping tests) - 3-way concurrency (all complete) - shutdown (cancels active + rejects pending) - active_count zeros after drain The migrated tests use looser assertions than the originals (real engine output varies); structural invariants are what matters for scheduler correctness. tests/integration/conftest.py +82 lines - Existing pytest_collection_modifyitems hook (auto-marks everything under tests/integration/ with @pytest.mark.integration). - New session-scoped real_speculative_engine fixture (Qwen3-0.6B + SparseLogitsProposer + SpeculativeDecoder + SpeculativeEngine wrapper). Mirrors the long-standing fixture under tests/system/conftest.py but uses 0.6B not 1.7B to match the rest of the integration suite. tests/inference_engine/scheduler/test_scheduler_validation.py +102 lines, 5 tests Pre-engine validation paths on Linux: - construction validation (pool dim mismatch) - submit() argument validation: empty prompt, zero max_new_tokens, empty EOS All run with engine=None; the validation rejects before the scheduler enqueues a worker that would consult the engine. scripts/review_pr_n2_on_mac.sh +88 lines Mac M4 reviewer aid. Runs pytest -m integration tests/integration/ and produces pr-n2-mac-integration-tests- <unix>.json under results/platform-tests/. What stays in place ------------------- tests/inference_engine/scheduler/test_pooled_verifier.py Still uses _FakeVerifier / _RaisingVerifier. PR-D2 retires PooledVerifier entirely; cleanup before then is throwaway. tests/inference_engine/server/conftest.py Has its own copy of DeterministicEngine + DeterministicTokenizer used by the HTTP shim tests. PR-N3 scope. tests/inference_engine/server/test_app_*.py + test_engine.py + test_tokenizer.py + their server-specific subtypes (_RaisingEngine, _ProxyEngine, _AlwaysHoldingEngine, _KVAwareSlowEngine, _BrokenTokenizer, _EmptyTemplateTokenizer, _NoEosTokenizer) PR-N3 scope. CI workflow change ------------------ .github/workflows/ci.yaml: dropped --cov=inference_engine.scheduler in favor of explicit per-module coverage: - inference_engine.scheduler.config (Linux \u2713) - inference_engine.scheduler.session (Linux \u2713) - inference_engine.scheduler.pooled_verifier (Linux \u2713; via test_pooled_verifier.py exempt) - inference_engine.scheduler.scheduler (integration only) Also pre-emptively switched to the coverage-run pattern (was `pytest --cov=` before; the GitHub-hosted runner's torch+pytest-cov interaction surfaced as SIGSEGV during PR-N1). Linux verification ------------------ PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>: 680 passed (was 695 on main, -15 net = removed 20 scheduler FakeEngine tests, added 5 verifier-independent validation tests). 100% coverage on 1456 stmts (was 1660 on main; -204 net stmts is inference_engine.scheduler.scheduler now integration-only). Mac M4 evidence (REQUIRED for merge) ------------------------------------ Per ADR 0008 \u00a79: this PR's runtime correctness lives in the integration suite. Reviewer runs: bash scripts/review_pr_n2_on_mac.sh git add results/platform-tests/pr-n2-mac-* git commit -m 'Mac M4 review evidence for PR-N2' git push Acceptance: all integration tests pass against real Qwen3-0.6B, including PR-N1's coordinator/generator suites and the INV-3 byte-exact GA gate (PR-E1). The Mac evidence is load-bearing because Linux CI cannot exercise the scheduler+real-engine path. Stack ----- PR-N2 is branched off main, independent of PR-N1 (#53). The two touch disjoint test files; can merge in either order. Once both land, PR-N3 cleans up the HTTP shim's doubles + subtypes. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…2-remove-engine-tokenizer-doubles-8e7f Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/ci.yaml # tests/integration/conftest.py

…tion workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

This was referenced Jun 2, 2026

PR-N3: remove HTTP-shim, engine, tokenizer test doubles #55

Merged

PR-N4: remove SDK conftest stub + finalize no-doubles cleanup #56

Merged

Merge remote-tracking branch 'origin/main' into AgentMemory/v030-pr-n…

584fc1e

…2-remove-engine-tokenizer-doubles-8e7f Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # .github/workflows/ci.yaml # tests/integration/conftest.py

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:04

FluffyAIcode merged commit a621b20 into main Jun 2, 2026
6 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-n2-remove-engine-tokenizer-doubles-8e7f branch June 2, 2026 04:04

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow #57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests#54

PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests#54
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-n2-remove-engine-tokenizer-doubles-8e7f

FluffyAIcode commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 2, 2026

Why

Scope

What was deleted

What was added

CI workflow change

Linux verification

Mac M4 evidence (REQUIRED for merge)

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants