Skip to content

PR-N4: remove SDK conftest stub + finalize no-doubles cleanup#56

Merged
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-n4-sdk-conftest-stub-cleanup-8e7f
Jun 2, 2026
Merged

PR-N4: remove SDK conftest stub + finalize no-doubles cleanup#56
FluffyAIcode merged 2 commits into
mainfrom
AgentMemory/v030-pr-n4-sdk-conftest-stub-cleanup-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Why

Final installment of the no-test-doubles cleanup. Closes the sequence PR-N1 → N2 → N3 → N4. After PR-N4 lands, NO test doubles implementing the verifier / engine / tokenizer protocols remain in the Linux test tree.

What was deleted

File Δ What
tests/sdk/python/conftest.py −203 _start_runtime / _stop_runtime helpers + runtime_address fixture (FakeVerifier-backed)
tests/sdk/python/test_client.py −157, 13 tests Client lifecycle against fake runtime
tests/sdk/python/test_session.py −502, 33 tests Session.append + .generate + .info + .close end-to-end against fake runtime

Total: ~860 lines, 46 tests deleted.

What was added

File Lines Tests What
tests/integration/test_sdk_real.py +137 11 Client + Session integration against real Qwen3-0.6B-backed gRPC runtime: lifecycle, error mapping (NOT_FOUND / INVALID_ARGUMENT / SessionClosedError), session info round-trip, close idempotency
tests/integration/conftest.py +180 pytest_collection_modifyitems hook + real_speculative_engine (Qwen3-0.6B SpeculativeEngine, session-scoped) + NEW real_grpc_runtime_address (in-process gRPC server backed by real verifier on background thread, yields host:port)
tests/integration/__init__.py +0 placeholder
scripts/review_pr_n4_on_mac.sh +93 Mac M4 reviewer aid running the full accumulated integration suite

What stays on Linux

tests/sdk/python/test_errors.py (9 tests) — pure _wrap_grpc_error mapping with synthesized grpc.RpcError objects. Verifier-independent; transport-only error-class translation.

CI workflow change

.github/workflows/ci.yaml — dropped kakeya.client, kakeya.session from --include= filter. Linux gate now covers ONLY:

inference_engine/server/{auth, config, errors, grpc_app, metrics, schemas, proto_gen}
inference_engine/memory/*
inference_engine/scheduler/{config, session, pooled_verifier}
inference_engine/pipeline/*
inference_engine/session/store
sdks/python/kakeya/{__init__, errors}
training/repr_align/*

This is the verifier-independent boundary, frozen post PR-N4.

Final state of the no-doubles cleanup

PR Scope Tests deleted Integration test added
#53 PR-N1 FakeVerifier hierarchy ~70 test_coordinator_real.py, test_generator_real.py
#54 PR-N2 DeterministicEngine/Tokenizer (scheduler) 20 test_scheduler_real.py
#55 PR-N3 HTTP shim cluster 88 test_http_shim_real.py, test_engine_real.py, test_tokenizer_real.py, test_streaming_real.py
#56 PR-N4 (this) SDK conftest stub 46 test_sdk_real.py

After all four merge, tests/integration/ contains:

test_inv3_session_determinism_gate.py          (PR-E1)
test_coordinator_real.py                       (PR-N1)
test_generator_real.py                         (PR-N1)
test_scheduler_real.py                         (PR-N2)
test_http_shim_real.py                         (PR-N3)
test_engine_real.py                            (PR-N3)
test_tokenizer_real.py                         (PR-N3)
test_streaming_real.py                         (PR-N3)
test_sdk_real.py                               (PR-N4)

Linux verification

PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  649 passed (was 695 on main; -46 net = removed 46 SDK runtime tests).
  100% coverage on 999 stmts (was 1660 on main; -661 net = all
  verifier-dependent modules now integration-only).

Mac M4 evidence (REQUIRED for merge)

bash scripts/review_pr_n4_on_mac.sh runs the full accumulated integration suite (PR-E1 INV-3 + PR-N1 coordinator/generator + PR-N2 scheduler + PR-N3 http_shim/engine/tokenizer/streaming + PR-N4 SDK) against real Qwen3-0.6B and produces pr-n4-mac-integration-tests-<unix>.json evidence.

Stack & merge order note

PR-N4 is branched off main, independent at the file level from PR-N1 (#53), PR-N2 (#54), PR-N3 (#55). However all four PRs add to tests/integration/conftest.py (each contributes a fixture), so post-merge the conftest needs a small reconciliation. Recommended merge order:

  1. PR-N1 (adds the marker hook)
  2. PR-N2 (adds real_speculative_engine)
  3. PR-N3 (uses real_speculative_engine)
  4. PR-N4 (this, adds real_grpc_runtime_address)

PR-N4's tests/integration/conftest.py is a superset of N1/N2's; if N4 lands first, the others need to skip their conftest creation and reuse the merged version.

Open in Web Open in Cursor 

cursoragent and others added 2 commits June 2, 2026 02:54
Final installment of the no-test-doubles cleanup. Closes the
sequence PR-N1 \u2192 N2 \u2192 N3 \u2192 N4. After PR-N4 lands, NO
test doubles implementing the verifier / engine / tokenizer
protocols remain in the Linux test tree.

What was deleted
----------------

  tests/sdk/python/conftest.py
    -203 lines. Contained _start_runtime / _stop_runtime helpers
    that spun up an in-process gRPC server with a FakeVerifier
    (later replaced by _MinimalVerifierStub in PR-N1's preview
    cleanup) on a background thread. The runtime_address +
    runtime_address_no_inspector fixtures are gone with it.

  tests/sdk/python/test_client.py
    -157 lines, 13 tests. Exercised Client + Session lifecycle
    against the FakeVerifier-backed runtime.

  tests/sdk/python/test_session.py
    -502 lines, 33 tests. Exercised Session.append + .generate +
    .info + .close end-to-end against the FakeVerifier-backed
    runtime.

What was added
--------------

  tests/integration/test_sdk_real.py             +137 lines, 11 tests
    SDK Client + Session integration tests against a real
    Qwen3-0.6B-backed gRPC runtime:
      - Client: create_session round-trip, eos_token_ids round-
        trip, idempotent close, address property
      - Session: append + generate yield tokens + metadata,
        info reflects history, close returns final length,
        close is locally idempotent
      - End-to-end error mapping: SessionNotFoundError on
        unknown id, InvalidArgumentError on max_tokens=0,
        SessionClosedError on append-after-close

  tests/integration/conftest.py                  +180 lines
    - pytest_collection_modifyitems hook (auto-marks everything
      under tests/integration/ with @pytest.mark.integration)
    - real_speculative_engine fixture (session-scoped, Qwen3-0.6B)
    - real_grpc_runtime_address fixture (session-scoped,
      in-process gRPC server backed by real Qwen3-0.6B verifier
      on a background thread; yields the host:port the SDK can
      connect to)

  tests/integration/__init__.py                  +0 lines (placeholder)

  scripts/review_pr_n4_on_mac.sh                 +93 lines
    Mac M4 reviewer aid running the full accumulated integration
    suite (PR-E1 INV-3 + PR-N1 coordinator/generator + PR-N2
    scheduler + PR-N3 http_shim/engine/tokenizer/streaming +
    PR-N4 SDK).

What stays on Linux
-------------------

  tests/sdk/python/test_errors.py               (unchanged, 9 tests)
    Pure _wrap_grpc_error mapping with synthesized
    grpc.RpcError objects. Verifier-independent;
    transport-only error-class translation. Stays on Linux.

CI workflow change
------------------

.github/workflows/ci.yaml: dropped kakeya.client and kakeya.session
from the --include= filter. Linux gate now covers ONLY:

  inference_engine/server/{auth, config, errors, grpc_app,
                            metrics, schemas, proto_gen}
  inference_engine/memory/*
  inference_engine/scheduler/{config, session, pooled_verifier}
  inference_engine/pipeline/*
  inference_engine/session/store
  sdks/python/kakeya/{__init__, errors}
  training/repr_align/*

That's the verifier-independent boundary, frozen post PR-N4.

Final state of the no-doubles cleanup
-------------------------------------

  PR-N1 (#53): retired FakeVerifier hierarchy
                (tests/inference_engine/session/test_coordinator.py,
                 test_generator.py, test_grpc_app.py FakeVerifier-using
                 sections).
  PR-N2 (#54): retired DeterministicEngine + DeterministicTokenizer
                (tests/inference_engine/scheduler/conftest.py +
                 test_scheduler.py).
  PR-N3 (#55): retired the HTTP shim cluster (server/conftest.py
                + 6 test files + their subtypes).
  PR-N4 (this): retired the SDK conftest stub.

The integration suite at tests/integration/ now contains:
  test_inv3_session_determinism_gate.py          (PR-E1)
  test_coordinator_real.py                       (PR-N1)
  test_generator_real.py                         (PR-N1)
  test_scheduler_real.py                         (PR-N2)
  test_http_shim_real.py                         (PR-N3)
  test_engine_real.py                            (PR-N3)
  test_tokenizer_real.py                         (PR-N3)
  test_streaming_real.py                         (PR-N3)
  test_sdk_real.py                               (PR-N4)

Linux verification
------------------
PYTHONPATH=.:sdks/python coverage run -m pytest <Linux gate paths>:
  649 passed (was 695 on main; -46 net = removed 46 SDK runtime
              tests, kept 9 SDK error-mapping tests).
  100% coverage on 999 stmts (was 1660 on main; -661 net stmts is
              all verifier-dependent modules now integration-only).

Mac M4 evidence (REQUIRED for merge)
------------------------------------
Per ADR 0008 \u00a79: this PR's runtime correctness lives in the
integration suite. Reviewer runs:

    bash scripts/review_pr_n4_on_mac.sh
    git add results/platform-tests/pr-n4-mac-*
    git commit -m 'Mac M4 review evidence for PR-N4'
    git push

Stack
-----
PR-N4 is branched off main, independent of PR-N1 (#53) /
PR-N2 (#54) / PR-N3 (#55) at the file level. Conftests in
tests/integration/ added by N1/N2/N3/N4 are file-disjoint from
each other (each adds one fixture) but the file IS shared, so
post-merge the four contributors' fixture defs need to be
reconciled. The recommended merge order:

  1. PR-N1 (verifier doubles) — adds conftest with marker hook
  2. PR-N2 (engine/tokenizer doubles) — adds real_speculative_engine
  3. PR-N3 (HTTP shim doubles) — uses real_speculative_engine
  4. PR-N4 (this, SDK doubles) — adds real_grpc_runtime_address

If a different order lands first, the integration conftest
needs a small merge to combine fixtures.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…4-sdk-conftest-stub-cleanup-8e7f

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	.github/workflows/ci.yaml
#	tests/integration/conftest.py
#	tests/sdk/python/conftest.py
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:06
@FluffyAIcode FluffyAIcode merged commit e8e8415 into main Jun 2, 2026
5 of 6 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-n4-sdk-conftest-stub-cleanup-8e7f branch June 2, 2026 04:06
FluffyAIcode added a commit that referenced this pull request Jun 2, 2026
…tion workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants