Skip to content

PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate#50

Merged
FluffyAIcode merged 3 commits into
mainfrom
AgentMemory/v030-pr-e1-integration-suite-8e7f
Jun 2, 2026
Merged

PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate#50
FluffyAIcode merged 3 commits into
mainfrom
AgentMemory/v030-pr-e1-integration-suite-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Jun 1, 2026

What ships

Per ADR 0008 §6.5, PR-E1 delivers:

1. tests/integration/ directory

  • __init__.py
  • conftest.py — auto-applies @pytest.mark.integration to every test in the directory; bare pytest skips them, contributors opt in via pytest -m integration.

2. tests/integration/test_inv3_session_determinism_gate.py

The INV-3 byte-exact GA gate (ADR 0008 §7 G3). Drives two independent SinkWindowVerifier instances (real Qwen3-0.6B weights via fresh_verifier_factory) through identical history fed via different chunkings, asserts the resulting greedy token streams are byte-identical. Three tests:

Test What it covers
test_one_call_vs_two_calls_yield_byte_identical_tokens Minimal: 1×10 vs 2×5 chunking, 12 tokens generated greedy.
test_chunking_invariance_across_three_splits Stronger: 1×20 / 3×medium / 10×small chunkings, 8 tokens generated. Catches chunk-boundary bugs the 1-vs-2 case might miss (e.g., a chunk crossing a sink+window trim boundary).
test_repeated_runs_with_same_history_byte_identical Sanity: same workload twice on same verifier == identical output.

This replaces tests/core/test_determinism_gate.py (deleted in PR-A3 along with verifier.path_select). Per ADR 0008 §6.6, the replacement lives in tests/integration/ rather than tests/core/ because integration is where Mac-M4-only GA gates belong per §9.

3. pytest.ini

Minimal new file registering the integration marker so opt-in invocations (pytest -m integration) don't trigger PytestUnknownMarkWarning.

4. scripts/review_pr_e1_on_mac.sh

Mac M4 reviewer aid. Runs pytest -m integration tests/integration/ and produces pr-e1-mac-integration-tests-<unix>.json under results/platform-tests/. Same coverage-free pattern as review_pr_b3_on_mac.sh.

Independence from PR-D1

PR-E1 was originally stacked on PR-D1 (#49) but reviewed-once it's clear the two are file-disjoint: PR-E1 only adds new files under tests/integration/, plus pytest.ini and scripts/review_pr_e1_on_mac.sh. It does not depend on PR-D1's deletions. Rebased onto main directly so CI triggers normally.

The two PRs can merge in either order.

Not in this PR (deferred)

  • scripts/bench_agentic/bench_session_long_run.py: §6.5 also mentions a Mac M4 long-session bench using the gRPC SDK. Splitting it out so PR-E1's diff stays focused on the GA gate. Will land as PR-E1b or rolled into PR-E2.
  • PR-E2: GitHub Actions self-hosted Mac M4 runner workflow invoking pytest -m integration on every PR labelled needs-mac-m4. Until that workflow lands, the gate runs manually via scripts/review_pr_e1_on_mac.sh.

Linux verification

Linux CI gate (existing 8 test paths):
  682 passed, 100% coverage  ← UNCHANGED. tests/integration/ is not in the Linux paths.

Integration suite collection:
  3 tests collected from tests/integration/, marker auto-applied via conftest.

Mac M4 evidence (REQUIRED for merge — load-bearing for v0.3 GA)

Per ADR 0008 §9, this PR's true validation happens on Mac M4. Linux CI cannot validate INV-3 against real Qwen3 numerics. Reviewer runs:

bash scripts/review_pr_e1_on_mac.sh
git add results/platform-tests/pr-e1-mac-*
git commit -m "Mac M4 review evidence for PR-E1"
git push

…and pushes the JSON evidence to this PR branch before merge. All 3 tests must pass with byte-exact equality. Any failure here means INV-3 is broken on real numerics → BLOCKS v0.3 GA.

Next PR after merge

  • PR-E1b or fold-in: bench_session_long_run.py against the gRPC SDK.
  • PR-E2: self-hosted Mac M4 GitHub Actions workflow.
  • PR-D2 (independent): HTTP shim refactor onto SessionStore.
Open in Web Open in Cursor 

Stacks on PR-D1 (#49). When this PR merges, PR-D1 lands along with
it.

Per ADR 0008 \u00a76.5, PR-E1 ships:

  1. tests/integration/ \u2014 new test directory with pytest.mark.integration
     marker auto-applied by tests/integration/conftest.py. Every
     test in this directory is opt-in via 'pytest -m integration';
     a bare pytest invocation skips them.

  2. tests/integration/test_inv3_session_determinism_gate.py \u2014
     the INV-3 byte-exact GA gate (ADR 0008 \u00a77 G3). Drives two
     independent SinkWindowVerifier instances (real Qwen3-0.6B
     weights) through identical history fed via different
     chunkings, asserts the resulting greedy token streams are
     byte-identical. Three tests:

       test_one_call_vs_two_calls_yield_byte_identical_tokens
         Minimal gate: 1\u00d710 vs 2\u00d75 chunking on a 10-token history,
         12 tokens of greedy generation.

       test_chunking_invariance_across_three_splits
         Stronger version: 1\u00d720 / 3\u00d7medium / 10\u00d7small chunkings on
         a 20-token history, 8 tokens of greedy generation. Catches
         chunk-boundary bugs the 1-vs-2 case might miss (e.g., a
         bug that only triggers when a chunk crosses a sink+window
         trim boundary).

       test_repeated_runs_with_same_history_byte_identical
         Sanity: same workload run twice on the same verifier
         produces the same output. Greedy decoding has no
         legitimate source of nondeterminism.

  3. pytest.ini \u2014 minimal new file registering the 'integration'
     marker so it doesn't trigger PytestUnknownMarkWarning. Tests
     opt in via 'pytest -m integration'.

Replaces the deleted tests/core/test_determinism_gate.py (PR-A3
removed it together with verifier.path_select; per ADR 0008 \u00a76.6
PR-E1's replacement is in tests/integration/, not tests/core/,
because integration is where Mac-M4-only GA gates belong per \u00a79).

NOT in this PR (deferred):

  scripts/bench_agentic/bench_session_long_run.py
    \u00a76.5 also mentions a Mac M4 long-session bench using the gRPC
    SDK. Splitting that out as PR-E1b (or rolled into PR-E2's CI
    workflow PR) so PR-E1's diff stays focused on the GA gate.

  PR-E2: GitHub Actions self-hosted Mac M4 runner workflow that
    invokes 'pytest -m integration' on every PR labelled
    'needs-mac-m4'. Until that lands, the gate is run manually via
    scripts/review_pr_e1_on_mac.sh.

Mac M4 reviewer aid:

  scripts/review_pr_e1_on_mac.sh
    Runs 'pytest -m integration tests/integration/' and produces
    one JSON artifact (pr-e1-mac-integration-tests-<unix>.json)
    under results/platform-tests/. Same coverage-free pattern as
    review_pr_b3_on_mac.sh + commit 9d1a250 hotfixes already
    folded in.

Local verification (Linux VM, py3.12):
  Linux CI gate (existing 8 paths): 682 passed, 100% coverage,
                                    UNCHANGED. tests/integration/
                                    is not in the Linux paths.
  Integration suite collection:     3 tests collected from
                                    tests/integration/, marker
                                    auto-applied via conftest.
  Bare 'pytest' from repo root:     would still pick up
                                    tests/integration/ if
                                    discovered, but the existing
                                    project convention is explicit
                                    test paths in CI; the marker
                                    is the safety net.

Per ADR 0008 \u00a79: this PR ships the test suite that IS the GA gate.
Linux CI gate does not exercise the integration tests (HF-cache-
bound, real-model-numerics-dependent), so PR-E1's true validation
happens on Mac M4. Reviewer pushes scripts/review_pr_e1_on_mac.sh's
JSON output to the PR branch as evidence.

Next PR after merge:
  PR-E2: GitHub Actions self-hosted Mac M4 runner workflow.
  PR-E1b (or rolled into PR-E2): bench_session_long_run.py.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot force-pushed the AgentMemory/v030-pr-e1-integration-suite-8e7f branch from 66eb1cd to 5aa648c Compare June 1, 2026 16:03
@cursor cursor Bot changed the base branch from AgentMemory/v030-pr-d1-remove-adr-0007-server-deadcode-8e7f to main June 1, 2026 16:03
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The Mac smoke run reported INV-3 gate fixture scope mismatch:
session_verifier_pair was @pytest.fixture(scope="module") but
depended on fresh_verifier_factory which is function-scoped in
tests/conftest.py. Pytest forbids module-scoped fixtures from
depending on function-scoped ones \u2014 raises ScopeMismatch.

Inlined the verifier build inside session_verifier_pair so the
module scope is self-contained. No fixture dependency on the
function-scoped factory. Behavior identical: same VerifierConfig
(sink=4, window=64, bf16, CPU), same shared-pair pattern across
the 3 tests in this file.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02
@FluffyAIcode FluffyAIcode merged commit 6e9e9e4 into main Jun 2, 2026
6 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-e1-integration-suite-8e7f branch June 2, 2026 04:02
FluffyAIcode added a commit that referenced this pull request Jun 2, 2026
…tion workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants