Skip to content

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow#57

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f
Jun 2, 2026
Merged

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow#57
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Why

Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules — inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4.

What ships

File Lines Purpose
.github/workflows/integration.yaml +136 Self-hosted runner workflow: verify host + HF cache → pip install -e .pytest -m integration → upload JUnit + inline failure summary
.github/workflows/auto-label-mac.yaml +89 pull_request_target workflow that auto-applies needs-mac-m4 when a PR touches inference_engine/, sdks/, proto/, tests/integration/, or kv_cache_proposer/; auto-removes it if a subsequent push drops all verifier-dependent edits
docs/ops/mac-m4-runner-setup.md +137 Operator runbook: hardware requirements, runner registration with kakeya-mac-m4 label, HF cache pre-warm command, Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, failure triage

Workflow details

integration.yaml

  • Runs on: [self-hosted, macOS, ARM64, kakeya-mac-m4]
  • Trigger: PR opened/synchronize/reopened/labeled + workflow_dispatch. Conditional on needs-mac-m4 label being present.
  • Pre-flight: chip/memory check + HF_HUB_OFFLINE=1 enforcement (cache miss fails fast with a clear pre-warm command).
  • Concurrency: cancel-in-progress per PR; new pushes supersede.
  • Timeout: 90 min (operating point is 2-3 min on warm cache; the timeout is a safety margin).

auto-label-mac.yaml

  • Runs on: ubuntu-latest (no self-hosted overhead).
  • Triggers: every PR opened/synchronize/reopened.
  • Logic: lists changed files via the GitHub API; matches against the trigger-path list; adds/removes the label as needed.

Two-tier gating model post PR-E2

Tier Workflow Coverage Trigger
Linux gate ci.yaml Verifier-independent code, 100 % Every PR; non-optional
Mac M4 gate integration.yaml Verifier-dependent code (runtime + SDK + proto + integration tests) PRs labelled needs-mac-m4 (auto-applied)

Failure semantics differ:

  • Linux gate failure blocks merge unconditionally.
  • Mac M4 gate failure surfaces a structured report; the merge decision is human until v0.3.0 final ships, at which point we'll set required-status-check on this workflow too.

Operator runbook highlights

The docs/ops/mac-m4-runner-setup.md is the source of truth for the runner. Key one-time setup:

  1. Register the runner with the kakeya-mac-m4 label (in addition to the default self-hosted, macOS, ARM64).
  2. Pre-warm the HF cache:
    python3 -c "
    from transformers import AutoModelForCausalLM, AutoTokenizer
    AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-0.6B')
    AutoTokenizer.from_pretrained('Qwen/Qwen3-0.6B')
    "
  3. Verify Python 3.12+ and ARM64 architecture.
  4. Run as launchd service so it survives reboots.

Stack

PR-E2 is branched off main, independent of the cleanup PRs. The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order:

  1. Cleanup PRs (so the workflow has tests to run): PR-D1 (ADR 0008 Phase D): remove ADR 0007 server-side dead code #49PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate #50PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI #51PR-E1c: fix kv_live_bytes reporting path #52PR-N1: remove verifier-protocol test doubles from Linux CI gate #53PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests #54PR-N3: remove HTTP-shim, engine, tokenizer test doubles #55PR-N4: remove SDK conftest stub + finalize no-doubles cleanup #56.
  2. PR-E2 (this PR) — wires up the workflow once the integration suite is on main.
  3. After PR-E2 merges, the next PR you touch will auto-trigger the Mac M4 workflow.

Per ADR 0008 §9

This PR ships only workflow YAML + a runbook — no Python source changes. No Mac M4 evidence required for this PR; the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs.

Reviewer checklist

  • YAML files parse cleanly (verified locally with python3 -c "import yaml; yaml.safe_load(...)").
  • auto-label-mac.yaml permission scope is pull-requests: write and trigger is pull_request_target (required for PRs from forks to gain the label).
  • integration.yaml runs-on includes kakeya-mac-m4 (operator must add this label when registering the runner per the runbook).
  • Runbook covers the cache pre-warm command + the HF_HUB_OFFLINE=1 constraint at test time.
Open in Web Open in Cursor 

…tion workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode force-pushed the AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f branch from 60b18bb to a0397c9 Compare June 2, 2026 04:36
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:40
@FluffyAIcode FluffyAIcode merged commit 7f30880 into main Jun 2, 2026
7 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f branch June 2, 2026 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants