PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow by FluffyAIcode · Pull Request #57 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-02T04:12:23Z

Why

Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules — inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4.

What ships

File	Lines	Purpose
`.github/workflows/integration.yaml`	+136	Self-hosted runner workflow: verify host + HF cache → `pip install -e .` → `pytest -m integration` → upload JUnit + inline failure summary
`.github/workflows/auto-label-mac.yaml`	+89	`pull_request_target` workflow that auto-applies `needs-mac-m4` when a PR touches `inference_engine/`, `sdks/`, `proto/`, `tests/integration/`, or `kv_cache_proposer/`; auto-removes it if a subsequent push drops all verifier-dependent edits
`docs/ops/mac-m4-runner-setup.md`	+137	Operator runbook: hardware requirements, runner registration with `kakeya-mac-m4` label, HF cache pre-warm command, Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, failure triage

Workflow details

`integration.yaml`

Runs on: [self-hosted, macOS, ARM64, kakeya-mac-m4]
Trigger: PR opened/synchronize/reopened/labeled + workflow_dispatch. Conditional on needs-mac-m4 label being present.
Pre-flight: chip/memory check + HF_HUB_OFFLINE=1 enforcement (cache miss fails fast with a clear pre-warm command).
Concurrency: cancel-in-progress per PR; new pushes supersede.
Timeout: 90 min (operating point is 2-3 min on warm cache; the timeout is a safety margin).

`auto-label-mac.yaml`

Runs on: ubuntu-latest (no self-hosted overhead).
Triggers: every PR opened/synchronize/reopened.
Logic: lists changed files via the GitHub API; matches against the trigger-path list; adds/removes the label as needed.

Two-tier gating model post PR-E2

Tier	Workflow	Coverage	Trigger
Linux gate	`ci.yaml`	Verifier-independent code, 100 %	Every PR; non-optional
Mac M4 gate	`integration.yaml`	Verifier-dependent code (runtime + SDK + proto + integration tests)	PRs labelled `needs-mac-m4` (auto-applied)

Failure semantics differ:

Linux gate failure blocks merge unconditionally.
Mac M4 gate failure surfaces a structured report; the merge decision is human until v0.3.0 final ships, at which point we'll set required-status-check on this workflow too.

Operator runbook highlights

The docs/ops/mac-m4-runner-setup.md is the source of truth for the runner. Key one-time setup:

Register the runner with the kakeya-mac-m4 label (in addition to the default self-hosted, macOS, ARM64).

Pre-warm the HF cache:

python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-0.6B')
AutoTokenizer.from_pretrained('Qwen/Qwen3-0.6B')
"

Verify Python 3.12+ and ARM64 architecture.
Run as launchd service so it survives reboots.

Stack

PR-E2 is branched off main, independent of the cleanup PRs. The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order:

Cleanup PRs (so the workflow has tests to run): PR-D1 (ADR 0008 Phase D): remove ADR 0007 server-side dead code #49 → PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate #50 → PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI #51 → PR-E1c: fix kv_live_bytes reporting path #52 → PR-N1: remove verifier-protocol test doubles from Linux CI gate #53 → PR-N2: remove DeterministicEngine + DeterministicTokenizer from scheduler tests #54 → PR-N3: remove HTTP-shim, engine, tokenizer test doubles #55 → PR-N4: remove SDK conftest stub + finalize no-doubles cleanup #56.
PR-E2 (this PR) — wires up the workflow once the integration suite is on main.
After PR-E2 merges, the next PR you touch will auto-trigger the Mac M4 workflow.

Per ADR 0008 §9

This PR ships only workflow YAML + a runbook — no Python source changes. No Mac M4 evidence required for this PR; the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs.

Reviewer checklist

YAML files parse cleanly (verified locally with python3 -c "import yaml; yaml.safe_load(...)").
auto-label-mac.yaml permission scope is pull-requests: write and trigger is pull_request_target (required for PRs from forks to gain the label).
integration.yaml runs-on includes kakeya-mac-m4 (operator must add this label when registering the runner per the runbook).
Runbook covers the cache pre-warm command + the HF_HUB_OFFLINE=1 constraint at test time.

…tion workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-D2 (ADR 0008 Phase D): refactor HTTP shim onto SessionStore #58

Merged

5 tasks

FluffyAIcode force-pushed the AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f branch from 60b18bb to a0397c9 Compare June 2, 2026 04:36

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:40

FluffyAIcode merged commit 7f30880 into main Jun 2, 2026
7 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f branch June 2, 2026 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow#57

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow#57
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e2-mac-m4-self-hosted-runner-8e7f

FluffyAIcode commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 2, 2026

Why

What ships

Workflow details

integration.yaml

auto-label-mac.yaml

Two-tier gating model post PR-E2

Operator runbook highlights

Stack

Per ADR 0008 §9

Reviewer checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`integration.yaml`

`auto-label-mac.yaml`