Summary
Add .github/workflows/deploy.yml that triggers on push to main, builds and pushes the container image to GHCR, opens an auto-merging PR against the infra repo bumping the api-proxy image digest, joins NetBird, and SSHes into the home box to run bin/deploy.sh. Auto-rolls-back on healthcheck failure and dumps diagnostics on any failure.
Context
This is the last piece of the CI/CD chain. ci.yml and claude-code-review.yml already exist. The host scripts (bin/*.sh) and the local-dev compose are tracked in their own issues.
The workflow must mirror the established infra pattern, including the preflight-secrets guard that prevents the well-known netbird up hang when NB_SETUP_KEY is empty.
Cross-repo coordination: the api-proxy deploy workflow does NOT directly deploy via the infra compose. It opens a PR against Stoganet/infra bumping the api-proxy image digest, waits for that PR to merge, then SSHes in and runs the host-side deploy.sh. This preserves the "infra deploys infra" invariant — no cross-repo repository_dispatch complexity, no shared state.
Scope
on: triggers: push to main, plus workflow_dispatch with optional sha input for re-running older SHAs.
concurrency: { group: deploy, cancel-in-progress: false } so a second push waits its turn.
- Three jobs, in sequence via
needs::
build-and-push: checkout the target SHA, set up buildx, log into GHCR with GITHUB_TOKEN, push tags ghcr.io/${{ github.repository }}:<sha> and :latest. Output image_digest (from docker/build-push-action) and target_sha.
bump-infra-digest: checkout Stoganet/infra using a INFRA_REPO_TOKEN PAT with contents: write + pull-requests: write. Sed-replace the api-proxy image digest line in compose/docker-compose.yml, commit on a bump-api-proxy-<sha> branch, push, open a PR, gh pr merge --auto --squash. Then poll the PR state every 20s up to 20min; fail if the PR is closed-without-merge or doesn't merge in time.
deploy:
- Preflight step: assert each of
NB_SETUP_KEY, DEPLOY_SSH_KEY, HOME_OVERLAY_IP, HOME_SSH_HOST_KEY is non-empty. Fail loudly listing ALL missing names in one pass. This step exists specifically because netbird up with an empty --setup-key hangs without surfacing the cause.
- Install NetBird (
curl -fsSL https://pkgs.netbird.io/install.sh | sh).
sudo netbird up --setup-key "$NB_SETUP_KEY" --management-url https://netbird.stoganet.com:443. Wait for wt0 interface (poll up to 20s).
- Write
DEPLOY_SSH_KEY to ~/.ssh/id_ed25519 (mode 600), append HOME_OVERLAY_IP HOME_SSH_HOST_KEY to ~/.ssh/known_hosts.
ssh deploy@${HOME_OVERLAY_IP} "/srv/api-proxy/bin/deploy.sh <sha> <digest>".
- On failure: SSH again to determine
HEAD~1 SHA, then ssh ... /srv/api-proxy/bin/rollback.sh $PREV.
- Always-on-failure: SSH and run
/srv/api-proxy/bin/diagnostics.sh, capture output to the job log.
Out of scope: the host scripts themselves (separate issue), renovate.json (we use Dependabot), auto-assign.yml (intentionally not used).
Acceptance criteria
Notes
- Required repo secrets (must be set before the first deploy):
NB_SETUP_KEY, DEPLOY_SSH_KEY, HOME_OVERLAY_IP, HOME_SSH_HOST_KEY, INFRA_REPO_TOKEN. GITHUB_TOKEN is automatic.
HOME_SSH_HOST_KEY value should be the output of ssh-keyscan -t ed25519 <HOME_OVERLAY_IP> — one line.
INFRA_REPO_TOKEN is a fine-grained PAT scoped to Stoganet/infra only.
- Dependency updates use Dependabot — don't add
renovate.json.
auto-assign.yml was intentionally removed (low value, noisy) — don't reintroduce.
- The preflight guard is non-negotiable; the silent
netbird up hang it prevents is the worst kind of CI failure (no signal, eats minutes).
Summary
Add
.github/workflows/deploy.ymlthat triggers on push to main, builds and pushes the container image to GHCR, opens an auto-merging PR against theinfrarepo bumping the api-proxy image digest, joins NetBird, and SSHes into the home box to runbin/deploy.sh. Auto-rolls-back on healthcheck failure and dumps diagnostics on any failure.Context
This is the last piece of the CI/CD chain.
ci.ymlandclaude-code-review.ymlalready exist. The host scripts (bin/*.sh) and the local-dev compose are tracked in their own issues.The workflow must mirror the established
infrapattern, including the preflight-secrets guard that prevents the well-knownnetbird uphang whenNB_SETUP_KEYis empty.Cross-repo coordination: the api-proxy deploy workflow does NOT directly deploy via the infra compose. It opens a PR against
Stoganet/infrabumping theapi-proxyimage digest, waits for that PR to merge, then SSHes in and runs the host-sidedeploy.sh. This preserves the "infra deploys infra" invariant — no cross-reporepository_dispatchcomplexity, no shared state.Scope
on:triggers:pushtomain, plusworkflow_dispatchwith optionalshainput for re-running older SHAs.concurrency: { group: deploy, cancel-in-progress: false }so a second push waits its turn.needs::build-and-push: checkout the target SHA, set up buildx, log into GHCR withGITHUB_TOKEN, push tagsghcr.io/${{ github.repository }}:<sha>and:latest. Outputimage_digest(fromdocker/build-push-action) andtarget_sha.bump-infra-digest: checkoutStoganet/infrausing aINFRA_REPO_TOKENPAT withcontents: write+pull-requests: write. Sed-replace the api-proxy image digest line incompose/docker-compose.yml, commit on abump-api-proxy-<sha>branch, push, open a PR,gh pr merge --auto --squash. Then poll the PR state every 20s up to 20min; fail if the PR is closed-without-merge or doesn't merge in time.deploy:NB_SETUP_KEY,DEPLOY_SSH_KEY,HOME_OVERLAY_IP,HOME_SSH_HOST_KEYis non-empty. Fail loudly listing ALL missing names in one pass. This step exists specifically becausenetbird upwith an empty--setup-keyhangs without surfacing the cause.curl -fsSL https://pkgs.netbird.io/install.sh | sh).sudo netbird up --setup-key "$NB_SETUP_KEY" --management-url https://netbird.stoganet.com:443. Wait forwt0interface (poll up to 20s).DEPLOY_SSH_KEYto~/.ssh/id_ed25519(mode 600), appendHOME_OVERLAY_IP HOME_SSH_HOST_KEYto~/.ssh/known_hosts.ssh deploy@${HOME_OVERLAY_IP} "/srv/api-proxy/bin/deploy.sh <sha> <digest>".HEAD~1SHA, thenssh ... /srv/api-proxy/bin/rollback.sh $PREV./srv/api-proxy/bin/diagnostics.sh, capture output to the job log.Out of scope: the host scripts themselves (separate issue),
renovate.json(we use Dependabot),auto-assign.yml(intentionally not used).Acceptance criteria
deploy.ymlparses cleanly (GitHub Actions accepts it on push without parse errors).<sha>and<digest>tobin/deploy.sh, so the host script can assert the infra checkout contains the expected digest.if: failure() && steps.deploy.conclusion == 'failure'.if: failure()and ends with|| trueso it can't itself mask the real error.Notes
NB_SETUP_KEY,DEPLOY_SSH_KEY,HOME_OVERLAY_IP,HOME_SSH_HOST_KEY,INFRA_REPO_TOKEN.GITHUB_TOKENis automatic.HOME_SSH_HOST_KEYvalue should be the output ofssh-keyscan -t ed25519 <HOME_OVERLAY_IP>— one line.INFRA_REPO_TOKENis a fine-grained PAT scoped toStoganet/infraonly.renovate.json.auto-assign.ymlwas intentionally removed (low value, noisy) — don't reintroduce.netbird uphang it prevents is the worst kind of CI failure (no signal, eats minutes).