Skip to content

Add deploy workflow with preflight secrets guard and auto-rollback #15

@koinsaari

Description

@koinsaari

Summary

Add .github/workflows/deploy.yml that triggers on push to main, builds and pushes the container image to GHCR, opens an auto-merging PR against the infra repo bumping the api-proxy image digest, joins NetBird, and SSHes into the home box to run bin/deploy.sh. Auto-rolls-back on healthcheck failure and dumps diagnostics on any failure.

Context

This is the last piece of the CI/CD chain. ci.yml and claude-code-review.yml already exist. The host scripts (bin/*.sh) and the local-dev compose are tracked in their own issues.

The workflow must mirror the established infra pattern, including the preflight-secrets guard that prevents the well-known netbird up hang when NB_SETUP_KEY is empty.

Cross-repo coordination: the api-proxy deploy workflow does NOT directly deploy via the infra compose. It opens a PR against Stoganet/infra bumping the api-proxy image digest, waits for that PR to merge, then SSHes in and runs the host-side deploy.sh. This preserves the "infra deploys infra" invariant — no cross-repo repository_dispatch complexity, no shared state.

Scope

  • on: triggers: push to main, plus workflow_dispatch with optional sha input for re-running older SHAs.
  • concurrency: { group: deploy, cancel-in-progress: false } so a second push waits its turn.
  • Three jobs, in sequence via needs::
    1. build-and-push: checkout the target SHA, set up buildx, log into GHCR with GITHUB_TOKEN, push tags ghcr.io/${{ github.repository }}:<sha> and :latest. Output image_digest (from docker/build-push-action) and target_sha.
    2. bump-infra-digest: checkout Stoganet/infra using a INFRA_REPO_TOKEN PAT with contents: write + pull-requests: write. Sed-replace the api-proxy image digest line in compose/docker-compose.yml, commit on a bump-api-proxy-<sha> branch, push, open a PR, gh pr merge --auto --squash. Then poll the PR state every 20s up to 20min; fail if the PR is closed-without-merge or doesn't merge in time.
    3. deploy:
      • Preflight step: assert each of NB_SETUP_KEY, DEPLOY_SSH_KEY, HOME_OVERLAY_IP, HOME_SSH_HOST_KEY is non-empty. Fail loudly listing ALL missing names in one pass. This step exists specifically because netbird up with an empty --setup-key hangs without surfacing the cause.
      • Install NetBird (curl -fsSL https://pkgs.netbird.io/install.sh | sh).
      • sudo netbird up --setup-key "$NB_SETUP_KEY" --management-url https://netbird.stoganet.com:443. Wait for wt0 interface (poll up to 20s).
      • Write DEPLOY_SSH_KEY to ~/.ssh/id_ed25519 (mode 600), append HOME_OVERLAY_IP HOME_SSH_HOST_KEY to ~/.ssh/known_hosts.
      • ssh deploy@${HOME_OVERLAY_IP} "/srv/api-proxy/bin/deploy.sh <sha> <digest>".
      • On failure: SSH again to determine HEAD~1 SHA, then ssh ... /srv/api-proxy/bin/rollback.sh $PREV.
      • Always-on-failure: SSH and run /srv/api-proxy/bin/diagnostics.sh, capture output to the job log.

Out of scope: the host scripts themselves (separate issue), renovate.json (we use Dependabot), auto-assign.yml (intentionally not used).

Acceptance criteria

  • deploy.yml parses cleanly (GitHub Actions accepts it on push without parse errors).
  • Preflight step lists ALL missing required secrets in one go before exiting, not just the first one.
  • The infra digest-bump branch name encodes the api-proxy SHA so re-runs don't collide.
  • The infra PR poll loop has a hard timeout and distinguishable error messages for "closed without merge" vs "timed out".
  • The deploy step passes both <sha> and <digest> to bin/deploy.sh, so the host script can assert the infra checkout contains the expected digest.
  • The rollback step uses if: failure() && steps.deploy.conclusion == 'failure'.
  • Diagnostics dump runs if: failure() and ends with || true so it can't itself mask the real error.

Notes

  • Required repo secrets (must be set before the first deploy): NB_SETUP_KEY, DEPLOY_SSH_KEY, HOME_OVERLAY_IP, HOME_SSH_HOST_KEY, INFRA_REPO_TOKEN. GITHUB_TOKEN is automatic.
  • HOME_SSH_HOST_KEY value should be the output of ssh-keyscan -t ed25519 <HOME_OVERLAY_IP> — one line.
  • INFRA_REPO_TOKEN is a fine-grained PAT scoped to Stoganet/infra only.
  • Dependency updates use Dependabot — don't add renovate.json.
  • auto-assign.yml was intentionally removed (low value, noisy) — don't reintroduce.
  • The preflight guard is non-negotiable; the silent netbird up hang it prevents is the worst kind of CI failure (no signal, eats minutes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:ciCI workflow / pipelineenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions