Skip to content

Slice 3A: processgit-updater sidecar (state machine + HTTP API + cosign verify)#128

Merged
rg4444 merged 1 commit into
mainfrom
slice-3/updater-sidecar
May 23, 2026
Merged

Slice 3A: processgit-updater sidecar (state machine + HTTP API + cosign verify)#128
rg4444 merged 1 commit into
mainfrom
slice-3/updater-sidecar

Conversation

@rg4444
Copy link
Copy Markdown
Contributor

@rg4444 rg4444 commented May 23, 2026

Slice 3A — processgit-updater sidecar

Foundation of the in-product self-update story. Adds the updater/ sidecar: a tiny separate Go module (stdlib-only) that orchestrates ProcessGit updates inside Docker deployments.

Why a separate process

  1. A container can't safely update itself in place. Replacing the running binary while it serves requests races with active sessions. The sidecar lives outside the main app's lifecycle.
  2. Privilege boundary. The updater needs /var/run/docker.sock. The main ProcessGit container does not.
  3. Tiny dependency surface. Stdlib-only Go (enforced by TestNoExternalImports), plus docker-cli + cosign in the runtime image. ~150 MB final image, dominated by the docker CLI.

What ships

Layer What
HTTP API GET /healthz, GET /status, GET /releases/latest, POST /update, GET /update/{id}, GET /history. All non-/healthz paths require Authorization: Bearer $PROCESSGIT_UPDATER_TOKEN, compared in constant time.
State machine idle → planning → snapshotting → pulling → verifying → migrating → swapping → healthchecking → committed. Failure paths: rolling_back → rolled_back (recovered) or failed (manual intervention). One job at a time, enforced by Store.AddJob.
Persistence $STATE_DIR/state.json, atomic write-temp-then-rename, bounded history (50 jobs).
GitHub client Real api.github.com/repos/…/releases + asset downloads.
Cosign Real cosign verify (image) + cosign verify-blob (release.json), via os/exec.
Runtime image Multi-stage Dockerfile: golang:1.25-alpine3.22 build → alpine:3.22 runtime with docker-cli, cosign (from gcr.io/projectsigstore/cosign:v2.4.1), ca-certificates, tini. Non-root, EXPOSE 9000.

Critical safety property

The manifest signature is verified before we trust any of its fields (image ref, digest, migration command). An attacker who can substitute a malicious release.json cannot redirect the updater to a different image, because cosign verify-blob against the workflow's OIDC identity must pass first.

What's stubbed (Slice 3B)

Docker.Pull, Docker.SwapContainer, Docker.RunMigration, Docker.Healthcheck, Docker.Rollback, Docker.InspectAppImageDigest, Docker.Snapshot — all currently log and sleep. Stub mode is the default (PROCESSGIT_UPDATER_STUB=true). The orchestrator state machine, GitHub client, cosign verification, persistence, and HTTP API all run real code; only the container surgery is deferred.

This split keeps the PR reviewable (~1400 lines of Go, no real container operations) and lets the state machine and signature-verification logic harden before they touch live containers.

Tests

7 tests, all passing locally against Go 1.22 (sandbox limit) and against Go 1.25 (target):

=== RUN   TestStore_RoundTrip                  --- PASS (0.00s)
=== RUN   TestState_IsTerminal                 --- PASS (0.00s)
=== RUN   TestJob_TransitionFinishesOnTerminal --- PASS (0.00s)
=== RUN   TestOrchestrator_HappyPath           --- PASS (14.03s)
=== RUN   TestOrchestrator_RejectsConcurrent   --- PASS (0.00s)
=== RUN   TestAPI_BearerAuth                   --- PASS (0.00s)
=== RUN   TestNoExternalImports                --- PASS (0.00s)
PASS
ok    github.com/Algomation-AI/ProcessGit/updater   14.043s

The 14-second HappyPath test traverses every state in the machine using stubbed docker ops, a fake GitHub server (httptest.NewServer), and /bin/true as a cosign substitute.

Local quickstart

cd updater
PROCESSGIT_UPDATER_TOKEN=devtoken \
PROCESSGIT_UPDATER_STATE_DIR=/tmp/pg-updater \
go run .

# In another terminal:
TOKEN=devtoken
curl -s http://localhost:9000/healthz | jq
curl -s -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"target_tag":"v0.1.0"}' \
     -X POST http://localhost:9000/update | jq

Full state-machine run completes in ~20 seconds in stub mode.

Out of scope (queued)

When What
Slice 3 follow-up deploy/docker-compose.yml integration — adds the updater service, wires bearer token via .env, sets internal network
Slice 3 follow-up .github/workflows/release.yml extension — builds & signs processgit-updater image paired with the main image on every semver tag
Slice 3B Real docker pull / swap / healthcheck / rollback
Slice 3C Volume snapshot + restore for disaster recovery
Slice 3D Robust migration runner
Slice 4 Admin UI at /-/admin/updates consuming this API

File inventory

updater/
├── Dockerfile          (89 lines)  multi-stage build + runtime
├── README.md           (166 lines) architecture, config, ops, scope
├── api.go              (159 lines) routes, auth middleware, handlers
├── cosign.go           (112 lines) cosign verify wrapper
├── docker.go           (122 lines) docker CLI wrapper (stubbed)
├── go.mod              (8 lines)   separate module, stdlib only
├── job.go              (268 lines) Job/Step types, atomic store
├── main.go             (169 lines) entrypoint, env config, signal handling
├── manifest.go         (266 lines) release.json types + GitHub fetcher
├── orchestrator.go     (200 lines) state machine
└── orchestrator_test.go (346 lines) tests
                       ─────
Total                   1,905 lines

…gn verify)

Foundation of the in-product self-update story. Adds the `updater/`
sidecar: a tiny separate Go module that orchestrates ProcessGit updates
inside Docker deployments.

A separate process is necessary because:

 1. A container cannot safely update itself in place. Replacing the
    running binary while it serves requests races with active sessions
    and connections. The sidecar runs continuously and survives
    main-container restarts.

 2. Privilege boundary. The updater needs access to /var/run/docker.sock;
    the main ProcessGit container does not.

 3. Tiny dependency surface. Stdlib-only Go (enforced by a tripwire
    test), plus docker CLI + cosign in the runtime image. The whole
    sidecar is independently reviewable.

What ships in this PR (Slice 3A):

  HTTP API
    GET  /healthz                 — liveness (no auth)
    GET  /status                  — current state, active job
    GET  /releases/latest         — proxies GitHub Releases
    POST /update                  — kicks off an update; 409 if one runs
    GET  /update/{id}             — job status + step history
    GET  /history                 — last 50 jobs (newest first)
    Auth: bearer token from $PROCESSGIT_UPDATER_TOKEN, constant-time
    compare via crypto/subtle.

  State machine
    idle → planning → snapshotting → pulling → verifying → migrating
         → swapping → healthchecking → committed
    On post-swap failure: rolling_back → rolled_back. If rollback
    itself fails: failed (requires manual intervention).

  Persistence
    Atomic write-temp-then-rename on $STATE_DIR/state.json. Bounded
    history (50 jobs). One job active at a time, enforced by Store.

  Real wiring
    - GitHub Releases API client (api.github.com/repos/…/releases).
    - cosign verify (image) + cosign verify-blob (release.json) via
      os/exec. Manifest signature is verified BEFORE we trust ANY
      field of release.json — an attacker who substitutes a malicious
      manifest cannot redirect the updater to a different image.

  Docker operations
    Stubbed in Slice 3A — each method logs + sleeps to simulate work.
    The orchestrator's state machine and HTTP API are exercisable
    end-to-end without touching real containers. PROCESSGIT_UPDATER_STUB
    defaults to true; Slice 3B will replace the stubs with real
    docker CLI invocations (pull / run --rm / stop / run / inspect /
    exec — none of which is conceptually hard, but each deserves
    careful testing and review).

  Runtime image
    Multi-stage Dockerfile: golang:1.25-alpine3.22 build → alpine:3.22
    runtime with docker-cli, cosign (from gcr.io/projectsigstore/cosign),
    ca-certificates, tini. Final image ~150 MB, dominated by docker CLI.
    Non-root user, EXPOSE 9000, ENTRYPOINT via tini for clean SIGTERM
    handling.

Tests (7 total, all passing):
  - Store round-trip: load/save/active enforced, ordering preserved
    after reload
  - Concurrent AddJob refused
  - State.IsTerminal classification
  - Job.transitionTo sets CompletedAt on terminal states only
  - Orchestrator happy path: full state-machine traversal in stub mode
    using a fake GitHub server + /bin/true as cosign stub
  - Concurrent-update rejection at the API layer
  - Bearer-auth: 401 without/wrong token, 200 with correct token
  - Tripwire: TestNoExternalImports fails if a go.sum ever appears

Code stats: ~1900 lines total across 11 files. ~1400 lines of Go
(including 346 lines of tests).

Out of scope for this PR (deliberate splits):

  - deploy/docker-compose.yml integration: adds the updater service,
    wires the bearer token via .env, sets the network so only the
    main app can reach the updater. Separate PR — touches deployment
    config and will need .env.example documentation.
  - .github/workflows/release.yml addition: builds the updater image
    paired with the main image on every semver tag. Separate PR —
    small workflow edit, doesn't affect updater code.
  - Slice 3B: real docker calls (pull / swap / healthcheck / rollback).
  - Slice 3C: volume snapshot for full disaster recovery.
  - Slice 4: admin UI at /-/admin/updates consuming this API.

Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff62b6507b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread updater/cosign.go
Comment on lines +77 to +78
"--certificate-identity-regexp", m.Signing.IdentityRegex,
"--certificate-oidc-issuer", m.Signing.Issuer,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin cosign identity policy outside untrusted manifest

VerifyBlob uses m.Signing.IdentityRegex and m.Signing.Issuer from release.json before that manifest is trusted, which defeats the stated security boundary. If an attacker can substitute release assets, they can choose a permissive regex/issuer and provide a matching cert/signature so blob verification passes, then control image and migration fields. The expected signer policy must be fixed in updater config/code (or another trusted channel), not read from the unverified payload being authenticated.

Useful? React with 👍 / 👎.

Comment thread updater/manifest.go
Comment on lines +254 to +256
req.Header.Set("Accept", "application/octet-stream")
req.Header.Set("User-Agent", "processgit-updater")
resp, err := c.HTTP.Do(req)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Authenticate release asset downloads with GitHub token

The updater documents PROCESSGIT_UPDATER_GITHUB_TOKEN as required for private repos, but downloadAsset does not send any Authorization header when fetching browser_download_url. In private repositories this causes manifest asset fetches to fail (typically 404), so updates cannot start even when a token is configured. Reuse the token-authenticated API asset flow (or equivalent authenticated download path) for these requests.

Useful? React with 👍 / 👎.

Comment thread updater/job.go
Comment on lines +237 to +240
func (s *Store) Active() *Job {
s.mu.Lock()
defer s.mu.Unlock()
return s.active
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return a snapshot from Active to avoid concurrent races

Store.Active returns the internal *Job pointer directly, unlike Get/List which copy. While an update is running, /status can serialize this object concurrently with orchestrator mutations (transitionTo appends/modifies Steps), creating unsynchronized read/write access to the same struct. This can yield race-detector failures and undefined runtime behavior; return a copied snapshot under the lock instead.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant