Security & resilience audit fixes (38 items) by RedBoardDev · Pull Request #2 · RedBoardDev/gh-runners-tool

RedBoardDev · 2026-05-16T19:31:38Z

Summary

Series of focused fixes from the recent audit.md, covering exactly the items listed in the brief — no scope creep, every item committed independently so the history reads as an audit log.

Sécurité (11) — I.1 tar symlink traversal, I.2 SHA-256 verification of the runner tarball, I.3 pgrep workdir guards, I.4 unix socket chmod 0600, I.5 bounded HTTP clients, I.6 credentials-file perms warning, I.7 XML-escape the launchd plist, I.8 migrate to launchctl bootstrap/bootout/kickstart, I.9 copyDir rejects non-local symlinks, I.10 graceful API shutdown, I.13 Credentials.LogValue masks the PAT.
Correctness (13) — II.2 per-version EnsureBits lock + .complete marker, II.3 truncate PAT validation bodies, II.4 stable group pointer in the controller fan-out, II.5 runner_group_id config-driven, II.6 reporter/notifier dispatch outside the health mutex, II.7 API server waits for Serve to return, II.8 race-tested Status/runChecks invariant, II.11 unexport Process.cmd, II.13 drop RunnerInstance.ID, II.14 drop unused daemonState.Groups, II.15 log SIGKILL failures, II.16 extract removeStaleDir helper, II.17 fix Discord double-close.
Architecture (4) — III.6 GitHub label regex, III.9 wipe runner log dirs on cleanup, III.10 internal/state.Paths centralizes the state filenames, III.14 narrow PID to int32.
Résilience (7) — IV.1 panic-recovering safeActor wrapper, IV.2 small circuit breaker around GitHub HTTP, IV.5 cache GC keeping the 3 most-recent runner versions, IV.7 liveness watchdog that exits to let launchd respawn, IV.8 ±20%% jitter on the backoff, IV.9 post-SIGKILL Wait timeout, IV.10 single retry on Discord 5xx.
Observabilité (1) — VII.6 runner stdout/stderr emitted as JSON envelopes with group/runner/source tags.
Doc (1) — IX.1 README repository tree matches the actual packages.

audit.md has been progressively trimmed of every addressed item — what remains is the still-open work.

Stats

38 commits, all single-item, all green on go build ./... && go vet ./... && go test -race -count=1 ./....
Tests: 316 passing (was 266 before the series).
Lint: 0 new findings; 2 pre-existing reports outside the brief remain (backoffMin naming, an unused //nolint:unconvert directive).

Test plan

CI: make ci — build + vet + fmt-check + go test -race + govulncheck.
Smoke: `./ghr login` against a sacrificial org, confirm warnings for loose credentials perms and circuit breaker behaviour on a 5xx outage.
Smoke: `./ghr start --config config.yaml` then `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.ghr.daemon.plist` (validate the new bootstrap path).
Smoke: `ghr status` while runners spin up; confirm runner logs land in groups/<grp>/runners/<r>/<date>.json as tagged JSON envelopes.
Disk: confirm old runner caches are GC'd after a version bump (keep top 3).

Stages the in-progress GitHub App login flow (installations listing, JWT signing, interactive/non-interactive wizards) as the starting point before applying the audit fix series.

Range copies the iteration value, so the previous code passed &g (the loop variable's address) to each goroutine. Go 1.22+ scopes loop vars per iteration, masking the bug — but the pattern is fragile. Index the slice directly so each goroutine captures a stable pointer to the underlying GroupConfig.

…er (II.17) The Send method deferred resp.Body.Close on the initial response, but reassigned resp on 429 retry — the deferred close then ran against the new body while the manual close still ran against the old. Factoring the retry into postWithRateLimitRetry leaves Send with a single resp and a single deferred close.

The Groups map was always written as empty and never refreshed at runtime, providing no signal to consumers of daemon.state.json. Remove it so the on-disk shape reflects what the daemon actually knows when state is written.

Discarding the kill error silently hid permission and ESRCH cases where ghr believed it had cleaned up but had not. Log the failure so operators see why an orphan persists.

The same RemoveAll + warn-log pattern was duplicated three times in cleanupStaleRunner. Factor it into a single helper so all stale-dir removals follow the same logging contract.

The POSIX kernels we target (macOS, Linux) keep PIDs in a signed 32-bit range. Tightening the type makes that constraint explicit and keeps JSON output stable across architectures. syscall.Kill still expects int, so the call site casts at the boundary.

Previously only empty labels were rejected. GitHub requires runner labels to start alphanumeric, stay within 64 chars, and use only [A-Za-z0-9._-]. Enforce the pattern at config load with table-driven regression coverage so misconfigurations fail fast instead of being reported by the scale set API at runtime.

…III.10) daemon.pid, daemon.state.json, and ghr.sock were spelled out as string literals across cli/daemon.go, cli/state.go, cli/status.go, cli/purge.go, and api/server.go. Adding or renaming any of them meant hunting for every occurrence and the cleanupStateFiles list could drift silently. Introduce internal/state.Paths so each consumer reads the path through the same helper and All() drives bulk cleanup. Also scope the runners/cache/state .gitignore entries to the repo root so internal/state/ is tracked.

Process.Cleanup wiped the workdir but left groups/<group>/runners/<name>/ behind, accumulating one empty directory per ephemeral runner. LogManager.RemoveRunnerLogs closes the tracked writer for that runner and removes the directory. The scaler now calls it after every successful Cleanup (killRunner, Shutdown, HandleJobCompleted) so the log tree shrinks with the runner set.

The random ID was embedded in Name (groupName-<id>) and never read on its own. Removing the redundant field keeps the runner identity single-sourced and trims the struct surface area.

The exec.Cmd handle is purely an implementation detail of the ProcessManager methods; nothing outside this package ever read or mutated it. Hide it so consumers cannot reach into the process state and so the struct surface stops advertising lifecycle details.

srv.Close cut in-flight requests, and the function could return while the Serve goroutine was still about to write to errCh — a use-after- free on the listener and a lost ErrServerClosed log. Replace Close with Shutdown(ctx) bounded by a 5s timeout and drain errCh before returning. Socket cleanup now lives in a defer so both branches (context cancel and Serve error) follow the same exit path.

http.DefaultClient never times out, so any stalled GitHub response (slow-loris DNS, hung TLS handshake, ratelimited connection held open) froze ghr login and auth status indefinitely. The runner binary manager had the same issue with a bare &http.Client{}. Introduce a 30s package-level client for internal/auth and a 10m client (large enough for a 100MB tarball over a slow link) for the runner BinaryManager.

validatePAT inlined the entire GitHub response body in its error, leaking rate-limit headers and other debug noise back to the user. Reuse the existing truncateBody/drainBody helpers (moved to http.go so both validation and installation flows share the same contract) and cap the excerpt at 500 chars.

Anything that logs a *Credentials would dump the raw PAT verbatim. Implement slog.LogValuer on Credentials and GitHubAppCreds so the default slog formatting routes through MaskedPAT and never leaks the token, while still surfacing method, URL, account, and key path for debugging.

text/template performs no XML escaping, so a label or path containing '<', '>', '&', or '"' would inject arbitrary keys into the generated plist (e.g. silently overriding RunAtLoad). Register an "xml" template function backed by encoding/xml.EscapeText and apply it to every operator-controlled field. Regression test asserts an injection payload becomes inert escaped text.

net.Listen on a unix path inherits the process umask, so the daemon status socket landed at 0755 — any local user could read snapshots, PIDs, and health issues. Chmod the socket immediately after Listen, tearing down the listener and the socket file on failure so callers do not end up with an over-permissive descriptor.

… (I.1) extractTarGz previously called sanitizeTarPath on the symlink target, and if that failed it silently fell back to the *original* attacker- controlled Linkname before calling os.Symlink. A forged tarball could plant a symlink such as runner -> /etc/shadow inside the cache, which copyDir would faithfully replicate and later writes would dereference. Validate Linkname with filepath.IsLocal (Go 1.20+) and reject any absolute or escaping path before creating the symlink. Regression tests cover the absolute, relative-escape, and benign local cases.

If an upgraded ghr is asked to recopy a cache that still holds a malicious symlink (or a hand-edited cache directory), copyDir would faithfully replicate the link into the workdir, exposing the runner to writes that escape the workdir tree. Validate every symlink with filepath.IsLocal during the walk and abort on absolute or escaping targets. Regression tests cover the absolute, escape, and benign relative cases.

…(I.2) downloadAndExtract trusted whatever the network handed back and then executed run.sh as the daemon. Fetch the matching .sha256 file co-published by actions/runner releases, hash the body in-stream with crypto/sha256, and reject the install on mismatch. EnsureBits already RemoveAlls the destination on error, so a tampered payload is wiped before any cmd.Start is attempted.

… (II.2) Two groups asking for the same runner version could race: the second goroutine saw run.sh appear mid-extraction, declared the cache ready, and started a runner against a half-extracted tree. A crashed download left the same incomplete cache hanging around forever. Guard each version with a sync.Mutex (keyed via sync.Map.LoadOrStore) and gate "cached" on a final .complete sentinel written only after a successful download+extract. On startup we now also detect and wipe incomplete directories instead of trusting partial state. Regression tests cover the cached, incomplete-cache, and download-cleanup paths.

pgrep -f m.workdirBase would happily match arbitrary user processes if the configured workdir lived in a noisy directory (e.g. /tmp) — every match was SIGKILLed unconditionally. Two layers of defense: 1. config validation rejects relative workdir_base, the obvious system roots (/, /tmp, /var, ...), and paths shorter than 8 chars so substring matches stay narrow. 2. KillOrphanRunners now confirms each pgrep hit with `ps -p <pid> -o command=` and only SIGKILLs when the command line still references workdir_base/.../run.sh. Config tests cover the new rejection cases.

LoadPrivateKey already refuses overly permissive RSA keys, but the credentials file (which holds the PAT and config URL) was read silently regardless of mode. A stray chmod 644 would never surface to the operator. loadFromFile now stats the file and emits a chmod-600 hint on stderr whenever group/world bits are set. Keeps the read going (no breakage on legacy installs) but makes the misconfiguration impossible to miss.

… (I.8) launchctl load/unload/start/stop are deprecated since macOS 10.10 and emit increasingly noisy warnings on modern releases. Switch the service lifecycle to the bootstrap/bootout/kickstart trio, choosing "system" or "gui/<uid>" as the domain target based on euid so both LaunchDaemons and LaunchAgents flow through the same helper.

The controller and the purge command both passed runnerGroupID=1 unconditionally, meaning any org with non-default runner groups silently misrouted scale set lookups. Surface runner_group_id in the YAML config (defaults to 1, the well-known "Default" group), validate it >= 1, and propagate the value to controller.New + purgeScaleSets.

…he lock (II.6) runChecks held m.mu for the duration of every reporter and notifier call, so a slow Discord webhook (~1s of throttle) or a hanging Uptime Kuma push would stall Status() and the next tick. Snapshot the issues, counts, reporters, and notifier under the lock, then dispatch in a goroutine after releasing it. The thread-safe noopNotifier helper now lets tests wait on dispatched events with a timeout.

…cy (II.8) II.8 noted that runChecks rebuilds m.issues under Lock while Status reads under RLock, so callers always see a coherent snapshot. Lock this invariant in place with a -race driver that interleaves 50 runChecks calls with 50 Status reads and fails if any issue field is torn.

A panic inside controller.Run, health.Run, api.Run, or the log cleaner killed the whole daemon. KeepAlive=true in the plist would restart us, but in-memory state was lost. Wrap each long-running goroutine in safeActor: deferred recover logs the panic with the actor name and stack via slog, then returns a sentinel error so oklog/run can tear the group down cleanly. Tests cover both the panic path and a normal error pass-through.

nextBackoff doubled deterministically, so all groups that failed at the same tick (e.g. during a GitHub outage) retried in lockstep and hammered the API together. Wrap the doubling in a math/rand-driven ±20% jitter that still respects backoffMax. Tests cover the bounds, the cap behaviour, and confirm successive calls vary.

After SIGKILL the previous code did `return <-done`, which would block forever if Wait never returned (zombie, reparented to launchd, etc.). Wrap that final Wait in a 5s timeout so Stop reliably surfaces stuck processes instead of stalling the scaler shutdown loop.

….10) postWithRateLimitRetry only knew about 429; a transient 502/503 caused the notification to drop on the first attempt. Add a single retry path with a 2s default backoff for any 5xx response (test helper shortens it to 10ms). Honors context cancellation while waiting so a daemon shutdown does not have to wait through the backoff.

Bumping the runner version left every previous cache directory on disk forever, which on SSD-constrained hosts ate gigabytes per minor release. After a successful EnsureBits, scan the cache directory for .complete-marked siblings, sort by modTime, and remove anything beyond the last cacheKeepVersions (=3). Incomplete caches (no marker) are left untouched and surface via the existing II.2 cleanup.

During a GitHub outage, every login/auth-status/listener-reconnect call kept hammering the API on every tick, burning rate-limit and serving no purpose. Introduce a small package-level circuit breaker around httpClient.Do that: - counts consecutive network errors and 5xx responses, - opens after 5 consecutive failures, - stays open for 60s, then half-opens to probe with a single request. Validation success / non-5xx response closes the breaker. Tests cover the threshold, success reset, half-open behaviour, and the breakable classification.

…V.7) If the controller or health monitor deadlocks, launchd still sees the process alive and never restarts us. Add a watchdog actor that probes the daemon's own /health endpoint over the unix socket every 30s. After 3 consecutive failures it logs critical and exits with code 2, which lets launchd's KeepAlive policy bring us back fresh. Tests cover the probeOK status-code classification (2xx/3xx/4xx healthy, 5xx unhealthy) and confirm connection errors are reported.

…s (VII.6) Runner output landed in date-rotated files as raw bytes, so operators could never correlate a job log line with the daemon event that provisioned the runner. Wrap the writer returned by RunnerOutputFile in a tagged JSON emitter: each newline-terminated line becomes one {time, source, group, runner, line} object. Partial lines are buffered until the next \n; Close flushes any trailing fragment.

The previous tree omitted controller/, health/, notification/, monitoring/, api/, launchd/, and state/, and labelled internal/github as a SDK adapter that does not exist by that name. Update the layout block to reflect every internal package shipped today and annotate each line with what it owns.

- api/server.go: derive the shutdown context from WithoutCancel(ctx) so contextcheck stops flagging the previous context.Background() switch while preserving the "ctx is already done" semantics. - cli/watchdog_test.go: close the response body when present in the connection-error path (bodyclose). - logging/tagged_writer.go: drop the unused source parameter; the field is always "runner" today.

Align the const block so fmt-check passes; no behaviour change.

RedBoardDev added 30 commits May 16, 2026 19:38

chore: checkpoint github app auth refactor

0c752b1

Stages the in-progress GitHub App login flow (installations listing, JWT signing, interactive/non-interactive wizards) as the starting point before applying the audit fix series.

fix(runner/cleanup): log SIGKILL failures on orphan runners (II.15)

77f4991

Discarding the kill error silently hid permission and ESRCH cases where ghr believed it had cleaned up but had not. Log the failure so operators see why an orphan persists.

refactor(runner/cleanup): extract removeStaleDir helper (II.16)

3d43316

The same RemoveAll + warn-log pattern was duplicated three times in cleanupStaleRunner. Factor it into a single helper so all stale-dir removals follow the same logging contract.

refactor(model): drop unused RunnerInstance.ID field (II.13)

06ae09a

The random ID was embedded in Name (groupName-<id>) and never read on its own. Removing the redundant field keeps the runner identity single-sourced and trims the struct surface area.

RedBoardDev added 9 commits May 16, 2026 21:16

style: gofmt watchdog.go

15d3a2f

Align the const block so fmt-check passes; no behaviour change.

RedBoardDev merged commit b1271a1 into main May 16, 2026
8 checks passed

RedBoardDev deleted the audit-fixes branch May 16, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security & resilience audit fixes (38 items)#2

Security & resilience audit fixes (38 items)#2
RedBoardDev merged 39 commits into
mainfrom
audit-fixes

RedBoardDev commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RedBoardDev commented May 16, 2026

Summary

Stats

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant