Skip to content

Security & resilience audit fixes (38 items)#2

Merged
RedBoardDev merged 39 commits into
mainfrom
audit-fixes
May 16, 2026
Merged

Security & resilience audit fixes (38 items)#2
RedBoardDev merged 39 commits into
mainfrom
audit-fixes

Conversation

@RedBoardDev
Copy link
Copy Markdown
Owner

Summary

Series of focused fixes from the recent audit.md, covering exactly the items listed in the brief — no scope creep, every item committed independently so the history reads as an audit log.

  • Sécurité (11) — I.1 tar symlink traversal, I.2 SHA-256 verification of the runner tarball, I.3 pgrep workdir guards, I.4 unix socket chmod 0600, I.5 bounded HTTP clients, I.6 credentials-file perms warning, I.7 XML-escape the launchd plist, I.8 migrate to launchctl bootstrap/bootout/kickstart, I.9 copyDir rejects non-local symlinks, I.10 graceful API shutdown, I.13 Credentials.LogValue masks the PAT.
  • Correctness (13) — II.2 per-version EnsureBits lock + .complete marker, II.3 truncate PAT validation bodies, II.4 stable group pointer in the controller fan-out, II.5 runner_group_id config-driven, II.6 reporter/notifier dispatch outside the health mutex, II.7 API server waits for Serve to return, II.8 race-tested Status/runChecks invariant, II.11 unexport Process.cmd, II.13 drop RunnerInstance.ID, II.14 drop unused daemonState.Groups, II.15 log SIGKILL failures, II.16 extract removeStaleDir helper, II.17 fix Discord double-close.
  • Architecture (4) — III.6 GitHub label regex, III.9 wipe runner log dirs on cleanup, III.10 internal/state.Paths centralizes the state filenames, III.14 narrow PID to int32.
  • Résilience (7) — IV.1 panic-recovering safeActor wrapper, IV.2 small circuit breaker around GitHub HTTP, IV.5 cache GC keeping the 3 most-recent runner versions, IV.7 liveness watchdog that exits to let launchd respawn, IV.8 ±20%% jitter on the backoff, IV.9 post-SIGKILL Wait timeout, IV.10 single retry on Discord 5xx.
  • Observabilité (1) — VII.6 runner stdout/stderr emitted as JSON envelopes with group/runner/source tags.
  • Doc (1) — IX.1 README repository tree matches the actual packages.

audit.md has been progressively trimmed of every addressed item — what remains is the still-open work.

Stats

  • 38 commits, all single-item, all green on go build ./... && go vet ./... && go test -race -count=1 ./....
  • Tests: 316 passing (was 266 before the series).
  • Lint: 0 new findings; 2 pre-existing reports outside the brief remain (backoffMin naming, an unused //nolint:unconvert directive).

Test plan

  • CI: make ci — build + vet + fmt-check + go test -race + govulncheck.
  • Smoke: `./ghr login` against a sacrificial org, confirm warnings for loose credentials perms and circuit breaker behaviour on a 5xx outage.
  • Smoke: `./ghr start --config config.yaml` then `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.ghr.daemon.plist` (validate the new bootstrap path).
  • Smoke: `ghr status` while runners spin up; confirm runner logs land in groups/<grp>/runners/<r>/<date>.json as tagged JSON envelopes.
  • Disk: confirm old runner caches are GC'd after a version bump (keep top 3).

Stages the in-progress GitHub App login flow (installations listing,
JWT signing, interactive/non-interactive wizards) as the starting
point before applying the audit fix series.
Range copies the iteration value, so the previous code passed &g (the
loop variable's address) to each goroutine. Go 1.22+ scopes loop vars
per iteration, masking the bug — but the pattern is fragile.

Index the slice directly so each goroutine captures a stable pointer
to the underlying GroupConfig.
…er (II.17)

The Send method deferred resp.Body.Close on the initial response, but
reassigned resp on 429 retry — the deferred close then ran against the
new body while the manual close still ran against the old. Factoring
the retry into postWithRateLimitRetry leaves Send with a single resp
and a single deferred close.
The Groups map was always written as empty and never refreshed at
runtime, providing no signal to consumers of daemon.state.json.
Remove it so the on-disk shape reflects what the daemon actually
knows when state is written.
Discarding the kill error silently hid permission and ESRCH cases
where ghr believed it had cleaned up but had not. Log the failure so
operators see why an orphan persists.
The same RemoveAll + warn-log pattern was duplicated three times in
cleanupStaleRunner. Factor it into a single helper so all stale-dir
removals follow the same logging contract.
The POSIX kernels we target (macOS, Linux) keep PIDs in a signed 32-bit
range. Tightening the type makes that constraint explicit and keeps
JSON output stable across architectures. syscall.Kill still expects
int, so the call site casts at the boundary.
Previously only empty labels were rejected. GitHub requires runner
labels to start alphanumeric, stay within 64 chars, and use only
[A-Za-z0-9._-]. Enforce the pattern at config load with table-driven
regression coverage so misconfigurations fail fast instead of being
reported by the scale set API at runtime.
…III.10)

daemon.pid, daemon.state.json, and ghr.sock were spelled out as string
literals across cli/daemon.go, cli/state.go, cli/status.go,
cli/purge.go, and api/server.go. Adding or renaming any of them meant
hunting for every occurrence and the cleanupStateFiles list could
drift silently. Introduce internal/state.Paths so each consumer reads
the path through the same helper and All() drives bulk cleanup.

Also scope the runners/cache/state .gitignore entries to the repo root
so internal/state/ is tracked.
Process.Cleanup wiped the workdir but left groups/<group>/runners/<name>/
behind, accumulating one empty directory per ephemeral runner.

LogManager.RemoveRunnerLogs closes the tracked writer for that runner
and removes the directory. The scaler now calls it after every
successful Cleanup (killRunner, Shutdown, HandleJobCompleted) so the
log tree shrinks with the runner set.
The random ID was embedded in Name (groupName-<id>) and never read on
its own. Removing the redundant field keeps the runner identity
single-sourced and trims the struct surface area.
The exec.Cmd handle is purely an implementation detail of the
ProcessManager methods; nothing outside this package ever read or
mutated it. Hide it so consumers cannot reach into the process state
and so the struct surface stops advertising lifecycle details.
srv.Close cut in-flight requests, and the function could return while
the Serve goroutine was still about to write to errCh — a use-after-
free on the listener and a lost ErrServerClosed log.

Replace Close with Shutdown(ctx) bounded by a 5s timeout and drain
errCh before returning. Socket cleanup now lives in a defer so both
branches (context cancel and Serve error) follow the same exit path.
http.DefaultClient never times out, so any stalled GitHub response
(slow-loris DNS, hung TLS handshake, ratelimited connection held open)
froze ghr login and auth status indefinitely. The runner binary
manager had the same issue with a bare &http.Client{}.

Introduce a 30s package-level client for internal/auth and a 10m
client (large enough for a 100MB tarball over a slow link) for the
runner BinaryManager.
validatePAT inlined the entire GitHub response body in its error,
leaking rate-limit headers and other debug noise back to the user.
Reuse the existing truncateBody/drainBody helpers (moved to http.go
so both validation and installation flows share the same contract)
and cap the excerpt at 500 chars.
Anything that logs a *Credentials would dump the raw PAT verbatim.
Implement slog.LogValuer on Credentials and GitHubAppCreds so the
default slog formatting routes through MaskedPAT and never leaks the
token, while still surfacing method, URL, account, and key path for
debugging.
text/template performs no XML escaping, so a label or path containing
'<', '>', '&', or '"' would inject arbitrary keys into the generated
plist (e.g. silently overriding RunAtLoad). Register an "xml" template
function backed by encoding/xml.EscapeText and apply it to every
operator-controlled field. Regression test asserts an injection
payload becomes inert escaped text.
net.Listen on a unix path inherits the process umask, so the daemon
status socket landed at 0755 — any local user could read snapshots,
PIDs, and health issues. Chmod the socket immediately after Listen,
tearing down the listener and the socket file on failure so callers
do not end up with an over-permissive descriptor.
… (I.1)

extractTarGz previously called sanitizeTarPath on the symlink target,
and if that failed it silently fell back to the *original* attacker-
controlled Linkname before calling os.Symlink. A forged tarball could
plant a symlink such as runner -> /etc/shadow inside the cache, which
copyDir would faithfully replicate and later writes would dereference.

Validate Linkname with filepath.IsLocal (Go 1.20+) and reject any
absolute or escaping path before creating the symlink. Regression tests
cover the absolute, relative-escape, and benign local cases.
If an upgraded ghr is asked to recopy a cache that still holds a
malicious symlink (or a hand-edited cache directory), copyDir would
faithfully replicate the link into the workdir, exposing the runner
to writes that escape the workdir tree.

Validate every symlink with filepath.IsLocal during the walk and
abort on absolute or escaping targets. Regression tests cover the
absolute, escape, and benign relative cases.
…(I.2)

downloadAndExtract trusted whatever the network handed back and then
executed run.sh as the daemon. Fetch the matching .sha256 file
co-published by actions/runner releases, hash the body in-stream with
crypto/sha256, and reject the install on mismatch. EnsureBits already
RemoveAlls the destination on error, so a tampered payload is wiped
before any cmd.Start is attempted.
… (II.2)

Two groups asking for the same runner version could race: the second
goroutine saw run.sh appear mid-extraction, declared the cache ready,
and started a runner against a half-extracted tree. A crashed download
left the same incomplete cache hanging around forever.

Guard each version with a sync.Mutex (keyed via sync.Map.LoadOrStore)
and gate "cached" on a final .complete sentinel written only after a
successful download+extract. On startup we now also detect and wipe
incomplete directories instead of trusting partial state. Regression
tests cover the cached, incomplete-cache, and download-cleanup paths.
pgrep -f m.workdirBase would happily match arbitrary user processes if
the configured workdir lived in a noisy directory (e.g. /tmp) — every
match was SIGKILLed unconditionally.

Two layers of defense:
1. config validation rejects relative workdir_base, the obvious
   system roots (/, /tmp, /var, ...), and paths shorter than 8 chars
   so substring matches stay narrow.
2. KillOrphanRunners now confirms each pgrep hit with `ps -p <pid>
   -o command=` and only SIGKILLs when the command line still
   references workdir_base/.../run.sh.

Config tests cover the new rejection cases.
LoadPrivateKey already refuses overly permissive RSA keys, but the
credentials file (which holds the PAT and config URL) was read
silently regardless of mode. A stray chmod 644 would never surface to
the operator.

loadFromFile now stats the file and emits a chmod-600 hint on stderr
whenever group/world bits are set. Keeps the read going (no breakage
on legacy installs) but makes the misconfiguration impossible to miss.
… (I.8)

launchctl load/unload/start/stop are deprecated since macOS 10.10 and
emit increasingly noisy warnings on modern releases. Switch the
service lifecycle to the bootstrap/bootout/kickstart trio, choosing
"system" or "gui/<uid>" as the domain target based on euid so both
LaunchDaemons and LaunchAgents flow through the same helper.
The controller and the purge command both passed runnerGroupID=1
unconditionally, meaning any org with non-default runner groups
silently misrouted scale set lookups. Surface runner_group_id in the
YAML config (defaults to 1, the well-known "Default" group), validate
it >= 1, and propagate the value to controller.New + purgeScaleSets.
…he lock (II.6)

runChecks held m.mu for the duration of every reporter and notifier
call, so a slow Discord webhook (~1s of throttle) or a hanging Uptime
Kuma push would stall Status() and the next tick. Snapshot the issues,
counts, reporters, and notifier under the lock, then dispatch in a
goroutine after releasing it. The thread-safe noopNotifier helper now
lets tests wait on dispatched events with a timeout.
…cy (II.8)

II.8 noted that runChecks rebuilds m.issues under Lock while Status
reads under RLock, so callers always see a coherent snapshot. Lock
this invariant in place with a -race driver that interleaves 50
runChecks calls with 50 Status reads and fails if any issue field is
torn.
A panic inside controller.Run, health.Run, api.Run, or the log cleaner
killed the whole daemon. KeepAlive=true in the plist would restart us,
but in-memory state was lost.

Wrap each long-running goroutine in safeActor: deferred recover logs
the panic with the actor name and stack via slog, then returns a
sentinel error so oklog/run can tear the group down cleanly. Tests
cover both the panic path and a normal error pass-through.
nextBackoff doubled deterministically, so all groups that failed at
the same tick (e.g. during a GitHub outage) retried in lockstep and
hammered the API together. Wrap the doubling in a math/rand-driven
±20% jitter that still respects backoffMax. Tests cover the bounds,
the cap behaviour, and confirm successive calls vary.
After SIGKILL the previous code did `return <-done`, which would block
forever if Wait never returned (zombie, reparented to launchd, etc.).
Wrap that final Wait in a 5s timeout so Stop reliably surfaces stuck
processes instead of stalling the scaler shutdown loop.
….10)

postWithRateLimitRetry only knew about 429; a transient 502/503 caused
the notification to drop on the first attempt. Add a single retry path
with a 2s default backoff for any 5xx response (test helper shortens
it to 10ms). Honors context cancellation while waiting so a daemon
shutdown does not have to wait through the backoff.
Bumping the runner version left every previous cache directory on
disk forever, which on SSD-constrained hosts ate gigabytes per minor
release. After a successful EnsureBits, scan the cache directory for
.complete-marked siblings, sort by modTime, and remove anything beyond
the last cacheKeepVersions (=3). Incomplete caches (no marker) are
left untouched and surface via the existing II.2 cleanup.
During a GitHub outage, every login/auth-status/listener-reconnect call
kept hammering the API on every tick, burning rate-limit and serving
no purpose. Introduce a small package-level circuit breaker around
httpClient.Do that:

- counts consecutive network errors and 5xx responses,
- opens after 5 consecutive failures,
- stays open for 60s, then half-opens to probe with a single request.

Validation success / non-5xx response closes the breaker. Tests cover
the threshold, success reset, half-open behaviour, and the breakable
classification.
…V.7)

If the controller or health monitor deadlocks, launchd still sees the
process alive and never restarts us. Add a watchdog actor that probes
the daemon's own /health endpoint over the unix socket every 30s.
After 3 consecutive failures it logs critical and exits with code 2,
which lets launchd's KeepAlive policy bring us back fresh.

Tests cover the probeOK status-code classification (2xx/3xx/4xx healthy,
5xx unhealthy) and confirm connection errors are reported.
…s (VII.6)

Runner output landed in date-rotated files as raw bytes, so operators
could never correlate a job log line with the daemon event that
provisioned the runner. Wrap the writer returned by RunnerOutputFile
in a tagged JSON emitter: each newline-terminated line becomes one
{time, source, group, runner, line} object. Partial lines are
buffered until the next \n; Close flushes any trailing fragment.
The previous tree omitted controller/, health/, notification/,
monitoring/, api/, launchd/, and state/, and labelled internal/github
as a SDK adapter that does not exist by that name. Update the layout
block to reflect every internal package shipped today and annotate
each line with what it owns.
- api/server.go: derive the shutdown context from WithoutCancel(ctx)
  so contextcheck stops flagging the previous context.Background()
  switch while preserving the "ctx is already done" semantics.
- cli/watchdog_test.go: close the response body when present in the
  connection-error path (bodyclose).
- logging/tagged_writer.go: drop the unused source parameter; the
  field is always "runner" today.
Align the const block so fmt-check passes; no behaviour change.
@RedBoardDev RedBoardDev merged commit b1271a1 into main May 16, 2026
8 checks passed
@RedBoardDev RedBoardDev deleted the audit-fixes branch May 16, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant