fix(security): non-root USER directives for production images (#874) by AndriiPasternak31 · Pull Request #878 · Abilityai/trinity

AndriiPasternak31 · 2026-05-17T18:52:05Z

Summary

Closes #874 — CSO MEDIUM defense-in-depth gap flagged persistent since the 2026-04-05 audit (~6 weeks).

Trinity-built production images now run as non-root:

Service	User	UID
backend	`trinity`	1000
scheduler	`trinity`	1000
mcp-server	`node` (built-in)	1000
frontend (prod)	`nginx` (`nginxinc/nginx-unprivileged`)	101

Backend joins the host docker group via group_add: ${DOCKER_GID:-999} so UID 1000 keeps /var/run/docker.sock access on Linux. macOS Docker Desktop ignores group_add (UID translation handles it).

The dev Vite frontend image (docker/frontend/Dockerfile) is intentionally exempt — it has no production attack surface.

Why it matters

Before this PR, any RCE in the FastAPI backend landed as root inside the backend container, and /var/run/docker.sock (mounted :ro) became a fleet-wide reconnaissance primitive — enumerate every agent container, owner labels, network bindings; read in-container env vars (ANTHROPIC_API_KEY, REDIS_URL with backend creds, SECRET_KEY). The socket is read-only so it's not a direct container escape, but "no USER" turned a single-point RCE into the worst kind of secondary blast radius.

What's new in this PR vs the original ticket

In addition to the four Dockerfile changes from the ticket, the branch fixes the operational gaps that came out of a /review pass on the implementation:

CI false positive — the original verify-non-root step asserted GET /api/agents returns [], but list_all_agents_fast catches every Docker exception and returns [] so the check passed even when group_add was broken. Now docker exec trinity-backend python -c "docker.from_env().ping()" runs the real round-trip.
Fresh prod install — backend bind-mounts ${TRINITY_DATA_PATH:-./trinity-data} to /data. If the host dir didn't pre-exist, Docker created it root-owned and UID 1000 couldn't create trinity.db. start.sh now pre-creates it with UID 1000.
Non-Debian Linux hosts — .env.example shipped DOCKER_GID=999 (Debian) which silently fails on RHEL/Fedora (~991) or Arch (990). .env.example now ships blank and start.sh auto-detects via getent group docker.
Upgrade path — existing deployments have root-owned trinity-data and agent-configs volumes; Docker only honours the Dockerfile chown on first volume creation. Migration procedure documented in docs/migrations/NON_ROOT_CONTAINERS_2026-05.md.
Dead capability — backend listens on 8000, no NET_BIND_SERVICE needed; removed.
Stale comments — nginx.conf and security-headers.conf referenced :80 after the switch to :8080; updated.
CSO I-01 — per-run random CI admin password replaces the prior hardcoded CiTestPassword!1 fallback.
CSO I-02 — verify-prod-frontend-uid builds the prod frontend image out-of-band (the e2e workflow boots the Vite-dev image) and asserts UID 101.

Audit details in docs/security-reports/cso-diff-2026-05-17.md.

Test plan

docker compose -f docker-compose.yml config parses cleanly
docker compose -f docker-compose.prod.yml config parses cleanly
.github/workflows/frontend-e2e.yml parses as valid YAML (15 steps, ordered)
bash -n scripts/deploy/start.sh syntax-clean
CI verify-non-root exercises the Docker socket via SDK ping (no false-positive /api/agents probe)
CI verify-prod-frontend-uid builds Dockerfile.prod and asserts UID 101
Manual smoke on a fresh Linux host: ./scripts/deploy/start.sh → admin login → create one agent (exercises containers.run)
Manual smoke on macOS Docker Desktop: same as above (group_add ignored, should still work)
Manual upgrade smoke on an existing deployment per docs/migrations/NON_ROOT_CONTAINERS_2026-05.md

Out of scope (follow-ups)

docker/frontend/Dockerfile (dev Vite) — kept root-owned, no production attack surface.
src/frontend/Dockerfile appears orphaned (not referenced by any compose/script). Worth confirming and either deleting or documenting in a separate hygiene PR.
src/mcp-server and backend mcp-server still ship cap_add: NET_BIND_SERVICE despite binding port 8080 — pre-existing, not part of this PR.

vybe

Implementation is excellent — hardening is comprehensive, CSO I-01/I-02 follow-up fixes go beyond the original scope, and the comments make the why obvious. Two CI items to resolve before merge:

Run the new CI guards on this PR. verify-non-root and verify-prod-frontend-uid live in frontend-e2e.yml, which is ui-label-gated and was skipped on this PR. The new gates have not actually executed. Either: (a) add the ui label to this PR so they run, OR (b) move both steps to an unconditional workflow with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, src/mcp-server/Dockerfile. Option (b) is the better long-term answer since this is a regression-guard for backend infrastructure, not UI.
Resolve the lint (sys.modules pollution check) failure. The failing file tests/unit/test_slot_per_slot_ttl.py is not touched by this PR — likely baseline drift on dev since the branch point. Please rebase on dev; if the failure persists, regenerate the baseline (python tests/lint_sys_modules.py --regenerate-baseline) in a separate commit on dev so this PR inherits a clean lint.

Architecture update, migration doc, and CSO report all look good. Approving once CI is green with the new guards actually running.

@vybe

Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.

…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.

@vybe

Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.

…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.

vybe · 2026-05-22T11:56:35Z

Rebased onto current dev and pushed (da2509f9) to unblock the architecture.md regression I flagged. Resolved two merge conflicts:

docker-compose.yml — kept both env-var additions (BACKEND_AGENT_CALL_LIMIT / BACKEND_AGENT_CALL_QUEUE_TIMEOUT_S from bug: agent container OOM cascades into backend worker saturation and freezes admin UI #904 RC-1, plus your PYTHONDONTWRITEBYTECODE=1)
docs/memory/architecture.md — auto-merged cleanly; the four newer-on-dev sections (fix(executor): salvage telemetry + auto-retry reader-race empty results (#678) #797 retry_count, feat(files): create folder in File Manager (#37) #898 mkdir endpoint, Soft-delete all entities + configurable data retention policy #834 Phase 1a/1b soft-delete) are preserved alongside your Invariant Client/Viewer User Role (AUTH-002) #17 + Container Security paragraph rewrite

Final diff against dev is now exactly the 15 files the PR advertises (+638/-28). CI re-running. Will approve once green.

Closes the CSO MEDIUM defense-in-depth gap flagged persistent since 2026-04-05: backend, scheduler, MCP server, and the production frontend ran their CMD as root. An RCE in any of them inherited root, and on the backend that meant the Docker socket bind mount turned a single RCE into fleet-wide reconnaissance. Changes: - docker/backend/Dockerfile, docker/scheduler/Dockerfile: new `trinity` user at UID 1000 (matched UID required — both share /data/trinity.db). - src/mcp-server/Dockerfile: switch to the built-in `node` user (UID 1000). - docker/frontend/Dockerfile.prod: switch to `nginxinc/nginx-unprivileged` (UID 101, binds 8080). nginx.conf + healthcheck + compose port mapping updated to match. NET_BIND_SERVICE/CHOWN/SETGID/SETUID dropped from the frontend caps (no longer needed once nginx is unprivileged). - docker-compose{,.prod}.yml: backend joins `${DOCKER_GID:-999}` via group_add so UID 1000 retains /var/run/docker.sock access on Linux. Dead NET_BIND_SERVICE removed from backend (binds 8000, doesn't need it). PYTHONDONTWRITEBYTECODE=1 added to dev compose so uvicorn --reload stops failing on __pycache__ writes when host UID != 1000. - scripts/deploy/start.sh: pre-creates the host bind-mount data dir with UID 1000 (the Dockerfile's chown is masked by the bind mount); auto- detects DOCKER_GID on Linux (Debian/Ubuntu=999, RHEL/Fedora=~991, Arch=990) so non-Debian hosts don't silently fail with EACCES on the socket. - .env.example: DOCKER_GID ships blank so start.sh auto-detect kicks in. Compose still falls back to 999 if .env value is missing entirely. - .github/workflows/frontend-e2e.yml: `verify-non-root` step asserts UID 1000 in backend/scheduler/mcp-server and exercises the Docker socket via `docker.from_env().ping()` from inside the backend (the prior `/api/agents` probe was a false positive — `list_all_agents_fast` catches every Docker exception and returns []). `verify-prod-frontend-uid` builds the prod frontend image out-of-band and asserts UID 101. Admin password is generated per-run instead of the previous hardcoded fallback (CSO I-01). - docs/memory/architecture.md: new invariant #17 documenting the rule. - docs/migrations/NON_ROOT_CONTAINERS_2026-05.md: upgrade procedure for existing deployments — Docker only honours the Dockerfile chown on first volume creation, so trinity-data and agent-configs volumes from prior root-running containers need to be re-owned manually. - docs/security-reports/cso-diff-2026-05-17.md: audit report of the branch itself. Verification (CI, fresh prod, upgrade): see the migration doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@vybe

Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.

…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.

…urity The new container-security workflow calls /api/token after backend health is green, but on a fresh DB the first-time setup wizard blocks login (`setup_required`, 403) until `setup_completed=true`. Mirror the "Skip first-time setup wizard" step from frontend-e2e.yml — flip the flag directly via `docker exec trinity-backend python3 ...` so the CI sanity probe can mint a token and hit /api/agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vybe · 2026-05-23T10:14:33Z

Rebased on dev and resolved the cso-diff filename collision on your behalf — branch is now MERGEABLE.

What changed:

Renamed your non-root audit to docs/security-reports/cso-diff-2026-05-17-non-root.md and preserved test(security): backfill encryption tests for Telegram/WhatsApp/Slack-002 (#664) #876's encryption-tests audit at the original cso-diff-2026-05-17.md path. Both reports are kept.
Rebased your 4 commits (5b5a604d, a672a610, d6a04861, 3b653edf) on top of current dev. No content changes to your CI or Dockerfile work.

Will re-approve and squash-merge once CI is green on 3b653edf. Heads up so you can pull --force-with-lease cleanly on your local.

vybe

Re-approved after rebase. All prior CR items addressed: new verify-non-root job runs unconditionally (passed 2m55s), least-privilege GITHUB_TOKEN on the new workflow, lint clean, all 6 pytest seeds + regression-diff green. Rebased the branch myself to resolve the add/add filename collision on cso-diff-2026-05-17.md — both audits preserved (yours at -non-root.md, #876's at the original path).

vybe requested changes May 18, 2026

View reviewed changes

AndriiPasternak31 force-pushed the AndriiPasternak31/issue-874 branch from 967b7e8 to 473733c Compare May 20, 2026 16:47

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

Comment thread .github/workflows/container-security.yml Fixed

vybe force-pushed the AndriiPasternak31/issue-874 branch from 25ddeb1 to da2509f Compare May 22, 2026 11:56

AndriiPasternak31 and others added 4 commits May 23, 2026 11:13

vybe force-pushed the AndriiPasternak31/issue-874 branch from da2509f to 3b653ed Compare May 23, 2026 10:14

vybe approved these changes May 23, 2026

View reviewed changes

vybe merged commit 420faea into dev May 23, 2026
14 checks passed

vybe mentioned this pull request May 23, 2026

bug: non-root prod images (#874) silently break upgrades when operators bypass start.sh #917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): non-root USER directives for production images (#874)#878

fix(security): non-root USER directives for production images (#874)#878
vybe merged 4 commits into
devfrom
AndriiPasternak31/issue-874

AndriiPasternak31 commented May 17, 2026

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

vybe commented May 22, 2026

Uh oh!

vybe commented May 23, 2026

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndriiPasternak31 commented May 17, 2026

Summary

Why it matters

What's new in this PR vs the original ticket

Test plan

Out of scope (follow-ups)

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vybe commented May 22, 2026

Uh oh!

vybe commented May 23, 2026

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants