fix(security): non-root USER directives for production images (#874)#878
Conversation
vybe
left a comment
There was a problem hiding this comment.
Implementation is excellent — hardening is comprehensive, CSO I-01/I-02 follow-up fixes go beyond the original scope, and the comments make the why obvious. Two CI items to resolve before merge:
-
Run the new CI guards on this PR.
verify-non-rootandverify-prod-frontend-uidlive infrontend-e2e.yml, which isui-label-gated and was skipped on this PR. The new gates have not actually executed. Either: (a) add theuilabel to this PR so they run, OR (b) move both steps to an unconditional workflow with a path filter ondocker/**,docker-compose*.yml,scripts/deploy/start.sh,src/mcp-server/Dockerfile. Option (b) is the better long-term answer since this is a regression-guard for backend infrastructure, not UI. -
Resolve the
lint (sys.modules pollution check)failure. The failing filetests/unit/test_slot_per_slot_ttl.pyis not touched by this PR — likely baseline drift ondevsince the branch point. Please rebase ondev; if the failure persists, regenerate the baseline (python tests/lint_sys_modules.py --regenerate-baseline) in a separate commit ondevso this PR inherits a clean lint.
Architecture update, migration doc, and CSO report all look good. Approving once CI is green with the new guards actually running.
Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.
967b7e8 to
473733c
Compare
…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.
Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.
…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.
25ddeb1 to
da2509f
Compare
|
Rebased onto current
Final diff against |
Closes the CSO MEDIUM defense-in-depth gap flagged persistent since
2026-04-05: backend, scheduler, MCP server, and the production frontend
ran their CMD as root. An RCE in any of them inherited root, and on the
backend that meant the Docker socket bind mount turned a single RCE into
fleet-wide reconnaissance.
Changes:
- docker/backend/Dockerfile, docker/scheduler/Dockerfile: new `trinity`
user at UID 1000 (matched UID required — both share /data/trinity.db).
- src/mcp-server/Dockerfile: switch to the built-in `node` user (UID 1000).
- docker/frontend/Dockerfile.prod: switch to `nginxinc/nginx-unprivileged`
(UID 101, binds 8080). nginx.conf + healthcheck + compose port mapping
updated to match. NET_BIND_SERVICE/CHOWN/SETGID/SETUID dropped from the
frontend caps (no longer needed once nginx is unprivileged).
- docker-compose{,.prod}.yml: backend joins `${DOCKER_GID:-999}` via
group_add so UID 1000 retains /var/run/docker.sock access on Linux.
Dead NET_BIND_SERVICE removed from backend (binds 8000, doesn't need
it). PYTHONDONTWRITEBYTECODE=1 added to dev compose so uvicorn --reload
stops failing on __pycache__ writes when host UID != 1000.
- scripts/deploy/start.sh: pre-creates the host bind-mount data dir with
UID 1000 (the Dockerfile's chown is masked by the bind mount); auto-
detects DOCKER_GID on Linux (Debian/Ubuntu=999, RHEL/Fedora=~991,
Arch=990) so non-Debian hosts don't silently fail with EACCES on the
socket.
- .env.example: DOCKER_GID ships blank so start.sh auto-detect kicks in.
Compose still falls back to 999 if .env value is missing entirely.
- .github/workflows/frontend-e2e.yml: `verify-non-root` step asserts UID
1000 in backend/scheduler/mcp-server and exercises the Docker socket
via `docker.from_env().ping()` from inside the backend (the prior
`/api/agents` probe was a false positive — `list_all_agents_fast`
catches every Docker exception and returns []). `verify-prod-frontend-uid`
builds the prod frontend image out-of-band and asserts UID 101. Admin
password is generated per-run instead of the previous hardcoded
fallback (CSO I-01).
- docs/memory/architecture.md: new invariant #17 documenting the rule.
- docs/migrations/NON_ROOT_CONTAINERS_2026-05.md: upgrade procedure for
existing deployments — Docker only honours the Dockerfile chown on
first volume creation, so trinity-data and agent-configs volumes from
prior root-running containers need to be re-owned manually.
- docs/security-reports/cso-diff-2026-05-17.md: audit report of the
branch itself.
Verification (CI, fresh prod, upgrade): see the migration doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses @vybe review on #878. The verify-non-root and verify-prod-frontend-uid guards added in #874 lived inside frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure PRs (the exact PRs that can regress the guards) skipped them silently. Moves both steps to .github/workflows/container-security.yml with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, and src/mcp-server/Dockerfile so the guards execute whenever the underlying surface changes — independent of the e2e workflow's UI gate. frontend-e2e.yml keeps the stack boot for Playwright smoke tests but no longer carries the regression guards. Architecture invariant #17 updated to point at the new workflow.
…workflow Addresses CodeQL finding flagged on PR #878 (security/code-scanning/173): the new workflow defaulted to the repository-default GITHUB_TOKEN scope, which is broader than the workflow actually uses. Pin top-level `permissions: contents: read` — the minimum needed for actions/checkout. The workflow does no PR commenting, issue updating, or security-events writes, so anything beyond `contents: read` would be unused authority.
…urity The new container-security workflow calls /api/token after backend health is green, but on a fresh DB the first-time setup wizard blocks login (`setup_required`, 403) until `setup_completed=true`. Mirror the "Skip first-time setup wizard" step from frontend-e2e.yml — flip the flag directly via `docker exec trinity-backend python3 ...` so the CI sanity probe can mint a token and hit /api/agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
da2509f to
3b653ed
Compare
|
Rebased on What changed:
Will re-approve and squash-merge once CI is green on |
vybe
left a comment
There was a problem hiding this comment.
Re-approved after rebase. All prior CR items addressed: new verify-non-root job runs unconditionally (passed 2m55s), least-privilege GITHUB_TOKEN on the new workflow, lint clean, all 6 pytest seeds + regression-diff green. Rebased the branch myself to resolve the add/add filename collision on cso-diff-2026-05-17.md — both audits preserved (yours at -non-root.md, #876's at the original path).
Summary
Closes #874 — CSO MEDIUM defense-in-depth gap flagged persistent since the 2026-04-05 audit (~6 weeks).
Trinity-built production images now run as non-root:
trinitytrinitynode(built-in)nginx(nginxinc/nginx-unprivileged)Backend joins the host
dockergroup viagroup_add: ${DOCKER_GID:-999}so UID 1000 keeps/var/run/docker.sockaccess on Linux. macOS Docker Desktop ignoresgroup_add(UID translation handles it).The dev Vite frontend image (
docker/frontend/Dockerfile) is intentionally exempt — it has no production attack surface.Why it matters
Before this PR, any RCE in the FastAPI backend landed as root inside the backend container, and
/var/run/docker.sock(mounted:ro) became a fleet-wide reconnaissance primitive — enumerate every agent container, owner labels, network bindings; read in-container env vars (ANTHROPIC_API_KEY,REDIS_URLwith backend creds,SECRET_KEY). The socket is read-only so it's not a direct container escape, but "no USER" turned a single-point RCE into the worst kind of secondary blast radius.What's new in this PR vs the original ticket
In addition to the four Dockerfile changes from the ticket, the branch fixes the operational gaps that came out of a
/reviewpass on the implementation:verify-non-rootstep assertedGET /api/agentsreturns[], butlist_all_agents_fastcatches every Docker exception and returns[]so the check passed even whengroup_addwas broken. Nowdocker exec trinity-backend python -c "docker.from_env().ping()"runs the real round-trip.${TRINITY_DATA_PATH:-./trinity-data}to/data. If the host dir didn't pre-exist, Docker created it root-owned and UID 1000 couldn't createtrinity.db.start.shnow pre-creates it with UID 1000..env.exampleshippedDOCKER_GID=999(Debian) which silently fails on RHEL/Fedora (~991) or Arch (990)..env.examplenow ships blank andstart.shauto-detects viagetent group docker.trinity-dataandagent-configsvolumes; Docker only honours the Dockerfile chown on first volume creation. Migration procedure documented indocs/migrations/NON_ROOT_CONTAINERS_2026-05.md.NET_BIND_SERVICEneeded; removed.:80after the switch to:8080; updated.CiTestPassword!1fallback.verify-prod-frontend-uidbuilds the prod frontend image out-of-band (the e2e workflow boots the Vite-dev image) and asserts UID 101.Audit details in
docs/security-reports/cso-diff-2026-05-17.md.Test plan
docker compose -f docker-compose.yml configparses cleanlydocker compose -f docker-compose.prod.yml configparses cleanly.github/workflows/frontend-e2e.ymlparses as valid YAML (15 steps, ordered)bash -n scripts/deploy/start.shsyntax-cleanverify-non-rootexercises the Docker socket via SDK ping (no false-positive/api/agentsprobe)verify-prod-frontend-uidbuilds Dockerfile.prod and asserts UID 101./scripts/deploy/start.sh→ admin login → create one agent (exercisescontainers.run)docs/migrations/NON_ROOT_CONTAINERS_2026-05.mdOut of scope (follow-ups)
docker/frontend/Dockerfile(dev Vite) — kept root-owned, no production attack surface.src/frontend/Dockerfileappears orphaned (not referenced by any compose/script). Worth confirming and either deleting or documenting in a separate hygiene PR.src/mcp-serverand backendmcp-serverstill shipcap_add: NET_BIND_SERVICEdespite binding port 8080 — pre-existing, not part of this PR.