Skip to content

fix(security): non-root USER directives for production images (#874)#878

Merged
vybe merged 4 commits into
devfrom
AndriiPasternak31/issue-874
May 23, 2026
Merged

fix(security): non-root USER directives for production images (#874)#878
vybe merged 4 commits into
devfrom
AndriiPasternak31/issue-874

Conversation

@AndriiPasternak31
Copy link
Copy Markdown
Contributor

Summary

Closes #874 — CSO MEDIUM defense-in-depth gap flagged persistent since the 2026-04-05 audit (~6 weeks).

Trinity-built production images now run as non-root:

Service User UID
backend trinity 1000
scheduler trinity 1000
mcp-server node (built-in) 1000
frontend (prod) nginx (nginxinc/nginx-unprivileged) 101

Backend joins the host docker group via group_add: ${DOCKER_GID:-999} so UID 1000 keeps /var/run/docker.sock access on Linux. macOS Docker Desktop ignores group_add (UID translation handles it).

The dev Vite frontend image (docker/frontend/Dockerfile) is intentionally exempt — it has no production attack surface.

Why it matters

Before this PR, any RCE in the FastAPI backend landed as root inside the backend container, and /var/run/docker.sock (mounted :ro) became a fleet-wide reconnaissance primitive — enumerate every agent container, owner labels, network bindings; read in-container env vars (ANTHROPIC_API_KEY, REDIS_URL with backend creds, SECRET_KEY). The socket is read-only so it's not a direct container escape, but "no USER" turned a single-point RCE into the worst kind of secondary blast radius.

What's new in this PR vs the original ticket

In addition to the four Dockerfile changes from the ticket, the branch fixes the operational gaps that came out of a /review pass on the implementation:

  1. CI false positive — the original verify-non-root step asserted GET /api/agents returns [], but list_all_agents_fast catches every Docker exception and returns [] so the check passed even when group_add was broken. Now docker exec trinity-backend python -c "docker.from_env().ping()" runs the real round-trip.
  2. Fresh prod install — backend bind-mounts ${TRINITY_DATA_PATH:-./trinity-data} to /data. If the host dir didn't pre-exist, Docker created it root-owned and UID 1000 couldn't create trinity.db. start.sh now pre-creates it with UID 1000.
  3. Non-Debian Linux hosts.env.example shipped DOCKER_GID=999 (Debian) which silently fails on RHEL/Fedora (~991) or Arch (990). .env.example now ships blank and start.sh auto-detects via getent group docker.
  4. Upgrade path — existing deployments have root-owned trinity-data and agent-configs volumes; Docker only honours the Dockerfile chown on first volume creation. Migration procedure documented in docs/migrations/NON_ROOT_CONTAINERS_2026-05.md.
  5. Dead capability — backend listens on 8000, no NET_BIND_SERVICE needed; removed.
  6. Stale comments — nginx.conf and security-headers.conf referenced :80 after the switch to :8080; updated.
  7. CSO I-01 — per-run random CI admin password replaces the prior hardcoded CiTestPassword!1 fallback.
  8. CSO I-02verify-prod-frontend-uid builds the prod frontend image out-of-band (the e2e workflow boots the Vite-dev image) and asserts UID 101.

Audit details in docs/security-reports/cso-diff-2026-05-17.md.

Test plan

  • docker compose -f docker-compose.yml config parses cleanly
  • docker compose -f docker-compose.prod.yml config parses cleanly
  • .github/workflows/frontend-e2e.yml parses as valid YAML (15 steps, ordered)
  • bash -n scripts/deploy/start.sh syntax-clean
  • CI verify-non-root exercises the Docker socket via SDK ping (no false-positive /api/agents probe)
  • CI verify-prod-frontend-uid builds Dockerfile.prod and asserts UID 101
  • Manual smoke on a fresh Linux host: ./scripts/deploy/start.sh → admin login → create one agent (exercises containers.run)
  • Manual smoke on macOS Docker Desktop: same as above (group_add ignored, should still work)
  • Manual upgrade smoke on an existing deployment per docs/migrations/NON_ROOT_CONTAINERS_2026-05.md

Out of scope (follow-ups)

  • docker/frontend/Dockerfile (dev Vite) — kept root-owned, no production attack surface.
  • src/frontend/Dockerfile appears orphaned (not referenced by any compose/script). Worth confirming and either deleting or documenting in a separate hygiene PR.
  • src/mcp-server and backend mcp-server still ship cap_add: NET_BIND_SERVICE despite binding port 8080 — pre-existing, not part of this PR.

Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is excellent — hardening is comprehensive, CSO I-01/I-02 follow-up fixes go beyond the original scope, and the comments make the why obvious. Two CI items to resolve before merge:

  • Run the new CI guards on this PR. verify-non-root and verify-prod-frontend-uid live in frontend-e2e.yml, which is ui-label-gated and was skipped on this PR. The new gates have not actually executed. Either: (a) add the ui label to this PR so they run, OR (b) move both steps to an unconditional workflow with a path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh, src/mcp-server/Dockerfile. Option (b) is the better long-term answer since this is a regression-guard for backend infrastructure, not UI.

  • Resolve the lint (sys.modules pollution check) failure. The failing file tests/unit/test_slot_per_slot_ttl.py is not touched by this PR — likely baseline drift on dev since the branch point. Please rebase on dev; if the failure persists, regenerate the baseline (python tests/lint_sys_modules.py --regenerate-baseline) in a separate commit on dev so this PR inherits a clean lint.

Architecture update, migration doc, and CSO report all look good. Approving once CI is green with the new guards actually running.

AndriiPasternak31 added a commit that referenced this pull request May 20, 2026
Addresses @vybe review on #878. The verify-non-root and
verify-prod-frontend-uid guards added in #874 lived inside
frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure
PRs (the exact PRs that can regress the guards) skipped them silently.

Moves both steps to .github/workflows/container-security.yml with a
path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh,
and src/mcp-server/Dockerfile so the guards execute whenever the
underlying surface changes — independent of the e2e workflow's UI gate.

frontend-e2e.yml keeps the stack boot for Playwright smoke tests but
no longer carries the regression guards. Architecture invariant #17
updated to point at the new workflow.
@AndriiPasternak31 AndriiPasternak31 force-pushed the AndriiPasternak31/issue-874 branch from 967b7e8 to 473733c Compare May 20, 2026 16:47
Comment thread .github/workflows/container-security.yml Fixed
AndriiPasternak31 added a commit that referenced this pull request May 20, 2026
…workflow

Addresses CodeQL finding flagged on PR #878
(security/code-scanning/173): the new workflow defaulted to the
repository-default GITHUB_TOKEN scope, which is broader than the
workflow actually uses.

Pin top-level `permissions: contents: read` — the minimum needed for
actions/checkout. The workflow does no PR commenting, issue updating,
or security-events writes, so anything beyond `contents: read` would
be unused authority.
vybe pushed a commit that referenced this pull request May 22, 2026
Addresses @vybe review on #878. The verify-non-root and
verify-prod-frontend-uid guards added in #874 lived inside
frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure
PRs (the exact PRs that can regress the guards) skipped them silently.

Moves both steps to .github/workflows/container-security.yml with a
path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh,
and src/mcp-server/Dockerfile so the guards execute whenever the
underlying surface changes — independent of the e2e workflow's UI gate.

frontend-e2e.yml keeps the stack boot for Playwright smoke tests but
no longer carries the regression guards. Architecture invariant #17
updated to point at the new workflow.
vybe pushed a commit that referenced this pull request May 22, 2026
…workflow

Addresses CodeQL finding flagged on PR #878
(security/code-scanning/173): the new workflow defaulted to the
repository-default GITHUB_TOKEN scope, which is broader than the
workflow actually uses.

Pin top-level `permissions: contents: read` — the minimum needed for
actions/checkout. The workflow does no PR commenting, issue updating,
or security-events writes, so anything beyond `contents: read` would
be unused authority.
@vybe vybe force-pushed the AndriiPasternak31/issue-874 branch from 25ddeb1 to da2509f Compare May 22, 2026 11:56
@vybe
Copy link
Copy Markdown
Contributor

vybe commented May 22, 2026

Rebased onto current dev and pushed (da2509f9) to unblock the architecture.md regression I flagged. Resolved two merge conflicts:

Final diff against dev is now exactly the 15 files the PR advertises (+638/-28). CI re-running. Will approve once green.

AndriiPasternak31 and others added 4 commits May 23, 2026 11:13
Closes the CSO MEDIUM defense-in-depth gap flagged persistent since
2026-04-05: backend, scheduler, MCP server, and the production frontend
ran their CMD as root. An RCE in any of them inherited root, and on the
backend that meant the Docker socket bind mount turned a single RCE into
fleet-wide reconnaissance.

Changes:
- docker/backend/Dockerfile, docker/scheduler/Dockerfile: new `trinity`
  user at UID 1000 (matched UID required — both share /data/trinity.db).
- src/mcp-server/Dockerfile: switch to the built-in `node` user (UID 1000).
- docker/frontend/Dockerfile.prod: switch to `nginxinc/nginx-unprivileged`
  (UID 101, binds 8080). nginx.conf + healthcheck + compose port mapping
  updated to match. NET_BIND_SERVICE/CHOWN/SETGID/SETUID dropped from the
  frontend caps (no longer needed once nginx is unprivileged).
- docker-compose{,.prod}.yml: backend joins `${DOCKER_GID:-999}` via
  group_add so UID 1000 retains /var/run/docker.sock access on Linux.
  Dead NET_BIND_SERVICE removed from backend (binds 8000, doesn't need
  it). PYTHONDONTWRITEBYTECODE=1 added to dev compose so uvicorn --reload
  stops failing on __pycache__ writes when host UID != 1000.
- scripts/deploy/start.sh: pre-creates the host bind-mount data dir with
  UID 1000 (the Dockerfile's chown is masked by the bind mount); auto-
  detects DOCKER_GID on Linux (Debian/Ubuntu=999, RHEL/Fedora=~991,
  Arch=990) so non-Debian hosts don't silently fail with EACCES on the
  socket.
- .env.example: DOCKER_GID ships blank so start.sh auto-detect kicks in.
  Compose still falls back to 999 if .env value is missing entirely.
- .github/workflows/frontend-e2e.yml: `verify-non-root` step asserts UID
  1000 in backend/scheduler/mcp-server and exercises the Docker socket
  via `docker.from_env().ping()` from inside the backend (the prior
  `/api/agents` probe was a false positive — `list_all_agents_fast`
  catches every Docker exception and returns []). `verify-prod-frontend-uid`
  builds the prod frontend image out-of-band and asserts UID 101. Admin
  password is generated per-run instead of the previous hardcoded
  fallback (CSO I-01).
- docs/memory/architecture.md: new invariant #17 documenting the rule.
- docs/migrations/NON_ROOT_CONTAINERS_2026-05.md: upgrade procedure for
  existing deployments — Docker only honours the Dockerfile chown on
  first volume creation, so trinity-data and agent-configs volumes from
  prior root-running containers need to be re-owned manually.
- docs/security-reports/cso-diff-2026-05-17.md: audit report of the
  branch itself.

Verification (CI, fresh prod, upgrade): see the migration doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses @vybe review on #878. The verify-non-root and
verify-prod-frontend-uid guards added in #874 lived inside
frontend-e2e.yml, which is `ui`-label-gated — so backend infrastructure
PRs (the exact PRs that can regress the guards) skipped them silently.

Moves both steps to .github/workflows/container-security.yml with a
path filter on docker/**, docker-compose*.yml, scripts/deploy/start.sh,
and src/mcp-server/Dockerfile so the guards execute whenever the
underlying surface changes — independent of the e2e workflow's UI gate.

frontend-e2e.yml keeps the stack boot for Playwright smoke tests but
no longer carries the regression guards. Architecture invariant #17
updated to point at the new workflow.
…workflow

Addresses CodeQL finding flagged on PR #878
(security/code-scanning/173): the new workflow defaulted to the
repository-default GITHUB_TOKEN scope, which is broader than the
workflow actually uses.

Pin top-level `permissions: contents: read` — the minimum needed for
actions/checkout. The workflow does no PR commenting, issue updating,
or security-events writes, so anything beyond `contents: read` would
be unused authority.
…urity

The new container-security workflow calls /api/token after backend
health is green, but on a fresh DB the first-time setup wizard blocks
login (`setup_required`, 403) until `setup_completed=true`. Mirror the
"Skip first-time setup wizard" step from frontend-e2e.yml — flip the
flag directly via `docker exec trinity-backend python3 ...` so the
CI sanity probe can mint a token and hit /api/agents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vybe vybe force-pushed the AndriiPasternak31/issue-874 branch from da2509f to 3b653ed Compare May 23, 2026 10:14
@vybe
Copy link
Copy Markdown
Contributor

vybe commented May 23, 2026

Rebased on dev and resolved the cso-diff filename collision on your behalf — branch is now MERGEABLE.

What changed:

Will re-approve and squash-merge once CI is green on 3b653edf. Heads up so you can pull --force-with-lease cleanly on your local.

Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved after rebase. All prior CR items addressed: new verify-non-root job runs unconditionally (passed 2m55s), least-privilege GITHUB_TOKEN on the new workflow, lint clean, all 6 pytest seeds + regression-diff green. Rebased the branch myself to resolve the add/add filename collision on cso-diff-2026-05-17.md — both audits preserved (yours at -non-root.md, #876's at the original path).

@vybe vybe merged commit 420faea into dev May 23, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants