Skip to content

fix(deploy): restrict workflow-runner egress to required destinations#1430

Merged
OleksandrUA merged 1 commit into
stagingfrom
TECH-30-tighten-workflow-runner-egress-networkpolicy
Jun 1, 2026
Merged

fix(deploy): restrict workflow-runner egress to required destinations#1430
OleksandrUA merged 1 commit into
stagingfrom
TECH-30-tighten-workflow-runner-egress-networkpolicy

Conversation

@OleksandrUA
Copy link
Copy Markdown

Problem

workflow-runner-block-imds only blocked link-local (169.254.0.0/16). Runner pods could still reach all of RFC1918, CGNAT and in-VPC services (RDS, Redis, other pods) over egress. A runner compromised through an SSRF vector that slips past the L7 SAFE_FETCH_ENFORCE guard could pivot to internal hosts. This is a SEV-1 postmortem follow-up.

Change

Replaced the single permissive egress rule with a union of explicit allows; everything else is denied. The runner needs exactly four destinations, each gets its own rule:

  • DNS to kube-dns only (UDP/TCP 53)
  • keeperhub-sandbox :8787 - delegated code-step execution
  • keeperhub-executor :3080 - metrics ingest (EXECUTOR_METRICS_INGEST_URL)
  • RDS Postgres :5432 - scoped to the dedicated DB subnets (the runner writes execution state directly via DATABASE_URL)
  • public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16, 100.64/10, 127/8 stripped; IPv6 keeps NAT64 reachable while blocking link-local and ULA (IMDSv6 sits in fc00::/7)

In-cluster Services are matched by podSelector rather than CIDR because the VPC CNI Network Policy Agent evaluates egress on the post-DNAT backend pod IP. Public egress is left port-unrestricted on purpose so non-443 RPC / webhook endpoints keep working; destination filtering at L7 stays the job of SAFE_FETCH_ENFORCE.

Both files edited (staging applies on this merge, prod applies on the later staging -> prod PR); only the RDS subnet CIDRs differ between envs.

Network facts (verified live, read-only, 2026-06-01)

prod (techops-prod, us-east-2) staging (techops-staging, us-east-1)
VPC CIDR 10.0.0.0/16 10.1.0.0/16
DB subnets 10.0.100.0/24, .101.0/24, .102.0/24 10.1.100.0/24, .101.0/24, .102.0/24
Pods all in 10.x (no 100.64 secondary CIDR) same shape

Validation

  • Both manifests pass kubectl apply --dry-run=server against the live API servers (reports configured - in-place update of the existing policy, same name, no orphan).

Post-merge verification plan (staging, before promoting to prod)

  1. Run a real workflow on staging - proves DB + sandbox + DNS + public RPC still work end to end.
  2. Throwaway app=workflow-runner debug pod: confirm IMDS 169.254.169.254 and an arbitrary 10.1.x host are denied, while RDS :5432 and public 443 succeed.

Risk: staging applies to all runner Jobs on merge. If a workflow needs an in-cluster endpoint not accounted for here, it would fail at run time (not deploy time) - the verification step above is what catches that before prod.

The workflow-runner-block-imds NetworkPolicy only blocked link-local
(169.254/16), leaving runner pods able to reach all of RFC1918, CGNAT and
in-VPC services (RDS, Redis, other pods) over egress. A compromised runner
(e.g. SSRF via an http-request step that slips past SAFE_FETCH_ENFORCE)
could pivot to internal hosts.

Replace the single permissive rule with a union of explicit allows and
deny-the-rest:

- DNS to kube-dns only
- keeperhub-sandbox :8787 (delegated code execution)
- keeperhub-executor :3080 (metrics ingest)
- RDS Postgres :5432, scoped to the dedicated DB subnets
- public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16,
  100.64/10 and 127/8 stripped; IPv6 keeps NAT64 while blocking link-local
  and ULA

In-cluster Services are matched by podSelector because the VPC CNI policy
agent evaluates egress on the post-DNAT backend pod IP. Public egress is
left port-unrestricted so non-443 RPC/webhook endpoints keep working.

Validated both manifests with kubectl apply --dry-run=server against the
live API servers.
@OleksandrUA OleksandrUA merged commit 03fbc02 into staging Jun 1, 2026
32 checks passed
@OleksandrUA OleksandrUA deleted the TECH-30-tighten-workflow-runner-egress-networkpolicy branch June 1, 2026 15:40
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1430
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant