fix(deploy): restrict workflow-runner egress to required destinations by OleksandrUA · Pull Request #1430 · KeeperHub/keeperhub

OleksandrUA · 2026-06-01T15:18:48Z

Problem

workflow-runner-block-imds only blocked link-local (169.254.0.0/16). Runner pods could still reach all of RFC1918, CGNAT and in-VPC services (RDS, Redis, other pods) over egress. A runner compromised through an SSRF vector that slips past the L7 SAFE_FETCH_ENFORCE guard could pivot to internal hosts. This is a SEV-1 postmortem follow-up.

Change

Replaced the single permissive egress rule with a union of explicit allows; everything else is denied. The runner needs exactly four destinations, each gets its own rule:

DNS to kube-dns only (UDP/TCP 53)
keeperhub-sandbox :8787 - delegated code-step execution
keeperhub-executor :3080 - metrics ingest (EXECUTOR_METRICS_INGEST_URL)
RDS Postgres :5432 - scoped to the dedicated DB subnets (the runner writes execution state directly via DATABASE_URL)
public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16, 100.64/10, 127/8 stripped; IPv6 keeps NAT64 reachable while blocking link-local and ULA (IMDSv6 sits in fc00::/7)

In-cluster Services are matched by podSelector rather than CIDR because the VPC CNI Network Policy Agent evaluates egress on the post-DNAT backend pod IP. Public egress is left port-unrestricted on purpose so non-443 RPC / webhook endpoints keep working; destination filtering at L7 stays the job of SAFE_FETCH_ENFORCE.

Both files edited (staging applies on this merge, prod applies on the later staging -> prod PR); only the RDS subnet CIDRs differ between envs.

Network facts (verified live, read-only, 2026-06-01)

	prod (techops-prod, us-east-2)	staging (techops-staging, us-east-1)
VPC CIDR	`10.0.0.0/16`	`10.1.0.0/16`
DB subnets	`10.0.100.0/24`, `.101.0/24`, `.102.0/24`	`10.1.100.0/24`, `.101.0/24`, `.102.0/24`
Pods	all in `10.x` (no `100.64` secondary CIDR)	same shape

Validation

Both manifests pass kubectl apply --dry-run=server against the live API servers (reports configured - in-place update of the existing policy, same name, no orphan).

Post-merge verification plan (staging, before promoting to prod)

Run a real workflow on staging - proves DB + sandbox + DNS + public RPC still work end to end.
Throwaway app=workflow-runner debug pod: confirm IMDS 169.254.169.254 and an arbitrary 10.1.x host are denied, while RDS :5432 and public 443 succeed.

Risk: staging applies to all runner Jobs on merge. If a workflow needs an in-cluster endpoint not accounted for here, it would fail at run time (not deploy time) - the verification step above is what catches that before prod.

The workflow-runner-block-imds NetworkPolicy only blocked link-local (169.254/16), leaving runner pods able to reach all of RFC1918, CGNAT and in-VPC services (RDS, Redis, other pods) over egress. A compromised runner (e.g. SSRF via an http-request step that slips past SAFE_FETCH_ENFORCE) could pivot to internal hosts. Replace the single permissive rule with a union of explicit allows and deny-the-rest: - DNS to kube-dns only - keeperhub-sandbox :8787 (delegated code execution) - keeperhub-executor :3080 (metrics ingest) - RDS Postgres :5432, scoped to the dedicated DB subnets - public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16, 100.64/10 and 127/8 stripped; IPv6 keeps NAT64 while blocking link-local and ULA In-cluster Services are matched by podSelector because the VPC CNI policy agent evaluates egress on the post-DNAT backend pod IP. Public egress is left port-unrestricted so non-443 RPC/webhook endpoints keep working. Validated both manifests with kubectl apply --dry-run=server against the live API servers.

github-actions · 2026-06-01T15:41:19Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-1430
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-06-01T15:41:20Z

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

The PR never had the deploy-pr-environment label
The environment was already cleaned up
The deployment never completed successfully

OleksandrUA temporarily deployed to staging June 1, 2026 15:19 — with GitHub Actions Inactive

OleksandrUA merged commit 03fbc02 into staging Jun 1, 2026
32 checks passed

OleksandrUA deleted the TECH-30-tighten-workflow-runner-egress-networkpolicy branch June 1, 2026 15:40

OleksandrUA temporarily deployed to staging June 1, 2026 15:40 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deploy): restrict workflow-runner egress to required destinations#1430

fix(deploy): restrict workflow-runner egress to required destinations#1430
OleksandrUA merged 1 commit into
stagingfrom
TECH-30-tighten-workflow-runner-egress-networkpolicy

OleksandrUA commented Jun 1, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OleksandrUA commented Jun 1, 2026

Problem

Change

Network facts (verified live, read-only, 2026-06-01)

Validation

Post-merge verification plan (staging, before promoting to prod)

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions Bot commented Jun 1, 2026

ℹ️ No PR Environment to Clean Up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant