clusterd: SIGTERM handler + in-process gRPC FQDN (Variant A) by jasonhernandez · Pull Request #36100 · MaterializeInc/materialize

jasonhernandez · 2026-04-15T05:16:34Z

Design question: where should clusterd's gRPC FQDN come from?

This is Variant A of a two-version pair (see #36872 for Variant B). Both add the SIGTERM handler that distroless clusterd needs as PID 1. They differ only in how CLUSTERD_GRPC_HOST is set once entrypoint.sh is removed by the distroless migration (#36099).

entrypoint.sh previously did, in Kubernetes only:

export CLUSTERD_GRPC_HOST=${CLUSTERD_GRPC_HOST:-$(hostname --fqdn)}

	Variant A (this PR)	Variant B (#36872)
Who resolves the FQDN	clusterd, at startup	orchestrator-kubernetes, at pod-spec build
Mechanism	in-process `getaddrinfo(AI_CANONNAME)` (`resolve_fqdn`)	downward API `MZ_POD_NAME` + env interpolation
Blocking DNS at startup	yes — `getaddrinfo` has no timeout; if CoreDNS is briefly down at pod start, clusterd's main thread blocks	no — kubelet substitutes the value
clusterd `main()`	SIGTERM handler + `resolve_fqdn` + `CLUSTERD_PROCESS`	SIGTERM handler + `CLUSTERD_PROCESS` only
Where pod identity lives	split between binary and orchestrator	with the orchestrator
Footprint	self-contained; works under any orchestrator	k8s-specific; +`EnvVar`s in the pod spec

Why Variant A

Keeps the change confined to the clusterd binary — no orchestrator changes, and it works under any orchestrator (not just k8s) that gives the pod a resolvable hostname. It directly mirrors what entrypoint.sh did (hostname --fqdn).

Known downside

resolve_fqdn calls getaddrinfo with no timeout on the main thread at startup (flagged in the code comment). If DNS is unavailable at pod start, clusterd blocks. This is the same behavior the old shell had, so it's not a regression — but Variant B avoids it entirely by injecting the value from the orchestrator, which already computes the same FQDN.

Part of the distroless migration (#36099 image, this/#36872 lifecycle, #36101 UID/GID).

Test plan

cargo check -p mz-clusterd (rustc 1.96.0)
Verify in k8s that resolve_fqdn yields the pod FQDN and gRPC host validation passes
Decide A vs B before merge — close the other

🤖 Generated with Claude Code

github-actions · 2026-04-15T05:16:42Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Distroless images have no shell, so there's no init process to forward signals. Add a direct SIGTERM handler in clusterd (using nix::signal) and environmentd so graceful shutdown works without a wrapper script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Distroless images run as nonroot (UID 65534) instead of root. Add version-gating so orchestratord sets the correct runAsUser/runAsGroup based on the Materialize version, avoiding UID mismatches during rolling upgrades from Debian-based to distroless images. Gate versions (verified against release history, 2026-06): - balancerd: V26_18_0. Its ci/Dockerfile switched to distroless-prod-base in v26.18.0 (prod-base in v26.17.x). The original V26_19_0 was off by one and would have forced UID 999 onto v26.18.x balancerd pods that actually run as 65534. - environmentd/clusterd: V26_28_0, matching the release that ships their distroless migration (#36099). The original V26_20_0 predated the actual landing by ~8 releases (main is now 26.28-dev) and would have applied UID 65534 to v26.20-v26.27 images that still run as UID 999. NOTE: the env/clusterd gate assumes #36099 lands in the 26.28 cycle. If it slips, bump V26_28_0 to the actual release. The three distroless PRs (#36099 image, #36100 SIGTERM, #36101 this) must ship in the same release. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jasonhernandez · 2026-06-02T17:06:01Z

🔗 Distroless migration — coordination note

One of three PRs that ship together: #36099 (distroless image), #36100 (this — SIGTERM handler), #36101 (UID/GID gating).

Merge this before or with #36099: distroless runs the binary as PID 1, which silently ignores SIGTERM without an explicit handler. Safe to merge first (harmless on current images). environmentd already handles termination signals, so this change is correctly clusterd-only despite the commit subject.

Distroless containers run the binary directly as PID 1 (no tini/shell), so two things from the removed entrypoint.sh are handled here: 1. PID 1 ignores signals with a SIG_DFL disposition, so SIGTERM from Kubernetes is silently dropped. Add an explicit termination-signal handler in clusterd (environmentd already has one). The StatefulSet ordinal (CLUSTERD_PROCESS) is derived in-process from the hostname. 2. entrypoint.sh set CLUSTERD_GRPC_HOST via `hostname --fqdn`. That value fed the CTP handshake's optional `server_fqdn` check: clusterd advertised its FQDN and the controller compared it against the address it dialed (transport.rs::handshake). The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Rather than re-plumb it for distroless, remove it: drop `--grpc-host`/`CLUSTERD_GRPC_HOST` and the `server_fqdn` field from the CTP `Hello`, along with the now-unused `host_from_address` helper and the `test_handshake_fqdn_mismatch` test. test_metrics byte-count bounds loosened since the handshake shrank. This is the "rip it out" alternative to the FQDN-handling PRs (#36100 in-process resolve, #36872 orchestrator-injected). Pick one. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jasonhernandez · 2026-06-02T20:38:37Z

Closing in favor of the #36872 → #36876 stack.

The SIGTERM handler from this PR now lives in #36872 (minimal). The FQDN question this PR solved with an in-process resolve_fqdn is instead handled by #36876, which removes the CTP server-FQDN validation entirely — it turned out to be an optional, misnamed (it's CTP, not gRPC) check that isn't worth re-plumbing for distroless.

This was referenced Apr 15, 2026

containers: prepare environmentd and clusterd for distroless migration #35859

Closed

clusterd: remove direct libc dependency #35928

Closed

jasonhernandez force-pushed the jason/distroless-process-lifecycle branch from f7adf0b to 6aaafcb Compare April 16, 2026 19:10

jasonhernandez force-pushed the jason/distroless-process-lifecycle branch from 6aaafcb to c23ae38 Compare June 2, 2026 16:57

jasonhernandez mentioned this pull request Jun 2, 2026

containers: switch environmentd and clusterd to distroless base image #36099

Draft

3 tasks

This was referenced Jun 2, 2026

orchestratord: version-gate pod security context UID/GID for distroless #36101

Draft

clusterd: SIGTERM handler for distroless containers #36872

Draft

jasonhernandez changed the title ~~clusterd,environmentd: add SIGTERM handler for distroless containers~~ clusterd: SIGTERM handler + in-process gRPC FQDN (Variant A) Jun 2, 2026

jasonhernandez mentioned this pull request Jun 2, 2026

clusterd: remove unused CTP server-FQDN validation #36876

Draft

3 tasks

jasonhernandez closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterd: SIGTERM handler + in-process gRPC FQDN (Variant A)#36100

clusterd: SIGTERM handler + in-process gRPC FQDN (Variant A)#36100
jasonhernandez wants to merge 1 commit into
mainfrom
jason/distroless-process-lifecycle

jasonhernandez commented Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

jasonhernandez commented Jun 2, 2026 •

edited

Loading

Uh oh!

jasonhernandez commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design question: where should clusterd's gRPC FQDN come from?

Why Variant A

Known downside

Test plan

Uh oh!

github-actions Bot commented Apr 15, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

jasonhernandez commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Distroless migration — coordination note

Uh oh!

jasonhernandez commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasonhernandez commented Apr 15, 2026 •

edited

Loading

jasonhernandez commented Jun 2, 2026 •

edited

Loading