Skip to content

fix(server): add startup probe for gateway boot#417

Merged
drew merged 1 commit intomainfrom
409-startup-probe-tls-logs/an
Mar 18, 2026
Merged

fix(server): add startup probe for gateway boot#417
drew merged 1 commit intomainfrom
409-startup-probe-tls-logs/an

Conversation

@drew
Copy link
Collaborator

@drew drew commented Mar 17, 2026

Summary

Add a Kubernetes startupProbe for the gateway StatefulSet so slow single-node boots get startup slack before liveness restarts begin.
Tighten TLS probe log handling so immediate EOF-style disconnects from TCP socket probes do not show up as misleading handshake errors.

Related Issue

Closes #409

Changes

  • add configurable startupProbe values to the OpenShell Helm chart and render the probe on the gateway StatefulSet
  • downgrade only UnexpectedEof TLS accept failures to debug-level logging while keeping other handshake failures at error
  • add unit coverage for the TLS handshake classifier and document the startup probe behavior in the gateway architecture doc

Testing

  • mise run pre-commit passes, except for the existing untracked ignored scratch/scrub.sh SPDX check failure in this workspace
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner March 17, 2026 22:58
@drew drew self-assigned this Mar 17, 2026
@drew drew force-pushed the 409-startup-probe-tls-logs/an branch 2 times, most recently from e6080a0 to 70fc831 Compare March 17, 2026 23:17
@drew drew added the test:e2e Requires end-to-end coverage label Mar 17, 2026
@drew drew merged commit 13f13c2 into main Mar 18, 2026
21 checks passed
@drew drew deleted the 409-startup-probe-tls-logs/an branch March 18, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing

2 participants