Skip to content

P0-7: graceful shutdown readiness drain (MR-P0-7)#120

Merged
mastermanas805 merged 5 commits into
masterfrom
ship/p0-7-graceful-shutdown-readiness-2026-05-20
May 20, 2026
Merged

P0-7: graceful shutdown readiness drain (MR-P0-7)#120
mastermanas805 merged 5 commits into
masterfrom
ship/p0-7-graceful-shutdown-readiness-2026-05-20

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Diagnosis

Pre-existing state on master:

  • runServerWithGracefulShutdown helper was wired in main.go (calls app.ShutdownWithTimeout on SIGTERM, drains in-flight requests in 25s).
  • TestRunServerWithGracefulShutdown_DrainsInflight already proved the drain contract.

Gap identified: /readyz kept returning 200 right up to listener close. The kubelet's readinessProbe never observed a 503, so the Service kept routing new traffic to a pod that was about to stop accepting connections — producing the connection-reset symptom on every rolling restart. The k8s manifest also lacked terminationGracePeriodSeconds and preStop (companion PR in infra repo).

Diff Summary

File Change
internal/handlers/readyz.go New MarkDraining() + IsDraining(); Get() short-circuits to 503 + overall=failed + a single shutting_down check when draining. atomic.Bool flag.
internal/router/router.go New ShutdownHooks struct + NewWithHooks() returning (*fiber.App, ShutdownHooks) so main.go can flip Readyz.MarkDraining. Legacy New() preserved for existing tests.
main.go New readinessDrainGrace = 3 * time.Second. runServerWithGracefulShutdown now takes router.ShutdownHooks and on SIGTERM: (A) calls hooks.Readyz.MarkDraining; (B) sleeps readinessDrainGrace; (C) app.ShutdownWithTimeout(25s). Doc updated to reflect terminationGracePeriodSeconds=35.
graceful_shutdown_test.go Existing drain test updated to pass router.ShutdownHooks{}. Two new tests: TestRunServerWithGracefulShutdown_MarksReadinessDraining (asserts the flag flip on SIGTERM); TestRunServerWithGracefulShutdown_TimeoutKillsStuckRequest (asserts the 25s timeout fires non-nil and exits in bounded time).
internal/handlers/readyz_test.go TestReadyz_DrainingReturns503 (no sqlmock expectations — proves the runner is NOT consulted in drain mode); TestReadyz_DrainingIsIdempotent.

Test Output (verbatim)

=== RUN   TestRunServerWithGracefulShutdown_DrainsInflight
2026/05/20 16:29:58 INFO server.shutdown_signal_received timeout_seconds=5 readiness_drain_grace_seconds=3
2026/05/20 16:30:01 INFO server.graceful_shutdown_complete
--- PASS: TestRunServerWithGracefulShutdown_DrainsInflight (3.03s)
=== RUN   TestRunServerWithGracefulShutdown_MarksReadinessDraining
2026/05/20 16:30:01 INFO server.shutdown_signal_received timeout_seconds=3 readiness_drain_grace_seconds=3
2026/05/20 16:30:01 INFO server.readiness_marked_draining
2026/05/20 16:30:04 INFO server.graceful_shutdown_complete
--- PASS: TestRunServerWithGracefulShutdown_MarksReadinessDraining (3.13s)
=== RUN   TestRunServerWithGracefulShutdown_TimeoutKillsStuckRequest
2026/05/20 16:30:04 INFO server.shutdown_signal_received timeout_seconds=0 readiness_drain_grace_seconds=3
2026/05/20 16:30:07 ERROR server.graceful_shutdown_failed error="context deadline exceeded"
--- PASS: TestRunServerWithGracefulShutdown_TimeoutKillsStuckRequest (3.51s)
PASS
ok  	instant.dev	10.845s
=== RUN   TestReadyz_DrainingReturns503
--- PASS: TestReadyz_DrainingReturns503 (0.00s)
=== RUN   TestReadyz_DrainingIsIdempotent
--- PASS: TestReadyz_DrainingIsIdempotent (0.00s)
PASS
ok  	instant.dev/internal/handlers	1.407s

go build ./... and go vet ./... clean.

Coverage Block (CLAUDE.md rule 17)

Symptom:        connection-resets on /healthz/openapi.json across rolling restart
                (kubelet sends SIGTERM, api binary exits immediately, in-flight
                requests RST and Razorpay webhooks during the window are LOST)
Enumeration:    rg -n 'app\.Listen|ShutdownWithTimeout|MarkDraining|terminationGracePeriodSeconds|preStop' \
                  in api/ and infra/k8s/app.yaml
Sites found:    5 — main.go helper, infra app.yaml deployment, /readyz handler,
                k8s readinessProbe wiring, k8s preStop wiring
Sites touched:  5 (companion infra PR ships the manifest sites; api PR ships
                the binary-side sites)
Coverage test:  TestRunServerWithGracefulShutdown_MarksReadinessDraining fails
                if MarkDraining gets moved to AFTER ShutdownWithTimeout (the
                exact regression class). TestReadyz_DrainingReturns503 fails
                if a future Get() refactor re-enters the runner during drain.
Live verified:  pending — requires merge + Deploy run + curl loop during
                rolling restart (see CLAUDE.md rule 21).

Required Companion PR

infra repoship/p0-7-api-grace-period-2026-05-20
adds terminationGracePeriodSeconds: 35 and a preStop: sleep 5 lifecycle hook
to k8s/app.yaml. Must land together or the in-process drain logic stalls until
SIGKILL at 30s default.

Live Verify Plan (post-merge)

  1. CI Deploy green + curl https://api.instanode.dev/healthz | jq .commit_id matches HEAD.
  2. kubectl rollout restart deploy/instant-api -n instant
  3. Loop curl -w '%{http_code}\n' /healthz /openapi.json for 60s — assert
    ZERO non-2xx (no connection-resets).
  4. kubectl get events -n instant shows no 'killed before terminationGracePeriod'.

🤖 Generated with Claude Code

mastermanas805 and others added 5 commits May 20, 2026 16:24
…R-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n (MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit bd97fab into master May 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant