fix(control-plane): implement graceful shutdown on SIGTERM/SIGINT#715
Conversation
Replace the select{} no-op with real signal handling and HTTP server
drain so SIGTERM during rolling deploys no longer kills in-flight
requests.
Changes:
- Install signal.NotifyContext for SIGINT/SIGTERM in runServer
- Replace gin Router.Run() with net/http.Server for Shutdown() support
- Call server.Stop() on signal: drains HTTP connections, stops background
goroutines (presence manager, health monitor, cleanup, OTel, etc.)
- Add configurable shutdown_timeout (default 30s) via YAML and
AGENTFIELD_SHUTDOWN_TIMEOUT env var
- Remove literal '// TODO: Implement graceful shutdown' comments
- Add nil guard for healthMonitor.Stop() to prevent panic on empty server
Closes Agent-Field#427
There was a problem hiding this comment.
Pull request overview
Implements first-class graceful shutdown for the control-plane server: replaces the previous “block forever” behavior with SIGINT/SIGTERM handling and introduces an http.Server so in-flight HTTP requests can be drained during rolling deploys.
Changes:
- Added
signal.NotifyContext-based shutdown waiting inrunServer, followed by aserver.Stop()shutdown sequence. - Replaced
gin.Engine.Run()withnet/http.Server.ListenAndServe()and addedhttp.Server.Shutdown()on stop. - Introduced configurable
agentfield.shutdown_timeout(default30s) with anAGENTFIELD_SHUTDOWN_TIMEOUTenv override.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| control-plane/internal/server/server.go | Switches HTTP startup to net/http.Server and adds HTTP shutdown behavior to Stop(). |
| control-plane/internal/config/config.go | Adds ShutdownTimeout to config, sets defaults, and supports env override. |
| control-plane/config/agentfield.yaml | Documents/configures default shutdown_timeout: 30s. |
| control-plane/cmd/agentfield-server/main.go | Adds SIGINT/SIGTERM waiting and triggers Stop() for graceful shutdown. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # control-plane/internal/config/config.go
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
- Test AGENTFIELD_SHUTDOWN_TIMEOUT env override and defaults - Test Stop() on empty server (nil safety) - Test HTTP server graceful shutdown with active listener - Test HTTP server shutdown timeout + force close - Test defaultWaitForShutdown unblocks on SIGINT (Linux/macOS only)
|
Thanks for the graceful shutdown fix. I pushed a small follow-up commit that drains HTTP first, keeps the HTTP server pointer behind a mutex, wraps the startup error, and adds the invalid I also resolved the Copilot threads after checking the changes. Local checks passed:
|
santoshkumarradha
left a comment
There was a problem hiding this comment.
Thanks for the follow-through here. I rechecked the graceful shutdown path, the config surface, and the added regression coverage. The shutdown timeout is wired through YAML and env overrides, the HTTP server now drains before process exit, and the required checks plus CLA are green. This looks good to merge.
Summary
Replace the
select{}no-op with real signal handling and HTTP server drain soSIGTERMduring rolling deploys no longer kills in-flight requests. Background goroutines (presence manager, health monitor, cleanup service, OTel tracer, etc.) are now stopped cleanly before the process exits.Changes:
signal.NotifyContextforSIGINT/SIGTERMinrunServergin Router.Run()withnet/http.ServerforShutdown()supportserver.Stop()on signal: drains HTTP connections, stops background goroutinesshutdown_timeout(default30s) via YAML andAGENTFIELD_SHUTDOWN_TIMEOUTenv var// TODO: Implement graceful shutdowncommentshealthMonitor.Stop()to prevent panic on empty serverCloses #427
Type of change
Test plan
cd control-plane && go test ./cmd/agentfield-server/ -v -run "TestLoadConfig|TestRunServer"cd control-plane && go test ./internal/config/ -count=1cd control-plane && CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ./cmd/agentfield-servercd control-plane && CGO_ENABLED=0 GOOS=darwin GOARCH=amd64 go build ./cmd/agentfield-serverSIGINT(Ctrl+C), confirmed"Shutdown signal received, draining connections..."and clean exit with code0Test coverage
coverage-baseline.jsonin this PR only if the removal caused a legitimate regression and I called it out in the summary above.Checklist
Related issues / PRs
Fixes #427