Skip to content

fix(control-plane): implement graceful shutdown on SIGTERM/SIGINT#715

Merged
santoshkumarradha merged 5 commits into
Agent-Field:mainfrom
7vignesh:fix/427-graceful-shutdown
Jul 4, 2026
Merged

fix(control-plane): implement graceful shutdown on SIGTERM/SIGINT#715
santoshkumarradha merged 5 commits into
Agent-Field:mainfrom
7vignesh:fix/427-graceful-shutdown

Conversation

@7vignesh

@7vignesh 7vignesh commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace the select{} no-op with real signal handling and HTTP server drain so SIGTERM during rolling deploys no longer kills in-flight requests. Background goroutines (presence manager, health monitor, cleanup service, OTel tracer, etc.) are now stopped cleanly before the process exits.

Changes:

  • Install signal.NotifyContext for SIGINT/SIGTERM in runServer
  • Replace gin Router.Run() with net/http.Server for Shutdown() support
  • Call server.Stop() on signal: drains HTTP connections, stops background goroutines
  • Add configurable shutdown_timeout (default 30s) via YAML and AGENTFIELD_SHUTDOWN_TIMEOUT env var
  • Remove // TODO: Implement graceful shutdown comments
  • Add nil guard for healthMonitor.Stop() to prevent panic on empty server

Closes #427

Type of change

  • Bug fix

Test plan

  • cd control-plane && go test ./cmd/agentfield-server/ -v -run "TestLoadConfig|TestRunServer"
  • cd control-plane && go test ./internal/config/ -count=1
  • cd control-plane && CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ./cmd/agentfield-server
  • cd control-plane && CGO_ENABLED=0 GOOS=darwin GOARCH=amd64 go build ./cmd/agentfield-server
  • Manual: started server locally, sent SIGINT (Ctrl+C), confirmed "Shutdown signal received, draining connections..." and clean exit with code 0

Test coverage

  • I ran tests for the surface(s) I changed locally.
  • New code paths are covered by tests in this PR (no bare additions).
  • If I removed code, I updated coverage-baseline.json in this PR only if the removal caused a legitimate regression and I called it out in the summary above.
  • The coverage gate check is green in CI before requesting review.

Checklist

Related issues / PRs

Fixes #427

Replace the select{} no-op with real signal handling and HTTP server
drain so SIGTERM during rolling deploys no longer kills in-flight
requests.

Changes:
- Install signal.NotifyContext for SIGINT/SIGTERM in runServer
- Replace gin Router.Run() with net/http.Server for Shutdown() support
- Call server.Stop() on signal: drains HTTP connections, stops background
  goroutines (presence manager, health monitor, cleanup, OTel, etc.)
- Add configurable shutdown_timeout (default 30s) via YAML and
  AGENTFIELD_SHUTDOWN_TIMEOUT env var
- Remove literal '// TODO: Implement graceful shutdown' comments
- Add nil guard for healthMonitor.Stop() to prevent panic on empty server

Closes Agent-Field#427
Copilot AI review requested due to automatic review settings July 3, 2026 21:10
@7vignesh 7vignesh requested review from a team and AbirAbbas as code owners July 3, 2026 21:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements first-class graceful shutdown for the control-plane server: replaces the previous “block forever” behavior with SIGINT/SIGTERM handling and introduces an http.Server so in-flight HTTP requests can be drained during rolling deploys.

Changes:

  • Added signal.NotifyContext-based shutdown waiting in runServer, followed by a server.Stop() shutdown sequence.
  • Replaced gin.Engine.Run() with net/http.Server.ListenAndServe() and added http.Server.Shutdown() on stop.
  • Introduced configurable agentfield.shutdown_timeout (default 30s) with an AGENTFIELD_SHUTDOWN_TIMEOUT env override.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
control-plane/internal/server/server.go Switches HTTP startup to net/http.Server and adds HTTP shutdown behavior to Stop().
control-plane/internal/config/config.go Adds ShutdownTimeout to config, sets defaults, and supports env override.
control-plane/config/agentfield.yaml Documents/configures default shutdown_timeout: 30s.
control-plane/cmd/agentfield-server/main.go Adds SIGINT/SIGTERM waiting and triggers Stop() for graceful shutdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread control-plane/internal/server/server.go Outdated
Comment thread control-plane/internal/server/server.go Outdated
Comment thread control-plane/internal/server/server.go Outdated
Comment thread control-plane/internal/config/config.go
# Conflicts:
#	control-plane/internal/config/config.go
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 84%, aggregate ≥ 85%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 86.90% 87.40% ↓ -0.50 pp 🟡
sdk-go 91.80% 92.00% ↓ -0.20 pp 🟢
sdk-python 93.76% 93.73% ↑ +0.03 pp 🟢
sdk-typescript 90.09% 90.42% ↓ -0.33 pp 🟢
web-ui 84.76% 84.79% ↓ -0.03 pp 🟡
aggregate 85.55% 85.75% ↓ -0.20 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 77 89.00%
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

7vignesh and others added 2 commits July 4, 2026 02:59
- Test AGENTFIELD_SHUTDOWN_TIMEOUT env override and defaults
- Test Stop() on empty server (nil safety)
- Test HTTP server graceful shutdown with active listener
- Test HTTP server shutdown timeout + force close
- Test defaultWaitForShutdown unblocks on SIGINT (Linux/macOS only)
@santoshkumarradha

santoshkumarradha commented Jul 3, 2026

Copy link
Copy Markdown
Member

Thanks for the graceful shutdown fix. I pushed a small follow-up commit that drains HTTP first, keeps the HTTP server pointer behind a mutex, wraps the startup error, and adds the invalid AGENTFIELD_SHUTDOWN_TIMEOUT env test.

I also resolved the Copilot threads after checking the changes. Local checks passed:

  • go test ./internal/config/ -count=1
  • go test ./internal/server/ -count=1
  • go test -race ./internal/server/ -run "TestStopGracefulShutdown|TestStopHTTPServerShutdownTimeout" -count=1
  • go test ./cmd/agentfield-server/ -v -run "TestLoadConfig|TestRunServer" -count=1

@santoshkumarradha santoshkumarradha left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-through here. I rechecked the graceful shutdown path, the config surface, and the added regression coverage. The shutdown timeout is wired through YAML and env overrides, the HTTP server now drains before process exit, and the required checks plus CLA are green. This looks good to merge.

@santoshkumarradha santoshkumarradha added this pull request to the merge queue Jul 4, 2026
Merged via the queue into Agent-Field:main with commit 6c6c195 Jul 4, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Control Plane] Graceful shutdown not implemented (literal TODO)

3 participants