Skip to content

feat: graceful shutdown#205

Merged
lindesvard merged 3 commits intomainfrom
graceful-healthchecks
Oct 2, 2025
Merged

feat: graceful shutdown#205
lindesvard merged 3 commits intomainfrom
graceful-healthchecks

Conversation

@lindesvard
Copy link
Copy Markdown
Contributor

@lindesvard lindesvard commented Sep 30, 2025

Summary by CodeRabbit

  • New Features
    • Added Kubernetes-style health endpoints: /healthz/live (always reports liveness) and /healthz/ready (per-dependency readiness). Kept /healthcheck for compatibility.
  • Refactor
    • Simplified health behavior to lightweight, real-time dependency probes with clearer ready/live responses and proper error codes.
  • Chores
    • Centralized graceful shutdown for safer deployments and signal handling.
    • Reduced log noise by excluding /healthz requests from request logging.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 30, 2025

Walkthrough

Refactors health checks into Kubernetes-style liveness and readiness endpoints, adds a centralized graceful shutdown utility and wiring, updates index to use new endpoints and shutdown handlers, and extends request-logging to ignore the new health path.

Changes

Cohort / File(s) Summary of changes
Health endpoints refactor
apps/api/src/controllers/healthcheck.controller.ts
Replaces legacy timing/queue checks with liveness and readiness handlers. readiness short-circuits when shutting down, performs lightweight checks (Redis, DB, query cache), and returns per-dependency flags; errors return 503. Removed healthcheckQueue and legacy timing logic.
Request logging filter
apps/api/src/hooks/request-logging.hook.ts
Adds /healthz to the ignore list so health endpoints are excluded from request logging.
API wiring and signals
apps/api/src/index.ts
Exposes /healthz/live and /healthz/ready (keeps /healthcheck for backward compatibility). Replaces manual signal handlers with centralized shutdown usage and wires SIGTERM/SIGINT/uncaughtException/unhandledRejection to graceful shutdown.
Graceful shutdown utility
apps/api/src/utils/graceful-shutdown.ts
New multi-step shutdown: setShuttingDown / isShuttingDown and shutdown(fastify, signal, exitCode?). Marks shutting down, waits grace period, closes Fastify, disconnects DB/ClickHouse, closes Bull queues, closes Redis connections, logs per-step outcomes, then exits.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant F as Fastify
  participant HC as HealthController
  participant R as Redis
  participant D as DB
  participant Q as QueryCache

  Note over C,F: Liveness
  C->>F: GET /healthz/live
  F->>HC: liveness()
  HC-->>F: { live: true }
  F-->>C: 200 OK

  rect rgba(230,245,255,0.5)
  Note over C,F: Readiness (short-circuits if shutting down)
  C->>F: GET /healthz/ready
  F->>HC: readiness()
  HC->>HC: check isShuttingDown()
  alt Shutting down
    HC-->>F: { ready: false, reason: "shutting_down" }
    F-->>C: 503 Service Unavailable
  else Not shutting down
    par Redis check
      HC->>R: ping/info
      R-->>HC: ok/fail
    and DB check
      HC->>D: simple query/connect
      D-->>HC: ok/fail
    and Query cache check
      HC->>Q: ping/get
      Q-->>HC: ok/fail
    end
    alt Any failure
      HC-->>F: { ready: false, deps: {...} }
      F-->>C: 503 Service Unavailable
    else All pass
      HC-->>F: { ready: true, deps: {...} }
      F-->>C: 200 OK
    end
  end
  end
Loading
sequenceDiagram
  autonumber
  participant OS as OS/Node Runtime
  participant F as Fastify
  participant GS as GracefulShutdown
  participant DB as Database
  participant CH as ClickHouse
  participant BQ as BullQueues
  participant RC as RedisConns

  OS-->>F: SIGTERM/SIGINT or fatal error
  F->>GS: shutdown(fastify, signal)
  GS->>GS: setShuttingDown(true)
  Note over GS: Wait SHUTDOWN_GRACE_PERIOD_MS (~5s default)
  GS->>F: fastify.close()
  F-->>GS: closed/err (logged)
  par Close backends
    GS->>DB: disconnect()
    DB-->>GS: ok/err
    GS->>CH: disconnect()
    CH-->>GS: ok/err
    GS->>BQ: close queues (events/sessions/cron/...)
    BQ-->>GS: ok/err
    GS->>RC: close cache/pub/sub/queue if ready
    RC-->>GS: ok/err
  end
  GS-->>OS: process.exit(exitCode)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

Thump-thump, I check each burrow bright,
Live flag waving in the moonlit night.
Ready I am—unless storms call,
Then I tuck in, close every hall.
Quiet logs, soft shutdown song, carrots safe and strong. 🥕🫶

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “feat: graceful shutdown” succinctly captures the introduction of a coordinated shutdown sequence triggered by process signals and follows conventional commit style, clearly reflecting the core objective of this pull request without extraneous detail.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch graceful-healthchecks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
apps/api/src/utils/graceful-shutdown.ts (3)

18-26: Shutdown flag lacks thread-safety guarantees.

JavaScript is single-threaded for execution, but the boolean flag shuttingDown could be subject to race conditions if multiple shutdown signals arrive simultaneously (e.g., SIGTERM followed quickly by SIGINT). While unlikely to cause major issues in practice due to the event loop model, consider using an atomic flag or checking if shutdown is already in progress at the start of the shutdown() function.

Add an early-return guard in the shutdown function:

 export async function shutdown(fastify: FastifyInstance, signal: string) {
+  if (shuttingDown) {
+    logger.warn('Shutdown already in progress, ignoring signal', { signal });
+    return;
+  }
   logger.info('Starting graceful shutdown', { signal });
   setShuttingDown(true);

34-35: Consider making the load balancer grace period configurable.

The hard-coded 5-second wait is reasonable and aligns with typical Kubernetes preStop hook delays, but consider making it configurable via an environment variable for flexibility across different deployment environments.

Example:

-  // Step 2: Wait for load balancer to stop sending traffic (matches preStop sleep)
-  await new Promise((resolve) => setTimeout(resolve, 5000));
+  // Step 2: Wait for load balancer to stop sending traffic (matches preStop sleep)
+  const gracePeriod = Number(process.env.SHUTDOWN_GRACE_PERIOD_MS || '5000');
+  await new Promise((resolve) => setTimeout(resolve, gracePeriod));

96-97: Consider using non-zero exit code for error-triggered shutdowns.

The shutdown function always exits with code 0, even when triggered by uncaughtException or unhandledRejection. This may confuse orchestrators (Kubernetes, Docker) that rely on exit codes to determine if a restart is needed.

Pass the exit code or distinguish error signals:

-export async function shutdown(fastify: FastifyInstance, signal: string) {
+export async function shutdown(fastify: FastifyInstance, signal: string, exitCode = 0) {
   logger.info('Starting graceful shutdown', { signal });
   
   // ... shutdown steps ...
   
   logger.info('Graceful shutdown completed');
-  process.exit(0);
+  process.exit(exitCode);
 }

Then in index.ts:

-      process.on('uncaughtException', async (error) => {
+      process.on('uncaughtException', (error) => {
         logger.error('Uncaught exception', error);
-        await shutdown(fastify, 'uncaughtException');
+        shutdown(fastify, 'uncaughtException', 1);
       });
apps/api/src/controllers/healthcheck.controller.ts (1)

36-60: LGTM! Readiness probe correctly integrates shutdown awareness.

The readiness endpoint properly:

  • Short-circuits to 503 when the shutdown flag is set
  • Performs lightweight dependency checks to determine readiness
  • Returns detailed status information for debugging

One optional improvement: Consider adding timeouts to the dependency checks to prevent the readiness probe from hanging if a dependency becomes slow during shutdown or failure scenarios.

Example timeout wrapper:

const withTimeout = <T>(promise: Promise<T>, ms: number): Promise<T> => {
  return Promise.race([
    promise,
    new Promise<T>((_, reject) => 
      setTimeout(() => reject(new Error('Timeout')), ms)
    ),
  ]);
};

// Then use:
const redisRes = await withTimeout(getRedisCache().ping(), 1000);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5092b6a and b00ce5d.

📒 Files selected for processing (4)
  • apps/api/src/controllers/healthcheck.controller.ts (1 hunks)
  • apps/api/src/hooks/request-logging.hook.ts (1 hunks)
  • apps/api/src/index.ts (4 hunks)
  • apps/api/src/utils/graceful-shutdown.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
apps/api/src/utils/graceful-shutdown.ts (4)
packages/db/src/clickhouse/client.ts (2)
  • error (40-45)
  • ch (130-140)
packages/db/src/prisma-client.ts (1)
  • db (270-270)
packages/queue/src/queues.ts (5)
  • eventsQueue (94-104)
  • sessionsQueue (106-111)
  • cronQueue (116-121)
  • miscQueue (123-128)
  • notificationQueue (137-145)
packages/redis/redis.ts (4)
  • getRedisCache (64-70)
  • getRedisPub (82-88)
  • getRedisSub (73-79)
  • getRedisQueue (91-106)
apps/api/src/index.ts (3)
apps/api/src/controllers/healthcheck.controller.ts (3)
  • healthcheck (7-29)
  • liveness (32-34)
  • readiness (37-60)
apps/api/src/utils/logger.ts (1)
  • logger (3-3)
apps/api/src/utils/graceful-shutdown.ts (1)
  • shutdown (29-98)
apps/api/src/controllers/healthcheck.controller.ts (4)
packages/redis/redis.ts (1)
  • getRedisCache (64-70)
packages/db/src/prisma-client.ts (1)
  • db (270-270)
packages/db/src/clickhouse/client.ts (1)
  • chQuery (180-185)
apps/api/src/utils/graceful-shutdown.ts (1)
  • isShuttingDown (24-26)
🔇 Additional comments (10)
apps/api/src/hooks/request-logging.hook.ts (1)

4-4: LGTM! Correctly excludes new health endpoints from logging.

The addition of '/healthz' to the ignore list appropriately reduces log noise from frequent Kubernetes health probes while maintaining backward compatibility with the existing '/healthcheck' path.

apps/api/src/index.ts (3)

24-25: LGTM! Health endpoint imports aligned with Kubernetes patterns.

The import of liveness and readiness functions supports the new Kubernetes-style health check endpoints.


44-44: LGTM! Centralized shutdown utility imported.


177-181: LGTM! Health endpoints properly registered with backward compatibility.

The implementation correctly:

  • Maintains the existing /healthcheck endpoint for backward compatibility
  • Adds Kubernetes-standard /healthz/live for liveness probes
  • Adds /healthz/ready for readiness probes
apps/api/src/utils/graceful-shutdown.ts (3)

37-43: LGTM! Fastify server closure properly handled.

The fastify.close() call drains in-flight requests gracefully with appropriate error handling.


75-94: LGTM! Redis connection closure properly handled.

The code correctly:

  • Checks connection status before calling quit()
  • Uses quit() instead of disconnect() to gracefully wait for pending commands
  • Handles errors per step

The status check for 'ready' is appropriate—connections in other states (e.g., 'connecting', 'disconnecting', 'end') should not be quit again.


61-73: queue.close() in API only tears down Redis connections, not jobs
In the API process you’re only enqueuing work—actual job execution and graceful shutdown (waiting on active jobs) occurs in the worker process via worker.close(). Calling queue.close() here simply closes the client connections; there are no in-flight jobs to await in this service.

Likely an incorrect or invalid review comment.

apps/api/src/controllers/healthcheck.controller.ts (3)

1-1: LGTM! Import of shutdown flag for readiness check.


6-29: LGTM! Backward-compatible healthcheck with proper dependency checks.

The healthcheck function correctly validates all dependencies (Redis, DB, ClickHouse) and returns appropriate status codes with detailed responses.


31-34: LGTM! Liveness probe correctly implemented.

The liveness endpoint properly follows Kubernetes conventions by only checking if the process is alive, without performing dependency checks.

Comment thread apps/api/src/index.ts
@vercel
Copy link
Copy Markdown

vercel Bot commented Oct 2, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
openpanel-public Ready Ready Preview Comment Oct 2, 2025 8:58am

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
apps/api/src/index.ts (1)

220-229: Process event handlers don't await async functions.

While the handlers are now declared async and use await, Node.js process event handlers (SIGTERM, SIGINT, uncaughtException, unhandledRejection) are inherently synchronous and won't wait for async operations to complete. The async/await wrapper here doesn't provide the synchronization guarantee you might expect.

However, since your shutdown() utility calls process.exit() at the end of its execution, the process will remain alive until shutdown completes. The current implementation works, but the async/await wrapper is misleading about how it achieves graceful shutdown.

Consider simplifying to make the control flow clearer:

-      process.on('SIGTERM', async () => await shutdown(fastify, 'SIGTERM', 0));
-      process.on('SIGINT', async () => await shutdown(fastify, 'SIGINT', 0));
-      process.on('uncaughtException', async (error) => {
+      process.on('SIGTERM', () => shutdown(fastify, 'SIGTERM', 0));
+      process.on('SIGINT', () => shutdown(fastify, 'SIGINT', 0));
+      process.on('uncaughtException', (error) => {
         logger.error('Uncaught exception', error);
-        await shutdown(fastify, 'uncaughtException', 1);
+        shutdown(fastify, 'uncaughtException', 1);
       });
-      process.on('unhandledRejection', async (reason, promise) => {
+      process.on('unhandledRejection', (reason, promise) => {
         logger.error('Unhandled rejection', { reason, promise });
-        await shutdown(fastify, 'unhandledRejection', 1);
+        shutdown(fastify, 'unhandledRejection', 1);
       });

This makes it clear that the handlers fire-and-forget the shutdown promise, which is safe because shutdown() manages the process lifecycle internally via process.exit().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b00ce5d and ce536ca.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (2)
  • apps/api/src/index.ts (4 hunks)
  • apps/api/src/utils/graceful-shutdown.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/api/src/utils/graceful-shutdown.ts
🧰 Additional context used
🧬 Code graph analysis (1)
apps/api/src/index.ts (2)
apps/api/src/controllers/healthcheck.controller.ts (3)
  • healthcheck (7-29)
  • liveness (32-34)
  • readiness (37-60)
apps/api/src/utils/graceful-shutdown.ts (1)
  • shutdown (29-108)
🔇 Additional comments (2)
apps/api/src/index.ts (2)

24-25: LGTM! Clean imports for health endpoints and shutdown utility.

The new imports for Kubernetes-style health endpoints (liveness, readiness) and the centralized shutdown utility align well with the PR's graceful shutdown objectives.

Also applies to: 44-44


177-181: LGTM! Proper Kubernetes health endpoint implementation.

The implementation correctly:

  • Maintains backward compatibility with the existing /healthcheck endpoint
  • Follows Kubernetes conventions with /healthz prefix
  • Separates liveness (always responds) from readiness (checks dependencies and shutdown state)

@lindesvard lindesvard merged commit ca4a880 into main Oct 2, 2025
9 checks passed
@lindesvard lindesvard deleted the graceful-healthchecks branch November 7, 2025 15:18
@coderabbitai coderabbitai Bot mentioned this pull request Apr 8, 2026
@coderabbitai coderabbitai Bot mentioned this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant