Skip to content

fix: Simplify and remove load from /livez#39070

Draft
geekgonecrazy wants to merge 2 commits into
developfrom
fix/simplify-readyz-livez
Draft

fix: Simplify and remove load from /livez#39070
geekgonecrazy wants to merge 2 commits into
developfrom
fix/simplify-readyz-livez

Conversation

@geekgonecrazy
Copy link
Copy Markdown
Contributor

@geekgonecrazy geekgonecrazy commented Feb 25, 2026

Proposed changes (including videos or screenshots)

This PR refactors the Kubernetes health probe endpoints (/livez and /readyz) to follow SRE best practices and reduce unnecessary database load.

Problem

Both /livez and /readyz endpoints were performing identical health checks including MongoDB connectivity, memory usage, and event loop lag. This caused:

  • Unnecessary MongoDB load (double the ping commands)
  • Didn't follow proper Kubernetes probe separation of concerns

Solution

Implemented the SRE standard "dumb liveness probe" pattern:

/livez (Liveness Probe)

  • Now returns 200 OK immediately without any health checks
  • Rationale: If the Node.js process is truly deadlocked, it won't respond to HTTP requests at all, and Kubernetes will timeout and restart the pod automatically
  • Eliminates redundant MongoDB pings and metric calculations

/readyz (Readiness Probe)

  • Continues to perform comprehensive health checks:
    • MongoDB connectivity
    • Heap memory usage (configurable threshold)
    • Event loop lag p99 (configurable threshold)
  • Determines if the application can successfully process user requests

Code Changes

  • Simplified /livez endpoint to immediately return success
  • Removed unused eventLoopHistogramLiveness (only need one for readiness)
  • Renamed performHealthChecks()performReadinessChecks() for clarity
  • Updated documentation to clarify the purpose of each endpoint
  • Follow the core principle: "Is my process dead?" vs "Can I serve traffic?"

Benefits

  • ✅ 50% reduction in MongoDB health check load
  • ✅ Follows Kubernetes and SRE community best practices
  • ✅ Clearer separation of concerns
  • ✅ Simpler, more maintainable code

Issue(s)

Steps to test or reproduce

  1. Start the Rocket.Chat server
  2. Test the endpoints:
    # Liveness - should always return 200 OK immediately
    curl http://localhost:3000/livez
    # Expected: {"status":"ok"}
    
    # Readiness - performs all health checks
    curl http://localhost:3000/readyz
    # Expected: {"status":"ok","checks":{...}} if healthy
    # Expected: {"status":"unavailable","checks":{...}} with 503 if unhealthy
    
    # Legacy endpoint - unchanged
    curl http://localhost:3000/health
    # Expected: {"status":"ok"}
    

Summary by CodeRabbit

Release Notes

  • Performance
    • Optimized Kubernetes health probe endpoints to reduce database load. Liveness probes now respond instantly without performing dependency checks, streamlining probe cycles. Readiness probes continue performing comprehensive health assessments including database connectivity and system resource monitoring to ensure traffic readiness.

@geekgonecrazy geekgonecrazy requested a review from a team as a code owner February 25, 2026 22:33
@dionisio-bot
Copy link
Copy Markdown
Contributor

dionisio-bot Bot commented Feb 25, 2026

Looks like this PR is not ready to merge, because of the following issues:

  • This PR is missing the 'stat: QA assured' label
  • This PR is missing the required milestone or project

Please fix the issues and try again

If you have any trouble, please check the PR guidelines

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Feb 25, 2026

🦋 Changeset detected

Latest commit: 259eceb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 41 packages
Name Type
@rocket.chat/meteor Patch
@rocket.chat/core-typings Patch
@rocket.chat/rest-typings Patch
@rocket.chat/uikit-playground Patch
@rocket.chat/api-client Patch
@rocket.chat/apps Patch
@rocket.chat/core-services Patch
@rocket.chat/cron Patch
@rocket.chat/ddp-client Patch
@rocket.chat/fuselage-ui-kit Patch
@rocket.chat/gazzodown Patch
@rocket.chat/http-router Patch
@rocket.chat/livechat Patch
@rocket.chat/model-typings Patch
@rocket.chat/ui-avatar Patch
@rocket.chat/ui-client Patch
@rocket.chat/ui-contexts Patch
@rocket.chat/ui-voip Patch
@rocket.chat/web-ui-registration Patch
@rocket.chat/account-service Patch
@rocket.chat/authorization-service Patch
@rocket.chat/ddp-streamer Patch
@rocket.chat/omnichannel-transcript Patch
@rocket.chat/presence-service Patch
@rocket.chat/queue-worker Patch
@rocket.chat/abac Patch
@rocket.chat/federation-matrix Patch
@rocket.chat/license Patch
@rocket.chat/media-calls Patch
@rocket.chat/omnichannel-services Patch
@rocket.chat/pdf-worker Patch
@rocket.chat/presence Patch
rocketchat-services Patch
@rocket.chat/models Patch
@rocket.chat/network-broker Patch
@rocket.chat/omni-core-ee Patch
@rocket.chat/mock-providers Patch
@rocket.chat/ui-video-conf Patch
@rocket.chat/instance-status Patch
@rocket.chat/omni-core Patch
@rocket.chat/server-fetch Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 25, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c201367-5aef-4a4d-844e-f2dc5522609a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The changes implement Kubernetes health probe optimizations by splitting health check responsibilities: /livez now returns a simple 200 OK status without dependency checks, while /readyz retains comprehensive checks for traffic readiness. This reduces MongoDB load from frequent liveness probes.

Changes

Cohort / File(s) Summary
Changeset Documentation
.changeset/optimize-health-probes-for-k8s.md
New changeset documenting Kubernetes health probe optimization with patch bump to @rocket.chat/meteor.
Health Check Implementation
apps/meteor/server/routes/health.ts
Modified health check endpoints: /livez simplified to always return 200 OK without dependency checks; /readyz retains dependency checks (MongoDB, memory, event loop). Consolidated event loop histogram initialization and renamed performHealthChecks to performReadinessChecks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

type: feature

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: Simplify and remove load from /livez' is clear, specific, and directly describes the main change—simplifying the /livez endpoint and reducing its load.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the type: feature Pull requests that introduces new feature label Feb 25, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/meteor/server/routes/health.ts (1)

116-131: Trim added implementation comments to match repo style guidance.

The newly expanded endpoint rationale comments should be minimized or moved to docs to keep implementation lean.

As per coding guidelines: "Avoid code comments in the implementation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/meteor/server/routes/health.ts` around lines 116 - 131, The added
multi-line explanatory comments above the /livez handler and before the
readiness probe are too verbose for in-code docs; reduce them to a single-line
summary or remove them and move detailed rationale to external docs.
Specifically, trim the block comment describing liveness/readiness to one
concise line (e.g., "Liveness probe: verifies process responds" and "Readiness
probe: verifies service readiness") or delete it around the
WebApp.rawHandlers.use('/livez', ...) and the readiness probe comment area so
only minimal, repo-style comments remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/meteor/server/routes/health.ts`:
- Around line 116-131: The added multi-line explanatory comments above the
/livez handler and before the readiness probe are too verbose for in-code docs;
reduce them to a single-line summary or remove them and move detailed rationale
to external docs. Specifically, trim the block comment describing
liveness/readiness to one concise line (e.g., "Liveness probe: verifies process
responds" and "Readiness probe: verifies service readiness") or delete it around
the WebApp.rawHandlers.use('/livez', ...) and the readiness probe comment area
so only minimal, repo-style comments remain.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fea32db and 6808812.

📒 Files selected for processing (2)
  • .changeset/optimize-health-probes-for-k8s.md
  • apps/meteor/server/routes/health.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: CodeQL-Build
  • GitHub Check: CodeQL-Build
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (.cursor/rules/playwright.mdc)

**/*.{ts,tsx,js}: Write concise, technical TypeScript/JavaScript with accurate typing in Playwright tests
Avoid code comments in the implementation

Files:

  • apps/meteor/server/routes/health.ts
🧠 Learnings (4)
📓 Common learnings
Learnt from: geekgonecrazy
Repo: RocketChat/Rocket.Chat PR: 36955
File: apps/meteor/server/routes/health.ts:101-111
Timestamp: 2025-09-22T21:37:05.164Z
Learning: In Rocket.Chat's health check implementation, the /livez endpoint intentionally performs the same checks as /readyz rather than being a lightweight process check. This design ensures that pods stuck in a permanently unready state will eventually be terminated by the liveness probe, with orchestrator configuration controlling the different failure thresholds and timeouts between readiness and liveness checks.
Learnt from: ahmed-n-abdeltwab
Repo: RocketChat/Rocket.Chat PR: 38974
File: apps/meteor/app/api/server/v1/im.ts:220-221
Timestamp: 2026-02-24T19:09:09.561Z
Learning: In RocketChat/Rocket.Chat OpenAPI migration PRs for apps/meteor/app/api/server/v1 endpoints, maintainers prefer to avoid any logic changes; style-only cleanups (like removing inline comments) may be deferred to follow-ups to keep scope tight.
Learnt from: ggazzo
Repo: RocketChat/Rocket.Chat PR: 35995
File: apps/meteor/app/api/server/v1/rooms.ts:1107-1112
Timestamp: 2026-02-23T17:53:18.785Z
Learning: In Rocket.Chat PR reviews, maintain strict scope boundaries—when a PR is focused on a specific endpoint (e.g., rooms.favorite), avoid reviewing or suggesting changes to other endpoints that were incidentally refactored (e.g., rooms.invite) unless explicitly requested by maintainers.
📚 Learning: 2025-09-22T21:37:05.164Z
Learnt from: geekgonecrazy
Repo: RocketChat/Rocket.Chat PR: 36955
File: apps/meteor/server/routes/health.ts:101-111
Timestamp: 2025-09-22T21:37:05.164Z
Learning: In Rocket.Chat's health check implementation, the /livez endpoint intentionally performs the same checks as /readyz rather than being a lightweight process check. This design ensures that pods stuck in a permanently unready state will eventually be terminated by the liveness probe, with orchestrator configuration controlling the different failure thresholds and timeouts between readiness and liveness checks.

Applied to files:

  • .changeset/optimize-health-probes-for-k8s.md
  • apps/meteor/server/routes/health.ts
📚 Learning: 2026-02-24T19:09:09.561Z
Learnt from: ahmed-n-abdeltwab
Repo: RocketChat/Rocket.Chat PR: 38974
File: apps/meteor/app/api/server/v1/im.ts:220-221
Timestamp: 2026-02-24T19:09:09.561Z
Learning: In RocketChat/Rocket.Chat OpenAPI migration PRs for apps/meteor/app/api/server/v1 endpoints, maintainers prefer to avoid any logic changes; style-only cleanups (like removing inline comments) may be deferred to follow-ups to keep scope tight.

Applied to files:

  • .changeset/optimize-health-probes-for-k8s.md
📚 Learning: 2026-02-24T19:05:56.710Z
Learnt from: ahmed-n-abdeltwab
Repo: RocketChat/Rocket.Chat PR: 0
File: :0-0
Timestamp: 2026-02-24T19:05:56.710Z
Learning: In Rocket.Chat PRs, keep feature PRs free of unrelated lockfile-only dependency bumps; prefer reverting lockfile drift or isolating such bumps into a separate "chore" commit/PR, and always use yarn install --immutable with the Yarn version pinned in package.json via Corepack.

Applied to files:

  • .changeset/optimize-health-probes-for-k8s.md
🔇 Additional comments (3)
.changeset/optimize-health-probes-for-k8s.md (1)

1-9: Changeset summary is clear and release-friendly.

This correctly communicates the probe behavior split and expected operational benefit.

apps/meteor/server/routes/health.ts (2)

41-42: Nice refactor: shared histogram + readiness-specific naming improves clarity.

The performReadinessChecks() rename and single histogram usage reduce duplication while keeping readiness checks explicit.

Also applies to: 93-100, 134-134


120-125: The current /livez endpoint design is intentional and follows standard Kubernetes probe patterns. The code explicitly separates liveness (lightweight process check) from readiness (comprehensive dependency checks), with comments documenting this design. This architecture is correct—liveness probes should be minimal to avoid cascading failures, while readiness probes handle resource/dependency validation. The review comment incorrectly assumes /livez should mirror /readyz, but they are intentionally different by design.

Likely an incorrect or invalid review comment.

@geekgonecrazy geekgonecrazy changed the title Fix: Simplify and remove load from /livez fix: Simplify and remove load from /livez Feb 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 70.83%. Comparing base (8d73ce5) to head (259eceb).
⚠️ Report is 89 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop   #39070      +/-   ##
===========================================
+ Coverage    70.65%   70.83%   +0.18%     
===========================================
  Files         3190     3193       +3     
  Lines       112732   113257     +525     
  Branches     20431    20584     +153     
===========================================
+ Hits         79649    80225     +576     
+ Misses       31035    30981      -54     
- Partials      2048     2051       +3     
Flag Coverage Δ
unit 71.57% <ø> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@geekgonecrazy geekgonecrazy marked this pull request as draft February 26, 2026 01:23
Comment thread apps/meteor/server/routes/health.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community type: feature Pull requests that introduces new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants