Skip to content

Design doc to introduce k8s data store for health events#787

Merged
lalitadithya merged 5 commits intoNVIDIA:mainfrom
yavinash007:k8s-ds-design-doc
Feb 10, 2026
Merged

Design doc to introduce k8s data store for health events#787
lalitadithya merged 5 commits intoNVIDIA:mainfrom
yavinash007:k8s-ds-design-doc

Conversation

@yavinash007
Copy link
Contributor

@yavinash007 yavinash007 commented Feb 2, 2026

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added ADR describing a Kubernetes Custom Resource–backed datastore option for HealthEvent with a Kubernetes-native API and storage backing.
    • Describes a proto-first generation approach, unified spec/status handling, watch-driven ingestion, and controller-driven status updates for remediation.
    • Documents CR lifecycle (TTL, per-node/per-cluster limits), single-active-backend deployment model, versioning, phased rollout, migration coordination, and operational tooling.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

Adds ADR-027: a Kubernetes-native HealthEvent CRD datastore (proto-first CRD + generated Go types). Moves HealthEventStatus into proto, introduces a pluggable CR-backed datastore alongside existing backends, describes watch-driven controllers for reconciliation/status updates, CR cleanup (TTL / limits), rollout, and migration notes.

Changes

Cohort / File(s) Summary
Design Documentation
docs/designs/027-kubernetes-data-store.md
Adds ADR-027 describing a Kubernetes-native HealthEvent CRD datastore: proto-first CRD and Go type generation, migration of HealthEventStatus into proto, CR-backed pluggable backend option (coexists with MongoDB/Postgres), watch-based controllers updating status and driving remediation, CR cleanup via TTL and per-node/cluster limits, rollout/versioning, and tooling/migration guidance.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Monitor as Health Monitor
    participant Connector as Platform Connector
    participant K8sAPI as Kubernetes API / CRD
    participant Controller as HealthEvent Controller
    participant Store as Cluster Store (CRs)

    Monitor->>Connector: Emit HealthEvent (proto spec)
    Connector->>K8sAPI: Create/Update HealthEvent CR (spec)
    K8sAPI-->>Store: Persist CR object
    Controller->>K8sAPI: Watch HealthEvent CRs (informers)
    K8sAPI-->>Controller: Deliver add/update/delete events
    Controller->>Store: Read CR spec/status
    Controller->>Controller: Reconcile desired state / remediation
    Controller->>K8sAPI: Update HealthEvent.status
    K8sAPI-->>Store: Persist status update
    Controller->>K8sAPI: Optionally delete CR (TTL / limits)
    K8sAPI-->>Store: Remove CR
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰
I hopped through proto lines so bright,
CRDs sprouted in the night,
Watches chatter, statuses peep,
TTLs tidy up the heap,
Remediation dreams take flight 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: introducing a Kubernetes data store for health events, which is the primary focus of the added design document.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to protoc-crd-gen repo


### CR Cleanup / Garbage Collection:
To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes:
- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think, maybe just the per node and per cluster limits is okay and ttl is not needed

@ivelichkovich
Copy link
Contributor

Overall looks good and well written to me, thank you!


### Mitigations

- Event deduplication and rate limiting: Only create CRs for unique unhealthy events or updates to existing events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this belongs in the doc but this can also be mitigated by partitioning etcd but its outside of the scope of nvsentinel

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
@yavinash007 yavinash007 marked this pull request as ready for review February 3, 2026 01:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 32: The doc text mistakenly names the CRD generator as "protoc-crd-gen"
while the linked repo is "protoc-gen-crd"; update the mention of the tool in the
sentence to "protoc-gen-crd" (the same spelling as the GitHub link) so the name
and URL match, and verify the surrounding sentence referencing HealthEvent spec
remains accurate after the rename.
🧹 Nitpick comments (2)
docs/designs/027-kubernetes-data-store.md (2)

40-40: Consider API versioning strategy earlier rather than deferring to v1 graduation.

While it's acceptable to iterate freely in v1alpha1, establishing basic versioning and conversion patterns early can prevent costly migrations later. Even alpha APIs benefit from thinking about field deprecation, backward compatibility, and storage version strategy upfront.


63-63: Consider adding more detail about proto migration coordination strategy.

Moving HealthEventStatus from a Go struct into proto affects multiple components (health monitors, controllers, store-client, converters). While the document correctly flags this coordination requirement, consider adding a brief migration strategy or phased approach to reduce risk—for example, maintaining compatibility shims during transition or coordinating the change across a specific set of PRs.

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: The two statements conflict: "Historic view via kube-state-metrics"
and "Limited querying capabilities" should be reconciled by clarifying that
kube-state-metrics exposes time-series metrics for basic trend monitoring and
observability (i.e., simple historical views), whereas CRs alone do not provide
full-featured historical/event query capabilities or ad-hoc complex queries;
update the "Historic view via kube-state-metrics" paragraph to explicitly say it
enables basic trend/time-series monitoring (not ad-hoc querying) and add a short
clarifying sentence to the "Limited querying capabilities" section referencing
that distinction.
- Around line 99-104: The document currently contradicts itself by listing
hand-managed CRD structs (HealthEventSpec, HealthEventStatus,
BehaviourOverrides, Entity) and controller-gen usage as "Under Consideration"
while the Decision section already claims a proto-first Kubernetes-native
HealthEvent API; update the prose to be consistent by either changing the
Decision section to state both approaches are still under evaluation (so reword
the Decision to reflect ongoing consideration of hand-managed structs vs
proto-first), or change the Alternatives/Under Consideration paragraph to
past-tense and note that the hand-managed approach (including files under
./data-models, conversion implementation in
pkg/conversion/healthevent_conversion.go and tests in
healthevent_conversion_test.go, and Helm packaging at
./distros/kubernetes/nvsentinel/charts/kubernetes-store) was considered but
rejected in favor of the proto-first/controller-gen approach with
HealthEventStatus kept as a handwritten struct only where applicable; ensure
mentions of HealthEventStatus and controller-gen align with the final decision.
🧹 Nitpick comments (2)
docs/designs/027-kubernetes-data-store.md (2)

67-68: Clarify the behavior when CR limits are reached.

The document mentions "enforcement of configurable per-node or per-cluster CR limits" but doesn't specify what happens when these limits are reached. Will new CRs be rejected? Will oldest CRs be evicted? This operational behavior should be documented to help operators understand the implications.

Consider adding a brief statement about the enforcement mechanism, for example:

  • "When limits are reached, the oldest CRs are automatically removed to make room for new events"
  • "When limits are reached, new CR creation is rate-limited or delayed"

113-113: Add link to the referenced datastore abstraction design document.

The "NVSentinel Datastore Abstraction Design Document" is referenced multiple times (lines 9, 113) but no link is provided. Adding a link or file path would help readers access the broader context.

… of other places too.

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: Update the ambiguous wording by changing the "Historic view via
kube-state-metrics" bullet to explicitly say it provides basic time-series trend
monitoring and observability (e.g., "Health events stored as CRs can be scraped
and exposed through Kubernetes-native monitoring tools like kube-state-metrics,
enabling basic time-series trend monitoring and observability without additional
infrastructure") and update the "Limited querying capabilities" sentence to
clarify it refers to complex ad‑hoc queries and full-featured historical
analysis that CRs alone cannot support (e.g., add "Complex ad-hoc queries and
full-featured historical analysis (beyond basic time-series metrics) are not
feasible using only CRs").
- Line 36: Rename the handwritten Go file "health_event_extentions.go" to
"health_event_extensions.go" and update all references/imports/usages to the
filename (e.g., any build tags, go:generate directives, package-level comments,
or CI scripts) so the compiler and tooling pick up the file; ensure the type
HealthEventStatus and any functions/types defined there are still
exported/accessible and run go build/test to verify no references remain to the
misspelled "extentions" name.

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Around line 65-68: Update the "CR Cleanup / Garbage Collection" section to
clarify TTL semantics for HealthEvent CRs: specify that age-based deletion only
applies to events in a confirmed terminal/resolved state (or after an explicit
retention window following resolution), and/or require an explicit "archived"
flag before deletion; alternatively mandate archiving unresolved events to an
external store before any deletion. Mention configurable guardrails (e.g.,
require terminal state + retentionSeconds, per-node/per-cluster maximums still
respected, and an opt-in flag to enable deletion) and add a short example
sentence referencing "HealthEvent CRs" and the "CR Cleanup / Garbage Collection"
policy to make the rule and exceptions explicit.

Comment on lines +65 to +68
### CR Cleanup / Garbage Collection:
To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes:
- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.
- Enforcement of configurable per-node or per-cluster CR limits – ensures the number of CRs remains manageable even during bursty failure periods or in large clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify TTL behavior to avoid dropping unresolved events.

Deleting unresolved HealthEvent CRs can break remediation workflows and reduce auditability. Please specify guardrails (e.g., only delete resolved events, or require explicit terminal state + retention window, or archive to another store before deletion).

📌 Suggested wording
-- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.
+- Optional TTL / age-based deletion – CRs older than a configured threshold can be removed only after reaching a terminal (resolved/failed) state, or after archiving to another backend to preserve auditability.
🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` around lines 65 - 68, Update the
"CR Cleanup / Garbage Collection" section to clarify TTL semantics for
HealthEvent CRs: specify that age-based deletion only applies to events in a
confirmed terminal/resolved state (or after an explicit retention window
following resolution), and/or require an explicit "archived" flag before
deletion; alternatively mandate archiving unresolved events to an external store
before any deletion. Mention configurable guardrails (e.g., require terminal
state + retentionSeconds, per-node/per-cluster maximums still respected, and an
opt-in flag to enable deletion) and add a short example sentence referencing
"HealthEvent CRs" and the "CR Cleanup / Garbage Collection" policy to make the
rule and exceptions explicit.

@lalitadithya
Copy link
Collaborator

lalitadithya commented Feb 5, 2026

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

@yavinash007
Copy link
Contributor Author

yavinash007 commented Feb 5, 2026

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

@lalitadithya This is ready for the final review, I left the TTL comment open, just to get more comments on it, if any.

@ivelichkovich
Copy link
Contributor

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

We're happy to jump on a call as well if that'll help the review process

@lalitadithya
Copy link
Collaborator

sorry for the delay in review. lgtm!

@lalitadithya
Copy link
Collaborator

/ok to test d9e0769

@lalitadithya lalitadithya merged commit c456409 into NVIDIA:main Feb 10, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments