Design doc to introduce k8s data store for health events by yavinash007 · Pull Request #787 · NVIDIA/NVSentinel

yavinash007 · 2026-02-02T21:54:33Z

Summary

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

Documentation
- Added ADR describing a Kubernetes Custom Resource–backed datastore option for HealthEvent with a Kubernetes-native API and storage backing.
- Describes a proto-first generation approach, unified spec/status handling, watch-driven ingestion, and controller-driven status updates for remediation.
- Documents CR lifecycle (TTL, per-node/per-cluster limits), single-active-backend deployment model, versioning, phased rollout, migration coordination, and operational tooling.

copy-pr-bot · 2026-02-02T21:54:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-02T21:54:41Z

📝 Walkthrough

Walkthrough

Adds ADR-027: a Kubernetes-native HealthEvent CRD datastore (proto-first CRD + generated Go types). Moves HealthEventStatus into proto, introduces a pluggable CR-backed datastore alongside existing backends, describes watch-driven controllers for reconciliation/status updates, CR cleanup (TTL / limits), rollout, and migration notes.

Changes

Cohort / File(s)	Summary
Design Documentation `docs/designs/027-kubernetes-data-store.md`	Adds ADR-027 describing a Kubernetes-native HealthEvent CRD datastore: proto-first CRD and Go type generation, migration of `HealthEventStatus` into proto, CR-backed pluggable backend option (coexists with MongoDB/Postgres), watch-based controllers updating status and driving remediation, CR cleanup via TTL and per-node/cluster limits, rollout/versioning, and tooling/migration guidance.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Monitor as Health Monitor
    participant Connector as Platform Connector
    participant K8sAPI as Kubernetes API / CRD
    participant Controller as HealthEvent Controller
    participant Store as Cluster Store (CRs)

    Monitor->>Connector: Emit HealthEvent (proto spec)
    Connector->>K8sAPI: Create/Update HealthEvent CR (spec)
    K8sAPI-->>Store: Persist CR object
    Controller->>K8sAPI: Watch HealthEvent CRs (informers)
    K8sAPI-->>Controller: Deliver add/update/delete events
    Controller->>Store: Read CR spec/status
    Controller->>Controller: Reconcile desired state / remediation
    Controller->>K8sAPI: Update HealthEvent.status
    K8sAPI-->>Store: Persist status update
    Controller->>K8sAPI: Optionally delete CR (TTL / limits)
    K8sAPI-->>Store: Remove CR

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰
I hopped through proto lines so bright,
CRDs sprouted in the night,
Watches chatter, statuses peep,
TTLs tidy up the heap,
Remediation dreams take flight 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: introducing a Kubernetes data store for health events, which is the primary focus of the added design document.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

docs/designs/027-kubernetes-data-store.md

ivelichkovich · 2026-02-02T22:41:53Z

docs/designs/027-kubernetes-data-store.md

link to protoc-crd-gen repo

docs/designs/027-kubernetes-data-store.md

ivelichkovich · 2026-02-02T22:50:02Z

docs/designs/027-kubernetes-data-store.md

+
+### CR Cleanup / Garbage Collection:
+To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes:
+- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.


Trying to think, maybe just the per node and per cluster limits is okay and ttl is not needed

docs/designs/027-kubernetes-data-store.md

ivelichkovich · 2026-02-02T22:54:40Z

Overall looks good and well written to me, thank you!

ivelichkovich · 2026-02-02T23:25:49Z

docs/designs/027-kubernetes-data-store.md

+
+### Mitigations
+
+- Event deduplication and rate limiting: Only create CRs for unique unhealthy events or updates to existing events.


not sure if this belongs in the doc but this can also be mitigated by partitioning etcd but its outside of the scope of nvsentinel

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/designs/027-kubernetes-data-store.md`:
- Line 32: The doc text mistakenly names the CRD generator as "protoc-crd-gen"
while the linked repo is "protoc-gen-crd"; update the mention of the tool in the
sentence to "protoc-gen-crd" (the same spelling as the GitHub link) so the name
and URL match, and verify the surrounding sentence referencing HealthEvent spec
remains accurate after the rename.

🧹 Nitpick comments (2)

docs/designs/027-kubernetes-data-store.md (2)

40-40: Consider API versioning strategy earlier rather than deferring to v1 graduation.

While it's acceptable to iterate freely in v1alpha1, establishing basic versioning and conversion patterns early can prevent costly migrations later. Even alpha APIs benefit from thinking about field deprecation, backward compatibility, and storage version strategy upfront.

63-63: Consider adding more detail about proto migration coordination strategy.

Moving HealthEventStatus from a Go struct into proto affects multiple components (health monitors, controllers, store-client, converters). While the document correctly flags this coordination requirement, consider adding a brief migration strategy or phased approach to reduce risk—for example, maintaining compatibility shims during transition or coordinating the change across a specific set of PRs.

docs/designs/027-kubernetes-data-store.md

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: The two statements conflict: "Historic view via kube-state-metrics"
and "Limited querying capabilities" should be reconciled by clarifying that
kube-state-metrics exposes time-series metrics for basic trend monitoring and
observability (i.e., simple historical views), whereas CRs alone do not provide
full-featured historical/event query capabilities or ad-hoc complex queries;
update the "Historic view via kube-state-metrics" paragraph to explicitly say it
enables basic trend/time-series monitoring (not ad-hoc querying) and add a short
clarifying sentence to the "Limited querying capabilities" section referencing
that distinction.
- Around line 99-104: The document currently contradicts itself by listing
hand-managed CRD structs (HealthEventSpec, HealthEventStatus,
BehaviourOverrides, Entity) and controller-gen usage as "Under Consideration"
while the Decision section already claims a proto-first Kubernetes-native
HealthEvent API; update the prose to be consistent by either changing the
Decision section to state both approaches are still under evaluation (so reword
the Decision to reflect ongoing consideration of hand-managed structs vs
proto-first), or change the Alternatives/Under Consideration paragraph to
past-tense and note that the hand-managed approach (including files under
./data-models, conversion implementation in
pkg/conversion/healthevent_conversion.go and tests in
healthevent_conversion_test.go, and Helm packaging at
./distros/kubernetes/nvsentinel/charts/kubernetes-store) was considered but
rejected in favor of the proto-first/controller-gen approach with
HealthEventStatus kept as a handwritten struct only where applicable; ensure
mentions of HealthEventStatus and controller-gen align with the final decision.

🧹 Nitpick comments (2)

docs/designs/027-kubernetes-data-store.md (2)

67-68: Clarify the behavior when CR limits are reached.

The document mentions "enforcement of configurable per-node or per-cluster CR limits" but doesn't specify what happens when these limits are reached. Will new CRs be rejected? Will oldest CRs be evicted? This operational behavior should be documented to help operators understand the implications.

Consider adding a brief statement about the enforcement mechanism, for example:

"When limits are reached, the oldest CRs are automatically removed to make room for new events"

"When limits are reached, new CR creation is rate-limited or delayed"

113-113: Add link to the referenced datastore abstraction design document.

The "NVSentinel Datastore Abstraction Design Document" is referenced multiple times (lines 9, 113) but no link is provided. Adding a link or file path would help readers access the broader context.

docs/designs/027-kubernetes-data-store.md

… of other places too. Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: Update the ambiguous wording by changing the "Historic view via
kube-state-metrics" bullet to explicitly say it provides basic time-series trend
monitoring and observability (e.g., "Health events stored as CRs can be scraped
and exposed through Kubernetes-native monitoring tools like kube-state-metrics,
enabling basic time-series trend monitoring and observability without additional
infrastructure") and update the "Limited querying capabilities" sentence to
clarify it refers to complex ad‑hoc queries and full-featured historical
analysis that CRs alone cannot support (e.g., add "Complex ad-hoc queries and
full-featured historical analysis (beyond basic time-series metrics) are not
feasible using only CRs").
- Line 36: Rename the handwritten Go file "health_event_extentions.go" to
"health_event_extensions.go" and update all references/imports/usages to the
filename (e.g., any build tags, go:generate directives, package-level comments,
or CI scripts) so the compiler and tooling pick up the file; ensure the type
HealthEventStatus and any functions/types defined there are still
exported/accessible and run go build/test to verify no references remain to the
misspelled "extentions" name.

docs/designs/027-kubernetes-data-store.md

Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/designs/027-kubernetes-data-store.md`:
- Around line 65-68: Update the "CR Cleanup / Garbage Collection" section to
clarify TTL semantics for HealthEvent CRs: specify that age-based deletion only
applies to events in a confirmed terminal/resolved state (or after an explicit
retention window following resolution), and/or require an explicit "archived"
flag before deletion; alternatively mandate archiving unresolved events to an
external store before any deletion. Mention configurable guardrails (e.g.,
require terminal state + retentionSeconds, per-node/per-cluster maximums still
respected, and an opt-in flag to enable deletion) and add a short example
sentence referencing "HealthEvent CRs" and the "CR Cleanup / Garbage Collection"
policy to make the rule and exceptions explicit.

coderabbitai · 2026-02-03T23:13:46Z

docs/designs/027-kubernetes-data-store.md

+### CR Cleanup / Garbage Collection:
+To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes:
+- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.
+- Enforcement of configurable per-node or per-cluster CR limits – ensures the number of CRs remains manageable even during bursty failure periods or in large clusters.


⚠️ Potential issue | 🟠 Major

Clarify TTL behavior to avoid dropping unresolved events.

Deleting unresolved HealthEvent CRs can break remediation workflows and reduce auditability. Please specify guardrails (e.g., only delete resolved events, or require explicit terminal state + retention window, or archive to another store before deletion).

📌 Suggested wording

-- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth. +- Optional TTL / age-based deletion – CRs older than a configured threshold can be removed only after reaching a terminal (resolved/failed) state, or after archiving to another backend to preserve auditability.

🤖 Prompt for AI Agents

In `@docs/designs/027-kubernetes-data-store.md` around lines 65 - 68, Update the "CR Cleanup / Garbage Collection" section to clarify TTL semantics for HealthEvent CRs: specify that age-based deletion only applies to events in a confirmed terminal/resolved state (or after an explicit retention window following resolution), and/or require an explicit "archived" flag before deletion; alternatively mandate archiving unresolved events to an external store before any deletion. Mention configurable guardrails (e.g., require terminal state + retentionSeconds, per-node/per-cluster maximums still respected, and an opt-in flag to enable deletion) and add a short example sentence referencing "HealthEvent CRs" and the "CR Cleanup / Garbage Collection" policy to make the rule and exceptions explicit.

docs/designs/027-kubernetes-data-store.md

lalitadithya · 2026-02-05T10:38:41Z

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

yavinash007 · 2026-02-05T17:59:16Z

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

@lalitadithya This is ready for the final review, I left the TTL comment open, just to get more comments on it, if any.

ivelichkovich · 2026-02-05T23:02:59Z

hey @yavinash007 , is this PR ready for a final review or are you still working on changes?

We're happy to jump on a call as well if that'll help the review process

lalitadithya · 2026-02-10T09:46:30Z

sorry for the delay in review. lgtm!

lalitadithya · 2026-02-10T09:46:43Z

/ok to test d9e0769

yavinash007 force-pushed the k8s-ds-design-doc branch from 4cc3d89 to 0b21248 Compare February 2, 2026 22:18