Design doc to introduce k8s data store for health events#787
Design doc to introduce k8s data store for health events#787lalitadithya merged 5 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds ADR-027: a Kubernetes-native HealthEvent CRD datastore (proto-first CRD + generated Go types). Moves HealthEventStatus into proto, introduces a pluggable CR-backed datastore alongside existing backends, describes watch-driven controllers for reconciliation/status updates, CR cleanup (TTL / limits), rollout, and migration notes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Monitor as Health Monitor
participant Connector as Platform Connector
participant K8sAPI as Kubernetes API / CRD
participant Controller as HealthEvent Controller
participant Store as Cluster Store (CRs)
Monitor->>Connector: Emit HealthEvent (proto spec)
Connector->>K8sAPI: Create/Update HealthEvent CR (spec)
K8sAPI-->>Store: Persist CR object
Controller->>K8sAPI: Watch HealthEvent CRs (informers)
K8sAPI-->>Controller: Deliver add/update/delete events
Controller->>Store: Read CR spec/status
Controller->>Controller: Reconcile desired state / remediation
Controller->>K8sAPI: Update HealthEvent.status
K8sAPI-->>Store: Persist status update
Controller->>K8sAPI: Optionally delete CR (TTL / limits)
K8sAPI-->>Store: Remove CR
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4cc3d89 to
0b21248
Compare
There was a problem hiding this comment.
link to protoc-crd-gen repo
|
|
||
| ### CR Cleanup / Garbage Collection: | ||
| To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes: | ||
| - Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth. |
There was a problem hiding this comment.
Trying to think, maybe just the per node and per cluster limits is okay and ttl is not needed
|
Overall looks good and well written to me, thank you! |
|
|
||
| ### Mitigations | ||
|
|
||
| - Event deduplication and rate limiting: Only create CRs for unique unhealthy events or updates to existing events. |
There was a problem hiding this comment.
not sure if this belongs in the doc but this can also be mitigated by partitioning etcd but its outside of the scope of nvsentinel
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
0b21248 to
2b1f520
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 32: The doc text mistakenly names the CRD generator as "protoc-crd-gen"
while the linked repo is "protoc-gen-crd"; update the mention of the tool in the
sentence to "protoc-gen-crd" (the same spelling as the GitHub link) so the name
and URL match, and verify the surrounding sentence referencing HealthEvent spec
remains accurate after the rename.
🧹 Nitpick comments (2)
docs/designs/027-kubernetes-data-store.md (2)
40-40: Consider API versioning strategy earlier rather than deferring to v1 graduation.While it's acceptable to iterate freely in
v1alpha1, establishing basic versioning and conversion patterns early can prevent costly migrations later. Even alpha APIs benefit from thinking about field deprecation, backward compatibility, and storage version strategy upfront.
63-63: Consider adding more detail about proto migration coordination strategy.Moving
HealthEventStatusfrom a Go struct into proto affects multiple components (health monitors, controllers, store-client, converters). While the document correctly flags this coordination requirement, consider adding a brief migration strategy or phased approach to reduce risk—for example, maintaining compatibility shims during transition or coordinating the change across a specific set of PRs.
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
4c68062 to
357ec7b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: The two statements conflict: "Historic view via kube-state-metrics"
and "Limited querying capabilities" should be reconciled by clarifying that
kube-state-metrics exposes time-series metrics for basic trend monitoring and
observability (i.e., simple historical views), whereas CRs alone do not provide
full-featured historical/event query capabilities or ad-hoc complex queries;
update the "Historic view via kube-state-metrics" paragraph to explicitly say it
enables basic trend/time-series monitoring (not ad-hoc querying) and add a short
clarifying sentence to the "Limited querying capabilities" section referencing
that distinction.
- Around line 99-104: The document currently contradicts itself by listing
hand-managed CRD structs (HealthEventSpec, HealthEventStatus,
BehaviourOverrides, Entity) and controller-gen usage as "Under Consideration"
while the Decision section already claims a proto-first Kubernetes-native
HealthEvent API; update the prose to be consistent by either changing the
Decision section to state both approaches are still under evaluation (so reword
the Decision to reflect ongoing consideration of hand-managed structs vs
proto-first), or change the Alternatives/Under Consideration paragraph to
past-tense and note that the hand-managed approach (including files under
./data-models, conversion implementation in
pkg/conversion/healthevent_conversion.go and tests in
healthevent_conversion_test.go, and Helm packaging at
./distros/kubernetes/nvsentinel/charts/kubernetes-store) was considered but
rejected in favor of the proto-first/controller-gen approach with
HealthEventStatus kept as a handwritten struct only where applicable; ensure
mentions of HealthEventStatus and controller-gen align with the final decision.
🧹 Nitpick comments (2)
docs/designs/027-kubernetes-data-store.md (2)
67-68: Clarify the behavior when CR limits are reached.The document mentions "enforcement of configurable per-node or per-cluster CR limits" but doesn't specify what happens when these limits are reached. Will new CRs be rejected? Will oldest CRs be evicted? This operational behavior should be documented to help operators understand the implications.
Consider adding a brief statement about the enforcement mechanism, for example:
- "When limits are reached, the oldest CRs are automatically removed to make room for new events"
- "When limits are reached, new CR creation is rate-limited or delayed"
113-113: Add link to the referenced datastore abstraction design document.The "NVSentinel Datastore Abstraction Design Document" is referenced multiple times (lines 9, 113) but no link is provided. Adding a link or file path would help readers access the broader context.
… of other places too. Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
c8414c6 to
6ff7527
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Line 83: Update the ambiguous wording by changing the "Historic view via
kube-state-metrics" bullet to explicitly say it provides basic time-series trend
monitoring and observability (e.g., "Health events stored as CRs can be scraped
and exposed through Kubernetes-native monitoring tools like kube-state-metrics,
enabling basic time-series trend monitoring and observability without additional
infrastructure") and update the "Limited querying capabilities" sentence to
clarify it refers to complex ad‑hoc queries and full-featured historical
analysis that CRs alone cannot support (e.g., add "Complex ad-hoc queries and
full-featured historical analysis (beyond basic time-series metrics) are not
feasible using only CRs").
- Line 36: Rename the handwritten Go file "health_event_extentions.go" to
"health_event_extensions.go" and update all references/imports/usages to the
filename (e.g., any build tags, go:generate directives, package-level comments,
or CI scripts) so the compiler and tooling pick up the file; ensure the type
HealthEventStatus and any functions/types defined there are still
exported/accessible and run go build/test to verify no references remain to the
misspelled "extentions" name.
Signed-off-by: Avinash Yeddula <avinash.yeddula@omniva.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/designs/027-kubernetes-data-store.md`:
- Around line 65-68: Update the "CR Cleanup / Garbage Collection" section to
clarify TTL semantics for HealthEvent CRs: specify that age-based deletion only
applies to events in a confirmed terminal/resolved state (or after an explicit
retention window following resolution), and/or require an explicit "archived"
flag before deletion; alternatively mandate archiving unresolved events to an
external store before any deletion. Mention configurable guardrails (e.g.,
require terminal state + retentionSeconds, per-node/per-cluster maximums still
respected, and an opt-in flag to enable deletion) and add a short example
sentence referencing "HealthEvent CRs" and the "CR Cleanup / Garbage Collection"
policy to make the rule and exceptions explicit.
| ### CR Cleanup / Garbage Collection: | ||
| To prevent uncontrolled growth of HealthEvent CRs and reduce load on the Kubernetes API server, the Kubernetes store implementation includes: | ||
| - Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth. | ||
| - Enforcement of configurable per-node or per-cluster CR limits – ensures the number of CRs remains manageable even during bursty failure periods or in large clusters. |
There was a problem hiding this comment.
Clarify TTL behavior to avoid dropping unresolved events.
Deleting unresolved HealthEvent CRs can break remediation workflows and reduce auditability. Please specify guardrails (e.g., only delete resolved events, or require explicit terminal state + retention window, or archive to another store before deletion).
📌 Suggested wording
-- Optional TTL / age-based deletion – CRs older than a configured threshold can be automatically removed if still unresolved, preventing indefinite growth.
+- Optional TTL / age-based deletion – CRs older than a configured threshold can be removed only after reaching a terminal (resolved/failed) state, or after archiving to another backend to preserve auditability.🤖 Prompt for AI Agents
In `@docs/designs/027-kubernetes-data-store.md` around lines 65 - 68, Update the
"CR Cleanup / Garbage Collection" section to clarify TTL semantics for
HealthEvent CRs: specify that age-based deletion only applies to events in a
confirmed terminal/resolved state (or after an explicit retention window
following resolution), and/or require an explicit "archived" flag before
deletion; alternatively mandate archiving unresolved events to an external store
before any deletion. Mention configurable guardrails (e.g., require terminal
state + retentionSeconds, per-node/per-cluster maximums still respected, and an
opt-in flag to enable deletion) and add a short example sentence referencing
"HealthEvent CRs" and the "CR Cleanup / Garbage Collection" policy to make the
rule and exceptions explicit.
|
hey @yavinash007 , is this PR ready for a final review or are you still working on changes? |
@lalitadithya This is ready for the final review, I left the TTL comment open, just to get more comments on it, if any. |
We're happy to jump on a call as well if that'll help the review process |
|
sorry for the delay in review. lgtm! |
|
/ok to test d9e0769 |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit