feat: add loop and retry for dcgm detection#180
Conversation
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConstructors now register DCGM watches/fields when a DCGM instance object exists even if DCGM is not yet connected. A reconnectingInstance was added to remember and replay deferred registrations, run periodic reconnect attempts, and allow polling/health caches to tolerate DCGM being temporarily unavailable. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant RI as ReconnectingInstance
participant CI as ConnectedInstance
participant DCGM as DCGM Library
App->>RI: AddHealthWatch / AddFieldsToWatch / AddEntityGroup
RI->>RI: Record deferred registrations (watches, fields, entities)
Note over RI: Background reconnect loop (periodic)
RI->>DCGM: Attempt Init / Connect
alt DCGM available
RI->>CI: Create connected instance
RI->>CI: Replay deferred watches, fields, entities
CI->>DCGM: Register health watches and fields/groups
RI->>RI: Swap active instance, run reconnect callbacks
else DCGM unavailable
RI->>RI: Keep deferred registrations, retry later
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go (1)
82-83: Update the guard comment to match current behavior.Line 82 says “Only initialize if DCGM is available”, but the condition at Line 83 now checks only that the instance is non-nil. Consider rewording to avoid confusion during maintenance.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go` around lines 82 - 83, The comment above the if check is outdated; update the guard comment to accurately describe the current condition that is being checked (c.dcgmInstance != nil) so it doesn't claim full DCGM availability when only a non-nil instance is required. Edit the comment immediately preceding the if that references c.dcgmInstance to something like "Initialize only when dcgmInstance is present (non-nil)" or "Proceed if c.dcgmInstance is non-nil" so it directly matches the behavior in component.go and the if (c.dcgmInstance != nil) guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go`:
- Around line 259-266: The code currently treats a zero group handle
(inst.GetGroupHandle() -> groupHandle.GetHandle() == 0) as a permanent error
which gets stored in lastError; instead treat this as a transient
bootstrap/reconnect condition so it doesn't surface as persistent unhealthy.
Change the zero-handle branch so it returns a transient error (e.g., a specific
sentinel like errTransientGroupNotReady or wrap the error so
IsTransientError(...) returns true) rather than a normal non-transient error;
update IsTransientError to recognize that sentinel/message if needed. Ensure the
change is applied where groupHandle.GetHandle() is checked (near
inst.GetGroupHandle()) so dcgm.HealthCheck is only called when handle != 0 and
transient errors are not persisted to lastError.
---
Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go`:
- Around line 82-83: The comment above the if check is outdated; update the
guard comment to accurately describe the current condition that is being checked
(c.dcgmInstance != nil) so it doesn't claim full DCGM availability when only a
non-nil instance is required. Edit the comment immediately preceding the if that
references c.dcgmInstance to something like "Initialize only when dcgmInstance
is present (non-nil)" or "Proceed if c.dcgmInstance is non-nil" so it directly
matches the behavior in component.go and the if (c.dcgmInstance != nil) guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 1b0aef71-ca4d-41a8-849c-d990a872825f
📒 Files selected for processing (13)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/clock/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/mem/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvlink/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvswitch/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/pcie/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/thermal/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/utilization/component.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00843a2478
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go (1)
46-49: Optional: strengthen the post-Poll()assertion.Right now this test only checks
err == nil. You can also assert default PASS + no incidents to lock expected fallback semantics.Proposed assertion extension
- _, _, err := hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE) + health, incidents, err := hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE) if err != nil { t.Fatalf("GetResult() should not expose transient group-not-ready as lastError, got: %v", err) } + if health != dcgm.DCGM_HEALTH_RESULT_PASS { + t.Fatalf("expected default PASS result, got: %v", health) + } + if len(incidents) != 0 { + t.Fatalf("expected no incidents for default result, got: %d", len(incidents)) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go` around lines 46 - 49, After calling hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE) capture the returned result and strengthen the assertion: assert result.Status == dcgm.DCGM_HEALTH_RESULT_PASS and assert there are no incidents (e.g., len(result.Incidents) == 0) so the test verifies the default PASS fallback semantics; update the t.Fatalf messages on failure to include the actual result for easier debugging.third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go (1)
141-145: Consider lowering transient retry log level to avoid periodic noise.This path can run every poll interval during warmup/reconnect. Using debug (or throttling) would keep routine transient retries quieter.
Proposed tweak
- log.Logger.Infow("DCGM transient error, will retry", + log.Logger.Debugw("DCGM transient error, will retry", "component", "health_cache", "error", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go` around lines 141 - 145, The transient-error branch in health_cache (the block checking errors.Is(err, errTransientGroupNotReady) || IsTransientError(err)) is using log.Logger.Infow and should be made quieter to avoid poll-interval noise; change that log call to a debug-level call (e.g., log.Logger.Debugw) or wrap it with a rate/throttle so transient retries during warmup/reconnect are not repeatedly logged at info level, keeping the same structured fields ("component","health_cache","error", err) and leaving the return nil behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go`:
- Around line 46-49: After calling hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE)
capture the returned result and strengthen the assertion: assert result.Status
== dcgm.DCGM_HEALTH_RESULT_PASS and assert there are no incidents (e.g.,
len(result.Incidents) == 0) so the test verifies the default PASS fallback
semantics; update the t.Fatalf messages on failure to include the actual result
for easier debugging.
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go`:
- Around line 141-145: The transient-error branch in health_cache (the block
checking errors.Is(err, errTransientGroupNotReady) || IsTransientError(err)) is
using log.Logger.Infow and should be made quieter to avoid poll-interval noise;
change that log call to a debug-level call (e.g., log.Logger.Debugw) or wrap it
with a rate/throttle so transient retries during warmup/reconnect are not
repeatedly logged at info level, keeping the same structured fields
("component","health_cache","error", err) and leaving the return nil behavior
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 380d9e6d-db9d-41f8-a95b-b442c2c5e013
📒 Files selected for processing (2)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go
|
Need to test this before merge. Need to verify after the dcgm available, is agent recreate the DCGM group and fieldgroup. |
Code Review — Medium Issues
[Medium]
|
in current code server.go already uses |
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go (1)
90-133:⚠️ Potential issue | 🟠 MajorDon't retain a destroyed field-group handle after setup failure.
Line 118 stores
fc.fieldGroupIDbeforeWatchFieldsWithGroupEx(...)succeeds. If that watch call fails, Line 127 destroys the handle, but the cached handle stays non-zero, so future calls return early at Line 90 and never retry setup.Proposed fix
fieldGroupID, err := dcgm.FieldGroupCreate(fieldGroupName, watchedFields) if err != nil { setupErr := fmt.Errorf("failed to create DCGM field group: %w", err) // Store error so GetResult() returns it to components fc.mu.Lock() fc.lastError = setupErr fc.mu.Unlock() return setupErr } - fc.fieldGroupID = fieldGroupID updateFreqMicroseconds := int64(fc.pollInterval / time.Microsecond) maxKeepAge := fc.pollInterval.Seconds() * 2 maxKeepSamples := int32(3) err = dcgm.WatchFieldsWithGroupEx(fieldGroupID, fc.instance.GetGroupHandle(), updateFreqMicroseconds, maxKeepAge, maxKeepSamples) if err != nil { dcgm.FieldGroupDestroy(fieldGroupID) setupErr := fmt.Errorf("failed to set up DCGM field watching: %w", err) // Store error so GetResult() returns it to components fc.mu.Lock() fc.lastError = setupErr fc.mu.Unlock() return setupErr } + fc.fieldGroupID = fieldGroupID🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go` around lines 90 - 133, The code stores fc.fieldGroupID before WatchFieldsWithGroupEx succeeds, so if WatchFieldsWithGroupEx fails the handle is destroyed but fc.fieldGroupID remains non-zero and prevents retries; fix by only assigning fc.fieldGroupID after WatchFieldsWithGroupEx returns nil, or if you keep the current order, ensure you clear fc.fieldGroupID (set to zero) while holding fc.mu immediately after dcgm.FieldGroupDestroy(fieldGroupID) and before returning the setup error, and also ensure lastError is set under the same lock so future GetResult/retries behave correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache_test.go`:
- Around line 97-105: The test currently uses a non-fatal assert.NotNil for
mockInstance.callback then unconditionally calls mockInstance.callback(), which
can cause a panic; make the check fatal so the test fails cleanly if the
callback isn't registered (e.g. replace assert.NotNil(t, mockInstance.callback,
...) with a fatal assertion such as require.NotNil(t, mockInstance.callback,
"reconnect callback should be registered") or an explicit if
mockInstance.callback == nil { t.Fatalf("reconnect callback should be
registered") }) before the subsequent fc.mu.Lock()/mockInstance.callback()
invocation; keep the rest of the setup (fc.lastError, fc.lastUpdate, fc.values)
unchanged.
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 512-535: The loop never re-enters reconnect mode because
DCGMExists() returns a static dcgmExists flag set at creation; change the logic
so reconnectLoop observes the actual runtime DCGM availability instead of that
static flag—either make DCGMExists() query the underlying live instance state
(or remove the cached dcgmExists and update it on disconnects) and then have
reconnectLoop call that dynamic check each tick, using the existing symbols
reconnectLoop, DCGMExists, reconnectNow, reconnectInterval, stopCh and the
dcgmExists field to locate and update the code paths.
- Around line 548-557: The reconnect swap is not atomic with old-instance
teardown and reconnects can race with Shutdown/reconnectNow; modify the logic
around inst.current, inst.currentMu, runReconnectCallbacks(), reconnectNow() and
Shutdown() so the swap + callback + previousInst.Shutdown() happen as a single
serialized operation and in-flight reconnects are aborted once Shutdown begins:
add a small synchronization token (e.g. a generation counter or a
reconnectInProgress flag guarded by currentMu or a new reconnectMu) that
reconnectNow() reads and compares before writing inst.current, increment that
token when starting the serialized swap, perform inst.current = connectedInst,
call inst.runReconnectCallbacks(), then call previousInst.Shutdown() while still
owning the token so no other reconnect can install a new current mid-teardown,
and have Shutdown() set a "shuttingDown" marker (or increment the token) so any
concurrent reconnectNow() detects this and aborts early; update reconnectNow(),
Shutdown(), and the swap site (where previousInst is captured) to use this
token/flag to ensure atomic swap+teardown and to abort in-flight reconnects.
---
Outside diff comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go`:
- Around line 90-133: The code stores fc.fieldGroupID before
WatchFieldsWithGroupEx succeeds, so if WatchFieldsWithGroupEx fails the handle
is destroyed but fc.fieldGroupID remains non-zero and prevents retries; fix by
only assigning fc.fieldGroupID after WatchFieldsWithGroupEx returns nil, or if
you keep the current order, ensure you clear fc.fieldGroupID (set to zero) while
holding fc.mu immediately after dcgm.FieldGroupDestroy(fieldGroupID) and before
returning the setup error, and also ensure lastError is set under the same lock
so future GetResult/retries behave correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fbb47656-c90b-4548-923c-ea417dd9bb06
📒 Files selected for processing (4)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache_test.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go (1)
280-287: Minor: Direct struct initialization bypasses constructor validation.The test directly instantiates
reconnectingInstance{}to avoid starting the reconnect loop, which is reasonable for this test. However, this pattern could become fragile if the struct adds required fields in the future.Consider adding a comment explaining why this is done, or creating a test-only constructor that doesn't start the goroutine.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go` around lines 280 - 287, The test constructs reconnectingInstance directly which bypasses constructor validation; either add a brief comment above this instantiation explaining that the direct struct init is intentional to avoid starting the reconnect goroutine (and must be kept in sync if fields are added), or provide and use a test-only constructor like NewTestReconnectingInstance that returns a pre-initialized reconnectingInstance without launching the reconnect loop; reference the reconnectingInstance type, the NewNoOp() call used for current, and the Shutdown() method to ensure any new constructor or comment clarifies identical field initialization (watchedFields, groupEntities, reconnectInterval, stopCh) and the reason for not starting the background goroutine.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go`:
- Around line 280-287: The test constructs reconnectingInstance directly which
bypasses constructor validation; either add a brief comment above this
instantiation explaining that the direct struct init is intentional to avoid
starting the reconnect goroutine (and must be kept in sync if fields are added),
or provide and use a test-only constructor like NewTestReconnectingInstance that
returns a pre-initialized reconnectingInstance without launching the reconnect
loop; reference the reconnectingInstance type, the NewNoOp() call used for
current, and the Shutdown() method to ensure any new constructor or comment
clarifies identical field initialization (watchedFields, groupEntities,
reconnectInterval, stopCh) and the reason for not starting the background
goroutine.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5939284a-37dd-40cd-b4bd-c5aba2c87cf0
📒 Files selected for processing (2)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
f66d7fb to
66db289
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go (1)
696-704: Minor style note: defer placement.The pattern of placing
defer inst.currentMu.RUnlock()after capturing the value (line 699) is unconventional but functionally correct. The more idiomatic Go pattern places defer immediately after lock acquisition:inst.currentMu.RLock() defer inst.currentMu.RUnlock() currentInst := inst.currentThis is consistent across multiple methods in this file, so it may be intentional. Not a correctness issue.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go` around lines 696 - 704, The defer for unlocking should be placed immediately after acquiring the read lock for idiomatic clarity: in reconnectingInstance.DCGMExists(), call inst.currentMu.RLock() then defer inst.currentMu.RUnlock() and only then read currentInst := inst.current; update other similar methods in this file to follow the same pattern (look for methods on reconnectingInstance that use currentMu.RLock/RUnlock and current) to keep style consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 696-704: The defer for unlocking should be placed immediately
after acquiring the read lock for idiomatic clarity: in
reconnectingInstance.DCGMExists(), call inst.currentMu.RLock() then defer
inst.currentMu.RUnlock() and only then read currentInst := inst.current; update
other similar methods in this file to follow the same pattern (look for methods
on reconnectingInstance that use currentMu.RLock/RUnlock and current) to keep
style consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bcc28f22-54a2-444f-b739-81d44c4f9473
📒 Files selected for processing (2)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.gothird_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
✅ Files skipped from review due to trivial changes (1)
- third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Description
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go.FieldValueCacheinfield_cache.goto support late availability:HealthCachepolling but reduce noisy warnings when DCGM is simply not loaded yet; reserve warning/error for real API failures.Note
The current implementation allows minimal invasive change to existing component flow, and it works for startup races and late DCGM readiness. It also aturally supports replays when DCGM reconnects (not just first startup).
Summary by CodeRabbit
Improvements
Tests